# Housing prediction using XGBoost Model

### Introduction¶
This notebook demonstrates the process of predicting house prices using the Ames Housing dataset. The goal is to achieve a high accuracy in predicting the sale prices of houses, evaluated using the Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price.

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.metrics import make_scorer, mean_squared_error
import xgboost as xgb

### Data Preprocessing¶
- Missing Values: Handled using `SimpleImputer`. Numerical features were filled with the median value, while categorical features were filled with the most frequent value.
- Feature Scaling: Numerical features were scaled using `StandardScaler`.
- Categorical Encoding: Categorical features were encoded using `OneHotEncoder`.

In [2]:
# Load the data
train = pd.read_csv('./Dataset/train.csv')
test = pd.read_csv('./Dataset/test.csv')
submission = pd.read_csv('./Dataset/sample_submission.csv')

# Target variable
y = np.log1p(train['SalePrice'])

# Drop target from training data
train.drop(columns=['SalePrice'], inplace=True)

### Feature Engineering¶
- Combined train and test data for consistent preprocessing.
- Identified and processed numerical and categorical features.

In [3]:
# Combine train and test data for preprocessing
all_data = pd.concat([train, test], keys=['train', 'test'])

# Identify numeric and categorical columns
numeric_features = all_data.select_dtypes(include=['int64', 'float64']).columns
categorical_features = all_data.select_dtypes(include=['object']).columns

# Preprocessing for numerical data
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])



# Combine preprocessors
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

### Model Selection¶
The XGBoost regressor was chosen for its high performance and ability to handle complex relationships in the data. The following hyperparameters were used:

- `n_estimators=1000`
- `learning_rate=0.05`
- `max_depth=3`
- `subsample=0.7`
- `colsample_bytree=0.7`

In [4]:
# Create a pipeline that combines preprocessing and modeling
model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', xgb.XGBRegressor(objective='reg:squarederror', n_estimators=1000, learning_rate=0.05, max_depth=3, subsample=0.7, colsample_bytree=0.7))])

# Split the data back into train and test sets
train_data = all_data.xs('train')
test_data = all_data.xs('test')

# Fit the model
model.fit(train_data, y)

### Model Evaluation¶
The model was evaluated using 5-fold cross-validation. The achieved RMSE was 0.12, indicating a good predictive performance.

In [5]:
# Cross-validation to evaluate the model
rmse = np.sqrt(-cross_val_score(model, train_data, y, scoring='neg_mean_squared_error', cv=5))
print("RMSE: {:.4f} ({:.4f})".format(rmse.mean(), rmse.std()))


# Predict on the test data
predictions = model.predict(test_data)

# Prepare the submission
submission['SalePrice'] = np.expm1(predictions)
submission.to_csv('submission.csv', index=False)

RMSE: 0.1203 (0.0125)


### Conclusion
In this notebook, we successfully predicted house prices with a low RMSE using advanced regression techniques. Future improvements could include further hyperparameter tuning and additional feature engineering to enhance model performance.

## Save model

In [6]:
import joblib

# Save the model
joblib.dump(model, 'fine-tune_xgboost_model.pkl')


['preprocessor.pkl']

In [9]:
# Save the preprocessor
joblib.dump(preprocessor, 'preprocessor.pkl')

['preprocessor.pkl']

## Load model

In [7]:
import joblib

# Load the model
model = joblib.load('fine-tune_xgboost_model.pkl')

# Load the preprocessor
preprocessor = joblib.load('preprocessor.pkl')

print(type(model))  # Ensure it's <class 'xgboost.sklearn.XGBRegressor'> or similar
print(type(preprocessor))  # Ensure it's <class 'sklearn.compose._column_transformer.ColumnTransformer'> or similar


<class 'sklearn.pipeline.Pipeline'>
<class 'sklearn.compose._column_transformer.ColumnTransformer'>
