# Introduction
This notebook will include the model development in order to get the best model predicting the house prices.
## My main goal in this notebook:
* Try diffferent models on the provided data.<br>
* Validate the model using K-fold cross validation.<br>
* Getting the final output for submission.

## Importing the libraries and the data that we will use
The first step will be to import the libraries that will be needed in this notebook. The most important one of course is Pandas that will be used to create a DataFrame from the data that we will be working on. Also, matplotlib that will help us visualize the dataset.

In [36]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

Now I will import the preprocessed data from the previous notebook.

In [37]:
x_train = pd.read_csv('Data/final_train.csv')
x_test = pd.read_csv('Data/final_test.csv')
y_train = pd.read_csv('Data/target.csv')

In [38]:
x_train.head()

Unnamed: 0,LotFrontage,LotArea,Neighborhood,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,ExterCond,BsmtCond,...,SaleType_ConLw,SaleType_New,SaleType_Oth,SaleType_WD,SaleCondition_Abnorml,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial
0,65.0,8450,5,7,5,2003,2003,196.0,3,3,...,0,0,0,1,0,0,0,0,1,0
1,80.0,9600,7,6,8,1976,1976,0.0,3,3,...,0,0,0,1,0,0,0,0,1,0
2,68.0,11250,5,7,5,2001,2002,162.0,3,3,...,0,0,0,1,0,0,0,0,1,0
3,60.0,9550,6,7,5,1915,1970,0.0,3,4,...,0,0,0,1,1,0,0,0,0,0
4,84.0,14260,10,8,5,2000,2000,350.0,3,3,...,0,0,0,1,0,0,0,0,1,0


In [39]:
x_test.head()

Unnamed: 0,LotFrontage,LotArea,Neighborhood,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,ExterCond,BsmtCond,...,SaleType_ConLw,SaleType_New,SaleType_Oth,SaleType_WD,SaleCondition_Abnorml,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial
0,80.0,11622,3,5,6,1961,1961,0.0,3,3,...,0,0,0,1,0,0,0,0,1,0
1,81.0,14267,3,6,6,1958,1958,108.0,3,3,...,0,0,0,1,0,0,0,0,1,0
2,74.0,13830,5,5,5,1997,1998,0.0,3,3,...,0,0,0,1,0,0,0,0,1,0
3,78.0,9978,5,6,6,1998,1998,20.0,3,3,...,0,0,0,1,0,0,0,0,1,0
4,43.0,5005,9,8,5,1992,1992,0.0,3,3,...,0,0,0,1,0,0,0,0,1,0


In [40]:
y_train.head()

Unnamed: 0,SalePrice
0,208500
1,181500
2,223500
3,140000
4,250000


In [41]:
print('Training set dimensions: ', x_train.shape)
print('Test set dimensions: ', x_test.shape)

Training set dimensions:  (1456, 226)
Test set dimensions:  (1459, 226)


## Creating estimators from different models
In this step I will start creating different estimators (XGBoost-Ridge-Lasso-SVM) and create a pipeline for each one of them.
The pipeline will include robust scaler before the model training and will be using k-fold cross validation to on each model)

In [42]:
from sklearn.model_selection import cross_val_score, KFold, cross_validate
from sklearn.preprocessing import RobustScaler
from sklearn.linear_model import LassoCV, RidgeCV
from sklearn.svm import SVR
from sklearn.pipeline import make_pipeline
from sklearn.metrics import mean_squared_error
from xgboost import XGBRegressor
from mlxtend.regressor import StackingCVRegressor

In [43]:
#setting k-fold cross validation
kf = KFold(10)

# parameter for ridge and lasso regressors
alphas = [0.001, 0.01, 0.1, 1.0, 10.0]

#ridge regressor
ridge = make_pipeline(RobustScaler(), RidgeCV(alphas=alphas, cv=kf,))

#lasso regressor
lasso = make_pipeline(RobustScaler(), LassoCV(max_iter=100000, alphas=alphas, random_state=132, cv=kf))

#SVM regressor
svr = make_pipeline(RobustScaler(), SVR(C=21, epsilon=0.0099, gamma=0.00017))

#XGBoost regressor
xgboost = XGBRegressor(learning_rate=0.02, n_estimators=3000, max_depth=4, min_child_weight=0, subsample=0.7968,
                       colsample_bytree=0.4064, nthread=-1, scale_pos_weight=2, seed=42)

In [44]:
models = [ridge, lasso, svr, xgboost]
names = ['ridge', 'lasso', 'svr', 'xgboost']
results = []
i = 0

for model in models:
    result = cross_validate(model, x_train, y_train, cv=kf, scoring='neg_root_mean_squared_error',
                             return_train_score=True, n_jobs=-1)
    results.append([result['train_score'].mean(), result['test_score'].mean(), names[i]])
    i += 1

In [45]:
for result in results:
    print(result)

[-25200.790951345327, -30048.208203778056, 'ridge']
[-22147.378291031702, -31126.300739759772, 'lasso']
[-76628.97098168459, -76283.79040125947, 'svr']
[-2493.5593176014177, -21725.379104499425, 'xgboost']


From the results shown above, we can see that the best model is the XGBoost one.

## Creating the submission
The final step is to train the XGBoost model on the whole training set then predict and save the final submission.

In [46]:
estimator_full = XGBRegressor(learning_rate=0.02, n_estimators=3000, max_depth=4, min_child_weight=0, subsample=0.7968,
                       colsample_bytree=0.4064, nthread=-1, scale_pos_weight=2, seed=42)
estimator_full.fit(x_train, y_train)

XGBRegressor(base_score=0.5, booster='gbtree', callbacks=None,
             colsample_bylevel=1, colsample_bynode=1, colsample_bytree=0.4064,
             early_stopping_rounds=None, enable_categorical=False,
             eval_metric=None, feature_types=None, gamma=0, gpu_id=-1,
             grow_policy='depthwise', importance_type=None,
             interaction_constraints='', learning_rate=0.02, max_bin=256,
             max_cat_threshold=64, max_cat_to_onehot=4, max_delta_step=0,
             max_depth=4, max_leaves=0, min_child_weight=0, missing=nan,
             monotone_constraints='()', n_estimators=3000, n_jobs=-1,
             nthread=-1, num_parallel_tree=1, predictor='auto', ...)

In [50]:
y_pred = estimator_full.predict(x_test)

[122347.35 162386.6  194208.66 198245.52 183772.66]


Will create the id column, make the submission dataset and save it.

In [53]:
id_column = range(1461,2920)
submission = pd.DataFrame(list(zip(id_column, y_pred)), columns =['Id', 'SalePrice'])
submission.to_csv('Data/mysubmission.csv', index=False)
submission.head()

Unnamed: 0,Id,SalePrice
0,1461,122347.351562
1,1462,162386.59375
2,1463,194208.65625
3,1464,198245.515625
4,1465,183772.65625


## Final note
I really want to gain more experience and further develop my skills so I am open to any comments or suggestions to my work, and if you liked this notebook please don't forget to vote, thank you!.