## Predict Prices Models
> This notebook build use the information from the exploration step to perform the preprocessing and build the ML Models

### Goal
> predict the sales price for each house. For each Id in the test set, you must predict the value of the SalePrice variable.

This notebook summarizes the whole preprocessing steps, includes feature selection based on visualizations in the preparation notebook, tests of various machine learning algorithms and a basic blended ml model. With that it is used as a basis to explore feature engineering, GridSearch to find the best hyperparameters for each machine learning algorithm, implement more advanced blending or even stacking methods to enhance the predictive sore.


### Content
- [Preprocessing](#pre)
- [Modeling](#modeling)
- [Use Models to make predictions](#pred)

### Data
- <a href='https://www.kaggle.com/c/house-prices-advanced-regression-techniques'>Link</a> to Kaggle competition 
- <a href='https://amstat.tandfonline.com/doi/abs/10.1080/10691898.2011.11889627#.X1ZIXy337UI'>Paper</a> on the data: 


### Score of final model
- blended model with XGBRegressor, LGBMRegressor and Ridge Regression (without any hyperparameter tuning or feature engineering): 
    - Kaggle Score: 0.13696
    - top 46%

In [3]:
# imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# ml related imports
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import KFold, cross_val_score, train_test_split
from sklearn.metrics import mean_squared_error
from lightgbm import LGBMRegressor
from xgboost import XGBRegressor
from sklearn.linear_model import Ridge, Lasso



# silence settingWithCopyWarning
import warnings
from pandas.core.common import SettingWithCopyWarning
warnings.simplefilter(action="ignore", category=SettingWithCopyWarning)

In [2]:
# get the data
train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')

In [3]:
train.shape, test.shape

((1460, 81), (1459, 80))

<a id='pre'></a>
### Preprocessing
- remove missing values
- transform `MSSubClass` and `MoSold` to a categorical feature
- drop 'unimportant' categorical features
- preprocess categorical features
- split all_data into train and test again
- transform numerical features
    - log transform `SalePrice`
    - drop numerical features that have a correlation below 0.3
    - drop outliers
- Identify columns to drop for test set

In [4]:
train_pre = train.copy()
test_pre = test.copy()

#### remove missing values
> For this I will concat the train and test set. Since I am doing only basic missing value imputation (mean, median, and mode), the problem of data leakage will be minimized. When using more advanced methods like KNN-imputation or missForest pone should definitely perform those one the dataset seperately since information will leak into to train set from the test set. 

In [5]:
all_data = pd.concat([train_pre, test_pre], ignore_index=True)

In [6]:
# a lot of the missing values are just encodings for the instance that a specific feaure isn't available
# list of features with worng encodign for NA
feature_NA = ['Alley', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 'PoolQC', 'Fence', 
              'MiscFeature', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2']

# assign NA to None to indicate the lack of a certain feature
all_data[feature_NA] = all_data[feature_NA].fillna('None')

In [7]:
# imute missing categorical features mostly with the mode
all_data['MSZoning'] = all_data['MSZoning'].fillna(all_data['MSZoning'].mode()[0])
all_data['Utilities'] = all_data['Utilities'].fillna(all_data['Utilities'].mode()[0])
all_data['Exterior1st'] = all_data['Exterior1st'].fillna(all_data['Exterior1st'].mode()[0])
all_data['Exterior2nd'] = all_data['Exterior2nd'].fillna(all_data['Exterior2nd'].mode()[0])
all_data['Exterior2nd'] = all_data['Exterior2nd'].fillna(all_data['Exterior2nd'].mode()[0])
all_data['MasVnrType'] = all_data['MasVnrType'].fillna('None')
all_data['Electrical'] = all_data['Electrical'].fillna(all_data['Electrical'].mode()[0])
all_data['KitchenQual'] = all_data['KitchenQual'].fillna(all_data['KitchenQual'].mode()[0])
all_data['Functional'] = all_data['Functional'].fillna(all_data['Functional'].mode()[0])
all_data['SaleType'] = all_data['SaleType'].fillna(all_data['SaleType'].mode()[0])

In [8]:
# imput missing numerical features (most numercial had only 1-2 missing values in that case I just imputet 0)
all_data['MasVnrArea'] = all_data['MasVnrArea'].fillna(0)
all_data['BsmtFinSF1'] = all_data['BsmtFinSF1'].fillna(0)
all_data['BsmtFinSF2'] = all_data['BsmtFinSF2'].fillna(0)
all_data['BsmtUnfSF'] = all_data['BsmtUnfSF'].fillna(0)
all_data['TotalBsmtSF'] = all_data['TotalBsmtSF'].fillna(0)
all_data['BsmtFullBath'] = all_data['BsmtFullBath'].fillna(0)
all_data['BsmtHalfBath'] = all_data['BsmtHalfBath'].fillna(0)
all_data['GarageCars'] = all_data['GarageCars'].fillna(0)
all_data['GarageArea'] = all_data['GarageArea'].fillna(0)
all_data['GarageYrBlt'] = all_data['GarageYrBlt'].fillna(0)
# Neighorhood should impact the size of of street connected to the property
# code from https://www.kaggle.com/serigne/stacked-regressions-top-4-on-leaderboard
all_data["LotFrontage"] = all_data.groupby("Neighborhood")["LotFrontage"].transform(
    lambda x: x.fillna(x.median()))

#### transform `MSSubClass` and `MoSold` to a categorical feature

In [9]:
all_data['MSSubClass'] = all_data['MSSubClass'].astype('str')

# create dict to assign categories
MoSold_dict = {1:'Jan', 2:'Feb', 3:'Mar', 4:'Apr', 5:'May', 6:'Jun', 7:'Jul', 8:'Aug', 9:'Sep', 10:'Oct', 11:'Nov', 12:'Dec'}
all_data['MoSold'] = all_data['MoSold'].map(MoSold_dict)

#### drop 'unimportant' categorical features

In [10]:
# create DataFrame with relevant categorical feats
all_cat_feats = list(all_data.select_dtypes('object'))
# do the same for the numerical features befor processing categorical features and creating dummies
all_num_feats = list(all_data.select_dtypes(exclude='object'))

important_cat_feats = ['Neighborhood', 'MSZoning', 'MSSubClass', 'HouseStyle', 'Foundation', 'ExterQual', 'KitchenQual', 
                       'GarageQual', 'CentralAir', 'HeatingQC', 'BsmtQual']

cat_feats_drop = list(set(all_cat_feats) - set(important_cat_feats))

all_data.drop(columns=cat_feats_drop, inplace=True)

#### preprocess categorical features

In [11]:
# greate dummies for real categorical features with drop_first=True to reduce multicollinearity
list_dummies = ['MSSubClass', 'MSZoning', 'Neighborhood', 'HouseStyle', 'Foundation']
df_dummy = pd.get_dummies(all_data[list_dummies], drop_first=True)
all_data.drop(columns=list_dummies, inplace=True)
all_data = pd.concat([all_data, df_dummy], axis=1)

# converte categorical to ordinal features and use them as numeric input in the model
list_ord = ['ExterQual', 'KitchenQual', 'GarageQual', 'HeatingQC', 'BsmtQual']
map_dict_ord = {'None': 0, 'Po': 1, 'Fa': 2, 'TA':3, 'Gd':4, 'Ex': 5}
for ord_ in list_ord:
    all_data[ord_] = all_data[ord_].map(map_dict_ord)

# convert categorical to binary
all_data['CentralAir'] = all_data['CentralAir'].map({'Y': 1, 'N': 0})

#### Split all_data into train and test again

In [12]:
# split all_data in train and test to perform more preprocessing and feature engineering speratly (prevent data leakage)
train_pre = all_data.loc[:train.shape[0]-1]
test_pre = all_data.loc[train.shape[0]:]

In [13]:
test_pre.reset_index(drop=True, inplace=True)

In [14]:
train_pre.shape, test_pre.shape

((1460, 97), (1459, 97))

#### transform numerical features

##### Log transform `SalePrice` and save in target

In [15]:
target = np.log(train_pre.SalePrice)

##### drop numerical features that have a correlation below 0.3

In [16]:
# create matrix
corr_matrix = train_pre[all_num_feats]
corr_matrix = corr_matrix.corr()
corr_matrix = corr_matrix['SalePrice']

# filter matrix: drop num features with corrleation below 0.3
num_feats_drop = corr_matrix[corr_matrix < 0.3].index
train_pre.drop(columns=num_feats_drop, inplace=True)

##### drop outliers

In [17]:
train_pre = train_pre.drop(train_pre[(train_pre['GrLivArea']>4000) & (train_pre['SalePrice']<300000)].index)

##### add target to train

In [18]:
train_pre['SalePrice_log'] = target

#### Identify columns to drop for test set

In [19]:
# save id 
Id = test_pre['Id']

In [20]:
# features to drop
to_drop = list(set(test_pre) - set(train_pre))

In [21]:
test_pre.drop(columns=to_drop, inplace=True)

In [43]:
test_pre.drop(columns='SalePrice', inplace=True)

<a id='modeling'></a>
### Modeling
- make x_train and y_train 
- define Cross Validation
- initiate Models

#### make x_train and y_train 

In [47]:
# use train set to make x_train and y_train
x_train = train_pre.drop(columns=['SalePrice_log', 'SalePrice'])
y_train = train_pre['SalePrice_log']

In [48]:
x_train.shape, y_train.shape, test_pre.shape

((1458, 78), (1458,), (1459, 78))

In [49]:
# check if x_train and test_pre are identical
list(set(x_train) - set(test_pre)), list(set(test_pre) - set(x_train))

([], [])

#### Cross Validation
> define cross validation method to evaluate different machine learning models on the training data before using a model on the test data.

In [50]:
# code: https://www.kaggle.com/serigne/stacked-regressions-top-4-on-leaderboard
n_folds = 5

def rmsle_cv(model):
    kf = KFold(n_folds, shuffle=True, random_state=42).get_n_splits(train.values)
    rmse= np.sqrt(-cross_val_score(model, x_train.values, y_train, scoring="neg_mean_squared_error", cv = kf))
    return(rmse)

#### Models

In [51]:
# XGBoost
model_xgb = XGBRegressor()
# LightGBM
model_lgb = LGBMRegressor(objective='regression')
# Lasso Regression
model_lasso = Lasso()
# Ridge Regression
model_ridge = Ridge()


# list of models for cross validation
models_list = [model_xgb, model_lgb, model_lasso, model_ridge]
model_names = ['model_xgb', 'model_lgb', 'model_lasso', 'model_ridge']

In [52]:
for model, name in zip(models_list, model_names):
    print(name + ' rmsle score:')
    print(np.mean(rmsle_cv(model)))
    print('#'*30)

model_xgb rmsle score:
0.14445967712041968
##############################
model_lgb rmsle score:
0.13746055595980713
##############################
model_lasso rmsle score:
0.17146718486654478
##############################
model_ridge rmsle score:
0.1244083497141513
##############################


> basic Ridge Regressions seems to be the best out of the selected basis models 

### Use models on test data
- use single model
- basic blended model

#### use LightGBM to make a submission
> used the following code to make predictions for multiple algorithms

In [58]:
# train on train data
model_lgb.fit(x_train, y_train)

LGBMRegressor(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
              importance_type='split', learning_rate=0.1, max_depth=-1,
              min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,
              n_estimators=100, n_jobs=-1, num_leaves=31,
              objective='regression', random_state=None, reg_alpha=0.0,
              reg_lambda=0.0, silent=True, subsample=1.0,
              subsample_for_bin=200000, subsample_freq=0)

In [59]:
lgb_preds = model_lgb.predict(test_pre)

In [60]:
submission = pd.DataFrame()
submission['Id'] = Id
# transform log of SalePrice
submission['SalePrice'] = np.exp(lgb_preds)

In [61]:
submission.to_csv('sub_lgb_basis.csv', index=False)

In [62]:
submission

Unnamed: 0,Id,SalePrice
0,1461,121332.313427
1,1462,154402.770147
2,1463,187984.471281
3,1464,182457.513648
4,1465,196060.735397
...,...,...
1454,2915,73066.162924
1455,2916,83065.032787
1456,2917,169410.979017
1457,2918,108783.834550


#### Basic Model Blending
> use XGBRegressor, LGBMRegressor and Ridge Regression to make a prediction, and calculate the mean

In [63]:
model_xgb.fit(x_train, y_train)
model_lgb.fit(x_train, y_train)
model_ridge.fit(x_train, y_train)

Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
      normalize=False, random_state=None, solver='auto', tol=0.001)

In [64]:
xgb_preds =model_xgb.predict(test_pre)
lgb_preds = model_lgb.predict(test_pre)
ridge_preds =model_ridge.predict(test_pre)

In [67]:
preds = (xgb_preds + lgb_preds + ridge_preds)/3

In [68]:
submission = pd.DataFrame()
submission['Id'] = Id
# transform log of SalePrice
submission['SalePrice'] = np.exp(preds)

In [69]:
submission.to_csv('blended_model_basis.csv', index=False)

In [70]:
submission

Unnamed: 0,Id,SalePrice
0,1461,118955.879161
1,1462,157738.755660
2,1463,185493.887521
3,1464,191904.085786
4,1465,198065.073784
...,...,...
1454,2915,73897.248834
1455,2916,84286.280698
1456,2917,172593.555022
1457,2918,110948.817135
