# Advanced House Prices - Regression Techniques Part 2 

##  - Using Plotly for Data Visualization

**Important Notice: Copy and paste this notebook link on [nbviewer](https://nbviewer.jupyter.org/) in order to have a great view of the plotly graphs plotted in this notebook because Github doesn’t render iframes at the moment.**

##  - Using predominantly Ensemble Models for Model training and  predictions

## Aim

**The aim of this project is to predict the house prices based on the 79 explanatory variables / features (from the data provided) describing (almost) every aspect of residential homes in Ames, Iowa**

## Tasks

1. Get the data
2. Prepare the data
3. Explore the training data
4. Feature engineering
5. Data Preprocessing
6. Evaluate your ensemble models using cross-validation
7. Tune your ensemble models using grid search
8. Check the correlation of the prediction of the ensemble models on the test data
9. Stack the models (with the lowest Root Mean Square Error) together
9. Make predictions for the test data

## File descriptions

[Data Description](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data) is hosted on Kaggle Datasets. 

- **train.csv** - the training set
- **test.csv** the test set
- **data_description.txt** - full description of each column, originally prepared by Dean De Cock but lightly edited to match the column names used here
- **sample_submission.csv** - a benchmark submission from a linear regression on year and month of sale, lot square footage, and number of bedrooms

# Task 1: Get the data

1. Go to the [Kaggle Datasets](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data) page, and click the **Download** button.
2. Unzip **`house-prices-advanced-regression-techniques.zip`**, and then move **`house-prices-advanced-regression-techniques`** to a directory where you can easily access it.

In [1]:
import sklearn

sklearn.__version__

'0.22'

In [2]:
#import necesssary libraries

import numpy as np
import pandas as pd
pd.set_option('max_columns', 105)
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set()

from scipy import stats
from scipy.stats import skew
from math import sqrt

# plotly
import chart_studio.plotly as py
import plotly.graph_objs as go
from plotly import tools
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)

# sklearn
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.preprocessing import StandardScaler, RobustScaler, QuantileTransformer
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression, Ridge, HuberRegressor, Lasso, ElasticNet, BayesianRidge
from sklearn.kernel_ridge import KernelRidge

from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import AdaBoostRegressor, GradientBoostingRegressor
from sklearn.neural_network import MLPRegressor

import xgboost as xgb
from xgboost import XGBRegressor
import lightgbm as lgb
from lightgbm import LGBMRegressor

from mlxtend.regressor import StackingRegressor

import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=DeprecationWarning)
#warnings.filterwarnings("ignore")

##  Task 2: Data Preparation

In [3]:
#load in the data

df_train = pd.read_csv('train.csv')
df_test = pd.read_csv('test.csv')

In [4]:
print(df_train.shape, '\n', '*'*10, '\n', df_test.shape)

(1460, 81) 
 ********** 
 (1459, 80)


In [5]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1201 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
Alley            91 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-n

In [6]:
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1459 entries, 0 to 1458
Data columns (total 80 columns):
Id               1459 non-null int64
MSSubClass       1459 non-null int64
MSZoning         1455 non-null object
LotFrontage      1232 non-null float64
LotArea          1459 non-null int64
Street           1459 non-null object
Alley            107 non-null object
LotShape         1459 non-null object
LandContour      1459 non-null object
Utilities        1457 non-null object
LotConfig        1459 non-null object
LandSlope        1459 non-null object
Neighborhood     1459 non-null object
Condition1       1459 non-null object
Condition2       1459 non-null object
BldgType         1459 non-null object
HouseStyle       1459 non-null object
OverallQual      1459 non-null int64
OverallCond      1459 non-null int64
YearBuilt        1459 non-null int64
YearRemodAdd     1459 non-null int64
RoofStyle        1459 non-null object
RoofMatl         1459 non-null object
Exterior1st      1458 non-

In [7]:
#lets drop Id since is not necessarily useful for predicting SalePrice

df_train = df_train.drop('Id', axis = 1)
id_test = df_test.Id
df_test = df_test.drop('Id', axis = 1)

## Task 3:  Data exploration

In [8]:
#lets generate descriptive statistics of the train dataset
#lets first, for the numeric columns

pd.set_option('max_columns', 105)

df_train.describe(exclude = [np.object]).T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
MSSubClass,1460.0,56.89726,42.300571,20.0,20.0,50.0,70.0,190.0
LotFrontage,1201.0,70.049958,24.284752,21.0,59.0,69.0,80.0,313.0
LotArea,1460.0,10516.828082,9981.264932,1300.0,7553.5,9478.5,11601.5,215245.0
OverallQual,1460.0,6.099315,1.382997,1.0,5.0,6.0,7.0,10.0
OverallCond,1460.0,5.575342,1.112799,1.0,5.0,5.0,6.0,9.0
YearBuilt,1460.0,1971.267808,30.202904,1872.0,1954.0,1973.0,2000.0,2010.0
YearRemodAdd,1460.0,1984.865753,20.645407,1950.0,1967.0,1994.0,2004.0,2010.0
MasVnrArea,1452.0,103.685262,181.066207,0.0,0.0,0.0,166.0,1600.0
BsmtFinSF1,1460.0,443.639726,456.098091,0.0,0.0,383.5,712.25,5644.0
BsmtFinSF2,1460.0,46.549315,161.319273,0.0,0.0,0.0,0.0,1474.0


**Only three features have missing values ( as observed from the ***count*** column)**

In [9]:
#lets get the descriptive statistics for the categorical columns

df_train.describe(exclude = [np.number]).T

Unnamed: 0,count,unique,top,freq
MSZoning,1460,5,RL,1151
Street,1460,2,Pave,1454
Alley,91,2,Grvl,50
LotShape,1460,4,Reg,925
LandContour,1460,4,Lvl,1311
Utilities,1460,2,AllPub,1459
LotConfig,1460,5,Inside,1052
LandSlope,1460,3,Gtl,1382
Neighborhood,1460,25,NAmes,225
Condition1,1460,9,Norm,1260


In [10]:
#lets the total number of categorical features that have missing  values

print('Total number of categorical features that have missing values :', (df_train.describe(exclude = [np.number]).T['count'] != 1460).astype(int).sum())

Total number of categorical features that have missing values : 16


In [11]:
#create a new dataframe with only missing values in descending order

df_train_null = pd.DataFrame()
df_train_null['missing'] = df_train.isnull().sum()[df_train.isnull().sum() > 0].sort_values(ascending = False)

df_test_null = pd.DataFrame(df_test.isnull().sum(), columns = ['missing'])
df_test_null = df_test_null.loc[df_test_null['missing'] > 0].sort_values(by = ['missing'], ascending = False)

In [12]:
df_train_null

Unnamed: 0,missing
PoolQC,1453
MiscFeature,1406
Alley,1369
Fence,1179
FireplaceQu,690
LotFrontage,259
GarageYrBlt,81
GarageType,81
GarageFinish,81
GarageQual,81


In [13]:
df_test_null

Unnamed: 0,missing
PoolQC,1456
MiscFeature,1408
Alley,1352
Fence,1169
FireplaceQu,730
LotFrontage,227
GarageCond,78
GarageYrBlt,78
GarageQual,78
GarageFinish,78


In [14]:
trace1 = go.Bar(x = df_train_null.index, 
                y = df_train_null['missing'],
                name="df_train", 
                text = df_train_null.index)

trace2 = go.Bar(x = df_test_null.index, 
                y = df_test_null['missing'],
                name="df_test", 
                text = df_test_null.index)

data = [trace1, trace2]

layout = dict(title = "NaN in test and train", 
              xaxis=dict(ticklen=10, zeroline= False),
              yaxis=dict(title = "number of rows", side='left', ticklen=10,),                                  
              legend=dict(orientation="v", x=1.05, y=1.0),
              autosize=False, width=750, height=500,
              barmode='stack'
              )

fig = dict(data = data, layout = layout)
iplot(fig)

In [15]:
#lets drop the missing data with lost of missing values

df_train = df_train.drop(['PoolQC', 'FireplaceQu', 'Fence', 'Alley', 'MiscFeature'], axis = 1)

df_test = df_test.drop(['PoolQC', 'FireplaceQu', 'Fence', 'Alley', 'MiscFeature'], axis = 1)

In [16]:
#lets extract numeric and categorical columns from the df_train

numeric = df_train.select_dtypes(exclude=['object']).columns.tolist()
category = df_train.select_dtypes(exclude=['number']).columns.tolist()

print(numeric, '\n', '*'*20, '\n', category)

['MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageYrBlt', 'GarageCars', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal', 'MoSold', 'YrSold', 'SalePrice'] 
 ******************** 
 ['MSZoning', 'Street', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual', 'Functional', 'GarageType', 'GarageFinish', 'GarageQua

In [17]:
#lets transform the Saleprice to natural logarithm
#this is done when ur feature is highly skewed (i.e. not normally distributed around the mean value)

df_train['SalePrice_Log'] = np.log(df_train['SalePrice'])
df_train['SalePrice_Log'].head()

0    12.247694
1    12.109011
2    12.317167
3    11.849398
4    12.429216
Name: SalePrice_Log, dtype: float64

In [18]:
fig = tools.make_subplots(rows=1, cols=2, print_grid=False, 
                          subplot_titles=["SalePrice", "SalePriceLog"])


trace_1 = go.Histogram(x=df_train["SalePrice"], name="SalePrice", nbinsx = 20, marker_color='#330C73')
trace_2 = go.Histogram(x=df_train["SalePrice_Log"], name="SalePriceLog", nbinsx = 20)

fig.append_trace(trace_1, 1, 1)
fig.append_trace(trace_2, 1, 2)

iplot(fig)

In [19]:
from scipy.stats import skew, kurtosis
print(df_train["SalePrice"].skew(),"   ", df_train["SalePrice"].kurtosis())
print(df_train["SalePrice_Log"].skew(),"  ", df_train["SalePrice_Log"].kurtosis())

1.8828757597682129     6.536281860064529
0.12133506220520406    0.8095319958036296


In [20]:
#Getting correlation of the features with the target feature
#one way to get it

df_corr = df_train.corr().abs()
cols_all = len(df_train)
df_corrwith = df_corr.nlargest(cols_all, 'SalePrice')['SalePrice'][2:]
df_cor = df_corr.nlargest(cols_all, 'SalePrice')['SalePrice']

In [21]:
#Getting correlation of the features with the target feature
#another way to get it

df_train.corrwith(df_train['SalePrice']).abs().sort_values(ascending=False)[2:]

OverallQual      0.790982
GrLivArea        0.708624
GarageCars       0.640409
GarageArea       0.623431
TotalBsmtSF      0.613581
1stFlrSF         0.605852
FullBath         0.560664
TotRmsAbvGrd     0.533723
YearBuilt        0.522897
YearRemodAdd     0.507101
GarageYrBlt      0.486362
MasVnrArea       0.477493
Fireplaces       0.466929
BsmtFinSF1       0.386420
LotFrontage      0.351799
WoodDeckSF       0.324413
2ndFlrSF         0.319334
OpenPorchSF      0.315856
HalfBath         0.284108
LotArea          0.263843
BsmtFullBath     0.227122
BsmtUnfSF        0.214479
BedroomAbvGr     0.168213
KitchenAbvGr     0.135907
EnclosedPorch    0.128578
ScreenPorch      0.111447
PoolArea         0.092404
MSSubClass       0.084284
OverallCond      0.077856
MoSold           0.046432
3SsnPorch        0.044584
YrSold           0.028923
LowQualFinSF     0.025606
MiscVal          0.021190
BsmtHalfBath     0.016844
BsmtFinSF2       0.011378
dtype: float64

In [22]:
#lets get the normalized correlation matrix between the columns

norm_corr = np.corrcoef(df_train[df_cor.index].values.T)
print(norm_corr)


[[ 1.          0.94837373  0.7909816  ... -0.02118958 -0.01684415
  -0.01137812]
 [ 0.94837373  1.          0.81718442 ... -0.02002082 -0.00514909
   0.00483241]
 [ 0.7909816   0.81718442  1.         ... -0.03140621 -0.04015016
  -0.05911869]
 ...
 [-0.02118958 -0.02002082 -0.03140621 ...  1.         -0.00736652
   0.00493978]
 [-0.01684415 -0.00514909 -0.04015016 ... -0.00736652  1.
   0.07094813]
 [-0.01137812  0.00483241 -0.05911869 ...  0.00493978  0.07094813
   1.        ]]


In [23]:
df_cor.index

Index(['SalePrice', 'SalePrice_Log', 'OverallQual', 'GrLivArea', 'GarageCars',
       'GarageArea', 'TotalBsmtSF', '1stFlrSF', 'FullBath', 'TotRmsAbvGrd',
       'YearBuilt', 'YearRemodAdd', 'GarageYrBlt', 'MasVnrArea', 'Fireplaces',
       'BsmtFinSF1', 'LotFrontage', 'WoodDeckSF', '2ndFlrSF', 'OpenPorchSF',
       'HalfBath', 'LotArea', 'BsmtFullBath', 'BsmtUnfSF', 'BedroomAbvGr',
       'KitchenAbvGr', 'EnclosedPorch', 'ScreenPorch', 'PoolArea',
       'MSSubClass', 'OverallCond', 'MoSold', '3SsnPorch', 'YrSold',
       'LowQualFinSF', 'MiscVal', 'BsmtHalfBath', 'BsmtFinSF2'],
      dtype='object')

In [24]:
#lets visualize the correlation to sale price using barchart 
#here, no normalized correlation matrix needed

data = go.Bar(x=df_corrwith.index, 
              y=df_corrwith.values )
       
layout = go.Layout(title = 'Correlation to Sale Price', 
                   xaxis = dict(title = ''), 
                   yaxis = dict(title = 'correlation'),
                   autosize=False, width=750, height=500,)

fig = dict(data = [data], layout = layout)
iplot(fig)

**Features with large correlation with the SalePrice will be looked into in more detail, to identify outliers to further enhance feature engineering**

In [25]:
#these are features with strong correlation to SalePrice

stg_corr = df_corrwith[df_corrwith > 0.3]
print(stg_corr)

OverallQual     0.790982
GrLivArea       0.708624
GarageCars      0.640409
GarageArea      0.623431
TotalBsmtSF     0.613581
1stFlrSF        0.605852
FullBath        0.560664
TotRmsAbvGrd    0.533723
YearBuilt       0.522897
YearRemodAdd    0.507101
GarageYrBlt     0.486362
MasVnrArea      0.477493
Fireplaces      0.466929
BsmtFinSF1      0.386420
LotFrontage     0.351799
WoodDeckSF      0.324413
2ndFlrSF        0.319334
OpenPorchSF     0.315856
Name: SalePrice, dtype: float64


In [26]:
#In terms of area, GvLivArea has the highest correlation to the SalePrice
#notice the outliers in blue color

fig = go.Figure(data=go.Scatter(x=df_train['GrLivArea'],  y=df_train['SalePrice'], mode='markers', 
                                marker=dict( size=16,color=df_train['GrLivArea'].values, colorbar=dict( title="Colorbar"), colorscale='RdBu')))

fig.update_layout(title="SalePrice vs GrLivArea", xaxis_title="GrLivArea", yaxis_title="SalePrice", font=dict(size=18, color="#7f7f7f"))



In [27]:
#lets get outliers for GrLivArea

out_Gv = df_train.loc[(df_train['GrLivArea'] > 4000.0) & (df_train['SalePrice'] < 250000.0)]

out_Gv[['GrLivArea', 'SalePrice']]


Unnamed: 0,GrLivArea,SalePrice
523,4676,184750
1298,5642,160000


In [28]:
df_train = df_train.drop(out_Gv.index)

## Task 4: Feature Engineering

**GrLivarea: Above grade (ground) living area square feet** (check the data field description)

***it is observed that the SUM of 1st, 2nd floor square feet (i.e. 1stFlrSF & 2ndFlrSF) and LowQualFinSF is == GrLivarea***

In [29]:
#sum of 1stFlrSF & 2ndFlrSF and LowQualFinSF is == GrLivarea

df_train['sum_1SF_2SF_LowQualSF'] =  df_train['1stFlrSF'] + df_train['2ndFlrSF'] + df_train['LowQualFinSF']  
df_test['sum_1SF_2SF_LowQualSF'] =  df_test['1stFlrSF'] + df_test['2ndFlrSF'] + df_test['LowQualFinSF'] 
print(sum(df_train['sum_1SF_2SF_LowQualSF'] != df_train['GrLivArea']))
print(sum(df_test['sum_1SF_2SF_LowQualSF'] != df_test['GrLivArea']))

0
0


In [30]:
#we drop the sum since is equal to GrLivArea

df_train = df_train.drop('sum_1SF_2SF_LowQualSF',axis=1)
df_test = df_test.drop('sum_1SF_2SF_LowQualSF',axis=1)

In [31]:
#lets check the correlation of their different sum to SalePrice

print((df_train['GrLivArea'] + df_train['LowQualFinSF']).corr(df_train['SalePrice']))
print('\n')
print((df_train['1stFlrSF'] + df_train['2ndFlrSF']).corr(df_train['SalePrice']))
print('\n')
print((df_train['GrLivArea']).corr(df_train['SalePrice']))

0.7196306405635335


0.7440628397752198


0.7349681645359327


**Important point to note here is that sum of some related numeric features can be highly correlated to the target feature than the individual numeric features, this can be featured engineered and further used for the ML model**

In [32]:
#lets plot Saleprice versus area numeric features

target = 'SalePrice'
area_feat = ['TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'MasVnrArea', 'GarageArea', 'LotArea', 'WoodDeckSF', 'OpenPorchSF', 'BsmtFinSF1']

In [33]:
len(area_feat)

9

In [34]:
row = range(1,4)
column = range(1,4)

for x in range(1,4):
    for y in range(1,4):
        print(x,y)
        

1 1
1 2
1 3
2 1
2 2
2 3
3 1
3 2
3 3


In [35]:
row = 3
column = 3

for x in range(1,4):
    for y in range(1,4):
        print((x-1) * (column) + y-1)

0
1
2
3
4
5
6
7
8


In [36]:
nr_rows=3
nr_cols=3

fig = tools.make_subplots(rows=nr_rows, cols=nr_cols, print_grid=False,
                          subplot_titles=area_feat)
                                                                
for row in range(1,nr_rows+1):
    for col in range(1,nr_cols+1): 
        
        i = (row-1) * nr_cols + col-1 #this output 0 - 8
                   
        trace = go.Scatter(x = df_train[area_feat[i]], 
                           y = df_train[target], 
                           name=area_feat[i], 
                           mode="markers", 
                           opacity=0.8)

        fig.append_trace(trace, row, col,)
 
                                                                                                  
fig['layout'].update(height=700, width=900, showlegend=False,
                     title='SalePrice' + ' vs. Area features')
iplot(fig)                                         

In [37]:
#features with strong correlation with SalePrice

print(stg_corr)

OverallQual     0.790982
GrLivArea       0.708624
GarageCars      0.640409
GarageArea      0.623431
TotalBsmtSF     0.613581
1stFlrSF        0.605852
FullBath        0.560664
TotRmsAbvGrd    0.533723
YearBuilt       0.522897
YearRemodAdd    0.507101
GarageYrBlt     0.486362
MasVnrArea      0.477493
Fireplaces      0.466929
BsmtFinSF1      0.386420
LotFrontage     0.351799
WoodDeckSF      0.324413
2ndFlrSF        0.319334
OpenPorchSF     0.315856
Name: SalePrice, dtype: float64


In [38]:
for x in area_feat:
    if x in list(stg_corr.index):
        print(x)
        

TotalBsmtSF
1stFlrSF
2ndFlrSF
MasVnrArea
GarageArea
WoodDeckSF
OpenPorchSF
BsmtFinSF1


In [39]:
#lets sum up all the surface area features that are in stg_corr and check their correlation to the target feature
# after tweaking, df_train['BsmtFinSF1'] was removed, since this increases the correlation significantly

df_train['all_Sf'] = df_train['TotalBsmtSF'] + df_train['1stFlrSF'] + df_train['2ndFlrSF'] + df_train['MasVnrArea'] + df_train['GarageArea']+ df_train['WoodDeckSF'] + df_train['OpenPorchSF']
df_train['all_Sf'].corr(df_train['SalePrice'])

0.8667329748305753

In [40]:
#lets visualize the saleprice versus surface areas


fig = go.Figure(data=go.Scatter(x=df_train['all_Sf'],  y=df_train['SalePrice'], mode='markers', 
                                marker=dict( size=16,color=df_train['all_Sf'].values, colorbar=dict( title="Colorbar"), colorscale='Viridis')))

fig.update_layout(title="SalePrice vs Surface areas", xaxis_title="Surface areas", yaxis_title="SalePrice", font=dict(size=18, color="#7f7f7f"))



In [41]:
#lets get the sum of the surface areas for the df_test

df_test['all_Sf'] = df_test['TotalBsmtSF'] + df_test['1stFlrSF'] + df_test['2ndFlrSF'] + df_test['MasVnrArea'] + df_test['GarageArea']+ df_test['WoodDeckSF'] + df_test['OpenPorchSF']


In [42]:
#outliers in saleprice vs all surface areas

out_sf = df_train[(df_train['all_Sf'] > 8000.00) & (df_train['SalePrice'] < 200000.0)]
out_sf[['all_Sf', 'SalePrice']]

Unnamed: 0,all_Sf,SalePrice


In [43]:
df_train.shape

(1458, 77)

In [44]:
#this right here, gives us the saleprices for each value of the 'OverallQual'
#we group the SalePrice by the OverallQual

for x, y in df_train[['SalePrice', 'OverallQual']].groupby('OverallQual'):
    print(x, ':', '\n', y['SalePrice'].values)

1 : 
 [61000 39300]
2 : 
 [60000 35311 60000]
3 : 
 [107400  85000  76500 126175  87500 120000  67000  52000  93500  37900
  91000  82000 139600  81000  92900  95000  72500  79000  58500 105000]
4 : 
 [ 90000  68500  40000  82000 113000  80000 129500  91000 135750 136500
 123600 109900  94750 128950 100000 100000 115000 103200 140000 141000
 107000  97000 141000  88000  82500  82000 134432 100000 150000 109008
  81000 118000 256000  89471  86000  34900 106250  86000 111250 108000
 108000 133000 118500  75500  84500 108000 137500 137500  87000  55000
 102776 129000 124500 135000 120500 103000  98000  96500 102000 107900
 109900  93000 128000 129000 132250 118500 106500 110000  75000 135000
  79900 150000 135000  85500 110000 121600  88000 176000  84000  97000
  80000  84900  83500 128000 112000 115000 135000  80000 108959 168000
 148000 116050 107000 113000 145000  80500 101800 161500  68400 119000
 111000  82500  55000 100000  52500 123000 108500 104900 105000 125500
 125500  90000 122

In [45]:
#lets visualize the saleprice versus OverallQual
#Boxplot is basically used for categorical to view relationships between the category column and a numeric columns

trace = []
for name, group in df_train[["SalePrice", "OverallQual"]].groupby("OverallQual"):
    trace.append( go.Box( y=group["SalePrice"].values, name=name ) )
    
layout = go.Layout(title="OverallQual", 
                   xaxis=dict(title='OverallQual',ticklen=5, zeroline= False),
                   yaxis=dict(title='SalePrice', side='left'),
                   autosize=False, width=750, height=500)

fig = go.Figure(data=trace, layout=layout)
iplot(fig)

**Notice the outliers in the boxplot, mostly and easily identified in OverallQual 4, 8, 9, and 10**

In [46]:
#lets extract out the outliers and remove them
#first extract them out

out_4 = df_train[(df_train['OverallQual'] == 4) & (df_train['SalePrice'] > 200000.0)]
out_8 = df_train[(df_train['OverallQual'] == 8) & (df_train['SalePrice'] > 500000.0)]
out_9 = df_train[(df_train['OverallQual'] == 9) & (df_train['SalePrice'] > 500000.0)]
out_10 = df_train[(df_train['OverallQual'] == 10) & (df_train['SalePrice'] > 700000.0)]

#lets concantenate the df
out_qual = pd.concat([out_4, out_8, out_9, out_10])
out_qual.shape

(8, 77)

In [47]:
type(out_4)

pandas.core.frame.DataFrame

In [48]:
#lets see the outliers
out_qual[['OverallQual', 'SalePrice']]

Unnamed: 0,OverallQual,SalePrice
457,4,256000
769,8,538000
178,9,501837
803,9,582933
898,9,611657
1046,9,556581
691,10,755000
1182,10,745000


In [49]:
out_qual.index

Int64Index([457, 769, 178, 803, 898, 1046, 691, 1182], dtype='int64')

In [50]:
df_train = df_train.drop(out_qual.index)

In [51]:
df_train.shape

(1450, 77)

In [52]:
#refined features with strong correlation to the Sale Price

df_train.corr().abs()[['SalePrice', 'SalePrice_Log']].sort_values(by='SalePrice', ascending = False)[2:16]

Unnamed: 0,SalePrice,SalePrice_Log
all_Sf,0.86294,0.852501
OverallQual,0.8094,0.819386
GrLivArea,0.718054,0.712796
GarageCars,0.65231,0.677201
TotalBsmtSF,0.641396,0.633767
GarageArea,0.634881,0.64959
1stFlrSF,0.618476,0.604977
FullBath,0.555959,0.58606
YearBuilt,0.54055,0.58829
YearRemodAdd,0.526499,0.567779


In [53]:
#Scatter plots of SalePrice vs All_Surfaceareas based on OverallQual


fig = go.Figure(data=go.Scatter(x=df_train['all_Sf'],  y=df_train['SalePrice'], mode='markers', 
                                marker=dict( size=16,color=df_train['OverallQual'].values, colorbar=dict( title="OverallQual"), colorscale="Cividis")))

fig.update_layout(title="SalePrice vs Surface areas based on OverallQual", xaxis_title="Surface areas", yaxis_title="SalePrice", font=dict(size=16, color="#7f7f7f"))



### Important point: OverallQual i.e Overall quality has strong correlation with both Sale Price and the Area Surface of the houses.
**This means that the the higher the sale price, the larger the surface area, the higher the quality of the house, this can be clearly seen in the graph above**

In [54]:
#3D Scatter plots of SalePrice vs All_Surfaceareas based on OverallQual


fig = go.Figure(data=go.Scatter3d(x=df_train['all_Sf'],  y=df_train['OverallQual'], z = df_train['SalePrice'], mode='markers', 
                                marker=dict( size=12,color=df_train['OverallQual'].values, colorbar=dict( title="OverallQual"), colorscale='spectral', opacity = 0.8)))

fig.update_layout(title="SalePrice vs Surface areas based on OverallQual", font=dict(size=12, color="#7f7f7f"), 
                  scene = dict( xaxis_title='Surfacearea', yaxis_title='OverallQuality', zaxis_title='Saleprice'), margin=dict(r=2.5, b=0.5, l=0.5, t=0.5))



In [55]:
#colorscales:
#             ['aggrnyl', 'agsunset', 'algae', 'amp', 'armyrose', 'balance',
#              'blackbody', 'bluered', 'blues', 'blugrn', 'bluyl', 'brbg',
#              'brwnyl', 'bugn', 'bupu', 'burg', 'burgyl', 'cividis', 'curl',
#              'darkmint', 'deep', 'delta', 'dense', 'earth', 'edge', 'electric',
#              'emrld', 'fall', 'geyser', 'gnbu', 'gray', 'greens', 'greys',
#              'haline', 'hot', 'hsv', 'ice', 'icefire', 'inferno', 'jet',
#              'magenta', 'magma', 'matter', 'mint', 'mrybm', 'mygbm', 'oranges',
#              'orrd', 'oryel', 'peach', 'phase', 'picnic', 'pinkyl', 'piyg',
#              'plasma', 'plotly3', 'portland', 'prgn', 'pubu', 'pubugn', 'puor',
#              'purd', 'purp', 'purples', 'purpor', 'rainbow', 'rdbu', 'rdgy',
#              'rdpu', 'rdylbu', 'rdylgn', 'redor', 'reds', 'solar', 'spectral',
#              'speed', 'sunset', 'sunsetdark', 'teal', 'tealgrn', 'tealrose',
#              'tempo', 'temps', 'thermal', 'tropic', 'turbid', 'twilight',
#              'viridis', 'ylgn', 'ylgnbu', 'ylorbr', 'ylorrd']


**Lets move to categorical columns**

In [56]:
print(category)

['MSZoning', 'Street', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual', 'Functional', 'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 'PavedDrive', 'SaleType', 'SaleCondition']


In [57]:
#we grouped the saleprice by Neighborhood, 
#calculating the saleprice median for each value in the Neighborhood

neigh_list = df_train['SalePrice'].groupby(df_train['Neighborhood']).median().sort_values().keys().tolist()
print(neigh_list)

['MeadowV', 'IDOTRR', 'BrDale', 'OldTown', 'Edwards', 'BrkSide', 'Sawyer', 'Blueste', 'SWISU', 'NAmes', 'NPkVill', 'Mitchel', 'SawyerW', 'Gilbert', 'NWAmes', 'Blmngtn', 'CollgCr', 'ClearCr', 'Crawfor', 'Veenker', 'Somerst', 'Timber', 'StoneBr', 'NoRidge', 'NridgHt']


In [58]:
print(df_train.Neighborhood.value_counts().index.tolist())

['NAmes', 'CollgCr', 'OldTown', 'Edwards', 'Somerst', 'Gilbert', 'NridgHt', 'Sawyer', 'NWAmes', 'SawyerW', 'BrkSide', 'Crawfor', 'Mitchel', 'NoRidge', 'Timber', 'IDOTRR', 'ClearCr', 'SWISU', 'StoneBr', 'MeadowV', 'Blmngtn', 'BrDale', 'Veenker', 'NPkVill', 'Blueste']


In [59]:
df_train[df_train['Neighborhood'] == 'BrDale']['SalePrice']

225     112000
227     106000
232      94500
235      89500
363     118000
430      85400
432     122500
500     113000
655      88000
837     100000
1029    118000
1104    106000
1219     91500
1291    119500
1334    125000
1378     83000
Name: SalePrice, dtype: int64

**Boxplots are a standardized way of displaying the distribution of data based on a five number summary (“minimum”, first quartile (Q1), median, third quartile (Q3), and “maximum”).**

**A boxplot is typically used below to analyze the relationship between a categorical feature and a continuous feature.**


In [60]:
#Notice that we sorted the boxplot by median values of the Saleprice 
#with respect to each value of the Neighborhood

trac = []
for name in neigh_list:
    trac.append( go.Box(y = df_train[df_train['Neighborhood'] == name]['SalePrice'], name=name ))
    
layout = go.Layout(title="Neighborhood", 
                   xaxis=dict(title='Neighborhood',ticklen=5, zeroline= False),
                   yaxis=dict(title='SalePrice', side='left'), font=dict(size=14, color="#7f7f7f"),
                   autosize=False, width=850, height=600)

fig = go.Figure(data=trac, layout=layout)
iplot(fig)

In [61]:
zon_list = df_train['SalePrice'].groupby(df_train['MSZoning']).median().sort_values().keys().tolist()
print(zon_list)

['C (all)', 'RM', 'RH', 'RL', 'FV']


In [62]:
#Here, we plot the sale price vs MsZoning to check their relationship
#as observed, there are variations in MSZoning with respect to SalePrices


colors = ['rgb(6,34,75)', 'rgb(7,40,89)', 'rgb(9,56,125)', 'rgb(8,81,156)', 'rgb(107,174,214)']

trac = []
for name, cls in zip(zon_list, colors):
    trac.append( go.Box(y = df_train[df_train['MSZoning'] == name]['SalePrice'], name=name, marker_color = cls))
    
layout = go.Layout(title="SalePrice vs MSZoning", 
                   xaxis=dict(title='MSZoning',ticklen=5, zeroline= False),
                   yaxis=dict(title='SalePrice', side='left'), font=dict(size=14, color="#7f7f7f"),
                   autosize=False, width=850, height=600)

fig = go.Figure(data=trac, layout=layout)
iplot(fig)

In [63]:
print(len(category))

38


## Task 5: Preprocessing and Pipeline

The purpose of creating pipelines is to **assemble several steps** that can be executed in sequential order.

In [64]:
#target feature is identified as y

y, y_log = df_train['SalePrice'], df_train['SalePrice_Log']

In [65]:
X_train = df_train.drop(['SalePrice', 'SalePrice_Log'], axis = 1).copy()

In [66]:
#lets import pipeline, make_column_transformer

from sklearn.pipeline import make_pipeline
from sklearn.compose import ColumnTransformer

In [67]:
#let update our numeric and categorical columns, incase we have added new features to it

numeric = X_train.select_dtypes(exclude=['object']).columns.tolist()
category = X_train.select_dtypes(include=['object']).columns.tolist()

In [68]:
X_train[numeric].head()

Unnamed: 0,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,TotRmsAbvGrd,Fireplaces,GarageYrBlt,GarageCars,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,all_Sf
0,60,65.0,8450,7,5,2003,2003,196.0,706,0,150,856,856,854,0,1710,1,0,2,1,3,1,8,0,2003.0,2,548,0,61,0,0,0,0,0,2,2008,3371.0
1,20,80.0,9600,6,8,1976,1976,0.0,978,0,284,1262,1262,0,0,1262,0,1,2,0,3,1,6,1,1976.0,2,460,298,0,0,0,0,0,0,5,2007,3282.0
2,60,68.0,11250,7,5,2001,2002,162.0,486,0,434,920,920,866,0,1786,1,0,2,1,3,1,6,1,2001.0,2,608,0,42,0,0,0,0,0,9,2008,3518.0
3,70,60.0,9550,7,5,1915,1970,0.0,216,0,540,756,961,756,0,1717,1,0,1,0,3,1,7,1,1998.0,3,642,0,35,272,0,0,0,0,2,2006,3150.0
4,60,84.0,14260,8,5,2000,2000,350.0,655,0,490,1145,1145,1053,0,2198,1,0,2,1,4,1,9,1,2000.0,3,836,192,84,0,0,0,0,0,12,2008,4805.0


In [69]:
X_train[category].head()

Unnamed: 0,MSZoning,Street,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinType2,Heating,HeatingQC,CentralAir,Electrical,KitchenQual,Functional,GarageType,GarageFinish,GarageQual,GarageCond,PavedDrive,SaleType,SaleCondition
0,RL,Pave,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,Gable,CompShg,VinylSd,VinylSd,BrkFace,Gd,TA,PConc,Gd,TA,No,GLQ,Unf,GasA,Ex,Y,SBrkr,Gd,Typ,Attchd,RFn,TA,TA,Y,WD,Normal
1,RL,Pave,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,Gable,CompShg,MetalSd,MetalSd,,TA,TA,CBlock,Gd,TA,Gd,ALQ,Unf,GasA,Ex,Y,SBrkr,TA,Typ,Attchd,RFn,TA,TA,Y,WD,Normal
2,RL,Pave,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,Gable,CompShg,VinylSd,VinylSd,BrkFace,Gd,TA,PConc,Gd,TA,Mn,GLQ,Unf,GasA,Ex,Y,SBrkr,Gd,Typ,Attchd,RFn,TA,TA,Y,WD,Normal
3,RL,Pave,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,Gable,CompShg,Wd Sdng,Wd Shng,,TA,TA,BrkTil,TA,Gd,No,ALQ,Unf,GasA,Gd,Y,SBrkr,Gd,Typ,Detchd,Unf,TA,TA,Y,WD,Abnorml
4,RL,Pave,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,Gable,CompShg,VinylSd,VinylSd,BrkFace,Gd,TA,PConc,Gd,TA,Av,GLQ,Unf,GasA,Ex,Y,SBrkr,Gd,Typ,Attchd,RFn,TA,TA,Y,WD,Normal


In [70]:
# from sklearn.preprocessing import LabelEncoder, OneHotEncoder
# from sklearn.preprocessing import StandardScaler, RobustScaler, QuantileTransformer
# from sklearn.impute import SimpleImputer
# from sklearn.metrics import mean_squared_error
# from sklearn.model_selection import GridSearchCV
# from sklearn.model_selection import cross_val_score
# from sklearn.linear_model import LinearRegression, Ridge, HuberRegressor, Lasso, ElasticNet, BayesianRidge
# from sklearn.kernel_ridge import KernelRidge

In [71]:
#create a pipeline (sequential steps) called numeric transformer (num_trans) of SimpleImputer--
#-(to replace missing values with median of the remaining numeric values) and
#standard scaler to normalize the numeric values to mean = 0 and std dev = 1

num_trans = make_pipeline(SimpleImputer(strategy = 'median'), StandardScaler())

#create a pipeline (sequential steps) called cartegory transformer (cat_trans) of SimpleImputer--
#-(to replace missing values with 'missing') and
#standard scaler to normalize the numeric values to mean = 0 and std dev = 1

cat_trans = make_pipeline(SimpleImputer(strategy = 'constant', fill_value = 'absent'), OneHotEncoder(handle_unknown='ignore'))


In [72]:
#columntransformer transform your input dataframe to an array after performing various functions
#passed within it

preprocessor = ColumnTransformer(transformers = [('nums', num_trans, numeric),
                                                ('cats', cat_trans, category)])

In [73]:
#transformation of the sparse matrix generated by the fit_transform method to dense matrix

#preprocessor.fit_transform(X_train).todense()

## Task 6:  Cross-validation of Ensemble models

**Join the already constructed preprocessing pipelines to the real models for cross-validation**

**In cross validation we will look at the 'RMSE score' of the different ensemble models using mostly their default parameters**


In [74]:
#Here we will use another form of pipeline

from sklearn.pipeline import Pipeline

**Boosting: Converting Weak Models to Strong Ones**

The term “boosting” is used to describe a family of algorithms which are able to convert weak models to strong models. The model is weak if it has a substantial error rate, but the performance is not random (resulting in an error rate of 0.5 for binary classification). Boosting incrementally builds an ensemble by training each model with the same dataset but where the weights of instances are adjusted according to the error of the last prediction. The main idea is forcing the models to focus on the instances which are hard. Unlike bagging, boosting is a sequential method, and so you can not use parallel operations here.

**For example:** Adaptive boosting or AdaBoost is one of the simplest boosting algorithms. Usually, decision trees are used for modelling. Multiple sequential models are created, each correcting the errors from the last model. AdaBoost assigns weights to the observations which are incorrectly predicted and the subsequent model works to predict these values correctly.

In [75]:
#Different ensemble models that will be used using basically default parameters

#GradientBoostingRegressor

pipe_GBR = Pipeline(steps = [('preprocessor', preprocessor), ('GBR', GradientBoostingRegressor())])

#XGBRegressor

pipe_XGB = Pipeline(steps = [('preprocessor', preprocessor), ('XGB', XGBRegressor(objective = 'reg:squarederror', metric = 'rmse', nthread = -1))])

#LGBM

pipe_LGBM = Pipeline(steps = [('preprocessor', preprocessor), ('LGBM', LGBMRegressor(objective = 'regression', metric = 'rmse'))])

#AdaBoostRegressor

pipe_ADA = Pipeline(steps = [('preprocessor', preprocessor), ('ADA', AdaBoostRegressor(DecisionTreeRegressor(), loss = 'exponential'))])

In [76]:
#lets get the rmse scores for each of the models

ensemble_pipes = [pipe_GBR, pipe_XGB, pipe_LGBM, pipe_ADA]

print('model', '\t', 'mean rmse', '\t', 'std', '\t\t', 'min_rmse')

for pipe in ensemble_pipes:
    scores = cross_val_score(pipe, X_train, y_log, scoring = 'neg_mean_squared_error', cv = 5)
    scores = np.sqrt(-scores)
    
    print(pipe.steps[1][0], '\t', '{:0.6f}'.format(np.mean(scores)), '\t',
         '{:0.6f}'.format(np.std(scores)), '\t', '{:0.6f}'.format(np.min(scores)))

model 	 mean rmse 	 std 		 min_rmse
GBR 	 0.121034 	 0.006641 	 0.111971
XGB 	 0.122824 	 0.005892 	 0.116251
LGBM 	 0.126041 	 0.006118 	 0.115503
ADA 	 0.136840 	 0.006401 	 0.127881


## Task 7:  GridSearchCv - used to find the best hyperparameters for the Ensemble models

In [77]:
#GridSearch for GradientBoostingRegressor

params_GBR ={'GBR__learning_rate' : [0.1, 0.01, 1.0],
            'GBR__n_estimators' : [200, 400, 600, 800],
            'GBR__max_depth' : [4, 5, 6, 7],
            'GBR__max_features' : [6, 8, 10, 12]}

grid_GBR = GridSearchCV(pipe_GBR, params_GBR, scoring = 'neg_mean_squared_error', verbose = 1, cv = 5)

grid_GBR.fit(X_train, y_log)

Fitting 5 folds for each of 192 candidates, totalling 960 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 960 out of 960 | elapsed: 20.7min finished


GridSearchCV(cv=5, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('preprocessor',
                                        ColumnTransformer(n_jobs=None,
                                                          remainder='drop',
                                                          sparse_threshold=0.3,
                                                          transformer_weights=None,
                                                          transformers=[('nums',
                                                                         Pipeline(memory=None,
                                                                                  steps=[('simpleimputer',
                                                                                          SimpleImputer(add_indicator=False,
                                                                                                        copy=True,
                             

In [78]:
print('RMSE: ', np.sqrt(-grid_GBR.best_score_))
print('Best Parameters: ', grid_GBR.best_params_)

RMSE:  0.11795729974677081
Best Parameters:  {'GBR__learning_rate': 0.1, 'GBR__max_depth': 4, 'GBR__max_features': 10, 'GBR__n_estimators': 400}


In [79]:
#GridSearch for XGBoost

params_XGB = {'XGB__learning_rate' : [0.02, 0.04, 0.008],
             'XGB__max_depth' : [3, 4, 5],
             'XGB__n_estimators' : [1000, 800, 2000],
             'XGB__reg_lambda' : [1, 1.5, 1.6, 1.3]}

grid_XGB = GridSearchCV(pipe_XGB, params_XGB, scoring = 'neg_mean_squared_error', verbose = 1, cv = 5)

grid_XGB.fit(X_train, y_log)

Fitting 5 folds for each of 108 candidates, totalling 540 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 540 out of 540 | elapsed: 35.6min finished


GridSearchCV(cv=5, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('preprocessor',
                                        ColumnTransformer(n_jobs=None,
                                                          remainder='drop',
                                                          sparse_threshold=0.3,
                                                          transformer_weights=None,
                                                          transformers=[('nums',
                                                                         Pipeline(memory=None,
                                                                                  steps=[('simpleimputer',
                                                                                          SimpleImputer(add_indicator=False,
                                                                                                        copy=True,
                             

In [80]:
print('RMSE: ', np.sqrt(-grid_XGB.best_score_))
print('Best Parameters: ', grid_XGB.best_params_)

RMSE:  0.11539210940674478
Best Parameters:  {'XGB__learning_rate': 0.02, 'XGB__max_depth': 3, 'XGB__n_estimators': 2000, 'XGB__reg_lambda': 1.6}


In [81]:
#GridSearch for LGBM

params_LGBM = {'LGBM__num_leaves' : [4, 5, 6],
             'LGBM__max_depth' : [3, 4, 5],
             'LGBM__n_estimators' : [1000, 2000, 3000],
             'LGBM__reg_lambda_l2' : [1, 1.5, 1.6, 1.3],
               'LGBM__bagging_fraction' : [0.5, 0.7],
              'LGBM__bagging_freq' : [1, 2, 3]}

grid_LGBM = GridSearchCV(pipe_LGBM, params_LGBM, scoring = 'neg_mean_squared_error', verbose = 1, cv = 5)

grid_LGBM.fit(X_train, y_log)

Fitting 5 folds for each of 648 candidates, totalling 3240 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 3240 out of 3240 | elapsed: 95.4min finished


GridSearchCV(cv=5, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('preprocessor',
                                        ColumnTransformer(n_jobs=None,
                                                          remainder='drop',
                                                          sparse_threshold=0.3,
                                                          transformer_weights=None,
                                                          transformers=[('nums',
                                                                         Pipeline(memory=None,
                                                                                  steps=[('simpleimputer',
                                                                                          SimpleImputer(add_indicator=False,
                                                                                                        copy=True,
                             

In [82]:
print('LGBM: ', np.sqrt(-grid_LGBM.best_score_))
print('Best Parameters: ', grid_LGBM.best_params_)

LGBM:  0.12161702543122528
Best Parameters:  {'LGBM__bagging_fraction': 0.7, 'LGBM__bagging_freq': 3, 'LGBM__max_depth': 3, 'LGBM__n_estimators': 1000, 'LGBM__num_leaves': 4, 'LGBM__reg_lambda_l2': 1}


In [83]:
#GridSearch for ADABoost

params_ADA = {'ADA__n_estimators' : [1500, 2000],
             'ADA__learning_rate' : [2.0, 3.0],
             'ADA__base_estimator__max_depth' : [9, 11]}

grid_ADA = GridSearchCV(pipe_ADA, params_ADA, scoring = 'neg_mean_squared_error', verbose = 1, cv = 5)

grid_ADA.fit(X_train, y_log)

Fitting 5 folds for each of 8 candidates, totalling 40 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  40 out of  40 | elapsed: 73.5min finished


GridSearchCV(cv=5, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('preprocessor',
                                        ColumnTransformer(n_jobs=None,
                                                          remainder='drop',
                                                          sparse_threshold=0.3,
                                                          transformer_weights=None,
                                                          transformers=[('nums',
                                                                         Pipeline(memory=None,
                                                                                  steps=[('simpleimputer',
                                                                                          SimpleImputer(add_indicator=False,
                                                                                                        copy=True,
                             

In [84]:
print('ADA: ', np.sqrt(-grid_ADA.best_score_))
print('Best Parameters: ', grid_ADA.best_params_)

ADA:  0.13023554121322364
Best Parameters:  {'ADA__base_estimator__max_depth': 11, 'ADA__learning_rate': 3.0, 'ADA__n_estimators': 2000}


### Task 8: Check the correlation of the prediction of the ensemble models on the test data


In [85]:
Ensemble_models = [grid_GBR, grid_XGB, grid_LGBM, grid_ADA]

pred_GBR = grid_GBR.predict(df_test)
pred_XGB = grid_XGB.predict(df_test)
pred_LGBM = grid_LGBM.predict(df_test)
pred_ADA = grid_ADA.predict(df_test)

In [86]:
#lets put our predictions into a data frame 


preds = {'GBR' : pred_GBR, 'XGB' : pred_XGB, 'LGBM' : pred_LGBM, 'ADA' : pred_ADA}

df_pred = pd.DataFrame(preds)
df_pred.corr()

Unnamed: 0,GBR,XGB,LGBM,ADA
GBR,1.0,0.987049,0.982327,0.982049
XGB,0.987049,1.0,0.988748,0.98622
LGBM,0.982327,0.988748,1.0,0.98115
ADA,0.982049,0.98622,0.98115,1.0


In [87]:
#lets visualize their correlation


import plotly.figure_factory as ff

z = df_pred.corr().values.tolist()

x = ['GradientBoost', 'XGBoost', 'Light GBM', 'ADABoost']
y = ['GradientBoost', 'XGBoost', 'Light GBM', 'ADABoost']

z_text = np.around(z, decimals = 2)

fig = ff.create_annotated_heatmap(z, x=x, y=y, annotation_text=z_text, colorscale='magenta')
fig.show()

***The above heatmap shows that the Ensemble models have almost the same predictions despite the differences in their RMSE values***

## Task 9: Stacking

**Stacking : is another ensemble model, where a new model is trained from the combined predictions of two (or more) previous model.**

Stacking regression is an ensemble learning technique to combine multiple regression models via a meta-regressor. The individual regression models are trained based on the complete training set; then, the meta-regressor is fitted based on the outputs -- meta-features -- of the individual regression models in the ensemble.


In [88]:
#Here we will stack the 2 ensembles models (Gradient Boost and XGBoost) that had the lowest RMSE and
#further use RandomForestRegressor our final regressor to compute the final prediction


new_GBR = GradientBoostingRegressor(learning_rate = 0.01, max_depth = 5, max_features = 12, n_estimators = 800)
new_XGB = XGBRegressor(learning_rate = 0.02, max_depth = 3, n_estimators = 2000, reg_lambda = 1.6, 
                                       objective = 'reg:squarederror', metric = 'rmse', nthread = -1)

reg = RandomForestRegressor(n_estimators = 12, max_depth = 3, n_jobs = -1)

stack_ensemble = StackingRegressor(regressors = [new_GBR, new_XGB], meta_regressor = reg)

stack_pipe = Pipeline(steps = [('preprocessor', preprocessor), ('stack_ensemble', stack_ensemble)])

stack_pipe.fit(X_train, y_log)

Pipeline(memory=None,
         steps=[('preprocessor',
                 ColumnTransformer(n_jobs=None, remainder='drop',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('nums',
                                                  Pipeline(memory=None,
                                                           steps=[('simpleimputer',
                                                                   SimpleImputer(add_indicator=False,
                                                                                 copy=True,
                                                                                 fill_value=None,
                                                                                 missing_values=nan,
                                                                                 strategy='median',
                                                      

## Task 10: Prediction on test data done by the stacked model

In [89]:
pred_stack = stack_pipe.predict(df_test)

In [90]:
print(pred_stack)

[11.75167507 11.96293106 12.11092158 ... 11.90862424 11.67555411
 12.31698973]


In [91]:
final_result = pd.DataFrame()
final_result['Id'] = id_test
final_result['SalePrice'] = np.exp(pred_stack)
final_result.head(10)

Unnamed: 0,Id,SalePrice
0,1461,126966.058345
1,1462,156832.095586
2,1463,181847.113913
3,1464,181847.113913
4,1465,181847.113913
5,1466,181847.113913
6,1467,181847.113913
7,1468,162386.641522
8,1469,181847.113913
9,1470,122630.743903


## Summary

1. Data Preparation and Exploration were done to familiarize ourselves with the data and also understand the underlying nature of the data we have; Data visualization was done to envision the correlation of some features to one another.


2. Feature Engineering: One feature was engineered through summing up of related numeric features that yielded higher correlation to the target variable when summed up. This was also visulaized to remove outliers and observed the correlation trend.


3. Data Preprocessing & Pipelines was executed basically to first, standardize the numeric data and fill in their missing values with the variable's median. It was also done for the categorical feature with one hot encoder transforming the unique values in each categorical feature into integers. Furthermore, Column Transformer was done to transform our preprocessed dataframe into a numpy array (sparse matrix).


4. Model training: Cross-validation on Ensembles: Here, four ensemble models were utilized to train the training data set using the cross validation to evaluate the Root Mean Square Error of each and every of the ensemble models used. As seen above, they all add similar RMSE values, apart from ADABoost which was a little bit higher. Furthermore, GridSearchCV was used to find the best hyperparameters for the individual ensemble models that was cross-validated.


5. Correlation of the predictions of the ensemble models to examine the closeness of their results.

 
6. Stacking was done where only the best models were chosen and stacked together, and their predictions were further trained by the final regressor.


7. Final Results (Predicted House Prices)