### By using interpretable ML techniques, this notebook gives you a taste of how to imputing missing data and process data accordingly

**Import all the libraries we need, some libraries are commented out since Kaggle doesn't support them**

In [None]:
#libraries we need
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime
from datetime import date
pd.options.mode.chained_assignment = None
import h2o
import seaborn as sns; sns.set()
import matplotlib.pyplot as plt


#libraries we need
# !pip install h2o

from scipy.special import expit

from h2o.estimators.glm import H2OGeneralizedLinearEstimator
from h2o.grid.grid_search import H2OGridSearch

from sklearn.model_selection import train_test_split
from h2o.estimators import H2OGradientBoostingEstimator
SEED  = 1111   # global random seed for better reproducibility

from sklearn.tree import export_graphviz
# from sklearn.externals.six import StringIO  
from IPython.display import Image  
# import pydotplus

h2o.init(max_mem_size='24G', nthreads=4) # start h2o with plenty of memory and threads
h2o.remove_all()                         # clears h2o memory
h2o.no_progress() 

Add data by clicking **File** on the top left and find your data by using: 

* %cd ../input
* %ls

In [None]:
train = pd.read_csv('../input/house-prices-advanced-regression-techniques/train.csv') 
test = pd.read_csv('../input/house-prices-advanced-regression-techniques/test.csv')

## Basic Data Info

In [None]:
# Drop the id column from both test and training data
train.drop(['Id'],axis=1, inplace=True)
test.drop(['Id'],axis=1, inplace=True)

print('The shape of train data is {}'.format(train.shape))
print('The shape of test data is {}'.format(test.shape))

#concat both the datasets for easier cleaning 
full = train.append(test, ignore_index=True)
print('The shape of full data is {}'.format(full.shape))

Plotting all the data with missing value

In [None]:
pd.DataFrame(full.isna().sum()*100/full.shape[0]).plot.bar(figsize=(20,5))

* The above plot gives us a summary as percent values for all the variables in the training dataset.
* For the variables with huge proportion of missing value: Alley, PoolQC, Fence and MiscFeature etc., it's proper to replace **NA** value with **None**. The **None** can also be a category, telling us something info. Say PoolQC with NA value means the house doesn't have a pool, whcih makes sense to most houses.
* The area with NA value are imputed with the same logic

In [None]:
#NA already existing category
full.update(full[['BsmtCond','BsmtFinType2','BsmtFinType1','BsmtExposure','BsmtQual',
                  'GarageType','GarageQual','GarageFinish','GarageCond','FireplaceQu',
                  'MiscFeature','Fence','PoolQC','Alley','Electrical','MasVnrType']].fillna('None'))

#nan with zero as constant
full.update(full[['BsmtFinSF1','BsmtFinSF2','BsmtUnfSF','BsmtHalfBath',
                  'BsmtFullBath','GarageArea','GarageCars','MasVnrArea','TotalBsmtSF']].fillna(0)) 

# Replacing the missing values with mode for the list of variables ['Exterior1st','Exterior2nd','Functional','KitchenQual','MSZoning','SaleType','Utilities']
full['Exterior1st']=full['Exterior1st'].fillna(full.Exterior1st.value_counts().index[0])
full['Exterior2nd']=full['Exterior2nd'].fillna(full.Exterior2nd.value_counts().index[0])
full['Functional']=full['Functional'].fillna(full.Functional.value_counts().index[0])
full['KitchenQual']=full['KitchenQual'].fillna(full.KitchenQual.value_counts().index[0])
full['MSZoning']=full['MSZoning'].fillna(full.MSZoning.value_counts().index[0])
full['SaleType']=full['SaleType'].fillna(full.SaleType.value_counts().index[0])
full['Utilities']=full['Utilities'].fillna(full.Utilities.value_counts().index[0])

#Dropping irrelavent columns from the whole dataset based on the EDA on the training dataset
#GarageQual is repetitive, which has the same meaning as Garage Cond
#PoolQC is mostly NA and won't provide much info, and we've already have PoolArea
#MSSubClass is a combination of dweiing and year
full= full.drop(['MoSold','GarageQual','PoolQC','MSSubClass'],axis=1)

#filled missing garage years
#It makes no sense to fill year with 0, so we assume the garage was built when the house was built
full['GarageYrBlt'] = full['GarageYrBlt'].fillna(full['YearBuilt'])

#Create new features to make them more comprehensive to common sense
#converting years into age 
currentYear = datetime.now().year
full['Age_House']=currentYear-full['YearBuilt']
full['Age_Renovation']=currentYear-full['YearRemodAdd']
full['Garage_age']=currentYear-full['GarageYrBlt']
full = full.drop(['YearBuilt','YearRemodAdd','GarageYrBlt'],axis=1)

#Changing OverallCond into a categorical variable, they will be label encoded afterwards
#These're ordinal variables
full['OverallCond'] = full['OverallCond'].astype(str)
full['YrSold'] = full['YrSold'].astype(str)

Label encoding some features and create a new variabel **TotalSF**

In [None]:
from sklearn.preprocessing import LabelEncoder
cols = ('FireplaceQu', 'BsmtQual', 'BsmtCond', 'GarageCond', 'ExterQual', 
        'ExterCond','HeatingQC', 'KitchenQual', 'BsmtFinType1', 
        'BsmtFinType2', 'Functional', 'Fence', 'BsmtExposure', 'GarageFinish', 'LandSlope',
        'LotShape', 'PavedDrive', 'Street', 'Alley', 'CentralAir', 'OverallCond', 
        'YrSold')
# process columns, apply LabelEncoder to categorical features
for c in cols:
    lb = LabelEncoder() 
    lb.fit(list(full[c].values)) 
    full[c] = lb.transform(list(full[c].values))
    
    
    
# Adding total sqfootage feature 
full['TotalSF'] = full['TotalBsmtSF'] + full['1stFlrSF'] + full['2ndFlrSF']

After the proccess above, take a look at the full data we have, there's still a feature with missing value, **LotFrontage** 

In [None]:
pd.DataFrame(full.isna().sum()*100/full.shape[0]).plot.bar(figsize=(20,5))

To imputing the missing value, we explore the relationship of LotFrontage with other features. It turns out it's related to LotArea, LotConfig, MSZoning and Neighborhood. We build a random forest mdoel to impute the missing value

In [None]:
#spillitng the data again
train = full[full['SalePrice'].notnull()]
test = full[full['SalePrice'].isnull()]
train_y = train['SalePrice']
train_x = train.drop(['SalePrice'],axis=1)
test_x = test.drop(['SalePrice'],axis=1)

In [None]:
# Get train LotFrontage dummy variables
LotFrontage_Dummies_df = pd.get_dummies(train_x[['LotFrontage', 'MSZoning', 'LotArea', 'LotConfig', 'Neighborhood']])

# Get full dummy variables
# Split the data into LotFrontage known and LotFrontage unknown
LotFrontageKnown = LotFrontage_Dummies_df[LotFrontage_Dummies_df["LotFrontage"].notnull()]
LotFrontageUnknown = LotFrontage_Dummies_df[LotFrontage_Dummies_df["LotFrontage"].isnull()]

# Training data knowing LotFrontage
LotFrontage_Known_X = LotFrontageKnown.drop(["LotFrontage"], axis = 1)
LotFrontage_Known_y = LotFrontageKnown["LotFrontage"]
# Training data unknown LotFrontage
LotFrontage_Unknown_X = LotFrontageUnknown.drop(["LotFrontage"], axis = 1)
# Build model using random forest
from sklearn.ensemble import RandomForestRegressor
rfr=RandomForestRegressor(random_state=1,n_estimators=500,n_jobs=-1)
rfr.fit(LotFrontage_Known_X, LotFrontage_Known_y)
rfr.score(LotFrontage_Known_X, LotFrontage_Known_y)

In [None]:
# Predict training data unknown LotFrontage
LotFrontage_Unknown_y = rfr.predict(LotFrontage_Unknown_X)
train_x.loc[train_x["LotFrontage"].isnull(), "LotFrontage"] = LotFrontage_Unknown_y

In [None]:
# Repeat same process for test data
# Get train LotFrontage dummy variables
LotFrontage_Dummies_df = pd.get_dummies(test_x[['LotFrontage', 'MSZoning', 'LotArea', 'LotConfig', 'Neighborhood']])

# Get full dummy variables
# Split the data into LotFrontage known and LotFrontage unknown
LotFrontageKnown = LotFrontage_Dummies_df[LotFrontage_Dummies_df["LotFrontage"].notnull()]
LotFrontageUnknown = LotFrontage_Dummies_df[LotFrontage_Dummies_df["LotFrontage"].isnull()]

# Testing data knowing LotFrontage
LotFrontage_Known_X = LotFrontageKnown.drop(["LotFrontage"], axis = 1)
LotFrontage_Known_y = LotFrontageKnown["LotFrontage"]
# Testing data unknown LotFrontage
LotFrontage_Unknown_X = LotFrontageUnknown.drop(["LotFrontage"], axis = 1)
# Build model using random forest
from sklearn.ensemble import RandomForestRegressor
rfr=RandomForestRegressor(random_state=1,n_estimators=500,n_jobs=-1)
rfr.fit(LotFrontage_Known_X, LotFrontage_Known_y)
rfr.score(LotFrontage_Known_X, LotFrontage_Known_y)

In [None]:
# Predict testing data unknown LotFrontage
LotFrontage_Unknown_y = rfr.predict(LotFrontage_Unknown_X)
test_x.loc[test_x["LotFrontage"].isnull(), "LotFrontage"] = LotFrontage_Unknown_y

In [None]:
train['LotFrontage'] = train_x['LotFrontage']
test['LotFrontage'] = test_x['LotFrontage']

## EDA

Take a quick look at the target column
* Deviate from the normal distribution.
* Have appreciable positive skewness.
We could take a log of the SalePrice to make it's distribution normal

In [None]:
sns.distplot(train['LotFrontage'])

In [None]:
train.plot.scatter(x='Age_House', y='SalePrice', ylim=(0,800000))

Creating a new feature as the age of the house tells us that there's definately affect of age on the SalePrice as we can see a decreasing trend in Price as the age increases

In [None]:
#box plot overallqual/saleprice
var = 'MSZoning'
data = pd.concat([train['SalePrice'], train[var]], axis=1)
f, ax = plt.subplots(figsize=(8, 6))
fig = sns.boxplot(x=var, y="SalePrice", data=data)
fig.axis(ymin=0, ymax=800000);

This shows us that the mean of different categories of MSZoning, we can see there's a difference in the mean of the categories so keeping this variable in the model seems meaningful. 

In [None]:
result = pd.concat([train_x, train_y], axis=1)
Corr = result.corr().iloc[:-1,-1]

fig, ax_ = plt.subplots(figsize=(8, 10))
_ =  Corr.plot(kind='barh', ax=ax_, colormap='gnuplot')
_ = ax_.set_xlabel('Pearson Correlation for continuous variables')

The above graph gives us the correlation between the numerical variables in the model:

    Positively Correlated
    - TotalSF
    - OverallQual
    - GrLivArea
    - 1stFlrSF
    
    Negatively Correlated
    - BsmtQual
    - ExterQual
    - Kitchenqual
    - Agehouse

## MODEL BUILDING

In [None]:
train['SalePrice'] = np.log(train['SalePrice'])
test['SalePrice'] = np.log(test['SalePrice'])

train_y = train['SalePrice']
train_x = train.drop(['SalePrice'],axis=1)

test_y = test['SalePrice']
test_x = test.drop(['SalePrice'],axis=1)

### Elastic Net GLM

It's not proper to do GLM here but we build this modle as a benchmark

In [None]:
train_df = pd.get_dummies(train)
test_df = pd.get_dummies(test)

In [None]:
train_y_df = train_df['SalePrice']
train_x_df = train_df.drop('SalePrice', axis = 1)

Separate the data into 

In [None]:
r = 'SalePrice'
x = list(train_x_df.columns.values)

In [None]:
hf=h2o.H2OFrame(train_df)
gf=h2o.H2OFrame(test_df)

In [None]:
hyper_params = {'alpha': [0, .25, .5, .75, 1]
                ,'lambda':[1, 0.5, 0.1, 0.01, 0.001, 0.0001, 0.00001, 0]
               }

glm = H2OGeneralizedLinearEstimator(family = 'gaussian',standardize = True,lambda_search = True)

# build grid search with previously made GLM and hyperparameters
grid = H2OGridSearch(model = glm, hyper_params = hyper_params,
                     search_criteria = {'strategy': "Cartesian"})


grid.train(x = x, y = r, training_frame = hf,nfolds=5,seed=1)

In [None]:
sorted_grid = grid.get_grid(sort_by='RMSLE', decreasing=False)
best_model = sorted_grid.models[0]
best_model.cross_validation_metrics_summary()

In [None]:
pred_glm_tr =  best_model.predict(h2o.H2OFrame(train_x_df))
pred_glm_tr = pred_glm_tr.as_data_frame()
co = best_model.coef()

Feature importance for the continuous variables in elastic net glm

In [None]:
cc = [key for key in dict(train.dtypes) if dict(train.dtypes)[key] in ['float64', 'int64']]
cc.remove('SalePrice')

In [None]:
cont_coef = pd.DataFrame.from_dict(dict((k, co[k]) for k in cc),orient='index')

In [None]:
cont_coef = cont_coef.rename(columns={ 0: "Beta"})

In [None]:
cont_coef.plot.barh(figsize=(20, 20),color='orange')

From the above GLM model we can see that the most important numerical variables for the model:
Street and CentralAir followed by FullBath, Fireplaces and OverallQual - Positive Impact

# GBM

The next step in complexity from the penalized GLM will be a GBM model. The GBM model can fit the data using arbitrarily complex stair-step patterns, as opposed to being locked into the regression function form.

The goal is to compare the behavior of the monotonic GBM to the penalized GLM and Pearson correlation coefficients to make sure we trust and understand what the monotonic GBM is doing.

In [None]:
X_train, X_valid, y_train, y_valid = train_test_split(train_x, train_y, test_size=0.30, random_state=1111)

In [None]:
X_train = pd.concat([X_train, y_train], axis=1)
X_valid = pd.concat([X_valid, y_valid], axis=1)
X_train_hf = h2o.H2OFrame(X_train)
X_valid_hf = h2o.H2OFrame(X_valid)

SEED  = 1111   # global random seed for better reproducibility

In [None]:
y_name = 'SalePrice'
x_names = list(train.columns.drop('SalePrice'))

predictors = x_names
response = "SalePrice"

In [None]:
params = {'learn_rate': [0.01, 0.05, 0.1], 
          'max_depth': list(range(2,13,2)),
          'ntrees': [20, 50, 80, 110, 140, 170, 200],
          'sample_rate': [0.5,0.6,0.7,0.9,1], 
          'col_sample_rate': [0.2,0.4,0.5,0.6,0.8,1]
          }


# Prepare the grid object
grid = H2OGridSearch(model=H2OGradientBoostingEstimator,   # Model to be trained
                     grid_id='gbm_grid1',
                     hyper_params=params,              # Dictionary of parameters
                     search_criteria={"strategy": "RandomDiscrete", "max_models": 500}   # RandomDiscrete
                     )

# Train the Model
grid.train(x=predictors,y=response, 
           training_frame=X_train_hf, 
           validation_frame=X_valid_hf,
           seed = SEED) # Grid Search ID

In [None]:
# Identify the best model generated with least error
sorted_final_grid = grid.get_grid(sort_by='rmsle',decreasing = False)

In [None]:
best_model_id = sorted_final_grid.model_ids[0]
best_gbm_from_grid = h2o.get_model(best_model_id)
best_gbm_from_grid.summary()

Above is the summary of the best performing model based on the grid search.

In [None]:
preds_train = best_gbm_from_grid.predict(X_train_hf).exp().as_data_frame()

In [None]:
best_gbm_from_grid.model_performance(X_valid_hf)

In [None]:
X_test_hf = h2o.H2OFrame(test_x)
preds = best_gbm_from_grid.predict(X_test_hf)
final_preds = preds.exp()
final_preds = final_preds.as_data_frame()
pred_pandas=final_preds

In [None]:
raw_test = pd.read_csv('../input/house-prices-advanced-regression-techniques/test.csv')
raw_id = raw_test['Id']
output = pd.concat([raw_id, final_preds], axis=1)
output = output.rename(columns={'exp(predict)': "SalePrice"})

The output is the final content for submission

## INTERPRITIBILITY

We can see from a gradient boosting machine that the most important variable for our model is total square feet followed by overall quality followed by neighborhood excellent quality.

In [None]:
best_gbm_from_grid.varimp_plot()

### SHAPLEY VALUES

In [None]:
contributions = best_gbm_from_grid.predict_contributions(X_test_hf)
#contributions.head(5)

In [None]:
import shap
shap.initjs()
contributions_matrix = contributions.as_data_frame().iloc[:,:].values

X = list(train.columns)
X.remove('SalePrice')
len(X)

In [None]:
shap_values = contributions_matrix[:,:76]
shap_values.shape

In [None]:
expected_value = contributions_matrix[:,:76].min()
expected_value

In [None]:
shap.force_plot(expected_value, shap_values, X)

The above plot shows features each contributing to push the model output from the base value (the average model output over the training dataset we passed) to the model output. Features pushing the prediction higher are shown in red and those pushing the prediction lower are in blue.
Values pushing the model higher than the mean values:

- GarageArea
- GarageArea
- OverallCond

Values pushing the model lower than the mean values:
- OverallQual
- ToalSF
- GrLivArea
- Neighborhood
- MSZoning



The shapley output value is based on the first row (X_test_hf[0,:]) is given below

In [None]:
shap.force_plot(expected_value, shap_values[0,:], X)

In [None]:
shap.summary_plot(shap_values, X)

In [None]:
shap.summary_plot(shap_values, X, plot_type="bar")

Above is the summary plotoftheshapley values, this gives us the importance od the features in our model
We can see that the most importnant variables in our model are:
- TotalSF
- OverallQual
- Neighbourhood. 

This shows that our model is not dependent heavily exclusively on just one variable.  

### PARTIAL DEPENDENCE

Partial dependence can be interpreted as the estimated average output of a model across the values of some interesting input feature

We can see the PD for all the miportant variables in the model.

In [None]:
Continuous = [key for key in dict(train.dtypes) if dict(train.dtypes)[key] in ['float64', 'int64']]

In [None]:
dd = ['TotalSF','OverallQual','1stFlrSF']

for i in dd:
    print(best_gbm_from_grid.partial_plot(data = X_train_hf, cols = [i], server=True, plot = True))

The partial dependence for the GBM show that it picks up on the values as the total square feet of the house increases more than 2000 and becomes constant after the daughter in square feet of the house is 4000 , we can also see a similar trend with the overall quality will we can see a steep jump in the main response of the sales price install quality of the house is greater than 5 and this trend continues until the quality of the house is equal to 8 and becomes constant after that. We can see that if the 1stFloorarea is greater than 500 there is a slight increase in the mean value for nothing in space and after becomes constant.

In [None]:
from sklearn.tree import DecisionTreeRegressor,tree
dt = DecisionTreeRegressor(max_depth=10, min_samples_leaf=0.04,
random_state=SEED)
pred_pandas = h2o.as_list(preds)
test_x_dummies = pd.get_dummies(test_x)

In [None]:
dt = dt.fit(test_x_dummies,np.exp(pred_pandas))

In [None]:
dt.score(test_x_dummies,np.exp(pred_pandas))

StringIO is not supported on Kaggle

In [None]:
# feature_cols = list(test_x_dummies.columns.values)

# dot_data = StringIO()
# export_graphviz(dt, out_file=dot_data,  
#                 filled=True, rounded=True,
#                 special_characters=True,feature_names = feature_cols)
# graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
# Image(graph.create_png())

### SAVING THE MODEL
* path = "./house-prices-data/model"
* best_gbm_from_grid.save_mojo(path)
* best_gbm = h2o.import_mojo(path)

### RESIDUAL ANALYSIS

In [None]:
residual = np.exp(train['SalePrice']).sub(preds_train['exp(predict)'], axis = 0).abs()

In [None]:
residual = pd.DataFrame(residual,columns=['Residual'])

In [None]:
residual['SalePrice']= np.exp(train['SalePrice'])

In [None]:
residual = residual.fillna(0)

In [None]:
df = pd.concat([residual,train_x],axis=1)

In [None]:
residual.mean()

In [None]:
import matplotlib.pyplot as plt
sns.set_style('whitegrid')
fig, ax = plt.subplots(figsize=(20, 10))
plt.scatter(residual['SalePrice'],residual['Residual'],color='r')
plt.xlabel('SalePrice')
plt.ylabel('Residual')
plt.show()

We see that our model doesn't show any general trend for residuals for sales prices lesser then 300,000 and we can see that the residence tend to increase as the sales prices go up this shows that our model get strained on the majority cluster which is between  100,000𝑎𝑛𝑑 300,000 this shows that our model is sensitive to the residuals.

In [None]:
import seaborn as sns
sns.set(font_scale=0.9)                                         
sns.set_style('whitegrid') 

groups = df.groupby(x_names)

sorted_ = df.sort_values(by='Neighborhood') 

g=sns.FacetGrid(df, col="Neighborhood",col_wrap=5)
g= (g.map(plt.scatter, "SalePrice", "Residual").add_legend())

Splitting up to see if there is a trend unless you do this for a specific neighborhood or in general there is a trend of residuals with respect to the neighborhood.

We can see for the neighborhood Edwards we have higher value of residuals as the sale price increases and same is followed for SWISU , other neighborhoods don't show a generic trend in The residuals with respect to the sales price.

These are one of the few neighborhoods where the GBM model is struggling to predict the sales price accurately

In [None]:
sns.set(font_scale=0.9)                                         
sns.set_style('whitegrid') 

groups = df.groupby(x_names)

sorted_ = df.sort_values(by='OverallCond') 

g=sns.FacetGrid(df, col="OverallCond",col_wrap=3)
g= (g.map(plt.scatter, "SalePrice", "Residual").add_legend())

But if you do this can also be plotted for another important input variable that is overall quality when plotted we can see when overall quality is equal to 7 dirty GBM is struggling to accurately predict the sales price.

## COMPARISON OF THE PERFORMANCE OF THE MODEL

In [None]:
fig, ax = plt.subplots(figsize=(20, 10)) 
plt.plot(df['SalePrice'])
plt.plot(np.exp(pred_pandas['predict']),color='orange')
plt.plot(np.exp(pred_glm_tr['predict']),color='deeppink')
_ = ax.set_xlabel('Ranked Row Index')

In [None]:
fig, ax = plt.subplots(figsize=(20, 10)) 
plt.plot(df['SalePrice'],color='deeppink')
plt.plot(np.exp(pred_pandas['predict']),color='orange')

In [None]:
fig, ax = plt.subplots(figsize=(20, 10)) 
plt.plot(df['SalePrice'],color='deeppink')
plt.plot(np.exp(pred_glm_tr['predict']))

We can also compare the performance of our best elastic net model and GBM model, the first glove overlaps the actual sales price and the predictions by both of our models.

The second graph shows us a comparison between the actual sales price and the predictions normal gradient boosting machine we can see that the gradient boosting machine strongly trains around the majority values and is able to capture the effect of outliers in contrast with the elastic net model where we are not able to capture many outliers.

Overall our gradient boosting machine seems to perform better and looks more reliable compared to the elastic net model.

