# [🏠 House Prices: What factors make people pay more? 🔎](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data)

Suppose you are a Data Scientist hired by a Real Estate Company known as `Abode Inc`. The CEO wants to identify the top factors that influence the house prices. They have given you a dataset and want you to come up with useful insights that can help grow their business and become more profitable. What suggestions would you give to the CEO of Abode Inc.? 

The focus of this notebook is to explore techniques to understand and interpret the models and not achieve a top LB score. The aim is to simulate how one would tackle a real world ML problem for a business use case. **This work completely is inspired by [fastai machine learning](https://course18.fast.ai/ml.html) course.**

# 🏁 Preliminaries

In [None]:
# Install packages
! pip install -q dtreeviz
! pip install -q scikit-misc

In [None]:
# Import packages.
import scipy
from scipy.cluster import hierarchy as hc
import re, math, pathlib, numbers, functools, IPython, graphviz
import numpy as np 
import pandas as pd
import altair as alt
import matplotlib.pyplot as plt
import seaborn as sns
import category_encoders as ce
from pathlib import Path
from pdpbox import pdp, get_dataset, info_plots
from concurrent.futures import ProcessPoolExecutor
from sklearn import metrics, ensemble, model_selection, tree
from sklearn.inspection import permutation_importance
from sklearn.model_selection import train_test_split, cross_val_score, cross_validate, cross_val_predict
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder, LabelEncoder, StandardScaler, FunctionTransformer
from sklearn.impute import SimpleImputer, MissingIndicator
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.tree import export_graphviz

# Settings
%matplotlib inline
%reload_ext autoreload
%autoreload 2
pd.options.display.max_columns=100
pd.options.display.max_rows=100
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

In [None]:
# Read data
path = Path('/kaggle/input/house-prices-advanced-regression-techniques/')
train = pd.read_csv(path/'train.csv')
test = pd.read_csv(path/'test.csv')
sample = pd.read_csv(path/'sample_submission.csv')
train.SalePrice = np.log(train.SalePrice)

# Utility Code

In this block, I define some utility functions that I will call repeatedly in the notebook. This allows me to reduce code redundancy and keep the notebook crisp and precise.

In [None]:
# Inspect the data.
def check_features(df):
    return pd.DataFrame({'unique_values': df.nunique(),'type': df.dtypes,'pct_missing': df.isna().sum()/len(df) * 100}).sort_values(by = 'pct_missing', ascending=False) 

# Function to get rmse.
def rmse(targets, preds): 
    return metrics.mean_squared_error(targets, preds, squared=False)

# Impute missing values.
def missing_cats(df, cats): 
    return df[cats].fillna('#na#')

# Functions to fit train and evaluate a model.
def mfe(model, x_train, y_train, x_val, y_val):
    model.fit(x_train, y_train)
    preds_train = model.predict(x_train); preds_val = model.predict(x_val)
    rmse_train = rmse(y_train, preds_train); rmse_val = rmse(y_val, preds_val) # Calculate train & validation rmse.
    r2_train = metrics.r2_score(y_train, preds_train); r2_val = metrics.r2_score(y_val, preds_val) # Calculate train & validation R2.
    result = pd.DataFrame({'rmse_train': rmse_train, 'rmse_val': rmse_val, 'r2_train': r2_train, 'r2_val': r2_val}, index=['metrics'])
    display(result)
    
# Get permutation importance as a dataframe.   
def get_pi(model, x, y):
    imp = permutation_importance(model, x, y, scoring='neg_root_mean_squared_error', n_repeats=2, n_jobs=4, random_state=1)
    df_pi = pd.DataFrame({'features': x.columns, 'imp': imp.importances_mean}, index=None).astype({'imp': np.float64}).sort_values(by='imp', ascending=False).reset_index(drop=True)
    return df_pi    

# Get feature importance as a df.
def get_fi(model, x):
    return pd.DataFrame(np.stack([x.columns, model.feature_importances_], axis=1), columns=['features', 'imp']).astype({'imp':np.float32}).sort_values(by='imp', ascending=False).reset_index(drop=True) 

# Plot feature importance. 
def plot_fi(model, x, n=None, ax=None):
    if n is None: n = len(x.columns)
    df_fi = pd.DataFrame({'features':x.columns, 'imp':(model.feature_importances_)}).sort_values(by='imp', ascending=False).iloc[:n, :]    
    df_fi.sort_values(by='imp', ascending=False).plot.barh(x='features', y='imp', figsize=(10,6), ax=ax);

# Plot permutation importance.    
def plot_pi(model, x, y, n=None, ax=None):
    if n is None: n = len(x.columns)
    imp = permutation_importance(model, x, y, scoring='neg_root_mean_squared_error', n_repeats=2, n_jobs=4, random_state=1)
    df_imp = pd.DataFrame({'features': x.columns, 'imp': imp.importances_mean}, index=None).astype({'imp': np.float64}).sort_values(by='imp', ascending=False).iloc[:n]
    df_imp.sort_values(by='imp', ascending=False).plot.barh(x='features', y='imp', figsize=(10,6), ax=ax);

# Pipeline for categorical feature transformation.
def preproc(cat_feats, cont_feats):
    cat_tfms = Pipeline(steps=[
        ('cat_ordenc', ce.OrdinalEncoder(return_df=True, handle_unknown='value', handle_missing='value'))
    ])
    # Pipeline for numeric feature transformation.
    cont_tfms = Pipeline(steps=[
        ('cont_imputer', SimpleImputer(missing_values=np.nan, strategy='median'))
    ])
    # Transform cat & cont features separately and concatenate the features
    ctf = ColumnTransformer(transformers=[
        ('cat_tfms', cat_tfms, cat_feats),
        ('cont_tfms', cont_tfms, cont_feats)
    ], remainder='passthrough')
    return ctf

# Function to submit. 
def submit(preds_test,fname=None):
    preds = np.exp(preds_test)
    df_preds = pd.DataFrame({'Id':test.Id , 'SalePrice': preds})
    df_preds.to_csv(fname ,index=False)

def cluster_feats(xs, fsize=(10,6)):
    corr = np.round(scipy.stats.spearmanr(xs).correlation, 4)
    corr_condensed = hc.distance.squareform(1-corr)
    z = hc.linkage(corr_condensed, method='average')
    fig = plt.figure(figsize=fsize)
    dendrogram = hc.dendrogram(z, labels=xs.columns, orientation='left', leaf_font_size=12)
    plt.show()

# Data Exploration

In [None]:
train.head(3)

In [None]:
# Check number of train & test examples
train.shape, test.shape

In [None]:
# Check for missing values in train & test sets.
fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2)
train.isnull().sum().to_frame(name='num_missing').query('num_missing>0').sort_values(by='num_missing', ascending=False).plot.barh(ax=ax1, figsize=(10,6));
test.isnull().sum().to_frame(name='num_missing').query('num_missing>0').sort_values(by='num_missing', ascending=False).plot.barh(ax=ax2, figsize=(10,6));
plt.tight_layout()

# Data Preprocessing

Here we create objects that store predictors and target variables. Also we create list that store the numeric and categorical features. This will come in handy in the later sections since sklearn returns a numpy array after preprocessing the data and we need to convert it to pandas dataframe.

In [None]:
# Define feature types.
target = ['SalePrice']
cat_feats = train.drop(columns=['Id', 'SalePrice']).select_dtypes(include='object').columns.tolist()
cont_feats = train.drop(columns=['Id', 'SalePrice']).select_dtypes(include=np.number).columns.tolist()
all_feats = cat_feats + cont_feats
len(train.columns), len(cat_feats), len(cont_feats)

In this step we definine `pipelines` for sequential preprocessing of our numeric and categorical features. Then we create a column transformer object to apply these preprocessing steps to the specific numeric and categorical features. We can then use our `ctf` object to transform our train validation and test datasets in just one step. This is very helpful in making our code cleaner and maintaining it.  

In [None]:
# Pipeline for categorical feature transformation.
cat_tfms = Pipeline(steps=[
    ('cat_ordenc', ce.OrdinalEncoder(return_df=True, handle_unknown='value', handle_missing='value'))
])

# Pipeline for numeric feature transformation.
cont_tfms = Pipeline(steps=[
    ('cont_imputer', SimpleImputer(missing_values=np.nan, strategy='median'))
])

# Transform cat & cont features separately and concatenate the features.
ctf = ColumnTransformer(transformers=[
    ('cat_tfms', cat_tfms, cat_feats),
    ('cont_tfms', cont_tfms, cont_feats)
], remainder='passthrough')

In [None]:
# Transform the data.
X = train[all_feats]
y = train.SalePrice

# Split the data.
x_train, x_val, y_train, y_val = train_test_split(X, y, test_size=.2, shuffle=True, random_state=42)

# Transform the train, valid & test sets.
x_train_tf = pd.DataFrame(ctf.fit_transform(x_train), columns=all_feats)
x_val_tf = pd.DataFrame(ctf.transform(x_val), columns=all_feats)
x_test_tf = pd.DataFrame(ctf.transform(test[all_feats]), columns=all_feats)

In [None]:
# Map the categorical features to encodings.
ordenc_map = dict()
for feat in cat_feats: ordenc_map[feat] = dict(zip(x_train[feat], x_train_tf[feat]))

# Fitting a Random Forest Model

Let's fit a Random Forest model to our training data and see how it performs. Here we use the `mfe()` utility functions to perform the model fit and training.

In [None]:
# Define a random forest model.
rf1 = RandomForestRegressor(
    n_estimators=40, max_depth=None, min_samples_leaf=1, min_samples_split=2,
    max_features=.7, max_samples=None, n_jobs=-1, random_state=1)

# Call the utility function for model fitting & evaluation.
mfe(rf1, x_train_tf, y_train, x_val_tf, y_val)

# How does the model work?

Random Forests are based on the idea of Bagging, which basically says that if we build 5 different models on randomly sampled subsets of our data, then we have models that are somewhat predictive and not correlated with each other. This means that these five models have found different insights into the relationships in the data. If we take the average of these five models, we are essentially bringing insights from each of them. This idea of averaging models is a technique called **ensembling**. Suppose we created a large number of big, deep massively overfittig trees but each one fit to only 10% of our data. They overfit terribly, but since they all are fit on different random samples, they all overfit in different ways. In other words, they have errors, but the errors are random and the average of a buch of random errors is zero. So, if we take the average of these trees trained on different random subsets, the error will average out to zero and what is left is the true relationship. This is the essence of **Random Forests**

The building block of a Random Forest is a Decision Trees. In order to understand how it how it works, let's understand the working of a Decision Tree first. We will be covering the regression context here, but classification is also quite similar. Let's take the example of the problem we are working on. Here we have 80 features in our dataset. All of these features are represented as numeric values after preprocessing the data. For all the values that each of these 80 featutes take, we try to split out data and record a metric that tells us how good the split is. The feature & value that gives the best values of the metric becomes out root node. After splitting the root node, we get the left and right nodes. These can be further split using the same method as described above. The splitting continues untill we are restrained by the different stopping criteria such as max_depth, min_samples_leaf, max_leaves etc. 

The metric that we use to determine the quality of a split can be `mean_squared_error` or `root_mean_squared_error` in the regression setting. The way we calculate the metric for is split is that after we split the node on a particular value, we calculate the mean squared error for each resulting node. We just average the target value in each of the nodes and take it as our prediction. Now, to calculate the metric for the split we take a weighted average of `mse` by the number of samples in each of the nodes. This ensures that we do not consider splits with very little number of samples in the nodes. 

Now let's say that we built 40 Decision Trees in our Random Forest, now to calculate the prediction for a given observation, we would pass that observation through all the 40 DTs and just take the average of the 40 outputs and consider it as our prediction. This was a very brief and high level summary of how the models work. The plots below help us visualize how our RF model splits out data. We can see that the first split is created on the feature `OverallQual` and on the value `6.5`. This results in creation of two nodes that have a lower `mse` of `0.08` and `0.09` respectively. 

In [None]:
# Using the dtreeviz library to visualize how tree splits the data.
from dtreeviz.trees import *
regr = tree.DecisionTreeRegressor(max_depth=3)
regr.fit(x_train_tf, y_train)

dtreeviz(
    regr,
    x_train_tf,
    y_train,
    target_name='SalePrice',
    feature_names=x_train_tf.columns,
    orientation='LR')

In [None]:
# Vizualizing a tree using graphviz library
dot_data = tree.export_graphviz(regr, out_file=None, 
                                feature_names=x_train_tf.columns,
                                rotate=True,
                                filled=True)
graphviz.Source(dot_data, format="png") 

# How confident are we of the predictions generated by the model?

We can use the `predict()` method on our estimator and pass it the data to get the predictions. In the real world scenario, along with the predictions, we also want to know how confident we are about the prediciton. We would be less confident of a prediction, if it has not seen many observations of the similar kind. In such a scenario, we would not expect any of the trees to have a definite path through that is designed to help us predict the observation correctly. As this observations passes through different trees in our RF, it is going to end up in very different places. For this observation, it would make sense to look at the standard deviations of the predictions from the different trees.If the standard deviation is high, that means each tree is giving us a very different estimate of this observation's prediciton. So the standard deviation of the predictions across the trees gives us at least relative understanding of how confident we are of this prediction. You can use this confidence interval for two main purposes:
* We can look at the average confidence interval by group to find out if there are groups you do not seem to have confidence about.
* We can look at the confidence for specific observations in our data. For example let's say we have put our House Price model in production and it says that a particular house can sell for a large amount, but the confidence is quite low. In that case, we might want to change the business decision and sell it at a lower price.

In [None]:
preds = np.stack([tree.predict(x_val_tf) for tree in rf1.estimators_], axis=0)

In [None]:
preds.shape

In [None]:
preds[:, 0].mean(), preds[:, 0].std() 

# Which features are more important than others?

It's great to have a model that gives you the highest accuracy or the lowest error rate, but it's equally important to understand and interpret your model's predictions. In this case, knowing which features are most predictive of price helps us understand what factors are people willing to pay for. This can sometimes be more important in a business scenario where the company can then focus in improving things that matter more to their customers and make more profit. **Feature Importances** proves to be a very handy tool when it comes to model interpretation. In this notebook, we will discuss two common ways to calculating feature importances and their pros and cons.

Random Forest Default Feature Importance
* scikit-learn uses the mean decrease in impurity mechanism to calculate feature importance. 
* The mean decrease in impurity importance of a feature is computed by measuring how effective the feature is at reducing uncertainty (classifiers) or variance (regressors) when creating decision trees within RFs
* This mechanism of computing feature importance can be biased as it tends to inflate the feature importances of continuous and high cardinality categorical variables.

Permutation Importance
* The permutation feature importance is defined to be the decrease in a model score when a single feature value is randomly shuffled. 
* In this method, we take a trained model and record the R2 on validation data. Then we randomly shuffle one feature at a time and record the drop in R2 score. This drop in R2 score becomes the importance of that feature.  
* The permutation mechanism is much more computationally expensive than the mean decrease in impurity mechanism, but the results are more reliable.
* Permutation importance is less reliable when features are collinear, permutating one feature will have little effect on the models performance because it can get the same information from a correlated feature. 
* One way to handle multicollinear features is by performing hierarchical clustering on the Spearman rank-order correlations, picking a threshold, and keeping a single feature from each cluster. We will discuss this in the next section.

In the utility code section, I have provided the code for computing feature importance by both the techniques discussed above. Please take a look at the `plot_fi()` and `plot_pi()` functions. We will be using them below. For a more detailed study of this topic please refer to [this](https://explained.ai/rf-importance/#4) excellent resource.



In [None]:
fig, (ax1,ax2) = plt.subplots(1,2)
plot_fi(rf1, x_train_tf, n=30, ax=ax1) # Feature Importance plot
plot_pi(rf1, x_val_tf, y_val, n=30, ax=ax2) # Permutation Importance plot
fig.tight_layout()

# Which features are similar to each other(redundant features) ?

We have already seen that the variables which are basically measuring the same thing can confuse our variable importance. They can also make our Random Forest slightly less good because it requires more computation to do the same thing and there are more columns to check. In order to find redundant features, we will use a technique called hierarchical or agglomerative clustering. Cluster analysis is something where we are trying to look at objects, they can be rows in the dataset or columns and find which ones are similar to each other. In hierarchical or agglomerated clustering, we look at every pair of objects and say which two objects are the closest. We then take the closest pair, delete them, and replace them with the midpoint of the two. Then repeat that again and again. Since we are removing points and replacing them with their averages, you are gradually reducing a number of points by pairwise combining. The cool thing is, you can plot that. I have defined a function called `cluster_feats()` in the utility code that helps us plot it.
The horizontal axis in the plot indicates how similar are the two points that are being compared. If they are closer to the right, that means that they are very similar. We can see that `MiscVal` and `MiscFeature` are very close to each other and kind of measure the same thing. Similarly `Exterior2nd` and `Exterior1st` are very close to each other. So to move forward, we can decide a cut-off threshold, beyond which either can decide to keep one feature per cluster or combine the similar features in some way.


In [None]:
cluster_feats(x_train_tf, fsize=(12,18))

# Let's remove some features and see how it impacts our rmse.

Here we use the `get_pi()` utility function to get the permutation importance of the features as a dataframe sorted in descending order. Then we create a list `n_feats` to store the different number of features that we want to use to fit our model. We then fit the model for each of these number of features and store the rmse for each iteration in the errors list. Then we plot a graph between the rmse and number of iterations. 

In [None]:
pi = get_pi(rf1, x_val_tf, y_val)
errors = []
n_feats = [5,8,13,17,20,25,28,30,40,50,60,70,79]
for n in n_feats:
    m = RandomForestRegressor(n_estimators=40, max_depth=None, min_samples_leaf=1, min_samples_split=2, max_features=None, max_samples=None, n_jobs=-1, random_state=1)
    f = pi[:n].features.tolist()
    m.fit(x_train_tf.loc[:, f], y_train)
    preds_val = m.predict(x_val_tf.loc[:, f])
    errors.append(rmse(y_val, preds_val))

We can observe from the plot below that the rmse decreases rapidly until we get to using 20 features in our model, desceases further till about 28 features and then starts to decrease. In the real world scenario, depending on the usecase sometimes it relevant to trade off very small amounts of model improvement in favor of a simpler model which is simpler has less number of features and is easier to maintain. So, in this case, we will select the top 25 features by permutation importance for furthur analysis.

In [None]:
# RMSE vs Num Features.
fig, ax = plt.subplots(1, 1, figsize=(8,6))
ax.plot(n_feats, errors);
ax.set(title = "RMSE vs Num Features", xlabel = "Features", ylabel = "RMSE");

In [None]:
pi = get_pi(rf1, x_val_tf, y_val)
to_keep = pi.iloc[:20].features.tolist()
x_train_imp = x_train_tf[to_keep]
x_val_imp = x_val_tf[to_keep]

In [None]:
rf2 = RandomForestRegressor(
    n_estimators=40, max_depth=None, min_samples_leaf=1, min_samples_split=2,
    max_features=.7, max_samples=None, n_jobs=-1, random_state=1)

mfe(rf2, x_train_imp, y_train, x_val_imp, y_val)

We can see that after removing the features the rmse has improved. That's quite surprising. We are are getting a better score with just 30 features, as compared to when we wee using 80. Let's ty and understand what these features mean and how can we find more relevant features. Let's plot the feature importances again to see which features come at the top.

In [None]:
fig, (ax1,ax2) = plt.subplots(1,2)
plot_fi(rf2, x_train_tf[to_keep], ax=ax1) # Feature Importance plot
plot_pi(rf2, x_val_tf[to_keep], y_val, ax=ax2) # Permutation Importance plot
plt.tight_layout()

Here we plot the dendrogram again to check which features in our 25 selected features are similar and can be combined or discarded.

In [None]:
cluster_feats(x_train_tf[to_keep], fsize=(8,8))

In [None]:
check_features(train[to_keep])

# Partial Dependence Plots

What is the relationship between `SalePrice` and `Yearbuilt` all other things being equal?

In [None]:
fig, ax = plt.subplots(figsize=(8,6))
ax.scatter(train.YearBuilt, train.SalePrice, alpha=.3);
ax.set(title='YearBuilt vs SalePrice');

In [None]:
from plotnine import *
ggplot(train, aes('YearBuilt', 'SalePrice'))+stat_smooth(se=True, method='loess')

In [None]:
# Create the partial dependence plot.
pdp_goals = pdp.pdp_isolate(model=rf2, dataset=x_val_tf, model_features=to_keep, feature='YearBuilt')
pdp.pdp_plot(pdp_goals, 'YearBuilt')
plt.show()

# Tree Interpretor: What happens to an observation as it passes through our model?

# Predict & Submit

In [None]:
x_test_tf

In [None]:
y_test = rf2.predict(x_test_tf[to_keep])
submit(y_test,fname='subm1.csv')

In [None]:
# import json
# !mkdir ~/.kaggle
# !touch ~/.kaggle/kaggle.json
# api_token = {"username":"adityabhat","key":"077eab99785d783113fedbf2698bcc53"}
# with open('/root/.kaggle/kaggle.json', 'w') as file:
#     json.dump(api_token, file)
# !chmod 600 ~/.kaggle/kaggle.json

In [None]:
# !kaggle competitions submit -c house-prices-advanced-regression-techniques -f sub3.csv -m "Message"