# Stacked Ensemble Models
This marks my first attempt to begin my journey on Kaggle.  
At the time of writing, it has a RMSLE of 0.11878 (Top 3% on Public Leaderboard).  
  
This kernel will be a documentation of how I went through the processes and I sincerely hope that it will help the readers especially beginners.  
**Feel free to comment if you have any suggestions or questions!**

# ***Initialization***

In [None]:
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
pd.set_option('display.max_columns', None)
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
from scipy.stats import skew, norm, probplot
import time
from sklearn.preprocessing import OneHotEncoder, RobustScaler, PolynomialFeatures
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.feature_selection import mutual_info_regression
from sklearn.linear_model import Ridge, HuberRegressor, LinearRegression
from sklearn.svm import SVR
from sklearn.cluster import KMeans
import catboost as cb
from xgboost import XGBRegressor
from mlxtend.regressor import StackingCVRegressor

In [None]:
df = pd.read_csv('../input/house-prices-advanced-regression-techniques/train.csv')
test = pd.read_csv('../input/house-prices-advanced-regression-techniques/test.csv')

In [None]:
y = df['SalePrice']
df = df.drop(['SalePrice'],axis=1)
df = df.set_index('Id')
test = test.set_index('Id')

# ***Missing-data Imputation***
Data are not always clean. We cannot just simply discard them either as that will cause a huge loss of information.  
  
Imputation of missing data is therefore needed to preserves the data cases by replacing missing data with an estimated value based on other available information.

In [None]:
null_list = []
for col in df.columns:
    null = df[col].isnull().sum()
    test_null = test[col].isnull().sum()
    if null != 0 or test_null != 0:
        null_list.append([col,null,test_null])
null_df = pd.DataFrame(null_list,columns=['Feature','Null','Test Null'])
null_df.set_index('Feature')
null_df['Total Null'] = null_df['Null'] + null_df['Test Null']
print("-------------------------")
print("Total columns with null:")
print(len(null_df))
print("-------------------------")
print("Total null values:")
print(null_df['Total Null'].sum(axis=0))
print("-------------------------")
sns.set_palette(sns.color_palette("pastel"))
sns.barplot(data=null_df.sort_values(by='Total Null',ascending = False).head(10), x='Feature',y='Total Null')
plt.xticks(rotation = 70)
plt.title("Total Nulls in Feature")
plt.show()

We will have to impute those missing values with the most sensable method by looking at them one at a time (a little overkill?).  
But it's nice to do some general EDA throughout the process too.

___
**MSZoning** : Identifies the general zoning classification of the sale.  
       A	Agriculture  
       C	Commercial  
       FV	Floating Village Residential  
       I	Industrial  
       RH	Residential High Density  
       RL	Residential Low Density  
       RP	Residential Low Density Park   
       RM	Residential Medium Density  
       
We join our given training set and test data set together while we go through the process

In [None]:
full = pd.concat([df,test],axis=0).reset_index(drop=True)

In [None]:
null = test[test['MSZoning'].isnull()][["Neighborhood","MSZoning"]]
display(null)
plot_data = pd.concat([full[full['Neighborhood'] == 'IDOTRR'],full[full['Neighborhood'] == 'Mitchel']],axis = 0)
sns.histplot(data = plot_data, x ='MSZoning', hue ='Neighborhood',multiple="dodge", shrink=.9)
plt.title("Distribution of Zoning Classification")
plt.show()

Since the general zoning classification usually depends on the neighborhood, we will impute the missing value by the mode in the area.

In [None]:
test.loc[(test['Neighborhood'] == 'IDOTRR') & (test['MSZoning'].isnull()), 'MSZoning'] = 'RM'
test.loc[(test['Neighborhood'] == 'Mitchel') & (test['MSZoning'].isnull()), 'MSZoning'] = 'RL'

**LotFrontage** : Linear feet of street connected to property  
We expect LotFrontage to be somewhat correlated with LotArea. Hence we will use LinearRegression to impute the missing values.   
We also manually filter out the outliers from the data.

In [None]:
data = full[(~full['LotFrontage'].isnull()) & (full['LotFrontage'] <= 150) & (full['LotArea'] <= 20000)]
sns.lmplot(data=data,x="LotArea",y="LotFrontage", line_kws={'color': 'black'})
plt.ylabel("LotFrontage")
plt.xlabel("LotArea")
plt.title("LotArea vs LotFrontage")
plt.show()

In [None]:
area_vs_frontage = LinearRegression()
area_vs_frontage_X = data['LotArea'].values.reshape(-1, 1)
area_vs_frontage_y = data['LotFrontage'].values
area_vs_frontage.fit(area_vs_frontage_X,area_vs_frontage_y)
for table in [df,test]:
    table['LotFrontage'].fillna(area_vs_frontage.intercept_ + table['LotArea'] * area_vs_frontage.coef_[0] , inplace=True)

**Alley** : data description says NA means no alley access

In [None]:
for table in [df,test]:
    table['Alley'].fillna("None",inplace=True)

**Utilities** : Type of utilities available

In [None]:
full['Utilities'].value_counts()

Since there is only 1 data that uses NoSeWa and, we will surely fill the missing value in test set with AllPub.  
We will just drop the NoSeWa row in our training dataset since it is not found in the test set and will contribute to overfitting if left alone.

In [None]:
test['Utilities'].fillna("AllPub",inplace=True)

In [None]:
df.drop(df[df['Utilities'] == 'NoSeWa'].index, inplace = True)

**Exterior1st**: Exterior covering on house  
**Exterior2nd**: Exterior covering on house (if more than one material)  

There are more than 10 types of materials used in both the metrics. However, we can notice from the barplot that most of them are made of Vinyl. Hence, we will just fill the null values with the mode (Vinyl).

In [None]:
for metrics in ['Exterior1st','Exterior2nd']:
    table = full[metrics].value_counts(normalize=True).head()
    sns.barplot(x=table.index,y=table.values)
    plt.title("Distribution plot of "+metrics)
    plt.show()
    print("\n")

In [None]:
test['Exterior1st'] = test['Exterior1st'].fillna(full['Exterior1st'].mode()[0])
test['Exterior2nd'] = test['Exterior2nd'].fillna(full['Exterior2nd'].mode()[0])

**MasVnrType** : data description says NA means no Masonry veneer.  
However we notice one data in test set with area but missing type.

In [None]:
test[(test['MasVnrType'].isnull()) & (test['MasVnrArea'].notnull())][['MasVnrType','MasVnrArea']]

In [None]:
table = full['MasVnrType'].value_counts(normalize=True).head()
sns.barplot(x=table.index,y=table.values)
plt.title("Distribution plot of MasVnrType")
plt.show()
print("\n")

Since around 60% of our data do not have Masonry veneer. It will be used to fill the null value in row 2611 and also the other rows.

In [None]:
test['MasVnrType'][2611] = "BrkFace"
test['MasVnrType'] = test['MasVnrType'].fillna(full['MasVnrType'].mode()[0])
test['MasVnrArea'] = test['MasVnrArea'].fillna(0)
df['MasVnrType'] = df['MasVnrType'].fillna(full['MasVnrType'].mode()[0])
df['MasVnrArea'] = df['MasVnrArea'].fillna(0)

**Basement Metrics** : data description says BsmtFinType1 measures the Type 1 finished square feet of basement.  
However, we can see a few data in test data set having basement metrics but "0" squarefeets

In [None]:
for basement_metrics_cols in ['BsmtExposure','BsmtCond','BsmtQual']:
    if len(full[(full[basement_metrics_cols].notnull()) & (full['BsmtFinType1'].isnull())]) > 0 :
        print("Present with BsmtFinType1 but undetected" + basement_metrics_cols)
        display(full[(full[basement_metrics_cols].notnull()) & (full['BsmtFinType1'].isnull())])

In [None]:
for basement_metrics_cols in ['BsmtExposure','BsmtCond','BsmtQual']:
    if len(full[(full[basement_metrics_cols].isnull()) & (full['BsmtFinType1'].notnull())]) > 0 :
        print("\nPresent with "+ basement_metrics_cols+" but BsmtFinType1 undetected" )
        display(full[(full[basement_metrics_cols].isnull()) & (full['BsmtFinType1'].notnull())])

In [None]:
# We assume missing basement exposure of unfinished basement is "No".
df.loc[((df['BsmtExposure'].isnull()) & (df['BsmtFinType1'].notnull())), 'BsmtExposure'] = 'No'
test.loc[((test['BsmtExposure'].isnull()) & (test['BsmtFinType1'].notnull())), 'BsmtExposure'] = 'No'
# We impute missing basement condition with "mean" value of Typical.
test.loc[((test['BsmtCond'].isnull()) & (test['BsmtFinType1'].notnull())), 'BsmtCond'] = 'TA'
# We impute unfinished basement quality with "mean" value of Typical.
test.loc[((test['BsmtQual'].isnull()) & (test['BsmtFinType1'].notnull())), 'BsmtQual'] = 'TA'

There is one test data with missing square feet values. Let's check it out too.

In [None]:
test[test['BsmtFinSF1'].isnull()]

This test data do not have basement. Hence, those squarefeets metrics should be filled in with 0.

In [None]:
for square_feet_metrics in ['TotalBsmtSF','BsmtUnfSF','BsmtFinSF2','BsmtFinSF1']:
    test[square_feet_metrics][2121] = 0

There is two test data with missing basement bathroom values. Let's check them out first too.

In [None]:
test[test['BsmtFullBath'].isnull()]

The two test data do not have basement. Hence, those bathroom amount in basement should also be filled in with 0.

In [None]:
for bathroom_metrics in ['BsmtFullBath','BsmtHalfBath']:
    test[bathroom_metrics][2121] = 0
    test[bathroom_metrics][2189] = 0

The other data are assumed to not have basements hence filling in None.

In [None]:
for table in [df,test]:
    table[table.columns[table.columns.str.contains('Bsmt')]] = table[table.columns[table.columns.str.contains('Bsmt')]].fillna("None")

**Electrical, Functional and Kitchen Quality** These three metrics will too be filled with their "average" values.

In [None]:
for metrics in ['Electrical','Functional','KitchenQual']:
    table = full[metrics].value_counts(normalize=True)
    sns.barplot(x=table.index,y=table.values)
    plt.title("Distribution plot of "+metrics)
    plt.show()
    print("\n")

These three metrics are safe to be filled with the mode values.

In [None]:
df['Electrical'].fillna('SBrkr',inplace=True)
test['Functional'].fillna('Typ',inplace=True)
test['KitchenQual'].fillna('TA',inplace=True)

In [None]:
full[full['GarageCars'].isnull()]

Simililarly, this test data do not have a garage, filling GarageArea and GarageCars with 0.

In [None]:
test['GarageCars'].fillna(0,inplace=True)
test['GarageArea'].fillna(0,inplace=True)

In [None]:
display(full[full['SaleType'].isnull()])
table = full['SaleType'].value_counts(normalize=True)
sns.barplot(x=table.index,y=table.values)
plt.title("Distribution plot of SaleType")
plt.show()

For the SaleType column, we will impute the missing data with the mode since the mode value is kinda high too.

In [None]:
test['SaleType'].fillna('WD',inplace=True)

It's now a good time to recheck all other remaining missing values.

In [None]:
null_list = []
for col in df.columns:
    null = df[col].isnull().sum()
    test_null = test[col].isnull().sum()
    if null != 0 or test_null != 0:
        null_list.append([col,null,test_null])
null_df = pd.DataFrame(null_list,columns=['Feature','Null','Test Null'])
null_df.set_index('Feature')
null_df['Total Null'] = null_df['Null'] + null_df['Test Null']
print("-------------------------")
print("Total columns with null:")
print(len(null_df))
print("-------------------------")
print("Total null values:")
print(null_df['Total Null'].sum(axis=0))
print("-------------------------")
sns.set_palette(sns.color_palette("pastel"))
sns.barplot(data=null_df.sort_values(by='Total Null',ascending = False).head(10), x='Feature',y='Total Null')
plt.xticks(rotation = 70)
plt.title("Total Nulls in Feature")
plt.show()

We do not have anything extra to infer these missing columns. Hence, we will treat them as "None" which is not having those items.

In [None]:
df['GarageYrBlt'].fillna(0,inplace=True)
test['GarageYrBlt'].fillna(0,inplace=True)
df.fillna("None", inplace=True)
test.fillna("None", inplace=True)

Let's check the total null value again.

In [None]:
df.isnull().sum().sum() + test.isnull().sum().sum()

In [None]:
df.index = df.index - 1

# ***Feature Engineering***
## Log-transformation of skewed target variable
Log-transformation is a technique used to perform Feature Transformation. It is one of the many techniques that can be used to transform the features so that they are treated equally.  

Why do we want models to treat them equally? It is because when we input these features to the model, there is a posibillity that an larger value in an imbalance feature will influence the result more and further affect the model performance. This is not something we will want as each and every row of data are equally important as a predictor. 

We wouldn't want the model to prioritize predicting only data with higher sale prices. Hence, scaling and transforming is important for algorithms where distance between the data points is important.

We picked log-transformation here as it has the power to alter the skewness of a distribution towards normality. You can observe how log-transformation of a feature can transform its distribution and scale.

In [None]:
# Distribution plot
sns.distplot(y , fit=norm);

(mu, sigma) = norm.fit(y)
print( '\n mu = {:.2f} and sigma = {:.2f}\n'.format(mu, sigma))

plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],
            loc='best')
plt.ylabel('Frequency')
plt.title('SalePrice distribution')

# QQ-plot
fig = plt.figure()
res = probplot(y, plot=plt)
plt.show()

The first plot is a distribution plot where we compare the distribution of our target variable with a normal distribution.  
We can easily see it is right-skewed.  
  
The Q-Q plot below plots the quantiles of our target feature against the quantiles of a normal distribution.  
We can also easily see the skewness in the target feature.
  
Notice how it changes after we apply log transformation onto our feature.

In [None]:
y = np.log(y)

In [None]:
sns.distplot(y , fit=norm);
(mu, sigma) = norm.fit(y)
print( '\n mu = {:.2f} and sigma = {:.2f}\n'.format(mu, sigma))
plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],
            loc='best')
plt.ylabel('Frequency')
plt.title('SalePrice distribution')

fig = plt.figure()
res = probplot(y, plot=plt)
plt.show()

We can now see the distribution plot are much closer to a normal distribution.  
  
The Q-Q plot below also shows that the quantiles of our target feature and the quantiles of a normal distribution are much closer now.  

## Feature creation
In this short section we will construct some new (important) features from existing data that can be fed into our model later on. There are many ways to increase our data, one of them is through creating combinations or ratio from the most relevant variables from the raw data.  
  
In this competition, I decided to add only a few extra features related to square-feet as I think the size of a house will be the main factor of its price.  
  
We also transformed some features that are supposingly categorical but labelled as numerical as they are consisting of numbers.

In [None]:
df['TotalSF'] = df['TotalBsmtSF'] + df['1stFlrSF'] + df['2ndFlrSF']
test['TotalSF'] = test['TotalBsmtSF'] + test['1stFlrSF'] + test['2ndFlrSF']

In [None]:
for table in [df,test]:
    table['MSSubClass'] = table['MSSubClass'].apply(str)
    table['YrSold'] = table['YrSold'].astype(str)
    table['MoSold'] = table['MoSold'].astype(str)

## Feature Encoding Round 1 (Ordinal)
Many machine learning models prefer or can only work with numerical values. Hence, it is common practice to transform the categorical values of the relevant features into numerical ones.  
  
There are many ways though, to transform the features, one of which is through ordinal encoding. We use this method whenever our features has order (A is better than B) so that we can retain the information regarding the order.

In [None]:
qual_dict = {'None': 0, "Po": 1, "Fa": 2, "TA": 3, "Gd": 4, "Ex": 5}
bsmt_fin_dict = {'None': 0, "Unf": 1, "LwQ": 2, "Rec": 3, "BLQ": 4, "ALQ": 5, "GLQ": 6}

for table in [df,test]:
    table["ExterQual"] = table["ExterQual"].map(qual_dict)
    table["ExterCond"] = table["ExterCond"].map(qual_dict)
    table["BsmtQual"] = table["BsmtQual"].map(qual_dict)
    table["BsmtCond"] = table["BsmtCond"].map(qual_dict)
    table["PoolQC"] = table["PoolQC"].map(qual_dict)
    table["HeatingQC"] = table["HeatingQC"].map(qual_dict)
    table["KitchenQual"] = table["KitchenQual"].map(qual_dict)
    table["FireplaceQu"] = table["FireplaceQu"].map(qual_dict)
    table["GarageQual"] = table["GarageQual"].map(qual_dict)
    table["GarageCond"] = table["GarageCond"].map(qual_dict)

    table["BsmtExposure"] = table["BsmtExposure"].map(
        {'None': 0, "No": 1, "Mn": 2, "Av": 3, "Gd": 4}) 
    table["BsmtFinType1"] = table["BsmtFinType1"].map(bsmt_fin_dict)
    table["BsmtFinType2"] = table["BsmtFinType2"].map(bsmt_fin_dict)

    table["Functional"] = table["Functional"].map(
        {'None': 0, "Sal": 1, "Sev": 2, "Maj2": 3, "Maj1": 4, 
         "Mod": 5, "Min2": 6, "Min1": 7, "Typ": 8})

    table["GarageFinish"] = table["GarageFinish"].map(
        {'None': 0, "Unf": 1, "RFn": 2, "Fin": 3})

    table["Fence"] = table["Fence"].map(
        {'None': 0, "MnWw": 1, "GdWo": 2, "MnPrv": 3, "GdPrv": 4})
    
    table["CentralAir"] = table["CentralAir"].map(
        {'N': 0, "Y": 1})
    
    table["PavedDrive"] = table["PavedDrive"].map(
        {'N': 0, "P": 1, "Y": 2})

    
    table["Street"] = table["Street"].map(
        {'Grvl': 0, "Pave": 1})
    
    table["Alley"] = table["Alley"].map(
        {'None': 0, "Grvl": 1, "Pave": 2})
    
    table["LandSlope"] = table["LandSlope"].map(
        {'Gtl': 0, "Mod": 1, "Sev": 2})
    
    table["LotShape"] = table["LotShape"].map(
        {'Reg': 0, "IR1": 1, "IR2": 2, "IR3": 3})
    
modified_cols = ['ExterQual','ExterCond','BsmtQual','BsmtCond','HeatingQC','KitchenQual' \
                    ,'FireplaceQu','GarageQual','GarageCond','BsmtExposure','BsmtFinType1' \
                   ,'BsmtFinType2', 'Functional','GarageFinish','Fence','Street','Alley','LandSlope'\
                    ,'PavedDrive' ,'CentralAir','PoolQC','OverallQual','OverallCond','LotShape']

# Get list of categorical variables in holiday dataset
s = (df.dtypes == 'object')
object_cols = list(s[s].index)
object_cols = [x for x in object_cols if x not in modified_cols]

After round 1 of encoding the obvious ordinal features. We can still go further to simplify our features. This is great when the feature is highly skewed, we can group some values into "Others" to reduce the number of columns when we use one-hot encoding later on.
  
So, we will plot the distributions of the features and see how we should simplify them.

### **!! Long journey of charts and tables ahead, feel free to skip through**

In [None]:
full = pd.merge(left = df, right = y , left_index= True, right_index = True)
full['SalePrice'] = np.exp(full['SalePrice'])

for col in object_cols:
    if full[col].nunique()> 1:
        display(full.groupby(col)['SalePrice'].describe())
        print("\nSummary statistics and graph for "+ col)
        fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(15,5))
        sns.countplot(data = full, x=col, ax= ax[0])
        ax[0].title.set_text("Count plot of " + col)
        sns.swarmplot(data=full,x=col,y='SalePrice', ax= ax[1])
        ax[1].title.set_text("Swarm plot of " + col +" versus Sale Price")
        if (full[col].nunique()>=15):
            ax[0].tick_params('x',labelrotation=70)
            ax[1].tick_params('x',labelrotation=70)
        fig.tight_layout()
        plt.show()
        

## Feature Encoding Round 2 (Simplification + Ordinal)
We can see that many of the features are highly skewed and some feature value counts are very low.  
Hence, we will just group them as "Others". For features that only have two value, we will also just do the manual one-hot encoding here.  
  
Those that have more than two unique values will be one-hot encoded below.

In [None]:
cond_1_keep = ['Norm','Feedr','Artery']
roof_style_keep = ['Gable','Hip']
foundation_keep = ['PConc','CBlock','BrkTil']
garage_keep = ['Attchd','Detchd','BuiltIn']
sale_keep = ['WD','New','COD']
sale_cond_keep = ['Normal','Abnorml','Partial']
peak_months = ['5','6','7']
lot_config_keep = ['Inside','Corner','CulDSac']
unfinished_style = ['1.5Unf','2.5Unf']
exter_remove = ['AsphShn','BrkComm','CBlock','ImStucc','Stone']
for table in [df,test]:
    table.loc[table['LandContour']!='Lvl','LandContour'] = 0
    table.loc[table['LandContour']!=0,'LandContour'] = 1
    
    table.loc[~table['Condition1'].isin(cond_1_keep),'Condition1'] = "Others"
    table.loc[table['Condition2']!="Norm",'Condition2'] = 0
    table.loc[table['Condition2']!= 0,'Condition2'] = 1
    
    table.loc[~table['RoofStyle'].isin(roof_style_keep),'RoofStyle'] = "Others"
    table.loc[table['RoofMatl']!='CompShg','RoofMatl'] = 0
    table.loc[table['RoofMatl']!=0,'RoofMatl'] = 1
    
    table.loc[~table['Foundation'].isin(foundation_keep),'Foundation'] = "Others"
    table.loc[table['Heating']!='GasA','Heating'] = 0
    table.loc[table['Heating']=='GasA','Heating'] = 1
    table.loc[table['Electrical']!='SBrkr','Electrical'] = 0
    table.loc[table['Electrical']!=0,'Electrical'] = 1
    
    table.loc[~table['GarageType'].isin(garage_keep),'GarageType'] = "Others"
    
    table.loc[~table['SaleType'].isin(sale_keep),'SaleType'] = "Others"
    table.loc[~table['SaleCondition'].isin(sale_cond_keep),'SaleCondition'] = "Others"
    table.loc[~table['SaleCondition'].isin(sale_cond_keep),'SaleCondition'] = "Others"
    
    table.loc[table['Exterior1st'].isin(exter_remove),'Exterior1st'] = "Others"
    table.loc[table['Exterior2nd'].isin(exter_remove),'Exterior2nd'] = "Others"
    
    table.loc[table['MoSold'].isin(peak_months),'PeakMonths'] = 1
    table.loc[table['PeakMonths']!=1,'PeakMonths'] = 0
    
    table.loc[~table['LotConfig'].isin(lot_config_keep),'LotConfig'] = "Others"
    
    table.loc[~table['HouseStyle'].isin(unfinished_style),'Unfinished'] = 1
    table.loc[table['Unfinished']!= 1 ,'Unfinished'] = 0
    table.loc[table['HouseStyle'].isin(['SFoyer','SLvl']),'IsSplit'] = 1
    table.loc[table['IsSplit']!= 1 ,'IsSplit'] = 0   
    table["HouseStyle"] = table["HouseStyle"].map(
        {'SFoyer': 0, "SLvl": 0, "1Story": 1, "1.5Fin": 2, "1.5Unf": 2, "2Story": 3, "2.5Fin": 4, "2.5Unf": 4})
    
    table.drop('Utilities', axis = 1 , inplace = True)

    
modified_cols_round_2 = ['HouseStyle','LandContour','Condition2','RoofMatl','Heating','Electrical','Utilities']
object_cols = [x for x in object_cols if x not in modified_cols_round_2]

## Feature Clustering
Before we go on to one-hot encode our categorical features. We can see that some of the features still have a lot of unique values.  
  
This will cause our final training data to have a lot of columns as each and every of the unique values will be encoded into one extra columns. So we can go one step further to simplify the features using clusters.  
  
To do that, we will use an unsupervised learning method which is K-Means to identify suitable clusters.
  
For neighborhoods, I intend to group them into 5 clusters and subclasses I will group them into 4 clusters.
  
To do that, we try to provide K-Means with as many information regarding the feature that we want to cluster as possible. We will use .describe() to include the various statistics regarding the feature and feed it into the model.

In [None]:
neighborhood = full.groupby(['Neighborhood'])['SalePrice'].describe()
display(neighborhood.head())

In [None]:
neighborhood_cluster = KMeans(n_clusters=5, random_state = 927)
neighborhood_cluster.fit(neighborhood)

In [None]:
neigh_cluster_table = pd.DataFrame(zip(list(neighborhood.index),list(neighborhood.loc[:,'mean']),list(neighborhood_cluster.labels_)),columns = ['Neighborhood','MeanSalePrice','Neighborhood Cluster'])
for i  in range(len(neigh_cluster_table.groupby('Neighborhood Cluster')['Neighborhood'].unique())):
    print("Cluster " + str(i))
    print(neigh_cluster_table.groupby('Neighborhood Cluster')['Neighborhood'].unique()[i])
sns.scatterplot(data = neigh_cluster_table, x='Neighborhood',y = 'MeanSalePrice', hue='Neighborhood Cluster',palette=sns.color_palette("Set2",5))
plt.xticks(rotation=70)
plt.show()

In [None]:
subclass = full.groupby(['MSSubClass'])['SalePrice'].describe()
display(subclass.head())

In [None]:
subclass_cluster = KMeans(n_clusters=4, random_state = 927)
subclass_cluster.fit(subclass)

In [None]:
mssub_cluster_table = pd.DataFrame(zip(list(subclass.index),list(subclass.loc[:,'mean']),list(subclass_cluster.labels_)),columns = ['MSSubClass','MeanSalePrice','MSSubClass Cluster'])
for i  in range(len(mssub_cluster_table.groupby('MSSubClass Cluster')['MSSubClass'].unique())):
    print("Cluster " + str(i))
    print(mssub_cluster_table.groupby('MSSubClass Cluster')['MSSubClass'].unique()[i])
sns.scatterplot(data = mssub_cluster_table, x='MSSubClass',y = 'MeanSalePrice', hue='MSSubClass Cluster',palette=sns.color_palette("Set2",4))
plt.xticks(rotation=70)
plt.show()

In [None]:
mssub_cluster_table.drop('MeanSalePrice', axis = 1 ,inplace = True)
neigh_cluster_table.drop('MeanSalePrice', axis = 1, inplace = True)

In [None]:
df = pd.merge(left = df.reset_index(), right = mssub_cluster_table, how='left', on ='MSSubClass').set_index('Id')
df = pd.merge(left = df.reset_index(), right = neigh_cluster_table, how='left', on ='Neighborhood').set_index('Id')
df.drop('MSSubClass', axis = 1 ,inplace = True)
df.drop('Neighborhood', axis = 1 ,inplace = True)

In [None]:
test = pd.merge(left = test.reset_index(), right = mssub_cluster_table, how='left', on ='MSSubClass').set_index('Id')
test = pd.merge(left = test.reset_index(), right = neigh_cluster_table, how='left', on ='Neighborhood').set_index('Id')
test.drop('MSSubClass', axis = 1 ,inplace = True)
test.drop('Neighborhood', axis = 1 ,inplace = True)

After merging the clusters into our training dataset, we also keep track of what are the remaining categorical variables that we want to one-hot encode below.

In [None]:
modified_cols.append('MSSubClass')
modified_cols.append('Neighborhood')

In [None]:
object_cols.append('MSSubClass Cluster')
object_cols.append('Neighborhood Cluster')
object_cols.remove('MSSubClass')
object_cols.remove('Neighborhood')

## Feature Encoding Round 3 (Categorical)
We perform one-hot encoding to the remaining categorical variables

In [None]:
# One Hot Encoding for Other Columns
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols = pd.DataFrame(OH_encoder.fit_transform(df[object_cols]))
OH_cols.index = df.index
OH_cols.columns = OH_encoder.get_feature_names(object_cols)
df = df.drop(object_cols, axis=1)
df = pd.concat([df, OH_cols], axis=1)

OH_cols = pd.DataFrame(OH_encoder.transform(test[object_cols]))
OH_cols.index = test.index
OH_cols.columns = OH_encoder.get_feature_names(object_cols)
test = test.drop(object_cols, axis=1)
test = pd.concat([test, OH_cols], axis=1)

## Feature Transformation (Skewed Features)
We should also take care of the skewness of the features in our dataset. We use skew() from the scipy.stats module to identify which columns are skewed.  
  
Any skewness greater than 0.5 is actually considered slightly skewed hence we will perform log-transformation for any values greather than that.

In [None]:
skewed = df[df.columns[~df.columns.isin(list(OH_cols.columns) + modified_cols + object_cols)]].apply(lambda x: skew(x.dropna().astype(float)))
skewed = skewed[skewed > 0.5]
skewed = skewed.index

df[skewed] = np.log1p(df[skewed])
test[skewed] = np.log1p(test[skewed])

## Feature Scaling
While log-transformation took care of the skewness in the features, we will also want to further scale the features to a standardize the range.  
  
Of the many scaling choices such as MinMaxScaler, StandardScaler, we picked RobustScaler.  
  
The reasoning behind this is because we have seen that our data seems to be quite skewed and it will tend to have more outliers than a normal dataset. Using a RobustScaler can deal with that easily as it uses statistics that are insensitive to outliers to scale the data.
  
A robust scaler minuses the median and divides it by the interquatile range. Both of which are not affected by the outliers.

In [None]:
for col in df[df.columns]:
    if col not in (list(OH_cols.columns) + modified_cols + object_cols):
        scaler = RobustScaler()
        df[col] = scaler.fit_transform(df[[col]])
        test[col] = scaler.transform(test[[col]])

## Feature Selection
Feature selection is a simple way to reduce redundant and irrelevant data from our dataset and some of them contribute close to nothing.  
Removing the irrelevant data actually improves learning accuracy and greatly reduces the computation time.
  
By removing redundant data, we can reduce the chance of our model overfitting to the data too.
  
There are some ways to perform features selection and some of which we surely studied before such as the Pearson’s Correlation and Analysis of Variance (ANOVA). In this notebook, we will utilize the mutual info regression to estimate the dependency of the variables with our target variable.  
  
Mutual information is a non-negative value and it shows the dependency between the variables. Meaning a mutual information of 0 will be saying that both of the features are completely independent. Hence, it is a safe bet for us to remove them off. Note the other name of mutual information is information gain (you may have heard it before).
  
Mutual information measures the amount of information one can obtain from one random variable given another. Source : Data Mining: Practical Machine Learning Tools and Techniques, 4th edition, 2016.

In [None]:
full = pd.merge(left = df, right = y , left_index= True, right_index = True)
mi = mutual_info_regression(X = full.drop('SalePrice', axis = 1), y = full['SalePrice'])
mi_df = pd.DataFrame(list(zip(full.columns,mi)), columns =['Feature','Mutual Info'])
mi_df = mi_df.sort_values('Mutual Info',ascending=False)

In [None]:
low_mi_df = mi_df[abs(mi_df['Mutual Info']) == 0]
filter_feature = sorted(list(low_mi_df['Feature']))
print("Number of low correlated features dropped: " + str(len(filter_feature)))
df = df.drop(filter_feature,axis=1)
test = test.drop(filter_feature,axis=1)

## Polynomial and Interaction Features
Another part of feature creation ! In this part, we create new polynomial and interaction features from the high mutual information features to derive new combinations that might be useful to our model later on.  
  
Polynomial features can allow our linear models to grasp on the non-linearity of the features and we can also see if there is some new interesting relationships between the features themselves by introducing interaction features.
  
We can actually generate polynomial and interaction features from all of our features (quite large) and further cherry pick the good features. There may be hidden interesting relationship to be uncovered there but I am quite satisfied with only using the highly depended features.
  
To read more about interaction features: https://stattrek.com/multiple-regression/interaction.aspx

In [None]:
top_mi_list = list(mi_df.head(20)['Feature'])
top_mi_subset = df[top_mi_list]
index_copy = top_mi_subset.index

poly = PolynomialFeatures(2, interaction_only=True)
poly_features = pd.DataFrame(poly.fit_transform(top_mi_subset),columns=poly.get_feature_names_out(top_mi_list))
poly_features = poly_features.iloc[:,len(top_mi_list) + 1:]
poly_features.set_index(index_copy, inplace = True)
poly_and_price = pd.concat([y,poly_features],axis=1).dropna()
top_20_poly = abs(poly_and_price.corr()['SalePrice']).sort_values(ascending=False)[1:21]

In [None]:
df = pd.concat([df,poly_features[top_20_poly.index]],axis=1)

In [None]:
top_mi_subset = test[top_mi_list]
index_copy = top_mi_subset.index
poly_features = pd.DataFrame(poly.transform(top_mi_subset),columns=poly.get_feature_names_out(top_mi_list))
poly_features = poly_features.iloc[:,len(top_mi_list) + 1:]
poly_features.set_index(index_copy, inplace = True)
test = pd.concat([test,poly_features[top_20_poly.index]],axis=1)

In [None]:
top_20_poly.index

## Outlier Identification
Outliers, the one thing that statistic text books like to assume they are normal.
  
Too bad they are usually not. A bad outlier case actually increases the variance in our model and further reduces the power of our model to grasp onto the data. Outliers cause regression model (especially linear ones) to learn a skewed understanding towards the outlier.  
  
Isolation Forest much like its' name, works to isolation a tree in a huge forest. It works by randomly sampling data based on randomly selected features and potray them in a binary decision tree structure. For an outlier, there are actually less splits needed in the forest to isolate them. Conversely, a datapoint that is not an outlier will require a lot more splits to be isloted. 
[Read more about Isolation Forest on my article.](https://medium.com/@limyenwee_19946/unsupervised-outlier-detection-with-isolation-forest-eab398c593b2)

In [None]:
from sklearn.ensemble import IsolationForest
iso_forest = IsolationForest(random_state=0)
df_without_outlier = pd.Series(iso_forest.fit_predict(df), index = full.index)
df = df.loc[df_without_outlier.index[df_without_outlier == 1],:]

Another way to categorize outliers is by using standardized residuals from linear models. Standardized residuals is can easily identify an abnormal residuals as they are standardized and we can observe the residuals in standard deviation units. Anything larger than 3 standard deviations are usually considered outliers.

In [None]:
full = pd.merge(left = df, right = y , left_index= True, right_index = True)
linear = LinearRegression()
Y = full['SalePrice']
linear.fit(full.drop(['SalePrice'],axis=1), Y)
Y_hat = linear.predict(full.drop(['SalePrice'],axis=1))
residuals = Y - Y_hat
y_vs_yhat_df = pd.DataFrame(zip(Y.values,Y_hat,residuals),columns=['y','yhat','residuals'],index=full.index)

r2 = r2_score(Y, Y_hat)
print("About " + str(round(r2 * 100,2)) + "% of variation in the Sale Price can be explained by the model.")

sns.scatterplot(Y, Y_hat)
sns.lineplot(np.linspace(10.5,13.5),np.linspace(10.5,13.5), color='black', linewidth=2.5)
plt.show()

In [None]:
standard_residuals = (residuals - residuals.mean()) / residuals.std()
outliers = full[abs(standard_residuals) > 3]
y_vs_yhat_df.loc[y_vs_yhat_df.index.isin(outliers.index),'Outlier'] = 1
y_vs_yhat_df.loc[y_vs_yhat_df['Outlier'] != 1 ,'Outlier'] = 0

In [None]:
sns.scatterplot(data = y_vs_yhat_df, x='y', y='yhat',hue ='Outlier', palette = ['blue','red'])
sns.lineplot(np.linspace(10.5,13.5),np.linspace(10.5,13.5), color='black', linewidth=2.5)
plt.show()

In [None]:
df = df.loc[y_vs_yhat_df[y_vs_yhat_df['Outlier'] == 0].index,:]

In [None]:
df = df.drop(list(test.columns[test.nunique()== 1 ]),axis=1)
test = test.drop(list(test.columns[test.nunique()== 1]),axis=1)

# **Modelling**
For this part, we will be using Ridge, XGB, Catboost, SVR, Huber and a Stacked regression.  
Keep in mind that this notebook will not be showing the GridSearch part which is used to hypertune the parameters as that will take some time to finish running.
  
Note that you may choose not to select as much models as I used, I just did it to try out more models and decided to stick with them. The performance of the models will later be averaged out (ensemble model) and we will also implement a stacked regressor at the same time.  
  
Stacked regressor is a type of Level 1 ensemble model that generalizes the predictions made by different models to get the final output. You can study more information regarding stacked models here  
https://www.analyticsvidhya.com/blog/2020/12/improve-predictive-model-score-stacking-regressor/

In [None]:
full = pd.merge(left = df, right = y , left_index= True, right_index = True)
train_y = full['SalePrice']
train_X = full.drop(['SalePrice'],axis=1)

dev_train, dev_test = train_test_split(full, test_size=0.2 ,shuffle=True)
dev_train_y = dev_train['SalePrice']
dev_train_X = dev_train.drop(['SalePrice'],axis=1)
dev_test_y = dev_test['SalePrice']
dev_test_X = dev_test.drop(['SalePrice'],axis=1)

In [None]:
ridgemodel = Ridge(alpha=26)

xgbmodel = XGBRegressor(alpha= 3, colsample_bytree=0.5, reg_lambda=3, learning_rate= 0.01,\
           max_depth=3, n_estimators=10000, subsample=0.65)

svrmodel = SVR(C=8, epsilon=0.00005, gamma=0.0008)

hubermodel = HuberRegressor(alpha=30,epsilon=3,fit_intercept=True,max_iter=2000)

cbmodel = cb.CatBoostRegressor(loss_function='RMSE',colsample_bylevel=0.3, depth=2, \
          l2_leaf_reg=20, learning_rate=0.005, n_estimators=15000, subsample=0.3,verbose=False)

stackmodel = StackingCVRegressor(regressors=(ridgemodel, xgbmodel, svrmodel, hubermodel, cbmodel),
             meta_regressor=cbmodel, use_features_in_secondary=True)

We will fit the models onto development train and test data sets first to have a quick overview of the model performances.

In [None]:
start = time.time()
print("Recording Modelling Time")
for i in [ridgemodel,hubermodel,cbmodel,svrmodel,xgbmodel,stackmodel]:
    i.fit(train_X,train_y)
    if i == stackmodel:
        i.fit(np.array(dev_train_X), np.array(dev_train_y))
end = time.time()
print("Time Elapsed: " + str(round((end - start)/60,0)) +"minutes.")

In [None]:
print("-----------------------------")
print("Overview of model performance")
print("-----------------------------")
for i in [ridgemodel,hubermodel,cbmodel,svrmodel,xgbmodel,stackmodel]:
    print("\n")
    print(i)
    print("RMSLE of Development train set: ")
    print(mean_squared_error(dev_train_y,i.predict(dev_train_X), squared=False))
    print("RMSLE of Development test set: ")
    print(mean_squared_error(dev_test_y,i.predict(dev_test_X), squared=False))
    print("\n")
print("-----------------------------")
print("RMSLE of Development train set using ensemble model: ")
fit = (svrmodel.predict(train_X) + xgbmodel.predict(train_X) +   stackmodel.predict(train_X) + ridgemodel.predict(train_X) + hubermodel.predict(train_X) + cbmodel.predict(train_X)) / 6
print(mean_squared_error(train_y,fit, squared=False))
print("-----------------------------")

This time we fit the models with all the data.

In [None]:
start = time.time()
print("Recording Modelling Time")
for i in [ridgemodel,hubermodel,cbmodel,svrmodel,xgbmodel,stackmodel]:
    i.fit(train_X,train_y)
    if i == stackmodel:
        i.fit(np.array(train_X), np.array(train_y))
end = time.time()
print("Time Elapsed: " + str(round((end - start)/60,0)) +"minutes.")

# Submission
The scores of the models are again averaged out.  
They are given different weight based on my confidence and experiences on using the models.  
Feel free to modify the weights to obtain different results (maybe better).
  
---
Finally, submit it onto Kaggle.  

Final score: **RMSLE of 0.11878** (Top 3% on Leaderboard)

In [None]:
final_prediction = (np.exp(ridgemodel.predict(test))+ 3 * np.exp(xgbmodel.predict(test)) \
+  5 * np.exp(stackmodel.predict(test)) + 4 * np.exp(svrmodel.predict(test)) \
+  np.exp(hubermodel.predict(test)) +  np.exp(cbmodel.predict(test))) / 15

In [None]:
submission = pd.DataFrame(final_prediction, index = test.index)

In [None]:
submission.reset_index(drop=False, inplace = True)
submission = submission.rename(columns={0 : 'SalePrice', 'index' : 'Id'})
submission.to_csv('submission.csv', index=False)

# Thanks!
Thanks for spending your precious time to read this kernel.  
If you have any questions, comments or suggestions feel free to comment. It would be much appreciated!

# References
https://www.kaggle.com/serigne/stacked-regressions-top-4-on-leaderboard/notebook by **SERIGNE**  
https://www.kaggle.com/humananalog/xgboost-lasso/script by **HUMAN ANALOG**