![png](./docs/header.jpg)

# House Prices -  `Model V8`
### Top 9% on Kaggle August 27th, 2019 

This approach combines a lot of rather rough **feature engeneering**:
+ [X] imputing missing values
+ [X] deleting obvious outliers
+ [X] encoding categorical variables
+ [X] creatin dummy features
+ [X] transforming skewed features
+ [X] adding new features

and **stacking different regression models** with `SkLearn`, `XGBoost` and `LightGBM` simply taking the mean of their predictions as the final prediction.

---

This is the 8th iteration of my house price model. Of course I did a lot of data exploration aswell. This can be found in the other notebook in this repository. Through the process I got lots and lots of helpfull information and tips from all kinds of blogs and notebooks (linked in docs folder) most notably:

+ https://www.kaggle.com/pmarcelino/comprehensive-data-exploration-with-python
+ https://www.kaggle.com/serigne/stacked-regressions-top-4-on-leaderboard

In [1]:
# import some modules (later more)
import warnings
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from scipy import stats
from scipy.stats import norm, skew

In [2]:
# import data
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
print(train.shape)
print(test.shape)

(1460, 81)
(1459, 80)


There are several features (nearly up to 40) that seem uncorrolated with the `SalePrice` nonetheless they become very helpfull to squeeze out the little
rest to reach a top score! Yet the `Id` is irrelevant and will be droppped. So it `Utilities` as it adds no information.

## Data reduction

In [3]:
# dropping features
test_ID = test['Id']
dropF = ["Id", "Utilities"]
for f in dropF:
    train = train.drop(f, axis = 1)
    test = test.drop(f, axis = 1)

(1460, 79)
(1459, 78)


I read in several notebooks about outlier detection. But I can't push enough how **dangerous** automating outlier detection is! Although it seemed to bring some improvement for models with only a small subset of the given features it worsened the it drastically when including all the lesser correlated features. I will only drop some obvious outliers **manually**! (See ....)

In [4]:
# deleting outliers from GrLivArea
train = train.drop(train[(train['GrLivArea']>4000) & (train['SalePrice']<300000)].index)
ntrain = train.shape[0]
ntest = test.shape[0]

(1458, 79)
(1459, 78)


Although XBBoost alone generally doesn't need any feature normalization (and it even worsened smaller models) when using a stacked model transforming the SalePrice skew helped. Don't transform other features as there seem to be differences to the test set.

In [5]:
# correct skew
train["SalePrice"] = np.log1p(train["SalePrice"])

(1458, 79)
(1459, 78)


## Feature engeneering

In [6]:
# merge train and test into a single data frame
y_train = train['SalePrice'].values
data = pd.concat((train, test)).reset_index(drop=True)
data = data.drop(['SalePrice'], axis=1)


(2917, 79)
(2917, 78)


In [7]:
# drop NaN values from the set
dataNa = (data.isnull().sum() / len(data)) * 100
dataNa = dataNa.drop(dataNa[dataNa == 0].index).sort_values(ascending=False)[:30]
# how much data is missing now?
missing = pd.DataFrame({'Missing [%]':dataNa})
missing.head()

(2917, 78)


Unnamed: 0,Missing [%]
PoolQC,99.691464
MiscFeature,96.400411
Alley,93.212204
Fence,80.425094
FireplaceQu,48.680151


In [8]:
# Filling nonexisting data
nonFeature = ["PoolQC", "Alley", "Fence", "MiscFeature", "FireplaceQu", "GarageType", "GarageFinish",
              "GarageFinish", "GarageQual", "GarageCond", "MSSubClass", "BsmtQual", "BsmtCond", "BsmtExposure", 
              "BsmtFinType1", "BsmtFinType2", "MasVnrType"]
for f in nonFeature:
    data[f] = data[f].fillna("None")

# Filling missing data with zeros
zeroFeature = ["MasVnrArea", "GarageYrBlt", "GarageArea", "GarageCars", "BsmtFinSF1", "BsmtFinSF2",
               "BsmtUnfSF","TotalBsmtSF", "BsmtFullBath", "BsmtHalfBath"]
for f in zeroFeature:
    data[f] = data[f].fillna(0)

# Filling missing data with the median
data["LotFrontage"] = data.groupby("Neighborhood")["LotFrontage"].transform(lambda x: x.fillna(x.median()))

# Filling categorical NaN features
data["Functional"] = data["Functional"].fillna("Typ")
catFeature = ['MSZoning', 'Electrical', 'KitchenQual', 'Exterior1st', 'Exterior2nd', 'SaleType']
for f in catFeature:
    data[f] = data[f].fillna(data[f].mode()[0])


(2917, 78)


In [9]:
# drop NaN values from the set
dataNa = (data.isnull().sum() / len(data)) * 100
dataNa = dataNa.drop(dataNa[dataNa == 0].index).sort_values(ascending=False)[:30]
# how much data is missing now?
missing = pd.DataFrame({'Missing [%]':dataNa})
missing.head()


(2917, 78)


Unnamed: 0,Missing [%]


(Here we check if anything is still missing.)


We will now encode the labels to categorical features using `SkLearn`. Might not be the best decision, but this way I avoided creating to much dimensionality.

In [10]:
from sklearn.preprocessing import LabelEncoder

catFeature = ('FireplaceQu', 'BsmtQual', 'BsmtCond', 'GarageQual', 'GarageCond', 
        'ExterQual', 'ExterCond','HeatingQC', 'PoolQC', 'KitchenQual', 'BsmtFinType1', 
        'BsmtFinType2', 'Functional', 'Fence', 'BsmtExposure', 'GarageFinish', 'LandSlope',
        'LotShape', 'PavedDrive', 'Street', 'Alley', 'CentralAir', 'MSSubClass', 'OverallCond', 
        'YrSold', 'MoSold')
for f in catFeature:
    lbl = LabelEncoder() 
    lbl.fit(list(data[f].values)) 
    data[f] = lbl.transform(list(data[f].values))

Now the fun part! Let's add new features!

In [11]:
# total squarefootage feature 
data["TotalSF"] = data["TotalBsmtSF"] + data["1stFlrSF"] + data["2ndFlrSF"]
# people
data["People"] = 2 * data["GarageCars"]

(2917, 79)
(2917, 80)


And let's remove the skew from a lof of numerical features:

In [12]:
numFeatures = data.dtypes[data.dtypes != "object"].index

# Check the skew of all numerical features
skewedFeatures = data[numFeatures].apply(lambda x: skew(x.dropna())).sort_values(ascending=False)
skewed = pd.DataFrame({'Skew': skewedFeatures})
skewed.head()

Unnamed: 0,Skew
MiscVal,21.939672
PoolArea,17.688664
LotArea,13.109495
LowQualFinSF,12.084539
3SsnPorch,11.37208


Using the Box Cox Transformation we minimize the skew of numerical features:

In [13]:
from scipy.special import boxcox1p

# list of highly skewed features
skewed = skewed[abs(skewed) > 1]
skewedFeatures = skewed.index

# transform these features
lamda = 0.2
for f in skewedFeatures:
    data[f] = boxcox1p(data[f], lamda)

And we split our categorical features into several dummy features (0,1) increasing the number of features immensly:

In [41]:
print(data.shape)
data = pd.get_dummies(data)
print(data.shape)

(2917, 221)
(2917, 221)


Now, before we go into moddeling the processed data, let us split it again into a train and test set.

In [42]:
train = data[:ntrain]
test = data[ntrain:]
print(train.shape)
print(test.shape)

(1458, 221)
(1459, 221)


---
## Model

In [43]:
from sklearn.linear_model import ElasticNet, Lasso,  BayesianRidge, LassoLarsIC
from sklearn.ensemble import RandomForestRegressor,  GradientBoostingRegressor
from sklearn.kernel_ridge import KernelRidge
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import RobustScaler
from sklearn.base import BaseEstimator, TransformerMixin, RegressorMixin, clone
from sklearn.model_selection import KFold, cross_val_score, train_test_split
from sklearn.metrics import mean_squared_error
import xgboost as xgb
import lightgbm as lgb

In [44]:
# Lasso Regression
lasso = make_pipeline(RobustScaler(), Lasso(alpha=0.0005, random_state=1))
# Elastic Net
ENet = make_pipeline(RobustScaler(), ElasticNet(alpha=0.0005, l1_ratio=.9, random_state=3))
# Kernel Ridge
KRR = KernelRidge(alpha=0.6, kernel='polynomial', degree=2, coef0=2.5)
# GB Regressor
GBoost = GradientBoostingRegressor(n_estimators=3000, learning_rate=0.05,
                                   max_depth=4, max_features='sqrt',
                                   min_samples_leaf=15, min_samples_split=10, 
                                   loss='huber', random_state =5)
# XGB Regressor
model_xgb = xgb.XGBRegressor(colsample_bytree=0.4603, gamma=0.0468, 
                             learning_rate=0.05, max_depth=3, 
                             min_child_weight=1.7817, n_estimators=2200,
                             reg_alpha=0.4640, reg_lambda=0.8571,
                             subsample=0.5213, silent=1,
                             random_state =7, nthread = -1)
# Light GBM Regressor
model_lgb = lgb.LGBMRegressor(objective='regression',num_leaves=5,
                              learning_rate=0.05, n_estimators=720,
                              max_bin = 55, bagging_fraction = 0.8,
                              bagging_freq = 5, feature_fraction = 0.2319,
                              feature_fraction_seed=9, bagging_seed=9,
                              min_data_in_leaf =6, min_sum_hessian_in_leaf = 11)

In [45]:
# make individual predictions
pred1 = lasso.fit(train,y_train).predict(test)
pred2 = ENet.fit(train,y_train).predict(test)
pred3 = KRR.fit(train,y_train).predict(test)
pred4 = GBoost.fit(train,y_train).predict(test)
pred5 = model_xgb.fit(train,y_train).predict(test)
pred6 = model_lgb.fit(train,y_train).predict(test)

# use the mean to make final predictions
pred = []
for e in range(len(pred1)):
    pred.append(np.expm1(np.mean([pred1[e],pred2[e],pred3[e],pred4[e],pred5[e],pred6[e]])))

In [46]:
# save the predictions for submission
sub = pd.DataFrame()
sub['Id'] = test_ID
sub['SalePrice'] = pred
sub.to_csv('submissionV8.csv',index=False)