# Unleashing the mystery of Kaggle
- How does it all work?
- Feature Engineering
- Parameter Tuning
- Ensembling & Stacking

## How does it all work - Should I trust the public leaderboard?

- Each Kaggle competition has public and private leaderboard. Public leaderboard only uses part of the test dataset to determine the score and the private leaderboard will evaluated using the other part at the end of the competition.
- You can find how Kaggle calculate the public and private leaderboard [here](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/leaderboard).
- If the competition has a large training set and a relatively small public test set compared to private test set, you can easily overfit the public test set. In this case, you **should not** trust the public leaderboard. 
- If the traing set and test set are collected from different time frames, you **must** trust the public leaderboard.

![img](https://s3.amazonaws.com/nycdsabt01/s2-4.png)

### CV or LB?
- **TRUST YOUR CV!**
- Typical question on smaller datasets: 
 - “I’m doing proper cross-validation and see improvements on my CV score, but public leaderboard is so random and does not correlate at all!”
- Top kagglers’ pick most of the time:
 - Final Submission = $X*CV + (1-X)*LB$, typically $X=0.5$ is OK.
- Trusting CV is a hard thing to do

### Preprocess the dataframes before we dive into the details.

In [28]:
import pandas as pd
import numpy as np
pd.set_option('display.max_columns', 100)

train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')

train_df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,BrkFace,196.0,Gd,TA,PConc,Gd,TA,No,GLQ,706,Unf,0,150,856,GasA,Ex,Y,SBrkr,856,854,0,1710,1,0,2,1,3,1,Gd,8,Typ,0,,Attchd,2003.0,RFn,2,548,TA,TA,Y,0,61,0,0,0,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,1976,1976,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,CBlock,Gd,TA,Gd,ALQ,978,Unf,0,284,1262,GasA,Ex,Y,SBrkr,1262,0,0,1262,0,1,2,0,3,1,TA,6,Typ,1,TA,Attchd,1976.0,RFn,2,460,TA,TA,Y,298,0,0,0,0,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2001,2002,Gable,CompShg,VinylSd,VinylSd,BrkFace,162.0,Gd,TA,PConc,Gd,TA,Mn,GLQ,486,Unf,0,434,920,GasA,Ex,Y,SBrkr,920,866,0,1786,1,0,2,1,3,1,Gd,6,Typ,1,TA,Attchd,2001.0,RFn,2,608,TA,TA,Y,0,42,0,0,0,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,1915,1970,Gable,CompShg,Wd Sdng,Wd Shng,,0.0,TA,TA,BrkTil,TA,Gd,No,ALQ,216,Unf,0,540,756,GasA,Gd,Y,SBrkr,961,756,0,1717,1,0,1,0,3,1,Gd,7,Typ,1,Gd,Detchd,1998.0,Unf,3,642,TA,TA,Y,0,35,272,0,0,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,2000,2000,Gable,CompShg,VinylSd,VinylSd,BrkFace,350.0,Gd,TA,PConc,Gd,TA,Av,GLQ,655,Unf,0,490,1145,GasA,Ex,Y,SBrkr,1145,1053,0,2198,1,0,2,1,4,1,Gd,9,Typ,1,TA,Attchd,2000.0,RFn,3,836,TA,TA,Y,192,84,0,0,0,0,,,,0,12,2008,WD,Normal,250000


In [29]:
# Save the 'Id' column
train_ID = train_df['Id']
test_ID = test_df['Id']

# Now drop the 'Id' colum since we can not use it as a feature to train our model.
train_df.drop("Id", axis = 1, inplace = True)
test_df.drop("Id", axis = 1, inplace = True)

In [56]:
test_ID.shape

(1459,)

In [30]:
y_train = train_df['SalePrice']
X_train = train_df.drop('SalePrice', axis=1)
X_test = test_df.copy()

- Delete the dataframes that you do not need anymore to save memory.

In [31]:
del train_df, test_df

In [32]:
print(X_train.shape)
print(X_test.shape)

(1460, 79)
(1459, 79)


- Combine training and test dataframes before feature engineering.
- **This is not always the correct way.**
 - For categorial features, this is fine because you want to avoid having new categories in the test set, which will cause different dimensions after dummify the data set.
 - If you want to perform any transformation (normalization, standardization, etc) on the numerical features, you should **[fit on the training set and transform on the test set.](https://stats.stackexchange.com/a/174865)**
 - It also applys to how you perform cross-validation. See Chapter 7.10.2 of [ESLR](https://web.stanford.edu/~hastie/Papers/ESLII.pdf)

In [33]:
all_data = pd.concat([X_train, X_test], ignore_index=True)
all_data.shape

(2919, 79)

## Feature Engineering - most creative aspect of Data Science

### Categorical  features
- Nearly always need some treatment
- High cardinality can create very sparse data

#### One-hot encoding
- One-of-K encoding on an array of length K
- Basic method: used with most linear algorithm
- Drop first column avoids collinearity
 - encoding gender as two variables, **is_male** and **is_female**, produces two features which are perfectly negatively correlated
- Encode categories appearing 3+ times
 - Reduce training feature space with no loss of info.

In [34]:
for c in all_data.columns:
    if all_data[c].dtype == 'object':
        print(c, len(all_data[c].value_counts()))

MSZoning 5
Street 2
Alley 2
LotShape 4
LandContour 4
Utilities 2
LotConfig 5
LandSlope 3
Neighborhood 25
Condition1 9
Condition2 8
BldgType 5
HouseStyle 8
RoofStyle 6
RoofMatl 8
Exterior1st 15
Exterior2nd 16
MasVnrType 4
ExterQual 4
ExterCond 5
Foundation 6
BsmtQual 4
BsmtCond 4
BsmtExposure 4
BsmtFinType1 6
BsmtFinType2 6
Heating 6
HeatingQC 5
CentralAir 2
Electrical 5
KitchenQual 4
Functional 7
FireplaceQu 5
GarageType 6
GarageFinish 3
GarageQual 5
GarageCond 5
PavedDrive 3
PoolQC 3
Fence 4
MiscFeature 4
SaleType 9
SaleCondition 6


In [35]:
one_hot_df = pd.get_dummies(all_data, drop_first=True, dummy_na=True)
one_hot_df.head()

Unnamed: 0,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,TotRmsAbvGrd,Fireplaces,GarageYrBlt,GarageCars,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,MSZoning_FV,MSZoning_RH,MSZoning_RL,MSZoning_RM,MSZoning_nan,Street_Pave,Street_nan,Alley_Pave,Alley_nan,LotShape_IR2,LotShape_IR3,LotShape_Reg,LotShape_nan,LandContour_HLS,...,FireplaceQu_TA,FireplaceQu_nan,GarageType_Attchd,GarageType_Basment,GarageType_BuiltIn,GarageType_CarPort,GarageType_Detchd,GarageType_nan,GarageFinish_RFn,GarageFinish_Unf,GarageFinish_nan,GarageQual_Fa,GarageQual_Gd,GarageQual_Po,GarageQual_TA,GarageQual_nan,GarageCond_Fa,GarageCond_Gd,GarageCond_Po,GarageCond_TA,GarageCond_nan,PavedDrive_P,PavedDrive_Y,PavedDrive_nan,PoolQC_Fa,PoolQC_Gd,PoolQC_nan,Fence_GdWo,Fence_MnPrv,Fence_MnWw,Fence_nan,MiscFeature_Othr,MiscFeature_Shed,MiscFeature_TenC,MiscFeature_nan,SaleType_CWD,SaleType_Con,SaleType_ConLD,SaleType_ConLI,SaleType_ConLw,SaleType_New,SaleType_Oth,SaleType_WD,SaleType_nan,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial,SaleCondition_nan
0,60,65.0,8450,7,5,2003,2003,196.0,706.0,0.0,150.0,856.0,856,854,0,1710,1.0,0.0,2,1,3,1,8,0,2003.0,2.0,548.0,0,61,0,0,0,0,0,2,2008,0,0,1,0,0,1,0,0,1,0,0,1,0,0,...,0,1,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0
1,20,80.0,9600,6,8,1976,1976,0.0,978.0,0.0,284.0,1262.0,1262,0,0,1262,0.0,1.0,2,0,3,1,6,1,1976.0,2.0,460.0,298,0,0,0,0,0,0,5,2007,0,0,1,0,0,1,0,0,1,0,0,1,0,0,...,1,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0
2,60,68.0,11250,7,5,2001,2002,162.0,486.0,0.0,434.0,920.0,920,866,0,1786,1.0,0.0,2,1,3,1,6,1,2001.0,2.0,608.0,0,42,0,0,0,0,0,9,2008,0,0,1,0,0,1,0,0,1,0,0,0,0,0,...,1,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0
3,70,60.0,9550,7,5,1915,1970,0.0,216.0,0.0,540.0,756.0,961,756,0,1717,1.0,0.0,1,0,3,1,7,1,1998.0,3.0,642.0,0,35,272,0,0,0,0,2,2006,0,0,1,0,0,1,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
4,60,84.0,14260,8,5,2000,2000,350.0,655.0,0.0,490.0,1145.0,1145,1053,0,2198,1.0,0.0,2,1,4,1,9,1,2000.0,3.0,836.0,192,84,0,0,0,0,0,12,2008,0,0,1,0,0,1,0,0,1,0,0,0,0,0,...,1,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0


#### Label encoding
- Give every categorial variable a unique numerical ID
- Useful for non-linear tree-based algorithm
- Does not increase dimensionality

In [36]:
from sklearn.preprocessing import LabelEncoder

label_df = all_data.copy()

for c in label_df.columns:
    if label_df[c].dtype == 'object':
        le = LabelEncoder()
        # Need to convert the column type to string in order to encode missing values
        label_df[c] = le.fit_transform(label_df[c].astype(str))

In [37]:
label_df.head()

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,60,3,65.0,8450,1,2,3,3,0,4,0,5,2,2,0,5,7,5,2003,2003,1,1,12,13,1,196.0,2,4,2,2,3,3,2,706.0,5,0.0,150.0,856.0,1,0,1,4,856,854,0,1710,1.0,0.0,2,1,3,1,2,8,6,0,5,1,2003.0,1,2.0,548.0,4,4,2,0,61,0,0,0,0,3,4,4,0,2,2008,8,4
1,20,3,80.0,9600,1,2,3,3,0,2,0,24,1,2,0,2,6,8,1976,1976,1,1,8,8,2,0.0,3,4,1,2,3,1,0,978.0,5,0.0,284.0,1262.0,1,0,1,4,1262,0,0,1262,0.0,1.0,2,0,3,1,3,6,6,1,4,1,1976.0,1,2.0,460.0,4,4,2,298,0,0,0,0,0,3,4,4,0,5,2007,8,4
2,60,3,68.0,11250,1,2,0,3,0,4,0,5,2,2,0,5,7,5,2001,2002,1,1,12,13,1,162.0,2,4,2,2,3,2,2,486.0,5,0.0,434.0,920.0,1,0,1,4,920,866,0,1786,1.0,0.0,2,1,3,1,2,6,6,1,4,1,2001.0,1,2.0,608.0,4,4,2,0,42,0,0,0,0,3,4,4,0,9,2008,8,4
3,70,3,60.0,9550,1,2,0,3,0,0,0,6,2,2,0,5,7,5,1915,1970,1,1,13,15,2,0.0,3,4,0,3,1,3,0,216.0,5,0.0,540.0,756.0,1,2,1,4,961,756,0,1717,1.0,0.0,1,0,3,1,2,7,6,1,2,5,1998.0,2,3.0,642.0,4,4,2,0,35,272,0,0,0,3,4,4,0,2,2006,8,0
4,60,3,84.0,14260,1,2,0,3,0,2,0,15,2,2,0,5,8,5,2000,2000,1,1,12,13,1,350.0,2,4,2,2,3,0,2,655.0,5,0.0,490.0,1145.0,1,0,1,4,1145,1053,0,2198,1.0,0.0,2,1,4,1,2,9,6,1,4,1,2000.0,1,3.0,836.0,4,4,2,192,84,0,0,0,0,3,4,4,0,12,2008,8,4


### Categorical Features with many categories - rows:category ratio 20:1 or less

#### Label Count encoding
- Rank categorical variables by count in the **training** set and transform the test set
- Iterate counter for each CV fold - fit on the **new training set** and transform on the **new test set**
- Useful for both linear or non-linear algorithms

In [38]:
class LabelCountEncoder(object):
    def __init__(self):
        self.count_dict = {}
    
    def fit(self, column):
        # This gives you a dictionary with level as the key and counts as the value
        count = column.value_counts().to_dict()
        # We want to rank the key by its value and use the rank as the new value
        # Your code here
        # self.count_dict = 
    
    def transform(self, column):
        # If a category only appears in the test set, we will assign the value to zero.
        missing = 0
        # Your code here
    
    def fit_transform(self, column):
        self.fit(column)
        return self.transform(column)

In [39]:
label_count_df = X_train.copy()

for c in label_count_df.columns:
    if label_count_df[c].dtype == 'object':
        lce = LabelCountEncoder()
        label_count_df[c] = lce.fit_transform(label_count_df[c])

In [40]:
label_count_df.head()

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,60,,65.0,8450,,,,,,,,,,,,,7,5,2003,2003,,,,,,196.0,,,,,,,,706,,0,150,856,,,,,856,854,0,1710,1,0,2,1,3,1,,8,,0,,,2003.0,,2,548,,,,0,61,0,0,0,0,,,,0,2,2008,,
1,20,,80.0,9600,,,,,,,,,,,,,6,8,1976,1976,,,,,,0.0,,,,,,,,978,,0,284,1262,,,,,1262,0,0,1262,0,1,2,0,3,1,,6,,1,,,1976.0,,2,460,,,,298,0,0,0,0,0,,,,0,5,2007,,
2,60,,68.0,11250,,,,,,,,,,,,,7,5,2001,2002,,,,,,162.0,,,,,,,,486,,0,434,920,,,,,920,866,0,1786,1,0,2,1,3,1,,6,,1,,,2001.0,,2,608,,,,0,42,0,0,0,0,,,,0,9,2008,,
3,70,,60.0,9550,,,,,,,,,,,,,7,5,1915,1970,,,,,,0.0,,,,,,,,216,,0,540,756,,,,,961,756,0,1717,1,0,1,0,3,1,,7,,1,,,1998.0,,3,642,,,,0,35,272,0,0,0,,,,0,2,2006,,
4,60,,84.0,14260,,,,,,,,,,,,,8,5,2000,2000,,,,,,350.0,,,,,,,,655,,0,490,1145,,,,,1145,1053,0,2198,1,0,2,1,4,1,,9,,1,,,2000.0,,3,836,,,,192,84,0,0,0,0,,,,0,12,2008,,


### Ordinal Features

- Label Count encoding is good in general, however, some of the features are ordinal in nature.
- For example, we usually consider Excellent > Good > Average/Typical > Fair > Poor
- We can construct a dictionary like the following and map it to those columns:
  ```python
  {'Ex': 5, 'Gd': 4, 'TA': 3, 'Fa':2, 'Po':1}
  ```
 
- You need a different dictionary for columns with different levels.

In [41]:
ord_cols = ['ExterQual', 'ExterCond','BsmtCond','HeatingQC', 'KitchenQual', 
           'FireplaceQu', 'GarageQual', 'GarageCond', 'PoolQC']
ord_dic = {'Ex': 5, 'Gd': 4, 'TA': 3, 'Fa':2, 'Po':1}

In [42]:
ord_df = X_train.copy()

for col in ord_cols:
    ord_df[col] = ord_df[col].map(lambda x: ord_dic.get(x, 0))
ord_df.head()

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,BrkFace,196.0,4,3,PConc,Gd,3,No,GLQ,706,Unf,0,150,856,GasA,5,Y,SBrkr,856,854,0,1710,1,0,2,1,3,1,4,8,Typ,0,0,Attchd,2003.0,RFn,2,548,3,3,Y,0,61,0,0,0,0,0,,,0,2,2008,WD,Normal
1,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,1976,1976,Gable,CompShg,MetalSd,MetalSd,,0.0,3,3,CBlock,Gd,3,Gd,ALQ,978,Unf,0,284,1262,GasA,5,Y,SBrkr,1262,0,0,1262,0,1,2,0,3,1,3,6,Typ,1,3,Attchd,1976.0,RFn,2,460,3,3,Y,298,0,0,0,0,0,0,,,0,5,2007,WD,Normal
2,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2001,2002,Gable,CompShg,VinylSd,VinylSd,BrkFace,162.0,4,3,PConc,Gd,3,Mn,GLQ,486,Unf,0,434,920,GasA,5,Y,SBrkr,920,866,0,1786,1,0,2,1,3,1,4,6,Typ,1,3,Attchd,2001.0,RFn,2,608,3,3,Y,0,42,0,0,0,0,0,,,0,9,2008,WD,Normal
3,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,1915,1970,Gable,CompShg,Wd Sdng,Wd Shng,,0.0,3,3,BrkTil,TA,4,No,ALQ,216,Unf,0,540,756,GasA,4,Y,SBrkr,961,756,0,1717,1,0,1,0,3,1,4,7,Typ,1,4,Detchd,1998.0,Unf,3,642,3,3,Y,0,35,272,0,0,0,0,,,0,2,2006,WD,Abnorml
4,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,2000,2000,Gable,CompShg,VinylSd,VinylSd,BrkFace,350.0,4,3,PConc,Gd,3,Av,GLQ,655,Unf,0,490,1145,GasA,5,Y,SBrkr,1145,1053,0,2198,1,0,2,1,4,1,4,9,Typ,1,3,Attchd,2000.0,RFn,3,836,3,3,Y,192,84,0,0,0,0,0,,,0,12,2008,WD,Normal


#### Interactions
- If interactions are natural for a problem - ML only does approximations! => sub-optimal
 - Start from interactions that make sense intuitively. 
 - Winners usually find something that most people struggle to see in data. **Not many people look at the data at all!**
 
|  GarageCond |   GarageType   | GarageCond * GarageType  |
| ------------|:--------------:| -----:|
|  Ex  | 2Types | Ex * 2Types |
|  Ex  | CarPort| Ex * CarPort|
|  TA  | Basement| TA * Basement|
|  Fa  | BuiltIn | Fa * BuiltIn |
 
 
- Test your method with all explicitly created possible 2-way interactions if you have enough computing power
- This is especially useful when dealing with **anonymous data** (column name unknown)
- If 2-way interactions help – go even further (3-way, 4-way, ...)

**Dealing with NA's depends on situation. NA itself is an information unit! Usually separate category is enough.**

### Numerical features
Feature transformations to consider:
- Scaling - min/max, N(0,1), root/power scaling, log scaling, Box-Cox, quantiles.
 - **[Fit on the training set and transform on the test set.](https://stats.stackexchange.com/a/174865)**
- Rounding (too much precision might be noise!)
- Interactions {+,-,*,/}
 - Since area related features are very important to determine house prices, we can add one more feature which is the total area of basement, first and second floor areas of each house
 - `all_data['TotalSF'] = all_data['TotalBsmtSF'] + all_data['1stFlrSF'] + all_data['2ndFlrSF']`
- **Tree methods are almost invariant to scaling**

**Everything you need to know about feature engineering is [here](https://www.slideshare.net/HJvanVeen/feature-engineering-72376750?qid=14629b24-6d05-4275-acc9-ea0743605071&v=&b=&from_search=1).**

## Parameter Tuning

#### Basic approach: apply grid search on all parameter space
- Zero effort and no supervision
- Enormous parameter space
- Very time consuming

#### Expert approach: experience + intuition + resources at hand
1. Pick one set of parameters from the Kaggle kernel or the golden parameter you used in the previous competition
2. Start with the parameter that doesn't affect the others too much
 - i.e. learning rate $\eta $ in boosting method doesn't influence other parameter tuning (from my experience)
 - `max_depth`, `min_samples_split` and `min_samples_leaf` in random forest are highly correlated with each other
3. Iteratively tuning the features that control overfitting/underfitting
 - If it helps on CV, try to tune it as much as possible. Stop after CV score converges.
 - You can use public leaderboard as your K+1 fold to further prove it.
4. Go back to step 2 and stop when you are satisfied with the result and won't regret not working harder.  


#### [Bayesian optimization method](https://github.com/fmfn/BayesianOptimization/blob/master/examples/visualization.ipynb): trade-off between expert and grid search approach
- Zero effort and no supervision
- Grid space reduced on previous iteration's results (mimic expert decisions)
- Time consuming (still)
- Easy to integrate with sklearn cross validation function. See [examples](https://github.com/fmfn/BayesianOptimization/blob/master/examples/sklearn_example.py) here.

#### Golden rule: finding optimal configuration rarely is a good time investment!

## Ensemble

#### [Ensembling by voting](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html#sklearn.ensemble.VotingClassifier)
```
1111111100 = 80% accuracy 
0111011101 = 70% accuracy 
1000101111 = 60% accuracy
```
**Majority Vote**
```
1111111101 = 90% accuracy
```

In [43]:
from sklearn.ensemble import VotingClassifier

#### Ensembling by averaging
- Let’s say we have N predictions from N different models: $y_1, y_2, ... , y_N$
- We want to make a single prediction using weighted average: $\beta_1*y_1+\beta_2*y_2+...+\beta_N*y_N$
- How do we find the best beta cofficients?
- Very common mistake to select weights based on leaderboard feedback
 - **inefficient & prone to leaderboard overfitting**
- Solve the problem using CV predictions with optimization algorithms 
 - $optim(\beta_1*y_1+\beta_2*y_2+...+\beta_N*y_N)$ with starting weights $\beta_i=1/N$

## Stacked Generalization

The procedure for a 5 fold stacking may be described as follows:

1. Split the total training set into two disjoint sets (here train and holdout)

2. Train several base models on the first part (train)

3. Predict these base models on the second part (holdout)

4. Repeat step 1-3 five times and use the holdout predictions as the inputs, and the correct responses (target variable) as the outputs to train a higher level learner called meta-model.


- For the test set, we could either average the predictions of all base models on the test data or refit the model using the whole training set and then predict. Generally speaking, either way is fine because the test set hasn't seen the training set.
- If we ran 10 models using the same procedure, our meta model will have 10 input features.

![img](https://s3.amazonaws.com/nycdsabt01/stacking.jpg)

Borrowed from [Faron](https://www.kaggle.com/getting-started/18153#post103381)

As a quick note, one should try a few diverse models. To my experience, a good stacking solution is often composed of at least:
- 2 or 3 GBMs/XGBs/LightGBMs (one with low depth, one with medium and one with high)
- 1 or 2 Random Forests (again as diverse as possible–one low depth, one high)
- 1 linear model**!**

In [44]:
# Useful if you are debugging the function inside another .py script
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [45]:
from sklearn.linear_model import ElasticNet, LinearRegression as lr
from sklearn.ensemble import GradientBoostingRegressor as gbr, RandomForestRegressor as rfr
from preprocess import impute

In [46]:
all_data = impute(all_data)

In [47]:
all_data.head()

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,TotalSF
0,60,RL,65.0,8450,Pave,,Reg,Lvl,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,BrkFace,196.0,Gd,TA,PConc,Gd,TA,No,GLQ,706.0,Unf,0.0,150.0,856.0,GasA,Ex,Y,SBrkr,856,854,0,1710,1.0,0.0,2,1,3,1,Gd,8,Typ,0,,Attchd,2003.0,RFn,2.0,548.0,TA,TA,Y,0,61,0,0,0,0,,,,0,2,2008,WD,Normal,2566.0
1,20,RL,80.0,9600,Pave,,Reg,Lvl,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,1976,1976,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,CBlock,Gd,TA,Gd,ALQ,978.0,Unf,0.0,284.0,1262.0,GasA,Ex,Y,SBrkr,1262,0,0,1262,0.0,1.0,2,0,3,1,TA,6,Typ,1,TA,Attchd,1976.0,RFn,2.0,460.0,TA,TA,Y,298,0,0,0,0,0,,,,0,5,2007,WD,Normal,2524.0
2,60,RL,68.0,11250,Pave,,IR1,Lvl,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2001,2002,Gable,CompShg,VinylSd,VinylSd,BrkFace,162.0,Gd,TA,PConc,Gd,TA,Mn,GLQ,486.0,Unf,0.0,434.0,920.0,GasA,Ex,Y,SBrkr,920,866,0,1786,1.0,0.0,2,1,3,1,Gd,6,Typ,1,TA,Attchd,2001.0,RFn,2.0,608.0,TA,TA,Y,0,42,0,0,0,0,,,,0,9,2008,WD,Normal,2706.0
3,70,RL,60.0,9550,Pave,,IR1,Lvl,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,1915,1970,Gable,CompShg,Wd Sdng,Wd Shng,,0.0,TA,TA,BrkTil,TA,Gd,No,ALQ,216.0,Unf,0.0,540.0,756.0,GasA,Gd,Y,SBrkr,961,756,0,1717,1.0,0.0,1,0,3,1,Gd,7,Typ,1,Gd,Detchd,1998.0,Unf,3.0,642.0,TA,TA,Y,0,35,272,0,0,0,,,,0,2,2006,WD,Abnorml,2473.0
4,60,RL,84.0,14260,Pave,,IR1,Lvl,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,2000,2000,Gable,CompShg,VinylSd,VinylSd,BrkFace,350.0,Gd,TA,PConc,Gd,TA,Av,GLQ,655.0,Unf,0.0,490.0,1145.0,GasA,Ex,Y,SBrkr,1145,1053,0,2198,1.0,0.0,2,1,4,1,Gd,9,Typ,1,TA,Attchd,2000.0,RFn,3.0,836.0,TA,TA,Y,192,84,0,0,0,0,,,,0,12,2008,WD,Normal,3343.0


In [59]:
train_index = len(X_train)
for col in all_data.columns:
    if col in ord_cols:
        all_data[col] = all_data[col].map(lambda x: ord_dic.get(x, 0))
    elif all_data[col].dtype == "object":
        lce = LabelCountEncoder()
        # fit the encoder using just training set
        all_data.loc[:train_index, col] = lce.fit_transform(all_data.loc[:train_index, col])
        all_data.loc[train_index:, col] = lce.transform(all_data.loc[train_index:, col])

In [60]:
X_train = all_data.iloc[:train_index, :]
X_test = all_data.iloc[train_index:, :]

In [50]:
X_train.head()

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,TotalSF
0,,,65.0,8450,,,,,,,,,,,,7,,2003,2003,,,,,,196.0,4,3,,,3,,,706.0,,0.0,150.0,856.0,,5,,,856,854,0,1710,1.0,0.0,2,1,3,1,4,8,,0,0,,2003.0,,2.0,548.0,3,3,,0,61,0,0,0,0,0,,,0,,,,,2566.0
1,,,80.0,9600,,,,,,,,,,,,6,,1976,1976,,,,,,0.0,3,3,,,3,,,978.0,,0.0,284.0,1262.0,,5,,,1262,0,0,1262,0.0,1.0,2,0,3,1,3,6,,1,3,,1976.0,,2.0,460.0,3,3,,298,0,0,0,0,0,0,,,0,,,,,2524.0
2,,,68.0,11250,,,,,,,,,,,,7,,2001,2002,,,,,,162.0,4,3,,,3,,,486.0,,0.0,434.0,920.0,,5,,,920,866,0,1786,1.0,0.0,2,1,3,1,4,6,,1,3,,2001.0,,2.0,608.0,3,3,,0,42,0,0,0,0,0,,,0,,,,,2706.0
3,,,60.0,9550,,,,,,,,,,,,7,,1915,1970,,,,,,0.0,3,3,,,4,,,216.0,,0.0,540.0,756.0,,4,,,961,756,0,1717,1.0,0.0,1,0,3,1,4,7,,1,4,,1998.0,,3.0,642.0,3,3,,0,35,272,0,0,0,0,,,0,,,,,2473.0
4,,,84.0,14260,,,,,,,,,,,,8,,2000,2000,,,,,,350.0,4,3,,,3,,,655.0,,0.0,490.0,1145.0,,5,,,1145,1053,0,2198,1.0,0.0,2,1,4,1,4,9,,1,3,,2000.0,,3.0,836.0,3,3,,192,84,0,0,0,0,0,,,0,,,,,3343.0


In [51]:
from stacking import stacking_regression
from sklearn.metrics import mean_squared_error

In [52]:
def rmsle(y, y_pred):
    return np.sqrt(mean_squared_error(np.log(y), np.log(y_pred)))

In [53]:
models = [
    # linear model, ElasticNet = lasso + ridge
    ElasticNet(random_state=0),
    
    # conservative random forst model
    rfr(random_state=0,
        n_estimators=1000, max_depth=6,  max_features='sqrt'),
    
    # aggressive random forst model
    rfr(random_state=0, 
        n_estimators=1000, max_depth=9,  max_features='auto'),
    
    # conservative gbm model
    gbr(random_state=0, learning_rate = 0.005, max_features='sqrt',
        min_samples_leaf=15, min_samples_split=10, 
        n_estimators=3000, max_depth=3),
    
    # aggressive gbm model
    gbr(random_state = 0, learning_rate = 0.01, max_features='sqrt',
        min_samples_leaf=10, min_samples_split=5, 
        n_estimators = 1000, max_depth = 9)
    ]

meta_model = lr(normalize=True)

In [61]:
 X_train.shape

(1460, 79)

In [62]:
y_train.shape

(1460,)

In [63]:
X_test.shape

(1459, 79)

In [64]:
test_ID.shape

(1459,)

In [58]:
%%time
final_prediction = stacking_regression(models, meta_model, X_train.values, y_train.values, X_test.values,
                               transform_target=np.log1p, transform_pred = np.expm1, 
                               metric=rmsle, verbose=1)

metric: [rmsle]

model 0: [ElasticNet]


ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

**Having more models than necessary in ensemble may hurt.**


- Lets say we have a library of created models. Usually greedy-forward approach works well:
 - Start with a few well-performing models’ ensemble
 - Loop through each other model in a library and add to current ensemble
 - Determine best performing ensemble configuration
 - Repeat until metric converged
- If you are using linear regression as the meta model, make sure you have **diverse/uncorrelated** first layer models

- During each loop iteration it is wise to consider only a subset of library models, which could work as a regularization for model selection.

- Repeating procedure few times and bagging results reduces the possibility of overfitting by doing model selection.

- R users can use the `caretStack` function from the [caretEnsemble](https://github.com/zachmayer/caretEnsemble) package directly. A nice tutorial [here](https://machinelearningmastery.com/machine-learning-ensembles-with-r/).

- The `stackedEnsemble()` function from the [H2o package](https://h2o-release.s3.amazonaws.com/h2o/rel-ueno/2/docs-website/h2o-docs/data-science/stacked-ensembles.html) is also a good choice out there. But the downside is it only takes h2o model as input.

### Success formula (personal opinion)

50% - feature engineering

30% - model diversity

10% - luck

10% - proper ensembling
 - Voting
 - Averaging
 - Stacking