## Kaggle Competition - House Prices: Advanced Regression Techniques
# Feature Encoding

The main purpose of this section is to individually review each of the columns in the dataset to see if they can be improved for the purposes of linear regression

The code for this module is in: [src/models/FeatureEncoding.py](../src/models/FeatureEncoding.py)

In [1]:
import sys
import os
sys.path.append( os.path.abspath( os.path.join(os.getcwd(), ".." ))) 
from src.utils import reset_root_dir
reset_root_dir()

import pandas as pd
from pandas import DataFrame, Series
import cytoolz 
from cytoolz import groupby
import numpy as np
import itertools
from datetime import datetime
import math
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import pprint
import pydash
import simplejson
from sortedcontainers import SortedDict

from src.utils.Charts import Charts
from src.models import LinearRegressionModel, FeatureEncoding

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 10)

pp = pprint.PrettyPrinter(depth=6)

# Column Categorization and Cleanup

The baseline linear regression model removes non-numeric fields and fills in NaNs with 0

Looking through the individual columns, there are several types of basic feature mapping

Uncorrilated Variables that can be removed
- id: uncorrilated remove

Linear Numeric fields that left unmodified
- LotFrontage: linear numeric
- LotArea: linear numeric
- OverallQual: linear numeric
- OverallCond: linear numeric
- MasVnrArea: linear numeric
- BsmtFinSF2: linear numeric
- BsmtUnfSF: linear numeric
- TotalBsmtSF: linear numeric
- 1stFlrSF: linear numeric
- 2ndFlrSF: linear numeric
- LowQualFinSF: linear numeric
- GrLivArea: linear numeric
- BsmtFullBath: linear numeric
- BsmtHalfBath: linear numeric
- FullBath: linear numeric
- HalfBath: linear numeric
- Bedroom: linear numeric
- Kitchen: linear numeric
- TotRmsAbvGrd: linear numeric
- WoodDeckSF: linear numeric
- OpenPorchSF: linear numeric
- EnclosedPorch: linear numeric
- 3SsnPorch: linear numeric
- ScreenPorch: linear numeric
- PoolArea: linear numeric
- MiscVal: linear numeric
- Fireplaces: linear numeric
- GarageCars: linear numeric
- GarageArea: linear numeric


Absolute Years could be better converted to Ages Relative to current Year
- YearBuilt: numeric - convert to age
- YearRemodAdd: numeric - convert to age
- GarageYrBlt: numeric - convert to age
- YrSold: numeric - convert to age

Month sold is categorical assuming that houe prices have seasons 
- MoSold: categorical

Categorical fields may need OneHotEncoding - https://scikit-learn.org/stable/modules/preprocessing.html
- MSSubClass: categorical
- MSZoning: categorical
- Street: categorical
- Alley: categorical
- LotShape: categorical
- LandContour: categorical
- Utilities: categorical
- LotConfig: categorical
- Neighborhood: categorical
- BldgType: categorical
- HouseStyle: categorical
- RoofStyle: categorical
- RoofMatl: categorical
- MasVnrType: categorical
- Foundation: categorical
- BsmtExposure: categorical
- Heating: categorical
- Functional: categorical | categorical to numeric
- GarageFinish: categorical | categorical to numeric
- Fence: categorical | categorical to numeric
- SaleType: categorical
- SaleCondition: categorical
- MiscFeature: categorical
- Electrical: categorical
- GarageType: categorical
- PavedDrive: (Yes / Partial / No )

Categorical fields with optional multiple values
- Condition1 + Condition2: multiple categorical
- Exterior1st + Exterior2nd: multiple categorical
- BsmtFinType1 + BsmtFinType2: multiple categorical

LabelEncoder Quality Categories: Ex:Excellent, Gd:Good, TA:Average/Typical, Fa:Fair, Po:Poor, NA
- ExterQual: categorical to numeric
- ExterCond: categorical to numeric
- BsmtQual: categorical to numeric
- BsmtCond: categorical to numeric
- HeatingQC: categorical to numeric
- KitchenQual: categorical to numeric
- FireplaceQu: categorical to numeric
- GarageQual: categorical to numeric
- GarageCond: categorical to numeric
- PoolQC: categorical to numeric

Boolean fields are Y or N which can converted to numeric
- CentralAir: boolean




In [2]:
LinearRegressionModel().data['X_test']

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,TotRmsAbvGrd,Fireplaces,GarageYrBlt,GarageCars,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold
0,1461,20,80.0,11622,5,6,1961,1961,0.0,468.0,144.0,270.0,882.0,896,0,0,896,0.0,0.0,1,0,2,1,5,0,1961.0,1.0,730.0,140,0,0,0,120,0,0,6,2010
1,1462,20,81.0,14267,6,6,1958,1958,108.0,923.0,0.0,406.0,1329.0,1329,0,0,1329,0.0,0.0,1,1,3,1,6,0,1958.0,1.0,312.0,393,36,0,0,0,0,12500,6,2010
2,1463,60,74.0,13830,5,5,1997,1998,0.0,791.0,0.0,137.0,928.0,928,701,0,1629,0.0,0.0,2,1,3,1,6,1,1997.0,2.0,482.0,212,34,0,0,0,0,0,3,2010
3,1464,60,78.0,9978,6,6,1998,1998,20.0,602.0,0.0,324.0,926.0,926,678,0,1604,0.0,0.0,2,1,3,1,7,1,1998.0,2.0,470.0,360,36,0,0,0,0,0,6,2010
4,1465,120,43.0,5005,8,5,1992,1992,0.0,263.0,0.0,1017.0,1280.0,1280,0,0,1280,0.0,0.0,2,0,2,1,5,0,1992.0,2.0,506.0,0,82,0,0,144,0,0,1,2010
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1454,2915,160,21.0,1936,4,7,1970,1970,0.0,0.0,0.0,546.0,546.0,546,546,0,1092,0.0,0.0,1,1,3,1,5,0,0.0,0.0,0.0,0,0,0,0,0,0,0,6,2006
1455,2916,160,21.0,1894,4,5,1970,1970,0.0,252.0,0.0,294.0,546.0,546,546,0,1092,0.0,0.0,1,1,3,1,6,0,1970.0,1.0,286.0,0,24,0,0,0,0,0,4,2006
1456,2917,20,160.0,20000,5,7,1960,1996,0.0,1224.0,0.0,0.0,1224.0,1224,0,0,1224,1.0,0.0,1,0,4,1,7,1,1960.0,2.0,576.0,474,0,0,0,0,0,0,9,2006
1457,2918,85,62.0,10441,5,5,1992,1992,0.0,337.0,0.0,575.0,912.0,970,0,0,970,0.0,1.0,1,0,3,1,6,0,0.0,0.0,0.0,80,32,0,0,0,0,700,7,2006


After feature encoding, we have gone from 37 features to 254 features

In [3]:
FeatureEncoding().data['X_test']

Unnamed: 0,Id,LotFrontage,LotArea,OverallQual,OverallCond,MasVnrArea,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,TotRmsAbvGrd,Fireplaces,GarageCars,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,YearBuilt_Age,YearRemodAdd_Age,GarageYrBlt_Age,YrSold_Age,CentralAir_Numeric,ExterQual_Numeric,ExterCond_Numeric,BsmtQual_Numeric,BsmtCond_Numeric,HeatingQC_Numeric,KitchenQual_Numeric,FireplaceQu_Numeric,GarageQual_Numeric,GarageCond_Numeric,PoolQC_Numeric,MoSold_1,MoSold_2,MoSold_3,MoSold_4,MoSold_5,MoSold_6,MoSold_7,MoSold_8,MoSold_9,MoSold_10,MoSold_11,MoSold_12,LandSlope_Gtl,LandSlope_Mod,LandSlope_Sev,MSSubClass_20,MSSubClass_30,MSSubClass_40,MSSubClass_45,MSSubClass_50,MSSubClass_60,MSSubClass_70,MSSubClass_75,MSSubClass_80,MSSubClass_85,MSSubClass_90,MSSubClass_120,MSSubClass_150,MSSubClass_160,MSSubClass_180,MSSubClass_190,MSZoning_C (all),MSZoning_FV,MSZoning_RH,MSZoning_RL,MSZoning_RM,Street_Grvl,Street_Pave,Alley_Grvl,Alley_Pave,LotShape_IR1,LotShape_IR2,LotShape_IR3,LotShape_Reg,LandContour_Bnk,LandContour_HLS,LandContour_Low,LandContour_Lvl,Utilities_AllPub,Utilities_NoSeWa,LotConfig_Corner,LotConfig_CulDSac,LotConfig_FR2,LotConfig_FR3,LotConfig_Inside,Neighborhood_Blmngtn,Neighborhood_Blueste,Neighborhood_BrDale,Neighborhood_BrkSide,Neighborhood_ClearCr,Neighborhood_CollgCr,Neighborhood_Crawfor,Neighborhood_Edwards,Neighborhood_Gilbert,Neighborhood_IDOTRR,Neighborhood_MeadowV,Neighborhood_Mitchel,Neighborhood_NAmes,Neighborhood_NPkVill,Neighborhood_NWAmes,Neighborhood_NoRidge,Neighborhood_NridgHt,Neighborhood_OldTown,Neighborhood_SWISU,Neighborhood_Sawyer,Neighborhood_SawyerW,Neighborhood_Somerst,Neighborhood_StoneBr,Neighborhood_Timber,Neighborhood_Veenker,BldgType_1Fam,BldgType_2fmCon,BldgType_Duplex,BldgType_Twnhs,BldgType_TwnhsE,HouseStyle_1.5Fin,HouseStyle_1.5Unf,HouseStyle_1Story,HouseStyle_2.5Fin,HouseStyle_2.5Unf,HouseStyle_2Story,HouseStyle_SFoyer,HouseStyle_SLvl,RoofStyle_Flat,RoofStyle_Gable,RoofStyle_Gambrel,RoofStyle_Hip,RoofStyle_Mansard,RoofStyle_Shed,RoofMatl_ClyTile,RoofMatl_CompShg,RoofMatl_Membran,RoofMatl_Metal,RoofMatl_Roll,RoofMatl_Tar&Grv,RoofMatl_WdShake,RoofMatl_WdShngl,MasVnrType_BrkCmn,MasVnrType_BrkFace,MasVnrType_None,MasVnrType_Stone,Foundation_BrkTil,Foundation_CBlock,Foundation_PConc,Foundation_Slab,Foundation_Stone,Foundation_Wood,BsmtExposure_Av,BsmtExposure_Gd,BsmtExposure_Mn,BsmtExposure_No,Heating_Floor,Heating_GasA,Heating_GasW,Heating_Grav,Heating_OthW,Heating_Wall,Functional_Maj1,Functional_Maj2,Functional_Min1,Functional_Min2,Functional_Mod,Functional_Sev,Functional_Typ,GarageFinish_Fin,GarageFinish_RFn,GarageFinish_Unf,Fence_GdPrv,Fence_GdWo,Fence_MnPrv,Fence_MnWw,SaleType_COD,SaleType_CWD,SaleType_Con,SaleType_ConLD,SaleType_ConLI,SaleType_ConLw,SaleType_New,SaleType_Oth,SaleType_WD,SaleCondition_Abnorml,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial,MiscFeature_Gar2,MiscFeature_Othr,MiscFeature_Shed,MiscFeature_TenC,Electrical_FuseA,Electrical_FuseF,Electrical_FuseP,Electrical_Mix,Electrical_SBrkr,GarageType_2Types,GarageType_Attchd,GarageType_Basment,GarageType_BuiltIn,GarageType_CarPort,GarageType_Detchd,PavedDrive_N,PavedDrive_P,PavedDrive_Y,Condition_Artery,Condition_Feedr,Condition_Norm,Condition_PosA,Condition_PosN,Condition_RRAe,Condition_RRAn,Condition_RRNe,Condition_RRNn,Exterior_AsbShng,Exterior_AsphShn,Exterior_Brk Cmn,Exterior_BrkComm,Exterior_BrkFace,Exterior_CBlock,Exterior_CemntBd,Exterior_CmentBd,Exterior_HdBoard,Exterior_ImStucc,Exterior_MetalSd,Exterior_Other,Exterior_Plywood,Exterior_Stone,Exterior_Stucco,Exterior_VinylSd,Exterior_Wd Sdng,Exterior_Wd Shng,Exterior_WdShing,BsmtFinType_ALQ,BsmtFinType_BLQ,BsmtFinType_GLQ,BsmtFinType_LwQ,BsmtFinType_Rec,BsmtFinType_Unf
0,1461,80.0,11622,5,6,0.0,468.0,144.0,270.0,882.0,896,0,0,896,0.0,0.0,1,0,2,1,5,0,1.0,730.0,140,0,0,0,120,0,0,49,49,49.000000,0,1,5,5,5,5,5,5,3,5,5,3,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,1,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
1,1462,81.0,14267,6,6,108.0,923.0,0.0,406.0,1329.0,1329,0,0,1329,0.0,0.0,1,1,3,1,6,0,1.0,312.0,393,36,0,0,0,0,12500,52,52,52.000000,0,1,5,5,5,5,5,2,3,5,5,3,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,1,0,0,0,0,0,0,1,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0
2,1463,74.0,13830,5,5,0.0,791.0,0.0,137.0,928.0,928,701,0,1629,0.0,0.0,2,1,3,1,6,1,2.0,482.0,212,34,0,0,0,0,0,13,12,13.000000,0,1,5,5,2,5,2,5,5,5,5,3,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,1,0,0,0,0,0,0,1,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
3,1464,78.0,9978,6,6,20.0,602.0,0.0,324.0,926.0,926,678,0,1604,0.0,0.0,2,1,3,1,7,1,2.0,470.0,360,36,0,0,0,0,0,12,12,12.000000,0,1,5,5,5,5,0,2,2,5,5,3,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,1,0,0,0,0,0,0,1,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
4,1465,43.0,5005,8,5,0.0,263.0,0.0,1017.0,1280.0,1280,0,0,1280,0.0,0.0,2,0,2,1,5,0,2.0,506.0,0,82,0,0,144,0,0,18,18,18.000000,0,1,2,5,2,5,0,2,3,5,5,3,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1454,2915,21.0,1936,4,7,0.0,0.0,0.0,546.0,546.0,546,546,0,1092,0.0,0.0,1,1,3,1,5,0,0.0,0.0,0,0,0,0,0,0,0,40,40,32.278783,4,1,5,5,5,5,2,5,3,3,3,3,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,1,0,0,0,0,0,1,0,0,0,1,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
1455,2916,21.0,1894,4,5,0.0,252.0,0.0,294.0,546.0,546,546,0,1092,0.0,0.0,1,1,3,1,6,0,1.0,286.0,0,24,0,0,0,0,0,40,40,40.000000,4,1,5,5,5,5,5,5,3,5,5,3,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,1,0,0,0,0,0,1,0,0,0,1,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1456,2917,160.0,20000,5,7,0.0,1224.0,0.0,0.0,1224.0,1224,0,0,1224,1.0,0.0,1,0,4,1,7,1,2.0,576.0,474,0,0,0,0,0,0,50,14,50.000000,4,1,5,5,5,5,0,5,5,5,5,3,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,1,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
1457,2918,62.0,10441,5,5,0.0,337.0,0.0,575.0,912.0,970,0,0,970,0.0,1.0,1,0,3,1,6,0,0.0,0.0,80,32,0,0,0,0,700,18,18,32.278783,4,1,5,5,2,5,5,5,3,3,3,3,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,1,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


# Scoring

We can test each of the feature encodings indivudally as well as compare against a baseline 

In [4]:
empty_config = { "X_feature_exclude": [], "X_feature_year_ages": [], "X_feature_label_encode": {}, "X_feature_onehot": [] }

config_keys = pydash.chain([
    itertools.combinations( empty_config.keys(), 1 ),
    itertools.combinations( empty_config.keys(), 2 ), 
    itertools.combinations( empty_config.keys(), 3 ),
]).map(list).flatten().uniq().value()

results = []
for keys in config_keys:
    test_config = dict(empty_config)
    test_config['comment'] = ", ".join(keys)
    for key in keys: del test_config[key]
    results.append( FeatureEncoding( **test_config ).execute() )
    
results += [
    LinearRegressionModel( **{ "comment": "baseline"     } ).execute(),
    FeatureEncoding(       **{ "comment": "all features" } ).execute(),
]

# +index/(10**6)avoids duplicate keys
sorted_results = SortedDict({ 
    round(result['scores']['RMSLE'] + index/(10**6), 6): result['class']+':  '+result['comment']
    for index, result in enumerate(results) 
})
print( simplejson.dumps( sorted_results, indent=4*' ' ) )

{
    "0.182139": "FeatureEncoding:  X_feature_label_encode, X_feature_onehot",
    "0.182142": "FeatureEncoding:  X_feature_exclude, X_feature_label_encode, X_feature_onehot",
    "0.18234": "FeatureEncoding:  X_feature_year_ages, X_feature_label_encode, X_feature_onehot",
    "0.182342": "FeatureEncoding:  all features",
    "0.182921": "FeatureEncoding:  X_feature_year_ages, X_feature_onehot",
    "0.182924": "FeatureEncoding:  X_feature_exclude, X_feature_year_ages, X_feature_onehot",
    "0.183078": "FeatureEncoding:  X_feature_onehot",
    "0.183081": "FeatureEncoding:  X_feature_exclude, X_feature_onehot",
    "0.189631": "FeatureEncoding:  X_feature_year_ages, X_feature_label_encode",
    "0.189634": "FeatureEncoding:  X_feature_exclude, X_feature_year_ages, X_feature_label_encode",
    "0.192749": "FeatureEncoding:  X_feature_label_encode",
    "0.19322": "FeatureEncoding:  X_feature_exclude, X_feature_label_encode",
    "0.194176": "LinearRegressionModel:  baseline",
    "0.1

Findings:
- (0.1941 -> 0.2089) X_feature_year_ages    - major reduction in score encoding relative ages | maybe polynomial corrilation = leave absolute year as feature
- (0.1941 -> 0.1997) X_feature_exclude      - strange??? | minor corrilation between id and price = still remove id on therotical grounds
- (0.1941 -> 0.1927) X_feature_label_encode - this improves the score slightly compared to baseline 
- (0.1941 -> 0.1830) X_feature_onehot       - this implements the single biggest score improvement  
- (0.1941 -> 0.1823) all features           - best score results from all combined, with negligable effects caused by year encoding

## Submit to Kaggle
- https://www.kaggle.com/c/house-prices-advanced-regression-techniques/submissions

```
$ kaggle competitions submit -c house-prices-advanced-regression-techniques -f data/submissions/FeatureEncoding.csv -m "FeatureEncoding.py - label + onehot encoding"
```
    
- Your submission scored 0.80406, which is not an improvement of your best score (0.20892). Keep trying!
- Kaggle Rank 3778 / 4375


Whilst FeatureEncoding is theoretically valid and improves the local score, it results in a major downgrade of the Kaggle score

Ideas
- Possibly too many features (254) leading to overfitting on training data
- Features could be filtered based on correlation coefficents
- Investigate - Regularization: Ridge, Lasso and Elastic Net 
- Investigate - Adding polynomial features