### Iowa Housing Lab -- Data Encoding -- KFold + Pipelines

Welcome!! This lab will continue where we left off last class -- building a regression model, but this time with new features added in -- using cross validation to evaluate our scores, and building our encoding steps into pipelines.

**Important:** A summary of each of the columns in this dataset, and what their values mean, can be found here: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data

**Step 1).  Load in your data set**

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import GradientBoostingRegressor
import category_encoders as ce
from sklearn.pipeline import make_pipeline
df = pd.read_csv('../../data/iowa_train2.csv')

In [2]:
df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotArea,Neighborhood,OverallQual,OverallCond,YearBuilt,GrLivArea,1stFlrSF,2ndFlrSF,GrLivArea.1,FullBath,HalfBath,GarageType,GarageYrBlt,GarageFinish,GarageCars,SalePrice
0,1,60,RL,8450,CollgCr,7,5,2003,1710,856,854,1710,2,1,Attchd,2003.0,RFn,2,208500
1,2,20,RL,9600,Veenker,6,8,1976,1262,1262,0,1262,2,0,Attchd,1976.0,RFn,2,181500
2,3,60,RL,11250,CollgCr,7,5,2001,1786,920,866,1786,2,1,Attchd,2001.0,RFn,2,223500
3,4,70,RL,9550,Crawfor,7,5,1915,1717,961,756,1717,1,0,Detchd,1998.0,Unf,3,140000
4,5,60,RL,14260,NoRidge,8,5,2000,2198,1145,1053,2198,2,1,Attchd,2000.0,RFn,3,250000


In [8]:
X.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotArea,Neighborhood,OverallQual,OverallCond,YearBuilt,GrLivArea,1stFlrSF,2ndFlrSF,GrLivArea.1,FullBath,HalfBath,GarageType,GarageYrBlt,GarageFinish,GarageCars
0,1,60,RL,8450,CollgCr,7,5,2003,1710,856,854,1710,2,1,Attchd,2003.0,RFn,2
1,2,20,RL,9600,Veenker,6,8,1976,1262,1262,0,1262,2,0,Attchd,1976.0,RFn,2
2,3,60,RL,11250,CollgCr,7,5,2001,1786,920,866,1786,2,1,Attchd,2001.0,RFn,2
3,4,70,RL,9550,Crawfor,7,5,1915,1717,961,756,1717,1,0,Detchd,1998.0,Unf,3
4,5,60,RL,14260,NoRidge,8,5,2000,2198,1145,1053,2198,2,1,Attchd,2000.0,RFn,3


In [30]:
y.head()

0    208500
1    181500
2    223500
3    140000
4    250000
Name: SalePrice, dtype: int64

**Step 2).  There are missing values throughout this dataset.  Fill them in appropriately**

We already covered this in class, but to give you a reminder:

 - Are the missing values random or not?
 - Encode them as missing if possible

In [9]:
def denote_null_values(df):
    empty_cols_query = df.isnull().sum() > 0
    empty_df_cols = df.loc[:, empty_cols_query].columns.tolist()
    for col in empty_df_cols:
        col_name = f"{col}_missing"
        df[col_name] = pd.isnull(df[col])
    return df
df = denote_null_values(df)

In [10]:
df

Unnamed: 0,Id,MSSubClass,MSZoning,LotArea,Neighborhood,OverallQual,OverallCond,YearBuilt,GrLivArea,1stFlrSF,...,FullBath,HalfBath,GarageType,GarageYrBlt,GarageFinish,GarageCars,SalePrice,GarageType_missing,GarageYrBlt_missing,GarageFinish_missing
0,1,60,RL,8450,CollgCr,7,5,2003,1710,856,...,2,1,Attchd,2003.0,RFn,2,208500,False,False,False
1,2,20,RL,9600,Veenker,6,8,1976,1262,1262,...,2,0,Attchd,1976.0,RFn,2,181500,False,False,False
2,3,60,RL,11250,CollgCr,7,5,2001,1786,920,...,2,1,Attchd,2001.0,RFn,2,223500,False,False,False
3,4,70,RL,9550,Crawfor,7,5,1915,1717,961,...,1,0,Detchd,1998.0,Unf,3,140000,False,False,False
4,5,60,RL,14260,NoRidge,8,5,2000,2198,1145,...,2,1,Attchd,2000.0,RFn,3,250000,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,1456,60,RL,7917,Gilbert,6,5,1999,1647,953,...,2,1,Attchd,1999.0,RFn,2,175000,False,False,False
1456,1457,20,RL,13175,NWAmes,6,6,1978,2073,2073,...,2,0,Attchd,1978.0,Unf,2,210000,False,False,False
1457,1458,70,RL,9042,Crawfor,7,9,1941,2340,1188,...,2,0,Attchd,1941.0,RFn,1,266500,False,False,False
1458,1459,20,RL,9717,NAmes,5,6,1950,1078,1078,...,1,0,Attchd,1950.0,Unf,1,142125,False,False,False


In [11]:
missing_cols_query = df.isnull().sum() > 0
missing_cols_num = df.loc[:, missing_cols_query].select_dtypes(include=np.number).columns.tolist()
missing_cols_cat = df.loc[:, missing_cols_query].select_dtypes(include=np.number).columns.tolist()
df[missing_cols_num] = df[missing_cols_num].fillna(0)
df[missing_cols_cat] = df[missing_cols_cat].fillna("None")

In [15]:
cat_cols = df.select_dtypes(include=np.object).columns.tolist()
df[cat_cols] = df[cat_cols].astype("category")
for col in cat_cols:
    df[col] = df[col].cat.codes

In [16]:
X = df.drop("SalePrice", axis=1)
y = df["SalePrice"]

In [17]:
gbm = GradientBoostingRegressor()


In [31]:
gbm.fit(X, y)

GradientBoostingRegressor()

In [32]:
gbm.score(X, y)

0.9434156830101283

**Step 3): Create A Pipeline With Your Model And The Column Encoder of Your Choice**

For now, you can choose which encoding technique you would want to use.  Later on you'll go back and check to see if it made a large difference.  

In [33]:
pipe = make_pipeline(ce.OrdinalEncoder(), gbm)

**Step 4).  Create A Training & Test Set**

Re-use the same settings that we've completed previously in class

In [37]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1985)

In [38]:
X_train

Unnamed: 0,Id,MSSubClass,MSZoning,LotArea,Neighborhood,OverallQual,OverallCond,YearBuilt,GrLivArea,1stFlrSF,...,GrLivArea.1,FullBath,HalfBath,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageType_missing,GarageYrBlt_missing,GarageFinish_missing
461,462,70,3,7200,18,7,9,1936,1135,575,...,1135,1,0,5,1971.0,1,2,False,False,False
373,374,20,3,10634,12,5,6,1953,1319,1319,...,1319,1,0,1,1953.0,2,1,False,False,False
1271,1272,20,3,9156,14,6,7,1968,1489,1489,...,1489,2,0,1,1968.0,1,2,False,False,False
634,635,90,3,6979,17,6,5,1980,1056,1056,...,1056,0,0,5,1980.0,2,2,False,False,False
1245,1246,80,3,12090,14,6,7,1984,1868,1140,...,1868,3,1,3,1984.0,0,2,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
446,447,20,3,16492,12,6,6,1966,1888,1888,...,1888,2,1,1,1966.0,0,2,False,False,False
1376,1377,30,3,6292,18,6,5,1930,790,790,...,790,1,0,5,1925.0,2,1,False,False,False
1078,1079,120,4,4435,5,6,5,2004,848,848,...,848,1,0,1,2004.0,1,2,False,False,False
709,710,20,3,7162,19,5,7,1966,904,904,...,904,1,0,1,1966.0,2,1,False,False,False


In [39]:
y_train

461     155000
373     123000
1271    185750
634     144000
1245    178000
         ...  
446     190000
1376     91000
1078    155900
709     109900
1317    208900
Name: SalePrice, Length: 1168, dtype: int64

In [40]:
gbm.fit(X_train, y_train)

GradientBoostingRegressor()

In [41]:
X_train.shape

(1168, 21)

In [42]:
y_train.shape

(1168,)

**Step 5).  Get An Initial 10 Fold Cross Validation Score**

This will be your initial baseline for improving your score.  Use your pipeline in this step.

In [43]:
cv_scores = cross_val_score(estimator=gbm, X=X_train, y=y_train, cv=10)

In [44]:
cv_scores

array([0.90477537, 0.88255199, 0.89010202, 0.87518004, 0.88157886,
       0.87406452, 0.93575517, 0.55366739, 0.83120813, 0.87453804])

**Step 6).  Do Parameter Exploration With Your Model To Find the Best Combination On Your Validation Set**

Use pipelines here to make processing easier.

Parameters to explore:

 - `n_estimators` (would not go above 1000 for now)
 - `max_depth`  (usually up to 5 levels deep is okay)
 - `learning_rate` (.001 - 0.1 is a good range)
 
It's a good idea to refer to previous lab exercises to see how best to do this.

Use 5 folds to get your validation score (this is for time)

**Hint:** Use the `steps` attribute in the pipeline to grab the `GradientBoostingRegressor()` in your pipeline and set its params.


In [None]:
# your answer here

**Step 7).  Take the *best* parameter versions and fit this on your *entire* training set**

In [None]:
# your answer here


**Step 8).  Score the model on your test set**

How did the two compare?

In [None]:
# your answer here