### Iowa Housing Lab -- Data Encoding

Welcome!! This lab is going to be a bit more of an advanced version of last class, where we build a regression model to predict housing prices, but this time we do so with a dataset that has a more interesting mix of data -- numeric and categorical data, as well as some missing values.

**Important:** A summary of each of the columns in this dataset, and what their values mean, can be found here: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data

**Step 1).  Load in your data set**

In [16]:
import pandas as pd
import numpy as np
df = pd.read_csv("/Users/imac/DAT07-28-AG/ClassMaterial/Unit3/data/iowa_train2.csv")

In [17]:
df

Unnamed: 0,Id,MSSubClass,MSZoning,LotArea,Neighborhood,OverallQual,OverallCond,YearBuilt,GrLivArea,1stFlrSF,2ndFlrSF,GrLivArea.1,FullBath,HalfBath,GarageType,GarageYrBlt,GarageFinish,GarageCars,SalePrice
0,1,60,RL,8450,CollgCr,7,5,2003,1710,856,854,1710,2,1,Attchd,2003.0,RFn,2,208500
1,2,20,RL,9600,Veenker,6,8,1976,1262,1262,0,1262,2,0,Attchd,1976.0,RFn,2,181500
2,3,60,RL,11250,CollgCr,7,5,2001,1786,920,866,1786,2,1,Attchd,2001.0,RFn,2,223500
3,4,70,RL,9550,Crawfor,7,5,1915,1717,961,756,1717,1,0,Detchd,1998.0,Unf,3,140000
4,5,60,RL,14260,NoRidge,8,5,2000,2198,1145,1053,2198,2,1,Attchd,2000.0,RFn,3,250000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,1456,60,RL,7917,Gilbert,6,5,1999,1647,953,694,1647,2,1,Attchd,1999.0,RFn,2,175000
1456,1457,20,RL,13175,NWAmes,6,6,1978,2073,2073,0,2073,2,0,Attchd,1978.0,Unf,2,210000
1457,1458,70,RL,9042,Crawfor,7,9,1941,2340,1188,1152,2340,2,0,Attchd,1941.0,RFn,1,266500
1458,1459,20,RL,9717,NAmes,5,6,1950,1078,1078,0,1078,1,0,Attchd,1950.0,Unf,1,142125


**Step 2).  There are missing values throughout this dataset.  Fill them in appropriately**

We already covered this in class, but to give you a reminder:

 - Are the missing values random or not?
 - Encode them as missing if possible

In [18]:
df.isnull().sum() > 0

Id              False
MSSubClass      False
MSZoning        False
LotArea         False
Neighborhood    False
OverallQual     False
OverallCond     False
YearBuilt       False
GrLivArea       False
1stFlrSF        False
2ndFlrSF        False
GrLivArea.1     False
FullBath        False
HalfBath        False
GarageType       True
GarageYrBlt      True
GarageFinish     True
GarageCars      False
SalePrice       False
dtype: bool

In [19]:
def denote_null_values(df):
    empty_cols_query = df.isnull().sum() > 0
    empty_df_cols = df.loc[:, empty_cols_query].columns.tolist()
    for col in empty_df_cols:
        col_name = f"{col}: missing"
        df[col_name] = pd.isnull(df[col])
    return df
    

In [20]:
denote_null_values(df)

Unnamed: 0,Id,MSSubClass,MSZoning,LotArea,Neighborhood,OverallQual,OverallCond,YearBuilt,GrLivArea,1stFlrSF,...,FullBath,HalfBath,GarageType,GarageYrBlt,GarageFinish,GarageCars,SalePrice,GarageType: missing,GarageYrBlt: missing,GarageFinish: missing
0,1,60,RL,8450,CollgCr,7,5,2003,1710,856,...,2,1,Attchd,2003.0,RFn,2,208500,False,False,False
1,2,20,RL,9600,Veenker,6,8,1976,1262,1262,...,2,0,Attchd,1976.0,RFn,2,181500,False,False,False
2,3,60,RL,11250,CollgCr,7,5,2001,1786,920,...,2,1,Attchd,2001.0,RFn,2,223500,False,False,False
3,4,70,RL,9550,Crawfor,7,5,1915,1717,961,...,1,0,Detchd,1998.0,Unf,3,140000,False,False,False
4,5,60,RL,14260,NoRidge,8,5,2000,2198,1145,...,2,1,Attchd,2000.0,RFn,3,250000,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,1456,60,RL,7917,Gilbert,6,5,1999,1647,953,...,2,1,Attchd,1999.0,RFn,2,175000,False,False,False
1456,1457,20,RL,13175,NWAmes,6,6,1978,2073,2073,...,2,0,Attchd,1978.0,Unf,2,210000,False,False,False
1457,1458,70,RL,9042,Crawfor,7,9,1941,2340,1188,...,2,0,Attchd,1941.0,RFn,1,266500,False,False,False
1458,1459,20,RL,9717,NAmes,5,6,1950,1078,1078,...,1,0,Attchd,1950.0,Unf,1,142125,False,False,False


In [21]:
df = denote_null_values(df)

In [22]:
missing_cols_query = df.isnull().sum() > 0
missing_cols_num = df.loc[:, missing_cols_query].select_dtypes(include=np.number).columns.tolist()
missing_cols_cat = df.loc[:, missing_cols_query].select_dtypes(include=np.object).columns.tolist()
df[missing_cols_num] = df[missing_cols_num].fillna(0)
df[missing_cols_cat] = df[missing_cols_cat].fillna("None")

In [23]:
df[missing_cols_num]

Unnamed: 0,GarageYrBlt
0,2003.0
1,1976.0
2,2001.0
3,1998.0
4,2000.0
...,...
1455,1999.0
1456,1978.0
1457,1941.0
1458,1950.0


missing_cols_query = df.isnull.sum() > 0

**Step 3): Encode Your Categorical Data**

For now, you can choose which encoding technique you would want to use.  Later on you'll go back and check to see if it made a large difference.  

In [None]:
# your answer here

**Step 4):  Declare X & y, and fit your model**

In [None]:
# your code here

**Step 5):  Score your model, and look at your feature importances** 

In [None]:
# your code here

**Step 6):  (Time Permitting) Re-encode your categorical variables using the opposite technique, and observe if it made a difference**

In [None]:
# your code here

If you've made it this far, you can stop.  We'll discuss step 7 as a way to wrap up the class and head into next session.

**Step 7):  Score your model on your validation set**

How much did your results change?

In [None]:
# your answer here