### Iowa Housing Lab

Welcome!! This lab is going to be a bit more of an advanced version of yesterday's class, where we build a regression model to predict housing prices, but this time do so with a dataset that has a more interesting mix of data -- ordinal and nominal features, as well as some missing values.

**Important:** A summary of each of the columns in this dataset, and what their values mean, can be found here: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data

**Step 1).  Load in both your training & test sets**

In [4]:
import pandas as pd
train = pd.read_csv('../data/train.csv')
test = pd.read_csv('../data/test.csv')
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 19 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Id            1460 non-null   int64  
 1   MSSubClass    1460 non-null   int64  
 2   MSZoning      1460 non-null   object 
 3   LotArea       1460 non-null   int64  
 4   Neighborhood  1460 non-null   object 
 5   OverallQual   1460 non-null   int64  
 6   OverallCond   1460 non-null   int64  
 7   YearBuilt     1460 non-null   int64  
 8   GrLivArea     1460 non-null   int64  
 9   1stFlrSF      1460 non-null   int64  
 10  2ndFlrSF      1460 non-null   int64  
 11  GrLivArea.1   1460 non-null   int64  
 12  FullBath      1460 non-null   int64  
 13  HalfBath      1460 non-null   int64  
 14  GarageType    1379 non-null   object 
 15  GarageYrBlt   1379 non-null   float64
 16  GarageFinish  1379 non-null   object 
 17  GarageCars    1460 non-null   int64  
 18  SalePrice     1460 non-null 

In [8]:
import numpy as np
y = np.log(train['SalePrice'])
train.drop('SalePrice', axis=1, inplace=True)

**Step 2).  There are missing values throughout this dataset.  For the time being, let's try and do a few things:**

 - were these missing values likely to be randomly occurring, or are they likely encoding for something else?  
 
If values are encoding for something else, there are usually either high correlations with missing values in similar columns, or they could potentially represent a particular rank in a hierarchy -- ie, 'None', 0, 'Other', etc.  

Take a look at the column descriptions, see what you think they might be.

 - when you've made your decisions, fill in the missing values in each column with their average values if it's a number, and their modal values if they're categorical.
 
*Make sure to perform this operation on the training and test set, using values from the training set for imputation.*

In [2]:
train.isnull().sum()

Id               0
MSSubClass       0
MSZoning         0
LotArea          0
Neighborhood     0
OverallQual      0
OverallCond      0
YearBuilt        0
GrLivArea        0
1stFlrSF         0
2ndFlrSF         0
GrLivArea.1      0
FullBath         0
HalfBath         0
GarageType      81
GarageYrBlt     81
GarageFinish    81
GarageCars       0
SalePrice        0
dtype: int64

In [18]:
import numpy as np
train[['GarageType','GarageFinish']].fillna('Null', axis=1, inplace=True)
train['GarageYrBlt'].fillna(0, inplace=True)

In [3]:
test.isnull().sum()

Id               0
MSSubClass       0
MSZoning         4
LotArea          0
Neighborhood     0
OverallQual      0
OverallCond      0
YearBuilt        0
GrLivArea        0
1stFlrSF         0
2ndFlrSF         0
GrLivArea.1      0
FullBath         0
HalfBath         0
GarageType      76
GarageYrBlt     78
GarageFinish    78
GarageCars       1
dtype: int64

In [12]:
test[['GarageType', 'GarageFinish']].fillna('Null', axis=1, inplace=True)
test['GarageYrBlt'].fillna(0, inplace=True)
test['MSZoning'].fillna(test['MSZoning'].mode(), inplace=True)
test['GarageCars'].fillna(0, inplace=True)

In [13]:
test['GarageCars']

0       1.0
1       1.0
2       2.0
3       2.0
4       2.0
       ... 
1454    0.0
1455    1.0
1456    2.0
1457    0.0
1458    3.0
Name: GarageCars, Length: 1459, dtype: float64

**Step 3): Ordinal vs Categorical Columns**

There are a number of categorical columns in this dataset, and they could represent both ordinal or nominal data.  

Take a look at their descriptions, and decide which one belongs to which.

In [15]:
train.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotArea,Neighborhood,OverallQual,OverallCond,YearBuilt,GrLivArea,1stFlrSF,2ndFlrSF,GrLivArea.1,FullBath,HalfBath,GarageType,GarageYrBlt,GarageFinish,GarageCars
0,1,60,RL,8450,CollgCr,7,5,2003,1710,856,854,1710,2,1,Attchd,2003.0,RFn,2
1,2,20,RL,9600,Veenker,6,8,1976,1262,1262,0,1262,2,0,Attchd,1976.0,RFn,2
2,3,60,RL,11250,CollgCr,7,5,2001,1786,920,866,1786,2,1,Attchd,2001.0,RFn,2
3,4,70,RL,9550,Crawfor,7,5,1915,1717,961,756,1717,1,0,Detchd,1998.0,Unf,3
4,5,60,RL,14260,NoRidge,8,5,2000,2198,1145,1053,2198,2,1,Attchd,2000.0,RFn,3


In [16]:
train['GarageType'].unique()

array(['Attchd', 'Detchd', 'BuiltIn', 'CarPort', nan, 'Basment', '2Types'],
      dtype=object)

In [17]:
train['GarageFinish'].unique()

array(['RFn', 'Unf', 'Fin', nan], dtype=object)

**Step 4):  Go Ahead and Change Your Ordinal Variables To Their Appropriate Values**

In [None]:
garage_map = {
    'Unf': 
}
train['GarageType'].map({})

**Step 5):  Now, OneHot Encode Your Dataset For Your Remaining Categorical Columns** 

**Note:** You want your training and your test sets attached for this one.  Detach them when you're finished.

In [1]:
pd.get_dummies(train[['MsZoning', 'Neighborhood']])

NameError: name 'pd' is not defined

**Step 6): Standardize Your Data On Your Training and Test Sets**

**Remember:** Use the values from your training set to standardize your test set!  

Ask me if you have any questions on how to do this.

In [None]:
# your answer here

**Step 7):  Create a validation set out of your training set**

Since there is no time based component, random shuffling is fine.

In [None]:
# your answer here

**Step 8): Fit Linear Regression on your training set, and score it on your validation set to get a feel for how you did.**

In [None]:
# your answer here

**Step 9):  Finally, go ahead and make your predictions on your test set.**

Save to a csv file the following the following columns: the ID of of each row in your test set, as well as your prediction.

In [None]:
# your answer here

**Bonus:** Can you improve your score?

The first part of this lab was meant to be a walk through of the basics of prepping a data set and getting it ready.

However, there's a lot that could be improved upon!  

Using validation scores as your guide, you could try and look at some of the following:

 - Removing outliers from the target variable, or using log transformations to make the data smoother
 - There are lots of highly correlated variables in this dataset.  Do the 4 different columns about the fireplace really tell you something that different from one another?  You can try averaging multiple columns into one if they're highly correlated, or removing some entirely to see if it improves anything.

In [None]:
# your answer here