### Iowa Housing Lab -- Solutions

Welcome!! This lab is going to be a bit more of an advanced version of yesterday's class, where we build a regression model to predict housing prices, but this time do so with a dataset that has a more interesting mix of data -- ordinal and nominal features, as well as some missing values.

**Important:** A summary of each of the columns in this dataset, and what their values mean, can be found here: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data

**Step 1).  Load in both your training & test sets**

In [90]:
import pandas as pd

In [91]:
train = pd.read_csv('../data/train.csv')
test = pd.read_csv('../data/test.csv')

In [92]:
train.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotArea,Neighborhood,OverallQual,OverallCond,YearBuilt,GrLivArea,1stFlrSF,2ndFlrSF,GrLivArea.1,FullBath,HalfBath,GarageType,GarageYrBlt,GarageFinish,GarageCars,SalePrice
0,1,60,RL,8450,CollgCr,7,5,2003,1710,856,854,1710,2,1,Attchd,2003.0,RFn,2,208500
1,2,20,RL,9600,Veenker,6,8,1976,1262,1262,0,1262,2,0,Attchd,1976.0,RFn,2,181500
2,3,60,RL,11250,CollgCr,7,5,2001,1786,920,866,1786,2,1,Attchd,2001.0,RFn,2,223500
3,4,70,RL,9550,Crawfor,7,5,1915,1717,961,756,1717,1,0,Detchd,1998.0,Unf,3,140000
4,5,60,RL,14260,NoRidge,8,5,2000,2198,1145,1053,2198,2,1,Attchd,2000.0,RFn,3,250000


Also....when you're cleaning training & test sets, it's usually a good idea to separate the column you're trying to predict from everything else.  

For now, declare `y` to be the `SalePrice` column, and then remove it from the training set entirely.  You can drop the `ID` column too, since it encodes nothing meaningful.

**Step 2).  There are missing values throughout this dataset.  For the time being, let's try and do a few things:**

 - were these missing values likely to be randomly occurring, or are they likely encoding for something else?  
 
If values are encoding for something else, there are usually either high correlations with missing values in similar columns, and/or they could potentially represent a particular rank in a hierarchy -- ie, 'None', 0, 'Other', etc.  Ie, the missing values basically are encoding for something specific, it's just not mentioned.

Take a look at the column descriptions, see what you think they might be.

 - if you think they are missing at random, fill in the missing values with their mean(numeric columns) or mode(categorical columns)
 - if you think they are **not** missing at random, then go ahead and fill them in with a value to encode what they are (0, 'Other', and 'None' are common choices)
 
**Hint:** You can try encoding null & non-null values to 0 and 1, respectively, and use the corr() method on that. 
 
*If filling in missing values, make sure to perform this operation on the training and test set, using values from the training set for imputation.*

In [93]:
y = train['SalePrice']
train = train.drop(['SalePrice', 'Id'], axis=1)

In [94]:
train.isnull().sum()

MSSubClass       0
MSZoning         0
LotArea          0
Neighborhood     0
OverallQual      0
OverallCond      0
YearBuilt        0
GrLivArea        0
1stFlrSF         0
2ndFlrSF         0
GrLivArea.1      0
FullBath         0
HalfBath         0
GarageType      81
GarageYrBlt     81
GarageFinish    81
GarageCars       0
dtype: int64

In [95]:
train['GarageType'] = train['GarageType'].fillna('NA')
train['GarageYrBlt'] = train['GarageYrBlt'].fillna(0)
train['GarageFinish'] = train['GarageFinish'].fillna('NA')


In [96]:
train['GarageFinish'].unique()

array(['RFn', 'Unf', 'Fin', 'NA'], dtype=object)

In [97]:
# import types
# import inspect
# def find_err(x):
#     print(type(x))
#     if inspect.ismethod(x):
#         print('YES')
#         x = train['MSZoning'].mode()[0]
        
#     return x
# test['MSZoning'] = test['MSZoning'].apply(find_err)

In [98]:
test['GarageType'] = test['GarageType'].fillna('NA')
test['GarageYrBlt'] = test['GarageYrBlt'].fillna(0)
test['GarageFinish'] = test['GarageFinish'].fillna('NA')
test['MSZoning'] = test['MSZoning'].fillna(train['MSZoning'].mode()[0])
test['GarageCars'] = test['GarageCars'].fillna(train['GarageCars'].mean())

**Step 3): Ordinal vs Nominal Data**

There are a number of categorical columns in this dataset, and they could represent both ordinal data(data that has a rank) or nominal data (data that doesn't have a rank).  

There is a file called `data_description.txt` that contains descriptions of all the values in each and what they mean if you want to do a little bit of research.

You can also find a brief description here:  https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data.

In [99]:
# your answer here (no real code required for this one)

**Step 4):  Go Ahead and Change Your Ordinal Variables To Their Appropriate Values, if they exist.**

**Hint:** The `map` method is useful for this.  

It goes like this:  `mapping = {'oldColVal1': 'NewColVal1',`
                                 `oldColVal2': 'NewColVal2', etc}`
                                
`df['Col'] = df['Col'].map(mapping)`

In [100]:
garfin = {
    'NA': 0,
    'Unf': 1,
    'RFn': 2,
    'Fin': 3
}
train['GarageFinish'] = train['GarageFinish'].map(garfin)
test['GarageFinish'] = test['GarageFinish'].map(garfin)

In [101]:
train['GarageFinish'].unique()

array([2, 1, 3, 0], dtype=int64)

In [102]:
test = test.drop('Id', axis=1)

**Step 5):  Now, OneHot Encode Your Dataset For Your Remaining Categorical Columns** 

**Note:** You want your training and your test sets attached for this one.  Detach them when you're finished.

**2nd Note:** Some columns are categorical, even if they're encoded as numbers.  the `MSSubClass` is essentially a zoning category, even though it's encoded as a number.  It's a good idea to encode these variables as strings using the `astype` method.

**3rd Note:** We'll discuss better ways to get around this, but the test set has a value in the `MSSubClass` column that is **not** in the training set.  For the time being just drop the column `MSSubClass_150` from both training and test sets before proceeding to the next step.

In [103]:
train['MSSubClass'] = train['MSSubClass'].astype(str)
test['MSSubClass'] = test['MSSubClass'].astype(str)

In [104]:
train.index[-1]
comb = pd.concat([train, test])

In [105]:
comb = pd.get_dummies(comb)
comb = comb.drop('MSSubClass_150', axis=1)

In [106]:
train = comb.iloc[:train.index[-1] + 1]
test = comb.iloc[train.index[-1] +1:]

In [107]:
train['GarageFinish'][:10]

0    2
1    2
2    2
3    1
4    2
5    1
6    2
7    2
8    1
9    2
Name: GarageFinish, dtype: int64

**Step 6): Standardize Your Data On Your Training and Test Sets**

**Remember:** Use the values from your training set to standardize your test set!  

Ask me if you have any questions on how to do this.

In [108]:
train.loc[:, train.isnull().sum() > 0]

0
1
2
3
4
...
1455
1456
1457
1458
1459


In [109]:
means = train.mean()
stds = train.std()
train = train - means
train = train / stds

test = (test - means) / stds

In [110]:
train.head()

Unnamed: 0,LotArea,OverallQual,OverallCond,YearBuilt,GrLivArea,1stFlrSF,2ndFlrSF,GrLivArea.1,FullBath,HalfBath,...,Neighborhood_StoneBr,Neighborhood_Timber,Neighborhood_Veenker,GarageType_2Types,GarageType_Attchd,GarageType_Basment,GarageType_BuiltIn,GarageType_CarPort,GarageType_Detchd,GarageType_NA
0,-0.207071,0.651256,-0.517023,1.050634,0.370207,-0.793162,1.161454,0.370207,0.78947,1.227165,...,-0.131946,-0.163415,-0.087099,-0.064216,0.823223,-0.114788,-0.253172,-0.07873,-0.600353,-0.242277
1,-0.091855,-0.071812,2.178881,0.15668,-0.482347,0.257052,-0.794891,-0.482347,0.78947,-0.76136,...,-0.131946,-0.163415,11.473319,-0.064216,0.823223,-0.114788,-0.253172,-0.07873,-0.600353,-0.242277
2,0.073455,0.651256,-0.517023,0.984415,0.514836,-0.627611,1.188943,0.514836,0.78947,1.227165,...,-0.131946,-0.163415,-0.087099,-0.064216,0.823223,-0.114788,-0.253172,-0.07873,-0.600353,-0.242277
3,-0.096864,0.651256,-0.517023,-1.862993,0.383528,-0.521555,0.936955,0.383528,-1.025689,-0.76136,...,-0.131946,-0.163415,-0.087099,-0.064216,-1.213905,-0.114788,-0.253172,-0.07873,1.664545,-0.242277
4,0.37502,1.374324,-0.517023,0.951306,1.298881,-0.045596,1.617323,1.298881,0.78947,1.227165,...,-0.131946,-0.163415,-0.087099,-0.064216,0.823223,-0.114788,-0.253172,-0.07873,-0.600353,-0.242277


In [111]:
train['GarageFinish'].describe()

count    1.460000e+03
mean    -8.760116e-17
std      1.000000e+00
min     -1.921700e+00
25%     -8.016673e-01
50%      3.183655e-01
75%      3.183655e-01
max      1.438398e+00
Name: GarageFinish, dtype: float64

In [112]:
train.loc[: , train.isnull().sum() > 0].isnull().sum()

Series([], dtype: float64)

**Step 7):  Create a validation set out of your training set**

Since there is no time based component, random shuffling is fine.  (You can use `train_test_split` for this, although homespun methods usually work equally as well).

In [144]:
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(train, y, test_size=0.2, random_state=2020)

**Step 8): Initialize Linear Regression, fit it on your training set, and score it on your validation set to get a feel for how you did.**

In [145]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(X_train, y_train)
print(lr.coef_)
print(lr.score(X_val, y_val))

[ 5.19990039e+03  1.87164828e+04  7.51032639e+03  1.23755961e+04
 -3.32355720e+14  9.31188802e+03  8.71204679e+03  3.32355720e+14
 -3.02771638e+02  3.12728125e+03 -1.61819531e+03  6.06525391e+03
  1.16770352e+04  2.21774925e+15  1.90364527e+15  7.72683337e+14
  1.32906491e+15  4.51581276e+15  1.98795091e+15  4.89697886e+14
  8.45848240e+14  2.79335889e+15  3.78067417e+15  1.85976123e+15
  9.75351450e+14  1.82980812e+15  1.08896467e+15  1.73628313e+15
 -6.18825983e+15 -1.54749168e+16 -7.81138651e+15 -3.06479507e+16
 -2.67407500e+16 -9.73974569e+14 -3.35802365e+14 -9.45221473e+14
 -1.77328277e+15 -1.24520402e+15 -2.75658424e+15 -1.66698071e+15
 -2.29329247e+15 -2.05399970e+15 -1.42689923e+15 -9.73974569e+14
 -1.63512718e+15 -3.27804362e+15 -7.10632314e+14 -1.97874418e+15
 -1.49993734e+15 -2.02930088e+15 -2.42612409e+15 -1.17783904e+15
 -1.99153281e+15 -1.78786641e+15 -2.13763023e+15 -1.17783904e+15
 -1.44554491e+15 -7.85091954e+14 -5.33610357e+14 -4.09309716e+15
 -9.45312091e+14 -1.98511

**Step 9):  Finally, go ahead and make your predictions on your test set.**

Create a dataframe with the following columns: 

    `ID`: The original ID of each row in the test set.  Goes from 1461 - 2919
    `SalePrice`: The predicted Sale Price of the house in the test set. 

In [146]:
preds = lr.predict(test)
pred_df = pd.DataFrame({'ID': range(1461, 2920),
                       'SalePrice': preds})
pred_df.head(10)

Unnamed: 0,ID,SalePrice
0,1461,110212.385135
1,1462,156446.385135
2,1463,172558.385135
3,1464,189862.385135
4,1465,254474.385135
5,1466,183212.385135
6,1467,186734.385135
7,1468,174120.385135
8,1469,189348.385135
9,1470,121130.385135


Now, use the `to_csv()` method to output the file to a csv.  Make sure to use `index=False` as an argument.

In [147]:
pred_df.to_csv('../data/predictions.csv', index=False)

**Bonus:** Can you improve your score?

The first part of this lab was meant to be a walk through of the basics of prepping a data set and getting it ready.

However, there's a lot that could be improved upon!  

Using validation scores as your guide, you could try and look at some of the following:

 - Removing outliers from the target variable, or using log transformations to make the data smoother (this would be done via `y = np.log(y)`
 - There are lots of highly correlated variables in this dataset.  Do the 4 different columns about the garage really tell you something that different from one another?  You can try averaging multiple columns into one if they're highly correlated, or removing some entirely to see if it improves anything.

In [149]:
import numpy as np
lr2 = LinearRegression()
ylog = np.log(y)
X_train, X_val, y_train, y_val = train_test_split(train, ylog, test_size=0.2,  random_state=2020)
lr2.fit(X_train, y_train)
print(lr2.score(X_val, y_val))

0.905202416037963


In [150]:
import seaborn as sns
train.corr()[train.corr() > 0.5]

Unnamed: 0,LotArea,OverallQual,OverallCond,YearBuilt,GrLivArea,1stFlrSF,2ndFlrSF,GrLivArea.1,FullBath,HalfBath,...,Neighborhood_StoneBr,Neighborhood_Timber,Neighborhood_Veenker,GarageType_2Types,GarageType_Attchd,GarageType_Basment,GarageType_BuiltIn,GarageType_CarPort,GarageType_Detchd,GarageType_NA
LotArea,1.0,,,,,,,,,,...,,,,,,,,,,
OverallQual,,1.000000,,0.572323,0.593007,,,0.593007,0.550600,,...,,,,,,,,,,
OverallCond,,,1.0,,,,,,,,...,,,,,,,,,,
YearBuilt,,0.572323,,1.000000,,,,,,,...,,,,,,,,,,
GrLivArea,,0.593007,,,1.000000,0.566024,0.687501,1.000000,0.630012,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
GarageType_Basment,,,,,,,,,,,...,,,,,,1.0,,,,
GarageType_BuiltIn,,,,,,,,,,,...,,,,,,,1.0,,,
GarageType_CarPort,,,,,,,,,,,...,,,,,,,,1.0,,
GarageType_Detchd,,,,,,,,,,,...,,,,,,,,,1.0,
