# Machine Learning Foundation

## Course 2, Part b: Regression Setup, Train-test Split LAB 


## Introduction

We will be working with a data set based on [housing prices in Ames, Iowa](https://www.kaggle.com/c/house-prices-advanced-regression-techniques). It was compiled for educational use to be a modernized and expanded alternative to the well-known Boston Housing dataset. This version of the data set has had some missing values filled for convenience.

There are an extensive number of features, so they've been described in the table below.

### Predictor

* SalePrice: The property's sale price in dollars. 

### Features

* MoSold: Month Sold
* YrSold: Year Sold   
* SaleType: Type of sale
* SaleCondition: Condition of sale
* MSSubClass: The building class
* MSZoning: The general zoning classification
* ...


In [33]:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn

In [34]:
import requests

def download(url, filename):
    response = requests.get(url)
    
    print(f"content {response}")
    if response.status_code == 200:
        with open(filename,'wb') as f:
            f.write(response.content)

In [35]:
path = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML240EN-SkillsNetwork/labs/data/Ames_Housing_Sales.csv"
download(path, "Ames_Housing_Sales.csv")

content <Response [200]>


In [36]:
import pandas as pd
import numpy as np

data = pd.read_csv("Ames_Housing_Sales.csv", keep_default_na=False)
data.head()

Unnamed: 0,1stFlrSF,2ndFlrSF,3SsnPorch,Alley,BedroomAbvGr,BldgType,BsmtCond,BsmtExposure,BsmtFinSF1,BsmtFinSF2,...,ScreenPorch,Street,TotRmsAbvGrd,TotalBsmtSF,Utilities,WoodDeckSF,YearBuilt,YearRemodAdd,YrSold,SalePrice
0,856.0,854.0,0.0,,3,1Fam,TA,No,706.0,0.0,...,0.0,Pave,8,856.0,AllPub,0.0,2003,2003,2008,208500.0
1,1262.0,0.0,0.0,,3,1Fam,TA,Gd,978.0,0.0,...,0.0,Pave,6,1262.0,AllPub,298.0,1976,1976,2007,181500.0
2,920.0,866.0,0.0,,3,1Fam,TA,Mn,486.0,0.0,...,0.0,Pave,6,920.0,AllPub,0.0,2001,2002,2008,223500.0
3,961.0,756.0,0.0,,3,1Fam,Gd,No,216.0,0.0,...,0.0,Pave,7,756.0,AllPub,0.0,1915,1970,2006,140000.0
4,1145.0,1053.0,0.0,,4,1Fam,TA,Av,655.0,0.0,...,0.0,Pave,9,1145.0,AllPub,192.0,2000,2000,2008,250000.0


In [37]:
data.dtypes.value_counts()

object     43
float64    21
int64      16
Name: count, dtype: int64

## Question 2

A significant challenge, particularly when dealing with data that have many columns, is ensuring each column gets encoded correctly. 

This is particularly true with data columns that are ordered categoricals (ordinals) vs unordered categoricals. Unordered categoricals should be one-hot encoded, however this can significantly increase the number of features and creates features that are highly correlated with each other.

Determine how many total features would be present, relative to what currently exists, if all string (object) features are one-hot encoded. Recall that the total number of one-hot encoded columns is `n-1`, where `n` is the number of categories.


In [38]:
mask = (data.dtypes == object)
categorial_cols = data.columns[mask]
categorial_cols

Index(['Alley', 'BldgType', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1',
       'BsmtFinType2', 'BsmtQual', 'CentralAir', 'Condition1', 'Condition2',
       'Electrical', 'ExterCond', 'ExterQual', 'Exterior1st', 'Exterior2nd',
       'Fence', 'FireplaceQu', 'Foundation', 'Functional', 'GarageCond',
       'GarageFinish', 'GarageQual', 'GarageType', 'Heating', 'HeatingQC',
       'HouseStyle', 'KitchenQual', 'LandContour', 'LandSlope', 'LotConfig',
       'LotShape', 'MSZoning', 'MasVnrType', 'MiscFeature', 'Neighborhood',
       'PavedDrive', 'PoolQC', 'RoofMatl', 'RoofStyle', 'SaleCondition',
       'SaleType', 'Street', 'Utilities'],
      dtype='object')

In [39]:
num_ohc_cols = (data[categorial_cols]
                .apply(lambda x : x.nunique())
                .sort_values(ascending=False))
num_ohc_cols

Neighborhood     25
Exterior2nd      16
Exterior1st      14
Condition1        9
SaleType          9
RoofMatl          8
HouseStyle        8
Condition2        8
Functional        7
BsmtFinType2      7
GarageType        6
Heating           6
BsmtFinType1      6
FireplaceQu       6
Foundation        6
RoofStyle         6
SaleCondition     6
MiscFeature       5
MSZoning          5
LotConfig         5
BsmtExposure      5
HeatingQC         5
BsmtQual          5
Electrical        5
BldgType          5
GarageCond        5
Fence             5
GarageQual        5
KitchenQual       4
LandContour       4
PoolQC            4
LotShape          4
ExterQual         4
MasVnrType        4
ExterCond         4
BsmtCond          4
Alley             3
PavedDrive        3
LandSlope         3
GarageFinish      3
CentralAir        2
Street            2
Utilities         2
dtype: int64

In [40]:
actual_numh_oh_cols = num_ohc_cols.loc[num_ohc_cols > 1]
actual_numh_oh_cols

Neighborhood     25
Exterior2nd      16
Exterior1st      14
Condition1        9
SaleType          9
RoofMatl          8
HouseStyle        8
Condition2        8
Functional        7
BsmtFinType2      7
GarageType        6
Heating           6
BsmtFinType1      6
FireplaceQu       6
Foundation        6
RoofStyle         6
SaleCondition     6
MiscFeature       5
MSZoning          5
LotConfig         5
BsmtExposure      5
HeatingQC         5
BsmtQual          5
Electrical        5
BldgType          5
GarageCond        5
Fence             5
GarageQual        5
KitchenQual       4
LandContour       4
PoolQC            4
LotShape          4
ExterQual         4
MasVnrType        4
ExterCond         4
BsmtCond          4
Alley             3
PavedDrive        3
LandSlope         3
GarageFinish      3
CentralAir        2
Street            2
Utilities         2
dtype: int64

In [41]:
actual_numh_oh_cols -= 1

In [42]:
actual_numh_oh_cols.sum()

215

## Question 3

Let's create a new data set where all of the above categorical features will be one-hot encoded. We can fit this data and see how it affects the results.

* Used the dataframe `.copy()` method to create a completely separate copy of the dataframe for one-hot encoding
* On this new dataframe, one-hot encode each of the appropriate columns and add it back to the dataframe. Be sure to drop the original column.
* For the data that are not one-hot encoded, drop the columns that are string categoricals.

For the first step, numerically encoding the string categoricals, either Scikit-learn;s `LabelEncoder` or `DictVectorizer` can be used. However, the former is probably easier since it doesn't require specifying a numerical value for each category, and we are going to one-hot encode all of the numerical values anyway. (Can you think of a time when `DictVectorizer` might be preferred?)


In [43]:
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

le = LabelEncoder()
ohc = OneHotEncoder()
data_ohc = data.copy()

for col in num_ohc_cols.index:
    
    print(data_ohc[col])
    
    dat = le.fit_transform(data_ohc[col]).astype(int)
    
    data_ohc = data_ohc.drop(col, axis=1)
    
    new_dat = ohc.fit_transform(dat.reshape(-1, 1))
    
    n_cols = new_dat.shape[1]
    col_names = ['_'.join([col, str(x)]) for x in range(n_cols)]
    
    
    new_df = pd.DataFrame(new_dat.toarray(), 
                          index=data_ohc.index, 
                          columns=col_names)
    
    data_ohc = pd.concat([data_ohc, new_df], axis=1)
    

0       CollgCr
1       Veenker
2       CollgCr
3       Crawfor
4       NoRidge
         ...   
1374    Gilbert
1375     NWAmes
1376    Crawfor
1377      NAmes
1378    Edwards
Name: Neighborhood, Length: 1379, dtype: object
0       VinylSd
1       MetalSd
2       VinylSd
3       Wd Shng
4       VinylSd
         ...   
1374    VinylSd
1375    Plywood
1376    CmentBd
1377    MetalSd
1378    HdBoard
Name: Exterior2nd, Length: 1379, dtype: object
0       VinylSd
1       MetalSd
2       VinylSd
3       Wd Sdng
4       VinylSd
         ...   
1374    VinylSd
1375    Plywood
1376    CemntBd
1377    MetalSd
1378    HdBoard
Name: Exterior1st, Length: 1379, dtype: object
0        Norm
1       Feedr
2        Norm
3        Norm
4        Norm
        ...  
1374     Norm
1375     Norm
1376     Norm
1377     Norm
1378     Norm
Name: Condition1, Length: 1379, dtype: object
0       WD
1       WD
2       WD
3       WD
4       WD
        ..
1374    WD
1375    WD
1376    WD
1377    WD
1378    WD
Name: Sal

In [44]:
data_ohc.head()

Unnamed: 0,Utilities_0,Utilities_1,Street_0,Street_1,CentralAir_0,CentralAir_1,GarageFinish_0,GarageFinish_1,GarageFinish_2,LandSlope_0,...,OverallQual,PoolArea,ScreenPorch,TotRmsAbvGrd,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd,YrSold,SalePrice
0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,...,7,0.0,0.0,8,856.0,0.0,2003,2003,2008,208500.0
1,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,...,6,0.0,0.0,6,1262.0,298.0,1976,1976,2007,181500.0
2,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,...,7,0.0,0.0,6,920.0,0.0,2001,2002,2008,223500.0
3,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,...,7,0.0,0.0,7,756.0,0.0,1915,1970,2006,140000.0
4,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,...,8,0.0,0.0,9,1145.0,192.0,2000,2000,2008,250000.0


In [45]:
data_ohc.shape[1] - data.shape[1]

215

In [46]:
data = data.drop(num_ohc_cols.index, axis=1)
print(f"new shape: {data.shape[1]}")

new shape: 37


In [47]:
data.head()

Unnamed: 0,1stFlrSF,2ndFlrSF,3SsnPorch,BedroomAbvGr,BsmtFinSF1,BsmtFinSF2,BsmtFullBath,BsmtHalfBath,BsmtUnfSF,EnclosedPorch,...,OverallQual,PoolArea,ScreenPorch,TotRmsAbvGrd,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd,YrSold,SalePrice
0,856.0,854.0,0.0,3,706.0,0.0,1,0,150.0,0.0,...,7,0.0,0.0,8,856.0,0.0,2003,2003,2008,208500.0
1,1262.0,0.0,0.0,3,978.0,0.0,0,1,284.0,0.0,...,6,0.0,0.0,6,1262.0,298.0,1976,1976,2007,181500.0
2,920.0,866.0,0.0,3,486.0,0.0,1,0,434.0,0.0,...,7,0.0,0.0,6,920.0,0.0,2001,2002,2008,223500.0
3,961.0,756.0,0.0,3,216.0,0.0,1,0,540.0,272.0,...,7,0.0,0.0,7,756.0,0.0,1915,1970,2006,140000.0
4,1145.0,1053.0,0.0,4,655.0,0.0,1,0,490.0,0.0,...,8,0.0,0.0,9,1145.0,192.0,2000,2000,2008,250000.0


In [48]:
data_ohc.head()

Unnamed: 0,Utilities_0,Utilities_1,Street_0,Street_1,CentralAir_0,CentralAir_1,GarageFinish_0,GarageFinish_1,GarageFinish_2,LandSlope_0,...,OverallQual,PoolArea,ScreenPorch,TotRmsAbvGrd,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd,YrSold,SalePrice
0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,...,7,0.0,0.0,8,856.0,0.0,2003,2003,2008,208500.0
1,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,...,6,0.0,0.0,6,1262.0,298.0,1976,1976,2007,181500.0
2,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,...,7,0.0,0.0,6,920.0,0.0,2001,2002,2008,223500.0
3,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,...,7,0.0,0.0,7,756.0,0.0,1915,1970,2006,140000.0
4,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,...,8,0.0,0.0,9,1145.0,192.0,2000,2000,2008,250000.0


## Question 4

* Create train and test splits of both data sets. To ensure the data gets split the same way, use the same `random_state` in each of the two splits.
* For each data set, fit a basic linear regression model on the training data. 
* Calculate the mean squared error on both the train and test sets for the respective models. Which model produces smaller error on the test data and why?


In [49]:
test_size = 0.3
random_state = 42
y_col = 'SalePrice'

In [50]:
feature_cols = [x for x in data.columns if x != y_col]
x_data = data[feature_cols]
y_data = data[y_col]

In [51]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=test_size, random_state=random_state)

In [52]:
one_hot_feature_cols = [x for x in data_ohc.columns if x != y_col]
x_data_ohc = data_ohc[one_hot_feature_cols]
y_data_ohc = data_ohc[y_col]

In [53]:
x_train_ohc, x_test_ohc, y_train_ohc, y_test_ohc = train_test_split(x_data_ohc, y_data_ohc, test_size=test_size, random_state=random_state) 

In [54]:
x_train

Unnamed: 0,1stFlrSF,2ndFlrSF,3SsnPorch,BedroomAbvGr,BsmtFinSF1,BsmtFinSF2,BsmtFullBath,BsmtHalfBath,BsmtUnfSF,EnclosedPorch,...,OverallCond,OverallQual,PoolArea,ScreenPorch,TotRmsAbvGrd,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd,YrSold
461,630.0,0.0,0.0,1,515.0,0.0,1,0,115.0,0.0,...,8,4,0.0,0.0,3,630.0,0.0,1970,2002,2009
976,845.0,0.0,0.0,3,0.0,0.0,0,0,0.0,0.0,...,3,4,0.0,0.0,5,0.0,186.0,1957,1957,2009
1128,728.0,728.0,0.0,3,0.0,0.0,0,0,728.0,0.0,...,5,6,0.0,0.0,8,728.0,100.0,2005,2005,2008
904,561.0,668.0,0.0,2,285.0,0.0,0,0,276.0,0.0,...,6,6,0.0,0.0,5,561.0,150.0,1980,1980,2009
506,1601.0,0.0,0.0,3,1358.0,0.0,1,0,223.0,0.0,...,5,8,0.0,0.0,6,1581.0,180.0,2001,2002,2010
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1095,855.0,601.0,0.0,3,311.0,0.0,0,0,544.0,0.0,...,5,6,0.0,0.0,7,855.0,26.0,1978,1978,2010
1130,815.0,875.0,0.0,3,0.0,0.0,0,0,815.0,330.0,...,6,7,0.0,0.0,7,815.0,0.0,1916,1950,2006
1294,1661.0,0.0,0.0,3,831.0,0.0,1,0,161.0,0.0,...,6,6,0.0,178.0,8,992.0,0.0,1955,1996,2008
860,742.0,742.0,0.0,3,0.0,0.0,0,0,742.0,0.0,...,5,6,0.0,0.0,8,742.0,36.0,2005,2005,2009


In [55]:
x_train_ohc

Unnamed: 0,Utilities_0,Utilities_1,Street_0,Street_1,CentralAir_0,CentralAir_1,GarageFinish_0,GarageFinish_1,GarageFinish_2,LandSlope_0,...,OverallCond,OverallQual,PoolArea,ScreenPorch,TotRmsAbvGrd,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd,YrSold
461,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,...,8,4,0.0,0.0,3,630.0,0.0,1970,2002,2009
976,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,...,3,4,0.0,0.0,5,0.0,186.0,1957,1957,2009
1128,1.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,...,5,6,0.0,0.0,8,728.0,100.0,2005,2005,2008
904,1.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,...,6,6,0.0,0.0,5,561.0,150.0,1980,1980,2009
506,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,...,5,8,0.0,0.0,6,1581.0,180.0,2001,2002,2010
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1095,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,...,5,6,0.0,0.0,7,855.0,26.0,1978,1978,2010
1130,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,...,6,7,0.0,0.0,7,815.0,0.0,1916,1950,2006
1294,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,...,6,6,0.0,178.0,8,992.0,0.0,1955,1996,2008
860,1.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,...,5,6,0.0,0.0,8,742.0,36.0,2005,2005,2009


In [56]:
(x_train_ohc.index == x_train.index).all()

True

In [57]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

lr = LinearRegression()
error_df = list()

In [58]:
lr = lr.fit(x_train, y_train)
y_train_pred = lr.predict(x_train)
y_test_pred = lr.predict(x_test)

In [59]:
error_df.append(pd.Series({
    'train': mean_squared_error(y_train, y_train_pred),
    'test': mean_squared_error(y_test, y_test_pred)},
                          name='no enc'))
error_df

[train    1.131507e+09
 test     1.372182e+09
 Name: no enc, dtype: float64]

In [60]:
lr = lr.fit(x_train_ohc, y_train_ohc)
y_train_pred_ohc = lr.predict(x_train_ohc)
y_test_pred_ohc = lr.predict(x_test_ohc)

In [61]:
y_test_pred_ohc

array([ 3.35766908e+05,  1.30701323e+05,  7.72701962e+04,  2.13621595e+05,
        1.87012150e+05,  6.97614704e+04,  2.17674681e+05,  2.01959640e+05,
        2.76231610e+09,  2.12377620e+05,  1.01046207e+05,  3.11242595e+05,
        1.05471076e+05,  3.47038652e+05,  1.58077467e+05,  1.43645895e+05,
        2.30907234e+05,  1.44310125e+05,  2.23649941e+05,  2.81810742e+05,
        1.21974403e+05,  1.49029566e+05,  1.46587914e+05,  2.48438769e+05,
        4.31507548e+05, -1.61950348e+07,  1.40525258e+05,  1.12218316e+05,
        3.25554052e+05,  1.47249031e+05,  2.54768366e+05, -8.46093530e+12,
        1.94191971e+05,  1.20625101e+05,  2.17101886e+05,  4.36921745e+05,
        9.29436553e+04,  1.10307193e+05,  2.26932089e+05,  1.83454636e+05,
        1.56857688e+05,  1.29990592e+05,  1.75709065e+05,  4.25479270e+05,
        1.56980370e+05,  2.71882599e+05,  9.12212529e+04,  2.65304508e+05,
        1.08923522e+05,  3.37926623e+05,  1.52221264e+05,  3.08552887e+04,
        1.20983131e+05,  

In [62]:
error_df.append(pd.Series({
    'train': mean_squared_error(y_train_ohc, y_train_pred_ohc),
    'test': mean_squared_error(y_test_ohc, y_test_pred_ohc)},
                          name="one-hot enc"))
error_df

[train    1.131507e+09
 test     1.372182e+09
 Name: no enc, dtype: float64,
 train    3.177294e+08
 test     1.729165e+23
 Name: one-hot enc, dtype: float64]

In [63]:
error_df = pd.concat(error_df, axis=1)
error_df

Unnamed: 0,no enc,one-hot enc
train,1131507000.0,317729400.0
test,1372182000.0,1.729165e+23
