In [158]:
%load_ext autoreload
%autoreload 2

%matplotlib inline

from math import sqrt

import pandas as pd
import numpy as np
import os
from pathlib import Path
from IPython.display import display, FileLink
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# House prices predictions: linear model

This is a very simple introduction to Linear Regression with the Housing Prices Kaggle dataset. It aims to describe a very typical set of feature engineer steps for linear models, which lots of room to improve.

The high-level overview is as follows:

1. Download dataset.
2. Analyse and prepare dataset.
3. Train model.
4. Evaluate and submit results.

## 1. Download dataset

The dataset can be downloaded using the Kaggle CLI tool. You can find installation instructions in the [kaggle-api)](https://github.com/Kaggle/kaggle-api) repo. You'll also need to accept the terms and conditions of the project, which I've already done.

I can download it as follows:

In [2]:
PATH = Path('./data')
PATH.mkdir(exist_ok=True)

In [5]:
!kaggle competitions download -c house-prices-advanced-regression-techniques -p {PATH}

data_description.txt: Downloaded 13KB of 13KB to data
train.csv.gz: Downloaded 89KB of 89KB to data
train.csv: Downloaded 450KB of 450KB to data
test.csv.gz: Downloaded 82KB of 82KB to data
test.csv: Downloaded 441KB of 441KB to data
sample_submission.csv.gz: Downloaded 15KB of 15KB to data
sample_submission.csv: Downloaded 31KB of 31KB to data


## 2. Analyse and prepare dataset

My goal here is to do the bare minimum EDA to get started with the competition. Let's start by loading and examining the dataset.

I use the transpose Pandas method to convert columns into rows, to make it easier to see everything.

In [212]:
df_raw = pd.read_csv(PATH / 'train.csv')

In [213]:
with pd.option_context("display.max_rows", 1000, "display.max_columns", 1000):
    display(df_raw.head().transpose())

Unnamed: 0,0,1,2,3,4
Id,1,2,3,4,5
MSSubClass,60,20,60,70,60
MSZoning,RL,RL,RL,RL,RL
LotFrontage,65,80,68,60,84
LotArea,8450,9600,11250,9550,14260
Street,Pave,Pave,Pave,Pave,Pave
Alley,,,,,
LotShape,Reg,Reg,IR1,IR1,IR1
LandContour,Lvl,Lvl,Lvl,Lvl,Lvl
Utilities,AllPub,AllPub,AllPub,AllPub,AllPub


### Extract and prepare target values

We know the target variable is going to be SalePrice, so I'll pop that off the data frame. The output expects the log of the predicted values, so we'll calculate that here too.

In [214]:
sale_price = df_raw.pop('SalePrice')
sale_price_log = np.log(sale_price)

In [215]:
sale_price_log.head()

0    12.247694
1    12.109011
2    12.317167
3    11.849398
4    12.429216
Name: SalePrice, dtype: float64

### Prepare columns

Since all inputs you pass into a machine learning model need to be numbers, we're going to need to make some decisions about how we prepare each column. With this dataset, there are really 3 types of column we need to consider:

* Continous columns: These are columns where each output is a unique number over a "continuous space", examples are lot area and lot frontage where each house will have a unique value. There are other continuous variables that may not be unique to each house, but nonetheless we can't categorise like year sold.
* Ordered categorical columns: These are columns that have small number of unique values but where the order matters. Good examples of these are `OverallQual`, which is a rating from 10 to 1.
* Unordered categorical columns: These are columns that are continuous but don't have any obvious ordering. `RoofStyle` might be a good example of this, though experienced property developers might be aware of some implicity ordering.

We'll start by removing the columns that aren't going to be useful for our model, like id which has nothing for the model to learn.

In [216]:
house_ids = df_raw.pop('Id')

We can easily find all continuous variables by collecting all columns that have a unique count above some threshold. 50 is a good place to start.

In [217]:
MAX_N_UNIQUE = 50

continuous_columns = set([
    col_name for col_name, col in df_raw.items()
    if len(col.unique()) > MAX_N_UNIQUE])

In [218]:
continuous_columns

{'1stFlrSF',
 '2ndFlrSF',
 'BsmtFinSF1',
 'BsmtFinSF2',
 'BsmtUnfSF',
 'EnclosedPorch',
 'GarageArea',
 'GarageYrBlt',
 'GrLivArea',
 'LotArea',
 'LotFrontage',
 'MasVnrArea',
 'OpenPorchSF',
 'ScreenPorch',
 'TotalBsmtSF',
 'WoodDeckSF',
 'YearBuilt',
 'YearRemodAdd'}

I can find a couple more columns that should be continuous by eyeing the description, so I'll add those in.

In [219]:
continuous_columns = list(continuous_columns | set([
    'LowQualFinSF', 'BsmtHalfBath', 'BsmtFullBath', 'FullBath',
    'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd', '3SsnPorch',
    'PoolArea', 'MiscVal', 'YrSold', 'Fireplaces']))

The columns are all square feet measurements or years - sounds about right.

We can then treat all the remaining columns as categorical.

In [220]:
categorical_columns = [col for col in df_raw.columns if col not in continuous_columns]

In [221]:
assert len(df_raw.columns) == len(categorical_columns + continuous_columns)

### Prepare categorical

We prepare categorical variables by converting each using Panda's `astype('category')`:

In [222]:
for col_name, col in df_raw[categorical_columns].items():
    df_raw[col_name] = col.astype('category').cat.as_ordered()

We'll manually fix the ordering of any categories that needs it:

In [223]:
# Quality measures (Excellent, Good, Average, Fair, Poor)
df_raw.ExterQual.cat.set_categories(['Ex', 'Gd', 'TA', 'Fa', 'Po'], ordered=True, inplace=True)
df_raw.ExterCond.cat.set_categories(['Ex', 'Gd', 'TA', 'Fa', 'Po'], ordered=True, inplace=True)
df_raw.BsmtQual.cat.set_categories(['Ex', 'Gd', 'TA', 'Fa', 'Po'], ordered=True, inplace=True)
df_raw.BsmtExposure.cat.set_categories(['Ex', 'Gd', 'TA', 'Fa', 'Po'], ordered=True, inplace=True)
df_raw.BsmtFinType1.cat.set_categories(['GLQ', 'ALQ', 'BLQ', 'Rec', 'LwQ', 'Unf'], ordered=True, inplace=True)
df_raw.BsmtFinType2.cat.set_categories(['GLQ', 'ALQ', 'BLQ', 'Rec', 'LwQ', 'Unf'], ordered=True, inplace=True)
df_raw.HeatingQC.cat.set_categories(['Ex', 'Gd', 'TA', 'Fa', 'Po'], ordered=True, inplace=True)
df_raw.KitchenQual.cat.set_categories(['Ex', 'Gd', 'TA', 'Fa', 'Po'], ordered=True, inplace=True)
df_raw.FireplaceQu.cat.set_categories(['Ex', 'Gd', 'TA', 'Fa', 'Po'], ordered=True, inplace=True)
df_raw.GarageFinish.cat.set_categories(['Fin', 'Rfn', 'Unf'], ordered=True, inplace=True)
df_raw.GarageQual.cat.set_categories(['Ex', 'Gd', 'TA', 'Fa', 'Po'], ordered=True, inplace=True)
df_raw.GarageCond.cat.set_categories(['Ex', 'Gd', 'TA', 'Fa', 'Po'], ordered=True, inplace=True)
df_raw.PoolQC.cat.set_categories(['Ex', 'Gd', 'TA', 'Fa'], ordered=True, inplace=True)

### Prepare continuous

The first thing we'll want to do is replace all NaNs with some value. It's common to just use the columns mean or median. We can also add a column `isna` that will tell the model whether the value was originally missing or not.

We'll also want to save the median values to be used to replace nas in the test set.

In [233]:
nas = {}

for col in continuous_columns:
    if not pd.isna(df_raw[col]).sum():
        continue
        
    median = df_raw[col].median()
        
    df_raw[f'{col}_na'] = pd.isna(df_raw[col])
    df_raw[col] = df_raw[col].fillna(median)
    
    nas[col] = median

In [140]:
nas

{'MasVnrArea': 0.0, 'LotFrontage': 69.0, 'GarageYrBlt': 1980.0}

Lastly, we'll scale any variable you pass into machine learning models so that the data has a mean of 0 and a unit variance. That can easily be achieved by subtracting the mean from each value in the column and dividing by the standard deviation. However, scikit learn has a tool that makes this quite easy.

In [141]:
df_raw['LotArea'].head()

0     8450
1     9600
2    11250
3     9550
4    14260
Name: LotArea, dtype: int64

In [142]:
scaler = StandardScaler()
df_raw[continuous_columns] = scaler.fit_transform(df_raw[continuous_columns])

In [143]:
df_raw['LotArea'].head()

0   -0.207142
1   -0.091886
2    0.073480
3   -0.096897
4    0.375148
Name: LotArea, dtype: float64

### Numericalise

Last thing we want to do is convert all categories into their numeric representation.

In [149]:
df_numeric = df_raw.copy()

for col_name in categorical_columns:
    # Use +1 to push the -1 NaN value to 0
    df_numeric[col_name] = df_numeric[col_name].cat.codes + 1

In [150]:
df_numeric.head()

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,MasVnrArea_na,LotFrontage_na,GarageYrBlt_na
0,6,4,-0.220875,-0.207142,2,0,4,4,1,5,...,0,0,-0.087688,2,0.138777,9,5,False,False,False
1,1,4,0.46032,-0.091886,2,0,4,4,1,3,...,0,0,-0.087688,5,-0.614439,9,5,False,False,False
2,6,4,-0.084636,0.07348,2,0,1,4,1,5,...,0,0,-0.087688,9,0.138777,9,5,False,False,False
3,7,4,-0.44794,-0.096897,2,0,1,4,1,1,...,0,0,-0.087688,2,-1.367655,9,1,False,False,False
4,6,4,0.641972,0.375148,2,0,1,4,1,3,...,0,0,-0.087688,12,0.138777,9,5,False,False,False


### Create validation set

We want to create a validation set that's as similar as possible to the test set provided by Kaggle.

We can use the `train_test_split` function to do that.

In [182]:
X_train, X_val, y_train, y_val = train_test_split(df_numeric.values, sale_price_log, test_size=0.2, random_state=42)

## 3. Train evaluate model

Now comes the easy part. We can train the model in 2 lines:

In [183]:
model = LinearRegression()

In [184]:
model.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

### Basic evaluation

In [187]:
print(model.score(X_train, y_train))
print(model.score(X_val, y_val))

0.8887167972411366
0.8735851718883093


In [188]:
preds = model.predict(X_val)

In [189]:
rms = sqrt(((y_val - preds) ** 2).mean())
print(f'RMSE: {rms}')

RMSE: 0.15359269459393404


* A RMSE of 0.153 would put us somewhere toward the top 50%. Not amazing but a good start.

## 4. Submit predictions

We now need to prepare the test set using exactly the same preparation we used with the training data.

In [242]:
df_test_raw = pd.read_csv(PATH / 'test.csv')

In [243]:
house_ids = df_test_raw.pop('Id')

In [244]:
for col_name in categorical_columns:
    df_test_raw[col_name] = (
        pd.Categorical(df_test_raw[col_name], categories=df_raw[col_name].cat.categories, ordered=True))

In [245]:
for col in continuous_columns:
    if col not in nas:
        continue

    df_test_raw[f'{col}_na'] = pd.isna(df_test_raw[col])
    df_test_raw[col] = df_test_raw[col].fillna(nas[col])

In [246]:
# Handle any other nas
df_test_raw[continuous_columns] = df_test_raw[continuous_columns].fillna(df_test_raw[continuous_columns].median())

In [247]:
df_test_raw[continuous_columns] = scaler.transform(df_test_raw[continuous_columns])

In [248]:
df_test = df_test_raw.copy()

for col_name in categorical_columns:
    # Use +1 to push the -1 NaN value to 0
    df_test[col_name] = df_test[col_name].cat.codes + 1

In [249]:
test_preds = model.predict(df_test)

In [252]:
pd.DataFrame({'Id': house_ids, 'SalePrice': np.exp(test_preds)}).to_csv(f'{PATH}/sub.csv', index=False)

In [253]:
FileLink(PATH / 'sub.csv')

<img src="./images/submission-2.png">