# Guided Project: Predicting House Sale Prices

## Introduction

The goal of this project is to implement a linear regression model to predict house sale prices. To do that, we will create a pipeline of functions that will let us quickly iterate on different models.

These functions will be the following:
* `transform_features()` : to perform feature engineering
* `select_features()` : to select the appropriate features
* `train_and_test()` : to create, train and test the linear regression model 

## Libraries and Modules Import

In [203]:
import pandas as pd
pd.options.display.max_columns = 999
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

## The Data

The data set we will be using to train and test our models is a collection of housing data for the city of Ames, Iowa, United States from 2006 to 2010.

More information on why the data were collected can be found [here](https://www.tandfonline.com/doi/abs/10.1080/10691898.2011.11889627). Information about the different columns in the data can be found [here](https://s3.amazonaws.com/dq-content/307/data_description.txt).

Let's read the data set in!

In [204]:
df = pd.read_csv('AmesHousing.tsv', delimiter='\t')

And print the head of the dataframe.

In [205]:
df.head()

Unnamed: 0,Order,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,Utilities,Lot Config,Land Slope,Neighborhood,Condition 1,Condition 2,Bldg Type,House Style,Overall Qual,Overall Cond,Year Built,Year Remod/Add,Roof Style,Roof Matl,Exterior 1st,Exterior 2nd,Mas Vnr Type,Mas Vnr Area,Exter Qual,Exter Cond,Foundation,Bsmt Qual,Bsmt Cond,Bsmt Exposure,BsmtFin Type 1,BsmtFin SF 1,BsmtFin Type 2,BsmtFin SF 2,Bsmt Unf SF,Total Bsmt SF,Heating,Heating QC,Central Air,Electrical,1st Flr SF,2nd Flr SF,Low Qual Fin SF,Gr Liv Area,Bsmt Full Bath,Bsmt Half Bath,Full Bath,Half Bath,Bedroom AbvGr,Kitchen AbvGr,Kitchen Qual,TotRms AbvGrd,Functional,Fireplaces,Fireplace Qu,Garage Type,Garage Yr Blt,Garage Finish,Garage Cars,Garage Area,Garage Qual,Garage Cond,Paved Drive,Wood Deck SF,Open Porch SF,Enclosed Porch,3Ssn Porch,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type,Sale Condition,SalePrice
0,1,526301100,20,RL,141.0,31770,Pave,,IR1,Lvl,AllPub,Corner,Gtl,NAmes,Norm,Norm,1Fam,1Story,6,5,1960,1960,Hip,CompShg,BrkFace,Plywood,Stone,112.0,TA,TA,CBlock,TA,Gd,Gd,BLQ,639.0,Unf,0.0,441.0,1080.0,GasA,Fa,Y,SBrkr,1656,0,0,1656,1.0,0.0,1,0,3,1,TA,7,Typ,2,Gd,Attchd,1960.0,Fin,2.0,528.0,TA,TA,P,210,62,0,0,0,0,,,,0,5,2010,WD,Normal,215000
1,2,526350040,20,RH,80.0,11622,Pave,,Reg,Lvl,AllPub,Inside,Gtl,NAmes,Feedr,Norm,1Fam,1Story,5,6,1961,1961,Gable,CompShg,VinylSd,VinylSd,,0.0,TA,TA,CBlock,TA,TA,No,Rec,468.0,LwQ,144.0,270.0,882.0,GasA,TA,Y,SBrkr,896,0,0,896,0.0,0.0,1,0,2,1,TA,5,Typ,0,,Attchd,1961.0,Unf,1.0,730.0,TA,TA,Y,140,0,0,0,120,0,,MnPrv,,0,6,2010,WD,Normal,105000
2,3,526351010,20,RL,81.0,14267,Pave,,IR1,Lvl,AllPub,Corner,Gtl,NAmes,Norm,Norm,1Fam,1Story,6,6,1958,1958,Hip,CompShg,Wd Sdng,Wd Sdng,BrkFace,108.0,TA,TA,CBlock,TA,TA,No,ALQ,923.0,Unf,0.0,406.0,1329.0,GasA,TA,Y,SBrkr,1329,0,0,1329,0.0,0.0,1,1,3,1,Gd,6,Typ,0,,Attchd,1958.0,Unf,1.0,312.0,TA,TA,Y,393,36,0,0,0,0,,,Gar2,12500,6,2010,WD,Normal,172000
3,4,526353030,20,RL,93.0,11160,Pave,,Reg,Lvl,AllPub,Corner,Gtl,NAmes,Norm,Norm,1Fam,1Story,7,5,1968,1968,Hip,CompShg,BrkFace,BrkFace,,0.0,Gd,TA,CBlock,TA,TA,No,ALQ,1065.0,Unf,0.0,1045.0,2110.0,GasA,Ex,Y,SBrkr,2110,0,0,2110,1.0,0.0,2,1,3,1,Ex,8,Typ,2,TA,Attchd,1968.0,Fin,2.0,522.0,TA,TA,Y,0,0,0,0,0,0,,,,0,4,2010,WD,Normal,244000
4,5,527105010,60,RL,74.0,13830,Pave,,IR1,Lvl,AllPub,Inside,Gtl,Gilbert,Norm,Norm,1Fam,2Story,5,5,1997,1998,Gable,CompShg,VinylSd,VinylSd,,0.0,TA,TA,PConc,Gd,TA,No,GLQ,791.0,Unf,0.0,137.0,928.0,GasA,Gd,Y,SBrkr,928,701,0,1629,0.0,0.0,2,1,3,1,TA,6,Typ,1,TA,Attchd,1997.0,Fin,2.0,482.0,TA,TA,Y,212,34,0,0,0,0,,MnPrv,,0,3,2010,WD,Normal,189900


Let's now move on to feature engineering. This is a crucial step in preparing the data set before it is used to train our linear regression model.

## Feature Engineering

In this section, we will define some feature engineering tasks that we will later include in our `
transform_features()` function
* removing features with many missing values
* diving deeper into potential categorical features
* transforming text and numerical columns.

### Missing Values In All Columns

We start by dropping any column, regardless of its type, that contains 5% or more of missing values.

In [206]:
missing_cols = df.isnull().sum().sort_values(ascending=True)

cols_drop = missing_cols[missing_cols > (len(df)*0.05)]

print(cols_drop)

df = df.drop(cols_drop.index, axis=1)

Garage Type       157
Garage Finish     159
Garage Cond       159
Garage Yr Blt     159
Garage Qual       159
Lot Frontage      490
Fireplace Qu     1422
Fence            2358
Alley            2732
Misc Feature     2824
Pool QC          2917
dtype: int64


### Missing Values In Text Columns

We then drop any text column that contains 1 or more missing values.

In [207]:
text_df = df.select_dtypes(include=['object'])
missing_text_cols = text_df.isnull().sum().sort_values(ascending=True)
text_cols_drop = missing_text_cols[missing_text_cols > 0]

print(text_cols_drop)

df = df.drop(text_cols_drop.index, axis=1)

Electrical         1
Mas Vnr Type      23
Bsmt Cond         80
Bsmt Qual         80
BsmtFin Type 1    80
BsmtFin Type 2    81
Bsmt Exposure     83
dtype: int64


### Replacing Missing Values In Numerical Columns

For numerical columns with missing values, we fill in the missing values with the most common value in that column.

In [208]:
numerical_df = df.select_dtypes(include=['integer', 'float'])
missing_num_cols = numerical_df.isnull().sum().sort_values(ascending=True)
numerical_cols_fix = missing_num_cols[(missing_num_cols < len(df)/20) & (missing_num_cols > 0)]
numerical_cols_fix

Garage Area        1
BsmtFin SF 2       1
Bsmt Unf SF        1
Total Bsmt SF      1
BsmtFin SF 1       1
Garage Cars        1
Bsmt Full Bath     2
Bsmt Half Bath     2
Mas Vnr Area      23
dtype: int64

And we calculate the mode of each of these columns for replacement.

In [209]:
replacement_values_dict = df[numerical_cols_fix.index].mode().to_dict(orient='record')[0]
print(replacement_values_dict)
df = df.fillna(replacement_values_dict)

{'Bsmt Unf SF': 0.0, 'Garage Area': 0.0, 'BsmtFin SF 2': 0.0, 'BsmtFin SF 1': 0.0, 'Total Bsmt SF': 0.0, 'Garage Cars': 2.0, 'Bsmt Half Bath': 0.0, 'Bsmt Full Bath': 0.0, 'Mas Vnr Area': 0.0}


We finally verify that each column has 0 missing values left.

In [210]:
df.isnull().sum().value_counts()

0    64
dtype: int64

We are now done dealing with missing values. We can now look at creating new features by combining others features.

### Features Creation

Combining existing features to create new ones is sometimes necessary to better capture information and gain in relevance.

In our data set, we spot three columns that can be combined to gain information:
* `Yr Sold`
* `Year Built`
* `Year Remod/Add`

We start by substracting `Year Built` to `Yr Sold` to create a new column that we name `Years Before Sale`.

In [211]:
df['Years Before Sale'] = df['Yr Sold'] - df['Year Built']

And we then substract `Yr Sold` to `Year Remod/Add` columns to create a new column named `Years Since Remod`. The number of years passed since the last remodelling of the house could be a good predictor of the house sale price.

In [212]:
df['Years Since Remod'] = df['Yr Sold'] - df['Year Remod/Add']

When substracting years, we want to ensure that we don't end up with some negative values that would make no temporal sense.

In [213]:
print(df['Years Before Sale'][df['Years Before Sale'] < 0])
print(df['Years Since Remod'][df['Years Since Remod'] < 0])

2180   -1
Name: Years Before Sale, dtype: int64
1702   -1
2180   -2
2181   -1
Name: Years Since Remod, dtype: int64


As we can see, we have 3 incoherent data points (one is shared across the two columns) over these two new features. We decide to drop them.

In [214]:
df = df.drop([1702, 2180, 2181], axis=0)

We can finally also drop the original columns.

In [215]:
df = df.drop(["Year Built", "Year Remod/Add"], axis = 1)

In [216]:
## Drop columns that aren't useful for ML
df = df.drop(["PID", "Order"], axis=1)

## Drop columns that leak info about the final sale
df = df.drop(["Mo Sold", "Sale Condition", "Sale Type", "Yr Sold"], axis=1)

We are now done with our feature engineering and we can group all the tasks we came up with in our `transform_features()` function.

In [217]:
def transform_features(df):
    # Drop any column that contains 5% or more of missing values
    missing_cols = df.isnull().sum().sort_values(ascending=True)
    cols_drop = missing_cols[missing_cols > (len(df)*0.05)]
    df = df.drop(cols_drop.index, axis=1)

    # Drop any text column that contains 1 or more missing values
    text_df = df.select_dtypes(include=['object'])
    missing_text_cols = text_df.isnull().sum().sort_values(ascending=True)
    text_cols_drop = missing_text_cols[missing_text_cols > 0]
    df = df.drop(text_cols_drop.index, axis=1)

    # Replace any missing value in numerical columns by their most common value
    numerical_df = df.select_dtypes(include=['integer', 'float'])
    missing_num_cols = numerical_df.isnull().sum().sort_values(ascending=True)
    numerical_cols_fix = missing_num_cols[(missing_num_cols < len(df)/20) & (missing_num_cols > 0)]
    replacement_values_dict = df[numerical_cols_fix.index].mode().to_dict(orient='record')[0]
    df = df.fillna(replacement_values_dict)

    # Create new features from existing ones
    years_sold = df['Yr Sold'] - df['Year Built']
    years_since_remod = df['Yr Sold'] - df['Year Remod/Add']
    df['Years Before Sale'] = years_sold
    df['Years Since Remod'] = years_since_remod

    # Drop rows with negative values (negative number of years) for both of these new features
    df = df.drop([1702, 2180, 2181], axis=0)

    # Drop original features
    df = df.drop(["Year Built", "Year Remod/Add"], axis=1)
    
    # Drop columns that aren't useful for ML
    df = df.drop(["PID", "Order"], axis=1)

    # Drop columns that leak info about the final sale
    df = df.drop(["Mo Sold", "Sale Condition", "Sale Type", "Yr Sold"], axis=1)
    
    return df

Let's now move on to features selection and to the creation of the `select_features()` function.

## Features Selection

As for the previous section, we will here come up with acto



### Selection of Numerical Columns

We want to only keep the most relevant numerical columns. We start by creating a new dataframe called `numerical_df` that contains only the numerical columns from our original dataframe.

In [218]:
numerical_df = df.select_dtypes(include=['integer','float'])
numerical_df.head()

Unnamed: 0,MS SubClass,Lot Area,Overall Qual,Overall Cond,Mas Vnr Area,BsmtFin SF 1,BsmtFin SF 2,Bsmt Unf SF,Total Bsmt SF,1st Flr SF,2nd Flr SF,Low Qual Fin SF,Gr Liv Area,Bsmt Full Bath,Bsmt Half Bath,Full Bath,Half Bath,Bedroom AbvGr,Kitchen AbvGr,TotRms AbvGrd,Fireplaces,Garage Cars,Garage Area,Wood Deck SF,Open Porch SF,Enclosed Porch,3Ssn Porch,Screen Porch,Pool Area,Misc Val,SalePrice,Years Before Sale,Years Since Remod
0,20,31770,6,5,112.0,639.0,0.0,441.0,1080.0,1656,0,0,1656,1.0,0.0,1,0,3,1,7,2,2.0,528.0,210,62,0,0,0,0,0,215000,50,50
1,20,11622,5,6,0.0,468.0,144.0,270.0,882.0,896,0,0,896,0.0,0.0,1,0,2,1,5,0,1.0,730.0,140,0,0,0,120,0,0,105000,49,49
2,20,14267,6,6,108.0,923.0,0.0,406.0,1329.0,1329,0,0,1329,0.0,0.0,1,1,3,1,6,0,1.0,312.0,393,36,0,0,0,0,12500,172000,52,52
3,20,11160,7,5,0.0,1065.0,0.0,1045.0,2110.0,2110,0,0,2110,1.0,0.0,2,1,3,1,8,2,2.0,522.0,0,0,0,0,0,0,0,244000,42,42
4,60,13830,5,5,0.0,791.0,0.0,137.0,928.0,928,701,0,1629,0.0,0.0,2,1,3,1,6,1,2.0,482.0,212,34,0,0,0,0,0,189900,13,12


From numerical features, we only want to keep those who correlate well with our target feature `SalePrice`.
We calculate the absolute correlation coefficient between each numerical column and `SalePrice` and rank them by ascending order.

In [219]:
abs_corr_coeffs = numerical_df.corr()['SalePrice'].abs().sort_values()
abs_corr_coeffs

BsmtFin SF 2         0.006127
Misc Val             0.019273
3Ssn Porch           0.032268
Bsmt Half Bath       0.035875
Low Qual Fin SF      0.037629
Pool Area            0.068438
MS SubClass          0.085128
Overall Cond         0.101540
Screen Porch         0.112280
Kitchen AbvGr        0.119760
Enclosed Porch       0.128685
Bedroom AbvGr        0.143916
Bsmt Unf SF          0.182751
Lot Area             0.267520
2nd Flr SF           0.269601
Bsmt Full Bath       0.276258
Half Bath            0.284871
Open Porch SF        0.316262
Wood Deck SF         0.328183
BsmtFin SF 1         0.439284
Fireplaces           0.474831
TotRms AbvGrd        0.498574
Mas Vnr Area         0.506983
Years Since Remod    0.534985
Full Bath            0.546118
Years Before Sale    0.558979
1st Flr SF           0.635185
Garage Area          0.641425
Total Bsmt SF        0.644012
Garage Cars          0.648361
Gr Liv Area          0.717596
Overall Qual         0.801206
SalePrice            1.000000
Name: Sale

We decide to only keep columns with a correlation coefficient larger than 0.4. This is an arbitrary decision and we may experiment to include or drop some more columns later on. 

In [220]:
abs_corr_coeffs[abs_corr_coeffs > 0.4]

BsmtFin SF 1         0.439284
Fireplaces           0.474831
TotRms AbvGrd        0.498574
Mas Vnr Area         0.506983
Years Since Remod    0.534985
Full Bath            0.546118
Years Before Sale    0.558979
1st Flr SF           0.635185
Garage Area          0.641425
Total Bsmt SF        0.644012
Garage Cars          0.648361
Gr Liv Area          0.717596
Overall Qual         0.801206
SalePrice            1.000000
Name: SalePrice, dtype: float64

We are left 14 numerical columns. The others are dropped.

In [221]:
df = df.drop(abs_corr_coeffs[abs_corr_coeffs > 0.4].index, axis=1)

### Categorical Columns

Next, we dedice to keep only a selection of the categorical features. From the data set documentation, we come up with the following list.

In [222]:
categorical_features = ["PID", "MS SubClass", "MS Zoning", "Street", "Alley", "Land Contour", "Lot Config", "Neighborhood", 
                    "Condition 1", "Condition 2", "Bldg Type", "House Style", "Roof Style", "Roof Matl", "Exterior 1st", 
                    "Exterior 2nd", "Mas Vnr Type", "Foundation", "Heating", "Central Air", "Garage Type", 
                    "Misc Feature", "Sale Type", "Sale Condition"]
len(categorical_features)

24

Some of these 24 columns may have already been dropped at an earlier stage of the workflow. We check for their occurence in the current state of the data set.

In [223]:
remaining_cat_cols = []
for col in categorical_features:
    if col in df.columns:
        remaining_cat_cols.append(col)
print(remaining_cat_cols)
print(len(remaining_cat_cols))

['MS SubClass', 'MS Zoning', 'Street', 'Land Contour', 'Lot Config', 'Neighborhood', 'Condition 1', 'Condition 2', 'Bldg Type', 'House Style', 'Roof Style', 'Roof Matl', 'Exterior 1st', 'Exterior 2nd', 'Foundation', 'Heating', 'Central Air']
17


Seven of these columns have already been dropped.

As we did for numerical columns previously, we want to only keep the categorical features that are the most relevant and that will be the best at predicting the houses sale price.

A good indication of a categorical feature relevance is the number of unique values it contains. Two scenarios need to be considered particularly:
* a categorical column has hundreds of unique values (or categories) - When we dummy code this column, hundreds of columns will need to be added back to the data frame.
* a categorical column only has a few unique values but a high percentage of them  belong to the same category - This would be similar to a low variance numerical feature (no variability in the data for the model to capture).

Let's count the number of unique values for our categorical features.

In [224]:
## How many unique values in each categorical column?
uniqueness_counts = df[remaining_cat_cols].apply(lambda col: len(col.value_counts())).sort_values()
uniqueness_counts

Central Air      2
Street           2
Land Contour     4
Lot Config       5
Bldg Type        5
Heating          6
Roof Style       6
Foundation       6
MS Zoning        7
Condition 2      8
House Style      8
Roof Matl        8
Condition 1      9
Exterior 1st    16
MS SubClass     16
Exterior 2nd    17
Neighborhood    28
dtype: int64

We decide to go with an arbitrary cutoff of 10 unique values.

In [225]:
nonuniq_cols = uniqueness_counts[uniqueness_counts > 10].index
df = df.drop(nonuniq_cols, axis=1)

Then, we convert the remaining text columns to categorical.

In [226]:
text_cols = df.select_dtypes(include=['object'])
for col in text_cols:
    df[col] = df[col].astype('category')

Finally, we dummy code these categorical columns and drop the originals.

In [227]:
df = pd.concat([
    df, 
    pd.get_dummies(df.select_dtypes(include=['category']))
], axis=1).drop(text_cols,axis=1)

All the code we wrote for features selection in copied in the `select_features()` function.

In [228]:
def select_features(df, coeff_thresh=0.4, unique_thresh=10):
    ## Selection of the best correlating numerical columns
    numerical_df = df.select_dtypes(include=['integer','float'])
    abs_corr_coeffs = numerical_df.corr()['SalePrice'].abs().sort_values()
    df = df.drop(abs_corr_coeffs[abs_corr_coeffs < coeff_thresh].index, axis=1)
    
    ## Selection of the most promising categorical features
    nominal_features = ["PID", "MS SubClass", "MS Zoning", "Street", "Alley", "Land Contour", "Lot Config", "Neighborhood", 
                    "Condition 1", "Condition 2", "Bldg Type", "House Style", "Roof Style", "Roof Matl", "Exterior 1st", 
                    "Exterior 2nd", "Mas Vnr Type", "Foundation", "Heating", "Central Air", "Garage Type", 
                    "Misc Feature", "Sale Type", "Sale Condition"]
    # Which of the categorical features from the above list are still in df?
    transform_cat_cols = []
    for col in nominal_features:
        if col in df.columns:
            transform_cat_cols.append(col)
    # Dropping features with number of unique values superior to threshold
    uniqueness_counts = df[transform_cat_cols].apply(lambda col: len(col.value_counts())).sort_values()
    nonuniq_cols = uniqueness_counts[uniqueness_counts > unique_thresh].index
    df = df.drop(nonuniq_cols, axis=1)
    # Conversion of text columns to category type
    text_cols = df.select_dtypes(include=['object'])
    for col in text_cols:
        df[col] = df[col].astype('category')
    # Concatenation of dummies with original df
    df = pd.concat([df, pd.get_dummies(df.select_dtypes(include=['category']))], axis=1).drop(text_cols,axis=1)
    
    return df

In the next and final section, we create the `train_and_test()` function that instantiate, trains and evaluates a linear regression model.

## Training and Testing the model

As mentioned in the introduction, we want to use a linear regression model to predict houses sale price.

In this final section, we create a function named `train_and_test()` that:
* Splits the data into train and test sets
* Instantiate and trains a linear regression model
* Makes predictions on the test set
* Evaluates the model by calculating the RMSE
* Perform cross-validation
* Return the RMSE

The function will then need to accept two parameters as input: the dataframe and a parameter named `k` that controls the type of cross validation that occurs.

Let's build this function!

In [231]:
def train_and_test(df, k=0):
    numerical_df = df.select_dtypes(include=['integer','float'])
    features = numerical_df.columns.drop('SalePrice')   
    lr = LinearRegression()
    
    # Holdout validation
    if k == 0:
        train = numerical_df[:1460]
        test = numerical_df[1460:]

        lr.fit(train[features], train['SalePrice'])
        predictions = lr.predict(test[features])
        rmse_one = np.sqrt(mean_squared_error(test['SalePrice'], predictions))
        
        return rmse
    
    # Simple cross validation
    elif k == 1:
        # Randomize all rows (frac=1) from df
        shuffled_df = df.sample(frac=1, )
        train = shuffled_df[:1460]
        test = shuffled_df[1460:]
        
        lr.fit(train[features], train['SalePrice'])
        predictions = lr.predict(test[features])
        rmse_one = np.sqrt(mean_squared_error(test['SalePrice'], predictions))
    
        lr.fit(test[features], test['SalePrice'])
        predictions = lr.predict(train[features])
        rmse_two = np.sqrt(mean_squared_error(train['SalePrice'], predictions))

        avg_rmse = np.mean([rmse_one, rmse_two])
        
        return avg_rmse
    
    # k-fold validation
    else:
        kf = KFold(n_splits=k, shuffle=True)
        rmse_values = []
        for train_index, test_index in kf.split(df):
            train = df.iloc[train_index]
            test = df.iloc[test_index]
            lr.fit(train[features], train['SalePrice'])
            predictions = lr.predict(test[features])
            rmse = np.sqrt(mean_squared_error(test['SalePrice'], predictions))
            rmse_values.append(rmse)
            avg_rmse = np.mean(rmse_values)
    
    return avg_rmse

Now that we have the three functions of our pipeline, we can quickly iterate on different models and compare their performances at prediction.

In [234]:
df = pd.read_csv("AmesHousing.tsv", delimiter="\t")
df = transform_features(df)
df = select_features(df, coeff_thresh=0.4, unique_thresh=10)
rmse = train_and_test(df, k=4)

rmse

29018.810132016784