# Predicting house prices using a Linear Regressor model
The goal of this project is to apply a Linear Regressor model to a dataset of house sale listings and predict the asking price.

In [135]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score, KFold
from sklearn.linear_model import LinearRegression

## Data read-in
A description of all the columns of the dataset can be found [here](https://ww2.amstat.org/publications/jse/v19n3/decock/DataDocumentation.txt).

In [136]:
data = pd.read_table("AmesHousing.tsv")
data.head()

Unnamed: 0,Order,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type,Sale Condition,SalePrice
0,1,526301100,20,RL,141.0,31770,Pave,,IR1,Lvl,...,0,,,,0,5,2010,WD,Normal,215000
1,2,526350040,20,RH,80.0,11622,Pave,,Reg,Lvl,...,0,,MnPrv,,0,6,2010,WD,Normal,105000
2,3,526351010,20,RL,81.0,14267,Pave,,IR1,Lvl,...,0,,,Gar2,12500,6,2010,WD,Normal,172000
3,4,526353030,20,RL,93.0,11160,Pave,,Reg,Lvl,...,0,,,,0,4,2010,WD,Normal,244000
4,5,527105010,60,RL,74.0,13830,Pave,,IR1,Lvl,...,0,,MnPrv,,0,3,2010,WD,Normal,189900


## Data Engineering
In the first step of data engineering, features with over 50% null values are dropped outright.

In [137]:
null_nums = data.isnull().sum()
expunge = null_nums[null_nums > data.shape[0]*.5]
clean_data = data.drop(expunge.index, axis = 1)

Next, let's deal with any null values in the numeric columns.

In [138]:
numerics = clean_data.select_dtypes(include=["int64","float64"])
missing_count = numerics.isnull().sum()
missing = missing_count[missing_count > 0]
missing

Lot Frontage      490
Mas Vnr Area       23
BsmtFin SF 1        1
BsmtFin SF 2        1
Bsmt Unf SF         1
Total Bsmt SF       1
Bsmt Full Bath      2
Bsmt Half Bath      2
Garage Yr Blt     159
Garage Cars         1
Garage Area         1
dtype: int64

Many house listings are missing the year in which the garage was built - very simply, this is likely to be because these houses have no garage at all. All the same, we must remove the null values so the mode of the column will be used to fill them in. We will try to convey the absence of a garage in an appropriate categorical column later on.

With that done, let's display the mode for all other columns with null values and decide whether to use these to fill in what's missing.

In [139]:
# Fill in 'Garage Yr Blt' missing values
clean_data["Garage Yr Blt"] = clean_data["Garage Yr Blt"].fillna(clean_data["Garage Yr Blt"].mode().iloc[0])
# Visualise mode of other columns with missing values
numerics = clean_data.select_dtypes(include=["int64","float64"])
missing_count = numerics.isnull().sum()
missing = missing_count[missing_count > 0]
numerics[missing.index].mode()

Unnamed: 0,Lot Frontage,Mas Vnr Area,BsmtFin SF 1,BsmtFin SF 2,Bsmt Unf SF,Total Bsmt SF,Bsmt Full Bath,Bsmt Half Bath,Garage Cars,Garage Area
0,60.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0


Observations:
- Lot frontage indicates the length of street connected to property (in linear feet). This should reasonably be non-zero in most cases, at least assuming that all houses are connected to a street.
- Many of these columns have mode equal to zero, which most probably indicates the absence of the corresponding area.

In general we find no obvious reasons not to choose the mode to fill in the missing values. Let's do so.

In [140]:
clean_data[missing.index] = clean_data[missing.index].fillna(clean_data[missing.index].mode().iloc[0])
clean_data[missing.index].isnull().sum()

Lot Frontage      0
Mas Vnr Area      0
BsmtFin SF 1      0
BsmtFin SF 2      0
Bsmt Unf SF       0
Total Bsmt SF     0
Bsmt Full Bath    0
Bsmt Half Bath    0
Garage Cars       0
Garage Area       0
dtype: int64

In [141]:
rescale_cols = numerics.columns.drop("SalePrice")
clean_data[rescale_cols] = (clean_data[rescale_cols]-clean_data[rescale_cols].min())/(clean_data[rescale_cols].max()-clean_data[rescale_cols].min())
clean_data[rescale_cols].head()

Unnamed: 0,Order,PID,MS SubClass,Lot Frontage,Lot Area,Overall Qual,Overall Cond,Year Built,Year Remod/Add,Mas Vnr Area,...,Garage Area,Wood Deck SF,Open Porch SF,Enclosed Porch,3Ssn Porch,Screen Porch,Pool Area,Misc Val,Mo Sold,Yr Sold
0,0.0,0.0,0.0,0.410959,0.14242,0.555556,0.5,0.637681,0.166667,0.07,...,0.354839,0.147472,0.083558,0.0,0.0,0.0,0.0,0.0,0.363636,1.0
1,0.000341,0.000102,0.0,0.202055,0.048246,0.444444,0.625,0.644928,0.183333,0.0,...,0.490591,0.098315,0.0,0.0,0.0,0.208333,0.0,0.0,0.454545,1.0
2,0.000683,0.000104,0.0,0.205479,0.060609,0.555556,0.625,0.623188,0.133333,0.0675,...,0.209677,0.275983,0.048518,0.0,0.0,0.0,0.0,0.735294,0.454545,1.0
3,0.001024,0.000108,0.0,0.246575,0.046087,0.666667,0.5,0.695652,0.3,0.0,...,0.350806,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.272727,1.0
4,0.001366,0.001672,0.235294,0.181507,0.058566,0.444444,0.5,0.905797,0.8,0.0,...,0.323925,0.148876,0.045822,0.0,0.0,0.0,0.0,0.0,0.181818,1.0


Next, let's deal with any null values in the categorical columns.

In [142]:
strings = clean_data.select_dtypes(include=["object"])
missing_count = strings.isnull().sum()
missing = missing_count[missing_count > 0]
missing

Mas Vnr Type        23
Bsmt Qual           80
Bsmt Cond           80
Bsmt Exposure       83
BsmtFin Type 1      80
BsmtFin Type 2      81
Electrical           1
Fireplace Qu      1422
Garage Type        157
Garage Finish      159
Garage Qual        159
Garage Cond        159
dtype: int64

There's a lone missing value in the 'Electrical' column which we'll fill with the mode for the column. 

Missing numbers in all other columns are consistent (cfr. corresponding numeric missing values) and likely indicate houses with no corresponding feature. We'll fill these in as "Not present" to generate an appropriate category.

In [143]:
# Fill in 'Electrical' missing value
clean_data["Electrical"] = clean_data["Electrical"].fillna(clean_data["Electrical"].mode().iloc[0])
# Fill in other missing values with "Not present"
strings = clean_data.select_dtypes(include=["object"])
missing_count = strings.isnull().sum()
missing = missing_count[missing_count > 0]
clean_data[missing.index] = clean_data[missing.index].fillna("Not present")
clean_data[missing.index].isnull().sum()

Mas Vnr Type      0
Bsmt Qual         0
Bsmt Cond         0
Bsmt Exposure     0
BsmtFin Type 1    0
BsmtFin Type 2    0
Fireplace Qu      0
Garage Type       0
Garage Finish     0
Garage Qual       0
Garage Cond       0
dtype: int64

The columns containing values for the years of construction/renovation would be more meaningful if expressed as the number of years between being constructed/renovated and being sold.

In [144]:
clean_data["Grg Years Since Built"] = clean_data["Yr Sold"] - clean_data["Garage Yr Blt"]
clean_data["Years Since Built"] = clean_data["Yr Sold"] - clean_data["Year Built"]
clean_data["Years Since Renovation"] = clean_data["Yr Sold"] - clean_data["Year Remod/Add"]

We can now drop both superfluous and obsolete columns:
- "Order","PID","Mo Sold","Yr Sold","Sale Type" and "Sale Condition" contain values with no effect on the final pricing of the house, and are thus superfluous.
- "Garage Yr Blt","Year Built" and "Year Remod/Add" have been modified into new variables, and are now obsolete.

In [145]:
to_be_dropped = ["Order","PID","Mo Sold","Yr Sold","Sale Type","Sale Condition","Garage Yr Blt","Year Built","Year Remod/Add"]
clean_data = clean_data.drop(to_be_dropped, axis = 1)

Finally we must rescale the numeric columns so that all of them share the same (0,1) range.

In [146]:
numerics = clean_data.select_dtypes(include=["int64","float64"])
rescale_cols = numerics.columns.drop("SalePrice")
clean_data[rescale_cols] = (clean_data[rescale_cols]-clean_data[rescale_cols].min())/(clean_data[rescale_cols].max()-clean_data[rescale_cols].min())
clean_data[rescale_cols].head()

Unnamed: 0,MS SubClass,Lot Frontage,Lot Area,Overall Qual,Overall Cond,Mas Vnr Area,BsmtFin SF 1,BsmtFin SF 2,Bsmt Unf SF,Total Bsmt SF,...,Wood Deck SF,Open Porch SF,Enclosed Porch,3Ssn Porch,Screen Porch,Pool Area,Misc Val,Grg Years Since Built,Years Since Built,Years Since Renovation
0,0.0,0.410959,0.14242,0.555556,0.5,0.07,0.113218,0.0,0.188784,0.176759,...,0.147472,0.083558,0.0,0.0,0.0,0.0,0.0,0.882569,0.684015,0.913793
1,0.0,0.202055,0.048246,0.444444,0.625,0.0,0.08292,0.094364,0.115582,0.144354,...,0.098315,0.0,0.0,0.0,0.208333,0.0,0.0,0.880734,0.680297,0.905172
2,0.0,0.205479,0.060609,0.555556,0.625,0.0675,0.163536,0.0,0.173801,0.217512,...,0.275983,0.048518,0.0,0.0,0.0,0.0,0.735294,0.886239,0.69145,0.931034
3,0.0,0.246575,0.046087,0.666667,0.5,0.0,0.188696,0.0,0.447346,0.345336,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.86789,0.654275,0.844828
4,0.235294,0.181507,0.058566,0.444444,0.5,0.0,0.140149,0.0,0.058647,0.151882,...,0.148876,0.045822,0.0,0.0,0.0,0.0,0.0,0.814679,0.546468,0.586207


## Feature selection

We should get rid of variables which are dominated by a single value, since this will offer very little in terms of predictive power. Similarly, we should get rid of variables with dozens of different possible values as they will bog down the model once expanded in their corresponding dummy columns.

In [147]:
def expunge_list(df):
    cols = df.columns
    ret = []
    for c in cols:
        uniques = df[c].value_counts().shape[0]
        skew = df[c].value_counts().iloc[0]/df.shape[0]
        if uniques > 30 or skew > .9:
            ret.append(c)
    return ret

In [148]:
expunge_strings = expunge_list(strings)
expunge_strings

['Street',
 'Utilities',
 'Land Slope',
 'Condition 2',
 'Roof Matl',
 'Heating',
 'Central Air',
 'Electrical',
 'Functional',
 'Garage Cond',
 'Paved Drive']

In [149]:
clean_data = clean_data.drop(expunge_strings, axis = 1)

We can now expand the categorical columns into dummy columns. Once that is done, the original columns may be dropped.

In [150]:
string_cols = clean_data.select_dtypes(include=["object"]).columns
for c in string_cols:
    clean_data = pd.concat([clean_data,pd.get_dummies(clean_data[c],prefix=c)], axis = 1)
    del clean_data[c]    

Instead of using all these features together, let's try to find ways to identify just the best among them. We can use the correlation coefficients to find which features have the most predictive strength.

In [151]:
feature_corr_coeffs = clean_data.corr()["SalePrice"].abs().sort_values()

In [152]:
feature_corr_coeffs.shape[0]

229

In [161]:
target = "SalePrice"
# Only select features with correlation coefficient > .4
best_features = [x for x in feature_corr_coeffs[feature_corr_coeffs > .4].index if x != target]
len(best_features)

27

## Predictions and performance test

In [162]:
def train_and_test(df,feats,target,k):
    # Check for wrong fold parameter input
    if k > 0 and type(k) == int:
        pass
    else:
        print("Fold parameter must be 0 or greater")
        return None
    ### Let's make a custom fold selection feature
    lr = LinearRegression()
    # If k == 1, let's make this a holdout validation with no cross-validation (what we already have)
    if k == 1:
        length = df.shape[0]
        half_len = int(np.ceil(length/2))
        train = df.iloc[:half_len]
        test = df.iloc[half_len:]
        lr.fit(train[feats],train[target])
        preds = lr.predict(test[feats])
        return np.sqrt(mean_squared_error(test[target],preds))
    # In all other cases, we can use the KFold tools from sklearn
    else:
        kf = KFold(k, shuffle=True)
        mses = cross_val_score(lr, df[feats], df[target], scoring="neg_mean_squared_error", cv=kf)
        rmses = np.sqrt(np.absolute(mses))
        return np.mean(rmses), np.std(rmses)

In [163]:
rmse = train_and_test(clean_data,best_features,target,1)
print(rmse)

38301.81217395522


In [171]:
avg_rmse, std_rmse = train_and_test(clean_data,best_features,target,3)
print(avg_rmse,"\n",std_rmse)

31803.847684177585 
 4827.980187192256
