# Challenge

One of the subjects that will be taught in Ironhack's data analytics program will be machine learning. This challenge is constructed in a way that lets you showcase your knowledge of feature engineering, dimensionality reduction, model selection and evaluation, and hyperparameter tuning. The data set chosen is purposefully small and has a variety of variables so that training doesn't take long and you can focus on these parts of the machine learning workflow. The goal of the challenge should not be purely to optimize for model performance but to demonstrate your thought process, knowledge, and creativity when working through a machine learning problem. So please document your thoughts in the notebook as you are working through this challenge.

**Instructions:**

* Download the [housing prices data set](https://www.dropbox.com/sh/kzge2vi9wfajwy5/AACsjbLbvwnG65N8CKKU1YXja?dl=0).
* Using Python, analyze the features and determine which feature set to select for modeling.
* Train and cross validate several regression models, attempting to accurately predict the SalePrice target variable.
* Evaluate all models and show comparison of performance metrics.
* State your thoughts on model performance, which model(s) you would select, and why.

In [1]:
import numpy as np
import pandas as pd

In [2]:
data = pd.read_csv('housing_prices.csv')

In [3]:
data.head(10)

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000
5,6,50,RL,85.0,14115,Pave,,IR1,Lvl,AllPub,...,0,,MnPrv,Shed,700,10,2009,WD,Normal,143000
6,7,20,RL,75.0,10084,Pave,,Reg,Lvl,AllPub,...,0,,,,0,8,2007,WD,Normal,307000
7,8,60,RL,,10382,Pave,,IR1,Lvl,AllPub,...,0,,,Shed,350,11,2009,WD,Normal,200000
8,9,50,RM,51.0,6120,Pave,,Reg,Lvl,AllPub,...,0,,,,0,4,2008,WD,Abnorml,129900
9,10,190,RL,50.0,7420,Pave,,Reg,Lvl,AllPub,...,0,,,,0,1,2008,WD,Normal,118000


In [4]:
data.shape

(1460, 81)

In [5]:
data.dtypes

Id                 int64
MSSubClass         int64
MSZoning          object
LotFrontage      float64
LotArea            int64
Street            object
Alley             object
LotShape          object
LandContour       object
Utilities         object
LotConfig         object
LandSlope         object
Neighborhood      object
Condition1        object
Condition2        object
BldgType          object
HouseStyle        object
OverallQual        int64
OverallCond        int64
YearBuilt          int64
YearRemodAdd       int64
RoofStyle         object
RoofMatl          object
Exterior1st       object
Exterior2nd       object
MasVnrType        object
MasVnrArea       float64
ExterQual         object
ExterCond         object
Foundation        object
                  ...   
BedroomAbvGr       int64
KitchenAbvGr       int64
KitchenQual       object
TotRmsAbvGrd       int64
Functional        object
Fireplaces         int64
FireplaceQu       object
GarageType        object
GarageYrBlt      float64


We have a lot of fields in the data - several numeric and categorical (object) fields. My first idea is to one-hot encode the categorical fields so that they are numeric and then use various feature selection methods from scikit-learn to reduce the number of features.

In [6]:
transformed = pd.get_dummies(data)
transformed.shape

(1460, 290)

The first thing I'll do is eliminate fields that have very low variance (where the vast majority of values are just a single value) as those features are less likely to be informative to our models.

In [7]:
from sklearn.feature_selection import VarianceThreshold

In [8]:
transformed = transformed.fillna(0)

sel = VarianceThreshold(threshold=(.8 * (1 - .8)))
features = sel.fit_transform(transformed)

selected = sel.get_support(indices = True)
colnames = [column for column in transformed.columns[selected]]
features = pd.DataFrame(features, columns=colnames)

In [9]:
features.head()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,HeatingQC_TA,KitchenQual_Gd,KitchenQual_TA,FireplaceQu_Gd,FireplaceQu_TA,GarageType_Attchd,GarageType_Detchd,GarageFinish_Fin,GarageFinish_RFn,GarageFinish_Unf
0,1.0,60.0,65.0,8450.0,7.0,5.0,2003.0,2003.0,196.0,706.0,...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
1,2.0,20.0,80.0,9600.0,6.0,8.0,1976.0,1976.0,0.0,978.0,...,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0
2,3.0,60.0,68.0,11250.0,7.0,5.0,2001.0,2002.0,162.0,486.0,...,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0
3,4.0,70.0,60.0,9550.0,7.0,5.0,1915.0,1970.0,0.0,216.0,...,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0
4,5.0,60.0,84.0,14260.0,8.0,5.0,2000.0,2000.0,350.0,655.0,...,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0


We have reduced our features down to 67 columns now. Let's perform some initial modeling. I'm going to choose 4 regression models to compare:

* Linear Regression
* K Nearest Neighbors
* Decision Tree
* Random Forest

I'm going to perform k-fold cross validation with k=12, calculate the average R-squared, and evaluate and compare their performances.

In [10]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

In [11]:
y = features['SalePrice']
x = features.drop('SalePrice', axis=1)

In [12]:
models = {'Linear Regression': LinearRegression(),
          'K-Nearest Neighbor': KNeighborsRegressor(),
          'Decision Tree': DecisionTreeRegressor(),
          'Random Forest': RandomForestRegressor()}

In [13]:
for name, model in models.items():
    scores = cross_val_score(model, x, y, cv=12)
    print(name + ':', np.mean(scores))

Linear Regression: 0.7803523076479765
K-Nearest Neighbor: 0.6563805170815439
Decision Tree: 0.7319940803789639
Random Forest: 0.8252583477888108


Another thing that helps models converge is scaling the features. Let's try that with the standard scaler and see if the performance of our models improves.

In [14]:
from sklearn import preprocessing

scaler = preprocessing.StandardScaler()
scaled = scaler.fit_transform(x)
scaled = pd.DataFrame(scaled, columns=x.columns)

for name, model in models.items():
    scores = cross_val_score(model, scaled, y, cv=12)
    print(name + ':', np.mean(scores))

Linear Regression: 0.7787191934571149
K-Nearest Neighbor: 0.734184360979827
Decision Tree: 0.7038446874912494
Random Forest: 0.8428763026304072


It looks like the performance of the KNN model improved quite a bit (and the decision tree a little also), but the others did not improve significantly. Let's try reducing our features even further using the scikit-learn's SelectKBest function to just the best 30 features.

In [15]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression

sel = SelectKBest(f_regression, k=30)
x2 = sel.fit_transform(x, y)

selected = sel.get_support(indices = True)
colnames = [column for column in x.columns[selected]]
x2 = pd.DataFrame(x2, columns=colnames)

for name, model in models.items():
    scores = cross_val_score(model, x2, y, cv=12)
    print(name + ':', np.mean(scores))

Linear Regression: 0.7749152744706641
K-Nearest Neighbor: 0.7135081691543501
Decision Tree: 0.6869970728952127
Random Forest: 0.8369908895722031


That didn't change our results significantly, but I'm just about out of time for this exercise. If I had more time, I'd go back and do some additional feature engineering. Feature engineering requires exploring and understanding the data, and because I had a limited amount of time, I chose some quick methods for feature selection so that I could get the data to a point where I could run some models on it.

I would also try some other models (e.g. SVM, Perceptron, etc.) to see how well they predict. My sense is that because we don't have a lot of examples (only 1,400 or so) for the models to learn from, the models are limited in their performance. In the absence of more data, one thing we can do is bootstrap to estimate what the population of housing features might look like. This is essentially what the Random Forest does, and I believe that is ultimately why it performs better than the other models in this case. I would like to try and bootstrap for the other models as well and see if that increases their performance.

I also didn't have time for hyperparameter tuning, which is also time-consuming, so I would do some of that if I had more time as well. 