# Homework 3

### Due: Wed Nov. 22 @ 9pm

In this homework we will be performing model evaluation, model selection and feature selection in both a regression and classification setting.

The data we will be looking at are a subset of home sales data from King County, Washington, as we might see on a realestate website.


## Instructions

Follow the comments below and fill in the blanks (____) to complete.

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pylab as plt
import seaborn as sns

%matplotlib inline

## Part 1: Regression

Here we try to build a model to predict adjusted sales price from a set of building features.

### Load and transform data

In [None]:
# Load data from file
infile_name = '../data/house_sales_subset.csv'
df = pd.read_csv(infile_name)

In [None]:
# Use a subset of the columns as features
X = df[['SqFtTotLiving','SqFtLot','Bathrooms','Bedrooms','BldgGrade']]

In [None]:
# Extract the target, adjusted sale price
# Note: the '_r' here is denote the different targets for regression and classification
y_r = df.AdjSalePrice

In [None]:
# Split into 80% train and 20% test using train_test_split
from sklearn.model_selection import train_test_split

X_train_r, X_test_r, y_train_r, y_test_r = train_test_split(____, ____, ____)

### Measure baseline performance

In [None]:
# Train dummy model using DummyRegressor on the training set
from sklearn.dummy import DummyRegressor

dummy_r = ____

In [None]:
# Calculate and print RMSE of the dummy model on the training set 
from sklearn.metrics import mean_squared_error

y_dummy_r = dummy_r.____
dummy_r_training_rsme = np.sqrt(mean_squared_error(____,____)

print('dummy RMSE: {:.3f}'.format(dummy_r_training_rsme))

In [None]:
# Calculate and print R2 of the dummy model on training set
# hint: can use models 'score' function
# note: why is this 0?
dummy_r_training_r2 = ____
print('dummy r2: {:.3f}'.format(dummy_r_training_r2))

### Measure performance of Simple Linear Model

In [None]:
# Instantiate and train a simple LinearRegression model on the training set
from sklearn.linear_model import LinearRegression

lr_r = ____

In [None]:
# Calculate RMSE and R2 of simple linear model on training set
# Note the improvement over the dummy model
lr_r_rmse = ____
lr_r_r2 = ____
print('simple linear RMSE: {:.3f}'.format(lr_r_rmse))
print('simple linear r2: {:.3f}'.format(lr_r_r2))

In [None]:
# Calculate mean 5-fold Cross Validation R2 score of simple linear model on the training set using cross_val_score
# Note that in this case the difference in R2 cv score on the R2 on the full training set is small
from sklearn.model_selection import cross_val_score

scores = cross_val_score(LinearRegression(), ____, ____, cv=____)

print('simple linear mean cv r2: {:.3f}'.format(____)

### Model selection

In [None]:
# Create a pipeline using make_pipeline to generate polynomial models
# There should be two steps: PolynomalFeatures then LinearRegression
# Recall: using PolynomialFeatures, we do not need to fit an intercept in the LinearRegression step
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline

poly_pipeline = make_pipeline(____,____)

In [None]:
# Perform GridSearch over different polynomial degrees in [1,2,3,4] using the training set
# To do this, instantiate and fit GridSearchCV on our poly_pipeline model and params set 
# hint: "polynomialfeatures__degree" is the parameter of interest
from sklearn.model_selection import GridSearchCV

params = {____:[1,2,3,4]}
gs = GridSearchCV(____,____).fit(____,____)

In [None]:
# Print out the best score found and the best parameter setting found
print('gs best r2 score : {:.3f}'.format(____))
print('gs best params   : {}'.format(____))

In [None]:
# Retrain poly_pipeline using the best parameter setting found above on the entire training set
# Print the RMSE and R2 of the new model on the training set
# hint: polynomialfeatures__degree is the parameter you need to set

poly_pipeline.set_params(____).____

poly_train_rmse = ____
poly_train_r2 = ____

print('polynomial training RMSE: {:.3f}'.format(poly_train_rmse))
print('polynomial training R2  : {:.3f}'.format(poly_train_r2))

In [None]:
# Using the newly trained model, get predictions on the full training set
y_hat = ____

In [None]:
# Plot predictions (x-axis) vs residuals (y-axis) of the training set
# recall: residual = y_hat - ground_truth_y

residuals = ____
_ = plt.scatter(____,____ alpha=0.2)
_ = plt.xlabel('y predicted')
_ = plt.ylabel('residuals')
# if this were a real analysis, we may want to address any of the outliers we see here
# also you should be seeing some signs of heteroskedasticity here

### Evaluate trained model on Test

In [None]:
# Using our trained model, calculate R2 and RMSE on the test set
# Note that error may have gone up slightly and r2 may have decreased slightly
print('test RMSE: {:.3f}'.format(____))
print('test r2  : {:.3f}'.format(____))

### Feature selection

In [None]:
# Select the top 2 most informative features from the trained model using SelectKBest and f_regression
# To do this, instantiate and fit SelectKbest on the training set
from sklearn.feature_selection import SelectKBest, f_regression

skb = SelectKBest(f_regression, ____).____

In [None]:
# Print out the selected features using skb.get_support() and the column names from X
# hint: get_support returns a boolean mask
kept_columns = ____
print('kept columns: {}'.format(kept_columns))

---

## Part 2: Classification

Here we try to build a model to predict low vs. high adjusted sales price.

## Create classification target

In [None]:
# Here we create a binary target by thresholding at the AdjSalePriceMedian
y_c = (df.AdjSalePrice > df.AdjSalePrice.median()).astype(int)

print('proportion of low to high: {:.3f}'.format(sum(y_c)/float(len(y_c))))

In [None]:
# Split into 80% train and 20% test using train_test_split
# Use our new y_c target and the same X we used for regression
X_train_c, X_test_c, y_train_c, y_test_c = ____

### Measure baseline performance

In [None]:
# Train a dummy classification model on the training set
from sklearn.dummy import DummyClassifier

dummy_c = ____

In [None]:
# Calculate Training set Accuracy of the dummy classifier
# This should match the proportion of low to high
dummy_c_acc = ____

print('dummy accuracy: {:.3f}'.format(dummy_c_acc))

### Measure performance of a Random Forest model

In [None]:
# Create, fit and calculate training set accuracy of a random forest with 5 trees
# Note: why is this so high?
from sklearn.ensemble import RandomForestClassifier

rf = ____
print('rf accuracy: {:.3f}'.format(____))

In [None]:
# Calculate mean 5-fold cross validation accuracy of a random forest with 5 trees on the training set
# Note that it should be less than the accuracy when trained on the full training set
scores = ____
print('mean cv accuracy: {:.3f}'.format(____))

### Model selection

In [None]:
# Perform cross validated grid search over the number of trees in [1,5,10] using the training set
params = {____:[1,5,10]}

gs = ____

In [None]:
# Print out the best score found and the best parameter setting found
print('gs best accuracy: {:.3f}'.format(____))
print('gs best params  : {}'.format(____))

In [None]:
# Retrain on the entire training set using the best number of trees found
rf = RandomForestClassifier(n_estimators=____).fit(X_train_c,y_train_c)

In [None]:
# get p(y=1|x) for the entire training set
# hint: py_pos should only contain one column
py_pos = ____

In [None]:
# Plot Precision (y-axis) vs. Recall (x-axis) using the targets and py_pos 
# The plot should indicate a good fit at almost any classification threshold
from sklearn.metrics import precision_recall_curve

precision, recall, _ = precision_recall_curve(____, ____)

_ = plt.step(____,____)
_ = plt.xlabel('recall')
_ = plt.ylabel('precision')

### Evaluate model performance on the Test set

In [None]:
# Calculate accuracy of the trained model on the test set
# Note that it should not be far from the cv training set accuracy
print('test accuracy: {:.3f}'.format(____))

### Feature selection

In [None]:
# Select the most informative features using SelectFromModel using 'mean' as threshold using the trained model
# note: this may select more than 2 features
# note: we use prefit=True since the model is already trained
from sklearn.feature_selection import SelectFromModel

sfm = SelectFromModel(____, threshold=____, prefit=True)

In [None]:
# print out the selected features using sfm.get_support() and columns from X 
kept_columns = ____
print('kept columns: {}'.format(kept_columns))

In [None]:
# transform X_train_c into a new X containing only the selected features
X_train_c_fs = ____

In [None]:
# Train a new model on the new X using the previously found best setting for n_estimators
rf_fs = ____

In [None]:
# Predict P(y=1|x) using the new model
py_pos_fs = ____

In [None]:
# Plot the ROC curves of the old model and the new model on the same plot
# Note that the full model is only a slight improvement on the model with fewer features
from sklearn.metrics import roc_curve

fpr,tpr,_ = roc_curve(____, ____)
fpr_fs,tpr_fs,_ = roc_curve(____, ____)

_ = plt.step(fpr,tpr,color='blue')
_ = plt.step(fpr_fs,tpr_fs,color='red')
_ = plt.xlabel('fpr')
_ = plt.ylabel('tpr')

In [None]:
# Confirm that the new and old models are similar by calculating their ROC AUC values on the training set
# hint: use the py_pos you predicted for both models
from sklearn.metrics import roc_auc_score

full_model_auc = ____
fs_model_auc = ____
print('full model auc: {:.3f}'.format(full_model_auc))
print('fs model auc  : {:.3f}'.format(fs_model_auc))