In [None]:
# Initialize OK
from client.api.notebook import Notebook
ok = Notebook('lab09.ok')

# Lab 9: Feature Engineering & Cross-Validation
In this lab, you will practice using scikit-learn to do feature engineering and cross-validation to produce a model with low error on held-out data.

### Due Date 
This assignment is due on **Thursday, July 25 at 11:59pm**.

### Collaboration Policy
Data science is a collaborative activity. While you may talk with others about this assignment, we ask that you **write your solutions individually**. If you discuss the assignment with others, please **include their names** in the cell below.

**Collaborators:** *list names here*

In [1]:
# Run this cell to set up your notebook
import seaborn as sns
import csv
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
sns.set()
sns.set_context("talk")

from IPython.display import display, Latex, Markdown

### Introduction

For this lab, we will use a toy dataset to predict the house prices in Boston with data provided by the `sklearn.datasets` package. There are more interesting datasets in the package if you want to explore them during your free time!

Run the following cell to load the data. `load_boston()` will return a dictionary object which includes keys for:
    - `data` : the covariates (X)
    - `target` : the response vector (Y)
    - `feature_names`: the column names
    - `DESCR` : a full description of the data


In [2]:
from sklearn.datasets import load_boston

boston_data = load_boston()
print(boston_data.keys())

In [3]:
print(boston_data['DESCR'])

A look at the `DESCR` attribute tells us the data contains these features:

    1. CRIM      per capita crime rate by town
    2. ZN        proportion of residential land zoned for lots over 
                 25,000 sq.ft.
    3. INDUS     proportion of non-retail business acres per town
    4. CHAS      Charles River dummy variable (= 1 if tract bounds 
                 river; 0 otherwise)
    5. NOX       nitric oxides concentration (parts per 10 million)
    6. RM        average number of rooms per dwelling
    7. AGE       proportion of owner-occupied units built prior to 1940
    8. DIS       weighted distances to five Boston employment centres
    9. RAD       index of accessibility to radial highways
    10. TAX      full-value property-tax rate per 10,000 USD
    11. PTRATIO  pupil-teacher ratio by town
    12. B        1000(Bk - 0.63)^2 where Bk is the proportion of black 
                 residents by town
    13. LSTAT    % lower status of the population
    
Let's now convert this data into a pandas DataFrame. 

In [4]:
boston = pd.DataFrame(boston_data['data'], columns=boston_data['feature_names'])
boston.head()

### Question 1

Let's model this housing price data! Before we can do this, however, we need to split the data into training and test sets. The latter, held-out points will be used to choose the best performing model. Remember that the response vector (housing prices) lives in the `target` attribute. A random seed is set here so that we can generate the same splitting in the future if we want to test our result again and find potential bugs.

Use the [`train_test_split`](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function to split out 10% of the data for test. Call the resulting splits `X_train`, `X_test`, `Y_train`, `Y_test`.

<!--
BEGIN QUESTION
name: q1
-->

In [5]:
from sklearn.model_selection import train_test_split
np.random.seed(47)

X = boston
Y = pd.Series(boston_data['target'])

...

In [None]:
ok.grade("q1");

### Question 2

As a warmup, fit a linear model to describe the relationship between the housing price and all available covariates. We've imported `sklearn.linear_model` as `lm`, so you can use that instead of typing out the whole module name. Fill in the cells below to fit a linear regression model to the covariates and create a scatter plot for our predictions vs. the true prices.

<!--
BEGIN QUESTION
name: q2
-->

In [10]:
import sklearn.linear_model as lm

linear_model = lm.LinearRegression()

# Fit your linear model
#linear_model.fit(...)

# Predict housing prices on the test set
Y_pred = ...

# Plot predicted vs true prices
plt.scatter(Y_test, Y_pred, alpha=0.5)
plt.xlabel("Prices")
plt.ylabel("Predicted Prices")
plt.title("Prices vs Predicted Prices");

Briefly analyze the scatter plot above. Do you notice any outliers? Write your answer in the cell below.

*Write your answer here, replacing this text.*

### Question 3

As we find from the scatter plot, our model is not perfect. If it were perfect, we would see the identity line (i.e. a line of slope 1). Compute the root mean squared error (RMSE) of the predicted responses: 

$$
\textbf{RMSE} = \sqrt{\frac{1}{n}\sum_{i=1}^n \left( y_i - \hat{y}_i \right)^2 }
$$

Fill out the function below and compute the RMSE for our predictions on both the training data `X_train` and the test set `X_test`.  Note your implementation should not contain the word **"for"** (...that would be very slow).

<!--
BEGIN QUESTION
name: q3
-->

In [11]:
def rmse(actual_y, predicted_y):
    """
    Args:
        predicted_y: an array of the prediction from the model
        actual_y: an array of the groudtruth label
        
    Returns:
        The root mean square error between the prediction and the groudtruth
    """
    ...

train_error = ...
test_error = ...

print("Training RMSE:", train_error)
print("Test RMSE:", test_error)

In [None]:
ok.grade("q3");

Is your training error lower than the test error? If so, why could this be happening? Answer in the cell below.

*Write your answer here, replacing this text.*

## Cross Validation

Let's try building a simpler linear model with fewer features. While this may increase our training error, it may also decrease our test error and help prevent overfitting to the training set.

In the next section, we'll use $k$-fold cross-validation to select the best subset of features for our model. Recall the approach looks something like:

<img src="cv.png" width=500px>

**Warning**: Don't use the test set to perform the feature selection! We want to avoid using the test set too frequently, as we want to preserve some data to see how well our model truly performs. When selecting features or choosing hyper-parameters, we can split the training set further into train and validation sets. Then we can use the average validation error to help select hyper-parameters.

### Question 4

Scikit-learn has built-in support for cross validation.  However, to better understand how cross validation works complete the following function which cross validates a given model.

1. Use the [`KFold.split`](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html) function to get 4 splits on the training data. Note that `split` returns the indices of the data for that split.
2. For each split, select out the rows and columns based on the split indices and features.
3. Compute the RMSE on the validation split.
4. Return the average error across all cross validation splits.

<!--
BEGIN QUESTION
name: q4
-->

In [13]:
from sklearn.model_selection import KFold

def compute_CV_error(model, X_train, Y_train):
    '''
    Split the training data into 4 subsets.
    For each subset, 
        fit a model holding out that subset
        compute the MSE on that subset (the validation set)
    You should be fitting 4 models total.
    Return the average MSE of these 4 folds.

    Args:
        model: an sklearn model with fit and predict functions 
        X_train (data_frame): Training data
        Y_train (data_frame): Label 

    Return:
        the average validation MSE for the 4 splits.
    '''
    kf = KFold(n_splits=4)
    validation_errors = []
    
    for train_idx, valid_idx in kf.split(X_train):
        # split the data
        split_X_train, split_X_valid = ..., ...
        split_Y_train, split_Y_valid = ..., ...

        # Fit the model on the training split
        ...
        
        # Compute the RMSE on the validation split
        error = ...


        validation_errors.append(error)
        
    return np.mean(validation_errors)

In [None]:
ok.grade("q4");

### Question 5

We have defined four different feature sets, each containing three features (stored in `feature_sets` below). Use `compute_CV_error` to determine which feature set gets us the lowest average validation error. Then, fill in the variables `best_err_idx`, `best_err`, and `best_feature_set` below.

**Hint:** To find the index of the lowest error in `errors`, you may want to use [`np.argmin`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.argmin.html).

<!--
BEGIN QUESTION
name: q5
-->

In [15]:
feature_sets = [
    ['TAX', 'INDUS', 'CRIM'], 
    ['RM', 'LSTAT', 'PTRATIO'], 
    ['RM', 'B', 'NOX'], 
    ['TAX', 'LSTAT', 'DIS']
]

errors = []
for feat in feature_sets:
    print("Trying features:", feat)
    model = lm.LinearRegression()
    # compute the cross validation error
    error = ...
    
    print("\tRMSE:", error)
    errors.append(error)

best_err_idx = ...
best_err = ...
best_feature_set = ...

for i in range(4):
    print('{}, error: {}'.format(feature_sets[i], errors[i]))

best_feature_set, best_err

In [None]:
ok.grade("q5");

Why is it logical to use the set of features that result in the smallest average root mean squared error when performing cross-validation? 

### Question 6

Finally, fit a linear model using your best feature set and predict housing prices for your original test set. You can also try to select your own features (on top of the given ones) to lower the RMSE. Compute the final train and test RMSEs for a linear model using your best feature set.

<!--
BEGIN QUESTION
name: q6
-->

In [18]:
# Fit your linear model
...

# Predict points from our test set and calculate the mse
train_rmse = ... 
test_rmse = ...

print("Train RMSE", train_rmse)
print("KFold Validation RMSE", best_err)
print("Test RMSE", test_rmse)

In [None]:
ok.grade("q6");

Here we've plotted a residual plot for each record from `X_test`. After seeing your testing and training error, it is often helpful to visiualize your error. When points in the residual plot are randomly scattered around the line y = 0, then we know that a linear regression model is good for the data we have.

In [20]:
plt.scatter(np.arange(len(X_test)), Y_test - model.predict(X_test[best_feature_set]))
plt.xlabel('x')
plt.ylabel('residual (true y - estimated y)')
plt.title('Residual vs x for Linear Model')
plt.axhline(y = 0, color='r');

Nice! You've used $k$-fold cross-validation to fit a linear regression model to the housing data.

In the future, you'd probably want to use something like [`cross_val_predict`](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_predict.html) to automatically perform cross-validation, but it's instructive to do it yourself at least once.

## Congratulations!

You are finished with this assignment. Please don't forget to submit by 11:59pm on Thursday!

# Submit
Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output.
**Please save before submitting!**

In [None]:
# Save your notebook first, then run this cell to submit.
ok.submit()