# Homework 8


In this homework you'll practice creating ordinary least squares, ridge, and LASSO regression models using the [Iris Dataset](https://archive.ics.uci.edu/ml/datasets/iris). 

In [59]:
import matplotlib.pyplot as plt
import random
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, KFold
from scipy.optimize import curve_fit
from sklearn.metrics import mean_squared_error
import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

Let's load in the data and do some minor preprocessing:

In [60]:
iris = pd.read_csv('iris.csv')
iris

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...,...
145,146,6.7,3.0,5.2,2.3,Iris-virginica
146,147,6.3,2.5,5.0,1.9,Iris-virginica
147,148,6.5,3.0,5.2,2.0,Iris-virginica
148,149,6.2,3.4,5.4,2.3,Iris-virginica


In [61]:
# drop irrelevant columns
iris_data = iris.drop(columns = ['Id', 'Species'])
iris_data.head()

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


Then we generate the train/test splits:

In [62]:
iris_train, iris_test = train_test_split(iris_data, test_size = 0.2, random_state = 0) # split into training set and test set
iris_train

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
137,6.4,3.1,5.5,1.8
84,5.4,3.0,4.5,1.5
27,5.2,3.5,1.5,0.2
127,6.1,3.0,4.9,1.8
132,6.4,2.8,5.6,2.2
...,...,...,...,...
9,4.9,3.1,1.5,0.1
103,6.3,2.9,5.6,1.8
67,5.8,2.7,4.1,1.0
117,7.7,3.8,6.7,2.2


**Exercise**: Now, create a regular linear regression model to predict the sepal length of iris flowers given the sepal width, petal length, and petal width using the process we've learned in lecture (fitting, predicting, finding the error/loss)!

In [63]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

X_train = iris_train.drop(columns = ["SepalLengthCm"]) # the i-th output column is the input vector raised element-wise to the power of N - i - 1
y_train = iris_train[["SepalLengthCm"]]

X_test = iris_test.drop(columns = ["SepalLengthCm"]) # the i-th output column is the input vector raised element-wise to the power of N - i - 1
y_test = iris_test[["SepalLengthCm"]]

# instantiate your model
ols_model = LinearRegression(fit_intercept=False)

# fit the model
ols_model.fit(X_train, y_train)
# make predictions on test set
y_pred = ols_model.predict(X_test)
# find mean squared error
ols_loss = mean_squared_error(y_test, y_pred)
ols_r2 = r2_score(y_test, y_pred)

print("Mean Squared Error of Linear Model: {:.3f}".format(ols_loss))
print("R^2 of Linear Model: {:.3f}".format(lin_r2))

Mean Squared Error of Linear Model: 0.143
R^2 of Linear Model: 0.720


### A Strategy for Hyperparameter Selection: K-Fold Cross Validation

Earlier in lecture, we discussed the K-fold cross validation method for selecting hyperparameters for our model, namely the Ridge and LASSO models. We will proceed by fitting K models per choice of hyperparameter, and to emphasize what K-fold cross validation actually means, we're going to manually carry out the procedure. Recall the approach looks something like the figure below for 4-fold cross validation:

<img src="cv.png" width=500px>

When we use K-fold cross validation, to select between various hyperparameters, we split the training set further into multiple temporary train and validation sets (each split is called a "fold", hence k-fold cross validation). We will use the average validation error across all k folds to make our optimal feature, model, and hyperparameter choices. In this example, we'll only use this procedure for hyperparameter selection, specifically to choose the best alpha, for both Ridge and LASSO models.

**Exercise**: Scikit-learn has built-in support for cross validation.  However, to better understand how cross validation works complete the following function which cross validates a given model.

1. Use the [`KFold.split`](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html) function to get 4 splits on the training data. Note that `split` returns the indices of the data for that split.
2. For **each** split:
    1. Select out the training and validation rows and columns based on the split indices and features.
    2. Compute the MSE on the validation split.
    3. Return the average error across all cross validation splits.

In [64]:
def compute_CV_error(model, X_train, y_train):
    '''
    Split the training data into 4 subsets.
    For each subset, 
        fit a model holding out that subset
        compute the MSE on that subset (the validation set)
    You should be fitting 4 models total.
    Return the average MSE of these 4 folds.

    Args:
        model: an sklearn model with fit and predict functions 
        X_train (data_frame): Training data
        y_train (data_frame): output data 

    Return:
        the average validation MSE for the 4 splits.
    '''
    # Creating a KFold object that will produce 4 splits
    kf = KFold(n_splits=4)
    validation_errors = []
    
    for train_idx, valid_idx in kf.split(X_train):
        # split the data
        split_X_train, split_X_valid = X_train.iloc[train_idx], X_train.iloc[valid_idx]
        split_y_train, split_y_valid = y_train.iloc[train_idx], y_train.iloc[valid_idx]

        # Fit the model on the training split
        model.fit(split_X_train, split_y_train)
        
        # Compute the MSE on the validation split
        error = mean_squared_error(split_y_valid, model.predict(split_X_valid))
        
        validation_errors.append(error)
        
    return np.mean(validation_errors)

**Exercise**: Now, use the given lambda parameters to create Ridge regression models with 5 different lambda parameters and determine which model has the lowest CV error. Then, calculate the error of that model on the test set. (As a bonus, repeat the same for LASSO regression.)

In [65]:
from sklearn.linear_model import Ridge, Lasso

lambdas = [0.01, 0.1, 1, 10, 100]
min_err = np.inf
error_mapping = {}
for param in lambdas:
    model = Ridge(alpha=param, fit_intercept = False)
    model_err = compute_CV_error(model, X_train, y_train)
    error_mapping[model_err] = (model, param)
    if model_err < min_err:
        min_err = model_err
print(f"The Ridge regression model with lambda = {error_mapping[min_err][1]} has the lowest validation error!")    

best_ridge_model = error_mapping[min_err][0]
y_pred = best_ridge_model.predict(X_test)
ridge_loss = mean_squared_error(y_test, y_pred)
ridge_r2 = r2_score(y_test, y_pred)

print("Mean Squared Error of Ridge Model: {:.3f}".format(ridge_loss))
print("R^2 of Ridge Model: {:.3f}".format(ridge_r2))

The Ridge regression model with lambda = 0.1 has the lowest validation error!
Mean Squared Error of Ridge Model: 0.140
R^2 of Ridge Model: 0.725


**Exercise**: Compare the performance of the Ridge models using different values of lambda with the OLS solution. What do you notice about the validation error as lambda increases?

_The validation error increases as we increase lambda, since we're intentionally increasing regularization and thus increasing the bias. The model becomes increasingly less complex. However, we can find a value of lambda that allows the Ridge regression model to outperform ordinary least squares!_


For this dataset and demonstration purposes, cross validation is a viable option to use for hyperparameter selection. Cross validation can also be used to select between various features or even various model architectures!

Keep in mind that in situations where models can be very memory intensive and time consuming to train (like with deep learning), you'll typically prefer using other methods to select hyperparameters like a holdout set (also known as simple cross validation, where you just use a held-out split from training data to test multiple different hyperparameters).


Congrats! You have completed this homework! :)