# Homework 8


In this homework you'll practice creating ordinary least squares, ridge, and LASSO regression models using the [Iris Dataset](https://archive.ics.uci.edu/ml/datasets/iris). 

In [None]:
import matplotlib.pyplot as plt
import random
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, KFold
from scipy.optimize import curve_fit
from sklearn.metrics import mean_squared_error
import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

Let's load in the data and do some minor preprocessing:

In [None]:
iris = pd.read_csv('iris.csv')
iris

In [None]:
# drop irrelevant columns
iris_data = iris.drop(columns = ['Id', 'Species'])
iris_data.head()

Then we generate the train/test splits:

In [None]:
iris_train, iris_test = train_test_split(iris_data, test_size = 0.2, random_state = 0) # split into training set and test set
iris_train

**Exercise**: Now, create a regular linear regression model to predict the sepal length of iris flowers given the sepal width, petal length, and petal width using the process we've learned in lecture (fitting, predicting, finding the error/loss)!

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# X_train contains our features, y_train contains the values we're trying to predict
X_train = iris_train.drop(...)
y_train = iris_train[...]

X_test = iris_test.drop(...)
y_test = iris_test[...]

# instantiate your OLS model
ols_model = ...

# fit the model
ols_model.fit(..., ...)
# make predictions on test set
y_pred = ols_model.predict(...)
# find mean squared error
ols_loss = mean_squared_error(..., ...)
ols_r2 = r2_score(..., ...)

print("Mean Squared Error of Linear Model: {:.3f}".format(ols_loss))
print("R^2 of Linear Model: {:.3f}".format(lin_r2))

### A Strategy for Hyperparameter Selection: K-Fold Cross Validation

Earlier in lecture, we discussed the K-fold cross validation method for selecting hyperparameters for our model, namely the Ridge and LASSO models. We will proceed by fitting K models per choice of hyperparameter, and to emphasize what K-fold cross validation actually means, we're going to manually carry out the procedure. Recall the approach looks something like the figure below for 4-fold cross validation:

<img src="cv.png" width=500px>

When we use K-fold cross validation, to select between various hyperparameters, we split the training set further into multiple temporary train and validation sets (each split is called a "fold", hence k-fold cross validation). We will use the average validation error across all k folds to make our optimal feature, model, and hyperparameter choices. In this example, we'll only use this procedure for hyperparameter selection, specifically to choose the best alpha, for both Ridge and LASSO models.

**Exercise**: Scikit-learn has built-in support for cross validation.  However, to better understand how cross validation works complete the following function which cross validates a given model.

1. Use the [`KFold.split`](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html) function to get 4 splits on the training data. Note that `split` returns the indices of the data for that split.
2. For **each** split:
    1. Select out the training and validation rows and columns based on the split indices and features.
    2. Compute the MSE on the validation split.
    3. Return the average error across all cross validation splits.

In [None]:
def compute_CV_error(model, X_train, y_train):
    '''
    Split the training data into 4 subsets.
    For each subset, 
        fit a model holding out that subset
        compute the MSE on that subset (the validation set)
    You should be fitting 4 models total.
    Return the average MSE of these 4 folds.

    Args:
        model: an sklearn model with fit and predict functions 
        X_train (data_frame): Training data
        y_train (data_frame): output data 

    Return:
        the average validation MSE for the 4 splits.
    '''
    # Creating a KFold object that will produce 4 splits
    kf = KFold(n_splits=4)
    validation_errors = []
    
    for train_idx, valid_idx in kf.split(X_train):
        # split the data
        split_X_train, split_X_valid = X_train.iloc[...], X_train.iloc[...]
        split_y_train, split_y_valid = y_train.iloc[...], y_train.iloc[...]

        # Fit the model on the training split
        model.fit(..., ...)
        
        # Compute the MSE on the validation split
        error = mean_squared_error(..., model.predict(...))
        
        validation_errors.append(...)
        
    return np.mean(...)

**Exercise**: Now, use the given lambda parameters to create Ridge regression models with 5 different lambda parameters and determine which model has the lowest CV error. Then, calculate the error of that model on the test set. (As a bonus, repeat the same for LASSO regression.) We've provided an outline for how you might want to structure your code for this question; feel free to use it or write your own solution however you'd like.

In [None]:
from sklearn.linear_model import Ridge, Lasso

lambdas = [0.01, 0.1, 1, 10, 100]
# feel free to add more lines of code here as necessary
...
for param in lambdas:
    model = Ridge(alpha=..., fit_intercept = False)
    model_err = compute_CV_error(..., ..., ...)
    ...

best_ridge_loss = mean_squared_error(..., ...)
ridge_r2 = r2_score(..., ...)

print("Mean Squared Error of Ridge Model: {:.3f}".format(ridge_loss))
print("R^2 of Ridge Model: {:.3f}".format(ridge_r2))

**Exercise**: Compare the performance of the Ridge models using different values of lambda with the OLS solution. What do you notice about the validation error as lambda increases?

_Your  answer here_


For this dataset and demonstration purposes, cross validation is a viable option to use for hyperparameter selection. Cross validation can also be used to select between various features or even various model architectures!

Keep in mind that in situations where models can be very memory intensive and time consuming to train (like with deep learning), you'll typically prefer using other methods to select hyperparameters like a holdout set (also known as simple cross validation, where you just use a held-out split from training data to test multiple different hyperparameters).


Congrats! You have completed this homework! :)