__NAME:__ __FULLNAME__   
__CS 5703: Machine Learning Practice__

# Homework 5: Regularization

## Assignment Overview
Follow the TODOs and read through and understand any provided code.  
For all plots, make sure all necessary axes and curves are clearly and 
accurately labeled. Include appropriate figure/plot titles, as well.

### Task
For this assignment you will be exploring regularization. Regularization
is a powerful tool in machine learning to impose rational constraints on 
models during the training process to mitigate overfitting to the training 
set and improve model generalization. By including one or more terms within
the cost (error) function to penalize the weights, the learning algorithm will try 
to fit the data while avoiding certain values for the weights that might 
overfit the data.


### Data set
The BMI (Brain Machine Interface) data are stored in a single pickle file; within this file, there
is one dictionary that contains all of the data.  The keys are: 'MI', 
'theta', 'dtheta', 'torque', and 'time'.  Each of these objects are python lists with 20 
numpy matrices; each matrix contains an independent fold of data, with rows representing 
different samples and columns representing different dimensions.  The samples are organized 
contiguously (one sample every 50ms), but there are gaps in the data.
* _MI_ contains the data for 48 neurons.  Each row encodes the number of action potentials for 
each neuron at each of 20 different time bins (so, 48 x 20 = 960 columns).  
* _theta_ contains the angular position of the shoulder (in column 0) and the elbow 
(in column 1) for each sample.  
* _dtheta_ contains the angular velocity of the shoulder (in column 0) and the elbow 
(in column 1) for each sample.  
* _torque_ contains the torque of the shoulder (in column 0) and the elbow (in column 
1) for each sample.  
* _time_ contains the actual time stamp of each sample.

A fold is a subset of the available data.  Cutting the data into folds is useful for adjusting training, validation, and test 
sets sizes, and for assessing the generality of a modelling approach.Each fold contains independent time points.

This assignment utilizes code examples and concepts from the Regression lectures.

### Objectives
* Use and understand regularization in regression
* Learn to select hyper-parameters to tune model behavior, specifically:
    * Regularization parameters
    
### Notes
* Be sure to adequately label all the plots you generate.


### General References
* [Python Built-in Functions](https://docs.python.org/3/library/functions.html)
* [Python Data Structures](https://docs.python.org/3/tutorial/datastructures.html)
* [Numpy Reference](https://docs.scipy.org/doc/numpy/reference/index.html)
* [Summary of matplotlib](https://matplotlib.org/3.1.1/api/pyplot_summary.html)
* [Pandas DataFrames](https://urldefense.proofpoint.com/v2/url?u=https-3A__pandas.pydata.org_pandas-2Ddocs_stable_reference_api_pandas.DataFrame.html&d=DwMD-g&c=qKdtBuuu6dQK9MsRUVJ2DPXW6oayO8fu4TfEHS8sGNk&r=9ngmsG8rSmDSS-O0b_V0gP-nN_33Vr52qbY3KXuDY5k&m=mcOOc8D0knaNNmmnTEo_F_WmT4j6_nUSL_yoPmGlLWQ&s=h7hQjqucR7tZyfZXxnoy3iitIr32YlrqiFyPATkW3lw&e=)
* [Sci-kit Learn Linear Models](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model)
* [Sci-kit Learn Ensemble Models](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.ensemble)
* [Sci-kit Learn Metrics](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics)
* [Sci-kit Learn Model Selection](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection)
* [JobLib](https://joblib.readthedocs.io/en/latest/)

## Hand-In Procedure
* Execute all cells so they are showing correct results
* Notebook (from Jupyter or Colab):
  + Submit this file (.ipynb) to the Gradscope Notebook HW5 dropbox
* Note: there is no need to submit a PDF file or to submit directly to Canvas

In [None]:
# PROVIDED
import pickle as pkl
import pandas as pd
import numpy as np
import os, re, fnmatch, time
import matplotlib.pyplot as plt
import joblib

from sklearn.model_selection import GridSearchCV 
from sklearn.linear_model import LinearRegression, ElasticNet, Lasso, Ridge
from sklearn.metrics import make_scorer

# Default figure parameters
plt.rcParams['figure.figsize'] = (10,5)
plt.rcParams['font.size'] = 12
plt.rcParams['legend.fontsize'] = 10
plt.rcParams['xtick.labelsize'] = 10
plt.rcParams['ytick.labelsize'] = 10
plt.rcParams['figure.constrained_layout.use'] = True
plt.rcParams['axes.titlesize'] = 18
plt.rcParams['axes.labelsize'] = 12

#%matplotlib inline

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Important Note

In [None]:
# CoLab users:

#If the error is observed while running plots:
# FileNotFoundError: [Errno 2] No such file or directory: 'latex': 'latex'
# Uncomment the below lines and run them once. It might take some time.

#!sudo apt update
#!sudo apt install cm-super dvipng texlive-latex-extra texlive-latex-recommended

# Note that you will need to execute this every time you start a new virtual machine

# Other users:
# If you experience a related error, you will need to use your standard installtion
# technique to add the same set of packages to your python environment


# LOAD DATA

In [None]:
""" TODO
Load the BMI data from all the folds
"""
# Local file name
# fname = '/home/fagg/datasets/bmi/bmi_dataset.pkl'
# CoLab file name
fname = '/content/drive/MyDrive/MLP_2022/datasets/bmi_dataset.pkl'

# Load the data
with open(fname, 'rb') as f:
  bmi = pkl.load(f)

# Extract the individual components
theta_folds = bmi['theta']
dtheta_folds = bmi['dtheta']
torque_folds = bmi['torque']
time_folds = bmi['time']
MI_folds = bmi['MI'] 

# Create tuples of MI, theta, dtheta, torque and time
alldata_folds = zip(MI_folds, theta_folds, dtheta_folds, torque_folds, time_folds)

nfolds = len(MI_folds)
nfolds

# REGULARIZED REGRESSION

In [None]:
""" TODO
Evaluate the training performance of an already trained model
"""
def mse_rmse(trues, preds):
    '''
    Compute MSE and rMSE for each column separately.
    '''
    mse = np.sum(np.square(trues - preds), axis=0) / trues.shape[0]
    rmse = np.sqrt(mse)
    return mse, rmse

def predict_score_eval(model, X, y):
    '''
    Compute the model predictions and cooresponding scores.
    PARAMS:
        X: feature data
        y: cooresponding output
    RETURNS:
        mse: mean squared error for each column
        rmse: rMSE for each column
        score: score computed by the models score() method
        preds: predictions of the model from X
    '''
    # TODO: place implementation from HW4 here
    
    return mse, rmse, score, preds


"""
Create scoring function object for gridsearch

This represents a more general way of creating a scoring mechanism than
what was discussed in the lectures.

GridSearchCV: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
make_scorer: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.make_scorer.html

"""
def rmse_scorer(trues, preds):
    '''
    Compute rMSE
    '''
    _, rmse = mse_rmse(trues, preds)
    return rmse 



In [None]:
def get_data_set(data, folds):
    '''
    For the data provided, extract only the specified folds and concatenate them together
    
    :param data: Python list of numpy matrices (one list element per fold)
    :param folds: Python list of folds to extract
    '''
    # For each field in data, extract only the specified folds
    output = [np.concatenate([d[f] for f in folds]) for d in data]
    
    # Convert the list to a tuple
    return tuple(output)

In [None]:
""" PROVIDED
Construct training set to obtain best model and testing set for 
evaluation of that one model. The model will focus on predicting 
the shoulder torque.
"""
# Extract fold indices for the training and testing sets
trainset_fold_inds = [3] 
validationset_fold_inds = [12, 13] 
testset_fold_inds = [14, 15] 

# Combine the folds into singular numpy arrays

# Training set
Xtrain, ytrain, time = get_data_set([MI_folds, torque_folds, time_folds],
                          trainset_fold_inds)
ytrain = np.reshape(ytrain[:,0], newshape=(-1,))

# Validation set
Xval, yval, timeval = get_data_set([MI_folds, torque_folds, time_folds],
                          validationset_fold_inds)
yval = np.reshape(yval[:,0], newshape=(-1,))


# Testing set
Xtest, ytest, timetest = get_data_set([MI_folds, torque_folds, time_folds],
                          testset_fold_inds)
ytest = np.reshape(ytest[:,0], newshape=(-1,))


In [None]:
Xtrain.shape, ytrain.shape, Xval.shape, yval.shape, Xtest.shape, ytest.shape

## Linear Model

In [None]:
""" TODO
Construct and train a linear model using the training set.
Display the Training rmse. You can use the rmse_scorer for this.
"""
# Create and train the model
# model_lnr = # TODO

# Show train rmse
# TODO
rmse_scorer(ytrain, model_lnr.predict(Xtrain))

In [None]:
# TODO
# Compute the linear model predictions and display the rmse on the test data
preds_lnr = # TODO


## Ridge Regression

In [None]:
# TODO

# Create a Ridge Regression model
ridge = #TODO

# A set of alpha parameter values to try 
#  These are factors of 10 from 10^0 to 10^6 spaced exponentially 

alphas = np.logspace(0, 6, base=10, num=20, endpoint=True)
alphas

In [None]:
# TODO

def hyper_loop(model, alphas, Xtrain, ytrain, Xval, yval):
    '''
    Loop over all possible alphas:
    - Set the model.alpha parameter to the specific alpha
    - Fit model to Xtrain/ytrain
    - Compute rmse for Xtrain/ytrain and Xval/yval & log these in python arrays (use rmse_scorer())
    Return the list of rmse's for both the training and validation sets
    
    :param model: ML model to fit
    :param alphas: List of alpha hyper-parameter values to try
    :param Xtrain: training set inputs
    :param ytrain: training set desired output
    :param Xval: validation set inputs
    :param yval: validation set desired output
    '''
    rmse_train = []
    rmse_valid = []
    # Loop over all possible alphas
    for a in alphas:
        # Set model.alpha
        # model.alpha = # TODO
        
        # Fit the model to the training set
        # TODO

        # Log rmse for both training and validation sets
        rmse_train.append(rmse_scorer(#TODO))
        rmse_valid.append(rmse_scorer(#TODO))
        
    # Return training and validation performance lists
    return rmse_train, rmse_valid

In [None]:
# TODO
# Call hyper_loop with the ridge regression model
rmse_train, rmse_valid = # TODO

print(f"rmse train: {rmse_train}")
print(f"rmse valid: {rmse_valid}")

In [None]:
# TODO
# Plot training and validation rmse as a function of alpha
plt.figure()

# TODO

plt.xscale('log')
plt.xlabel(r'$\alpha$')
plt.ylabel('RMSE')
plt.legend(['Train', 'Validation'])

In [None]:
# TODO
# Identify the index in rmse_valid that is smallest
idx = # TODO


idx

In [None]:
# TODO
# Show the alpha that corresponds to the best alpha
# TODO


In [None]:
# TODO
# Set the ridge model alpha to the best value & refit the training set data
# TODO
ridge.alpha = # TODO

In [None]:
# TODO
# Compute the predictions for the training data
predtrain = #TODO

# Report the rmse for the training data
# TODO


In [None]:
# TODO
# Compute the predictions for the test data

predtest = # TODO

# Report the rmse for the test data
# TODO


In [None]:
# TODO
# Plot: ground truth, best Ridge predictions and the Linear model predictions for 
#  time period 2102 to 2108

plt.figure()
# TODO

plt.legend(['Ground Truth', 'Ridge (best)', 'LMS'])
plt.xlim((2102, 2108))

In [None]:
""" TODO
Generate a plot that contains two overlapping histograms:
- Coefficients discovered by LinearRegression
- Coefficients discovered by the best Ridge model

Hint: the coefficients for the model are found in their coef_ property
"""
nbins = 50
start = -0.05
end = 0.05
incr = (end - start) / nbins2
bins = np.arange(start, end, incr)

# Figure 1: use hist() with bins=nbins.  For each histogram,
#  this will separately determine the bins based on the data
plt.figure()
#TODO
plt.title("Model Coefficients")
plt.legend()

# Figure 2: use bins=bins.  This will use the exact same bins for
#   both histograms
plt.figure()
#TODO
plt.title("Model Coefficients")
plt.legend()


## Reflection
Respond to each of the following questions:
* How does the RMSE compare for both the training set and proper test set for the Linear model and the best Ridge Regression model?
* How do the timeseries predictions compare for the two models and ground truth?
* Explain the difference in coefficient distributions between these two models.

## Answers
TODO 
* 
