__NAME:__ __FULLNAME__  
__SECTION:__ __NUMBER__  
__CS 5970: Machine Learning Practices__

# Homework 5: Regularization

## Assignment Overview
Follow the TODOs and read through and understand any provided code.  
For all plots, make sure all necessary axes and curves are clearly and 
accurately labeled. Include figure/plot titles appropriately as well.

### Task
For this assignment you will be exploring regularization. Regularization
is a powerful tool in machine learning to impose rational constraints on 
models during the training process to mitigate overfitting to the training 
set and improve model generalization. By including one or more terms within
the cost function to penalize the weights, the learning algorithm will try 
to fit the data while avoiding certain values for the weights that might 
overfit the data.


### Data set
The BMI data will be utilized. Recall: 
* _MI_ files contain data with the number of action potentials for 48 neurons, at mutliple 
time points, for a single fold. There are 20 folds (20 files), where each fold consists 
of over 1000 times points (the rows). At each time point, we record the number of 
activations for each neuron for 20 bins. Therefore, each time point has 48 * 20 = 960 
columns.  
* _theta_ files record the angular position of the shoulder (in column 0) and the elbow 
(in column 1) for each time point (in radians).  
* _dtheta_ files record the angular velocity of the shoulder (in column 0) and the elbow 
(in column 1) for each time point (in radians/sec).  
* _torque_ files record the torque of the shoulder (in column 0) and the elbow (in column 
1) for each time point (N-m).  
* _time_ files record the actual time stamp of each time point (seconds).  

This assignment utilizes code examples and concepts from the lectures on Sept 19 - Oct 1.

### Objectives
* Use and understand regularization in regression
* Learn to select hyper-parameters to tune model behavior, specifically:
    * Regularization parameters
    * Training set size

### Notes
* Do not save work within the ml_practices folder

### General References
* [Python Built-in Functions](https://docs.python.org/3/library/functions.html)
* [Python Data Structures](https://docs.python.org/3/tutorial/datastructures.html)
* [Numpy Reference](https://docs.scipy.org/doc/numpy/reference/index.html)
* [Summary of matplotlib](https://matplotlib.org/3.1.1/api/pyplot_summary.html)
* [Pandas DataFrames](https://urldefense.proofpoint.com/v2/url?u=https-3A__pandas.pydata.org_pandas-2Ddocs_stable_reference_api_pandas.DataFrame.html&d=DwMD-g&c=qKdtBuuu6dQK9MsRUVJ2DPXW6oayO8fu4TfEHS8sGNk&r=9ngmsG8rSmDSS-O0b_V0gP-nN_33Vr52qbY3KXuDY5k&m=mcOOc8D0knaNNmmnTEo_F_WmT4j6_nUSL_yoPmGlLWQ&s=h7hQjqucR7tZyfZXxnoy3iitIr32YlrqiFyPATkW3lw&e=)
* [Sci-kit Learn Linear Models](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model)
* [Sci-kit Learn Ensemble Models](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.ensemble)
* [Sci-kit Learn Metrics](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics)
* [Sci-kit Learn Model Selection](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection)

In [None]:
import pandas as pd
import numpy as np
import os, re, fnmatch, time
import matplotlib.pyplot as plt

from sklearn.model_selection import GridSearchCV 
from sklearn.linear_model import LinearRegression, ElasticNet
from sklearn.metrics import make_scorer

FIGWIDTH = 6
FIGHEIGHT = 6
FONTSIZE = 10

plt.rcParams['figure.figsize'] = (FIGWIDTH, FIGHEIGHT)
plt.rcParams['font.size'] = FONTSIZE

plt.rcParams['xtick.labelsize'] = FONTSIZE
plt.rcParams['ytick.labelsize'] = FONTSIZE

%matplotlib inline

# LOAD DATA

In [None]:
def read_bmi_file_set(directory, filebase):
    '''
    Read a set of CSV files and append them together
    :param directory: The directory in which to scan for the CSV files
    :param filebase: A file specification that potentially includes wildcards
    :returns: A list of Numpy arrays (one for each fold)
    '''
    
    # The set of files in the directory
    files = fnmatch.filter(os.listdir(directory), filebase)
    files.sort()

    # Create a list of Pandas objects; each from a file in the directory that matches filebase
    lst = [pd.read_csv(directory + "/" + file, delim_whitespace=True).values for file in files]
    
    # Concatenate the Pandas objects together.  ignore_index is critical here so that
    # the duplicate row indices are addressed
    return lst

In [None]:
""" TODO
Load the BMI data from all the folds, using read_bmi_file_set()
"""
dir_name = '../ml_practices/imports/datasets/bmi/DAT6_08' # TODO: make sure to set this appropriately
MI_folds = read_bmi_file_set(dir_name, 'MI_fold*')
theta_folds = read_bmi_file_set(dir_name, 'theta_fold*')
dtheta_folds = read_bmi_file_set(dir_name, 'dtheta_fold*')
torque_folds = read_bmi_file_set(dir_name, 'torque_fold*')
time_folds = read_bmi_file_set(dir_name, 'time_fold*')

alldata_folds = zip(MI_folds, theta_folds, dtheta_folds, torque_folds, time_folds)

nfolds = len(MI_folds)
nfolds

In [None]:
"""
Print out the shape of all the data for each fold
"""
for i, (MI, theta, dtheta, torque, time) in enumerate(alldata_folds):
    print("FOLD %2d " % i, MI.shape, theta.shape, dtheta.shape, torque.shape, time.shape)

In [None]:
""" 
Print out the first few examples of the theta data
for a few folds
"""
for i, theta in enumerate(theta_folds[:3]):
    print("FOLD %2d" % i)
    print(theta[:5, :])

In [None]:
"""
Check the data for any NaNs
"""
def anynans(X):
    return np.isnan(X).any()

alldata_folds = zip(MI_folds, theta_folds, dtheta_folds, torque_folds, time_folds)

for i, (MI, theta, dtheta, torque, time) in enumerate(alldata_folds):
    print("FOLD %2d " % i, anynans(MI), anynans(theta), anynans(dtheta), anynans(torque), anynans(time))

# REGULARIZED REGRESSION

In [None]:
""" TODO
Evaluate the training performance of an already trained model
"""
def mse_rmse(trues, preds):
    '''
    Compute MSE and rMSE for each column separately.
    '''
    mse = np.sum(np.square(trues - preds), axis=0) / trues.shape[0]
    rmse_rads = np.sqrt(mse)
    rmse_degs = rmse_rads * 180 / np.pi
    return mse, rmse_rads, rmse_degs

def predict_score_eval(model, X, y):
    '''
    Compute the model predictions and cooresponding scores.
    PARAMS:
        X: feature data
        y: cooresponding output
    RETURNS:
        mse: mean squared error for each column
        rmse_rads: rMSE in radians
        rmse_deg: rMSE in degrees
        score: score computed by the models score() method
        preds: predictions of the model from X
    '''
    
    # TODO: place implementation from HW4 here
    
    return mse, rmse_rads, rmse_degs, score, preds


""" TODO
Create scoring function object for gridsearch

This represents a more general way of creating a scoring mechanism than
what was discussed in the lectures.

GridSearchCV: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
make_scorer: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.make_scorer.html

"""
def rmse_deg_scorer(trues, preds):
    '''
    Compute rMSE in degrees
    '''
    _, _, rmse_degs = mse_rmse(trues, preds)
    return # TODO: return the rMSE in degrees 

# Make the scoring function for GridSearch
rmse_deg_scoring = make_scorer(rmse_deg_scorer, greater_is_better=False)


In [None]:
"""
Construct training set to obtain best model and testing set for 
evaluation. The model will focus on predicting the elbow torque.
"""
# Extract fold indices for the training and testing sets
trainset_fold_inds = [5, 6] 
testset_fold_inds = [8, 9] 

# Combine the folds into singular numpy arrays
# Training set
MI_trainset = [MI_folds[f] for f in trainset_fold_inds]
torque_trainset = [torque_folds[f] for f in trainset_fold_inds]
time_trainset = [time_folds[f] for f in trainset_fold_inds]

X = np.concatenate(MI_trainset, axis=0)
y = np.concatenate(torque_trainset, axis=0)[:,0]
time = np.concatenate(time_trainset, axis=0)

# Testing set
MI_testset = [MI_folds[f] for f in testset_fold_inds]
torque_testset = [torque_folds[f] for f in testset_fold_inds]
time_testset = [time_folds[f] for f in testset_fold_inds]

Xtest = np.concatenate(MI_testset, axis=0)
ytest = np.concatenate(torque_testset, axis=0)[:,0]
time_test = np.concatenate(time_testset, axis=0)

In [None]:
X.shape, y.shape, Xtest.shape, ytest.shape

## Linear Model

In [None]:
""" TODO
Construct and train the default linear model using the training set.
Display the Training and Testing rMSEs in degrees.
You can use the rmse_deg_scorer for this.
"""



## Grid Search and ElasticNet Model

In [None]:
""" TODO
Specify the parameter search grid as a dictionary, and display it
"""
alphas = np.logspace(-10, 9, base=2, num=9, endpoint=True)
l1_ratios = np.arange(0, 1.2, .2)
max_iters = [1e4]
nalphas = len(alphas)
nl1_ratios = len(l1_ratios)

param_grid = # TODO
param_grid

In [None]:
""" TODO
Perform the GridSearch using an ElasticNet model and the parameter grid 
constructed above. Use 10 cross validation folds, rmse_deg_scoring for 
the scoring function, any number of n_jobs to improve performance speed 
without causing memory leaks or time outs, return_train_score=True, iid=False, 
and set the verbosity to 1. Execute the grid search using the training data.
"""



In [None]:
""" TODO
Get and display the best parameter set

Note: see the best_params_ property of the GridSearchCV object
"""



In [None]:
""" TODO
Get and fit the best estimator to the training data

Note: see the best_estimator_ property of the GridSearchCV object
"""



In [None]:
""" TODO
Get and display the first few lines of results from the gridsearch 

Note: see the cv_results_ property of GridSearchCV. And, remember that this dict
can be converted to a DataFrame
"""



In [None]:
""" TODO
Extract and negate the mean_train_score
"""



In [None]:
""" TODO
Extract and negate the mean_test_score

Note: although scikit-learn refers to this as a "test score," it is actually
a validation score.  Remember, you are not allowed to look at the test set
performance across a grid of parameter choices (only look at the one test score 
for the hyper parameter set that you select).
"""



In [None]:
""" TODO
Display the Training and Testing rMSEs in degrees for the best estimator.
You can use rmse_deg_scorer for this
"""
# Train rmse


# Test rmse (note: this is the proper test set)


In [None]:
""" TODO
Plot the test set predictions for the best model compared with
the ground truth and the test set predictions from the linear model, 
for 1170 to 1187 seconds
"""



In [None]:
""" TODO
Plot the mean training and validation results from the grid search as a c
olormap, for the alpha (y-axis) vs the l1 ratio (x-axis). Use two subplots, 
subplot(1,2,1) for the training set performance and subplot(1,2,2) for the 
validation set performance. You can use imshow or matshow to display colormaps. 
Make sure to include appropriate labels, ticks, and colorbars
"""



In [None]:
""" TODO
Generate a plot that contains two overlapping histograms:
- Coefficients discovered by LinearRegression
- Coefficients discovered by the best ElasticNet 
    (best is relative to the validation performnce)
"""
nbins = 21

