__NAME:__ __FULLNAME__  
 
__CS 5703: Machine Learning Practice__

# Homework 4: Linear Regression

## Assignment Overview
Follow the TODOs and read through and understand any provided code.  
For all plots, make sure all necessary axes and curves are clearly and 
accurately labeled. Include figure/plot titles appropriately, as well.

### Task
For this assignment you will work with different training set sizes, constructing
regression models from these sets, and evaluating the training and validation performance
of these models. Additionally, it is good practice to have a high level understanding
of the data one is working with, thus upon loading the data, we will also display 
aspects of the data. 

### Data set
The BMI (Brain Machine Interface) data are stored in a single pickle file; within this file, there
is one dictionary that contains all of the data.  The keys are: 'MI', 
'theta', 'dtheta', 'torque', and 'time'.  Each of these objects are python lists with 20 
numpy matrices; each matrix contains an independent fold of data, with rows representing 
different samples and columns representing different dimensions.  The samples are organized 
contiguously (one sample every 50ms), but there are gaps in the data.
* _MI_ contains the data for 48 neurons.  Each row encodes the number of action potentials for 
each neuron at each of 20 different time bins (so, 48 x 20 = 960 columns).  
* _theta_ contains the angular position of the shoulder (in column 0) and the elbow 
(in column 1) for each sample.  
* _dtheta_ contains the angular velocity of the shoulder (in column 0) and the elbow 
(in column 1) for each sample.  
* _torque_ contains the torque of the shoulder (in column 0) and the elbow (in column 
1) for each sample.  
* _time_ contains the actual time stamp of each sample.

A fold is a subset of the available data.  Each fold contains independent time points.

This assignment uses code examples and concepts from the lectures on regression 

### Objectives
* Understand the impact of the training set size
* Understand the essentials of linear regression:
  + Prediction
  + Multiple Regression
  + Performance Evaluation

### Notes
* Make sure to select pages for your submission on Gradescope
* Make sure to fill in your name on the assignment

### General References
* [Python Built-in Functions](https://docs.python.org/3/library/functions.html)
* [Python Data Structures](https://docs.python.org/3/tutorial/datastructures.html)
* [Numpy Reference](https://docs.scipy.org/doc/numpy/reference/index.html)
* [Summary of matplotlib](https://matplotlib.org/3.1.1/api/pyplot_summary.html)
* [Pandas DataFrames](https://urldefense.proofpoint.com/v2/url?u=https-3A__pandas.pydata.org_pandas-2Ddocs_stable_reference_api_pandas.DataFrame.html&d=DwMD-g&c=qKdtBuuu6dQK9MsRUVJ2DPXW6oayO8fu4TfEHS8sGNk&r=9ngmsG8rSmDSS-O0b_V0gP-nN_33Vr52qbY3KXuDY5k&m=mcOOc8D0knaNNmmnTEo_F_WmT4j6_nUSL_yoPmGlLWQ&s=h7hQjqucR7tZyfZXxnoy3iitIr32YlrqiFyPATkW3lw&e=)
* [Sci-kit Learn Linear Models](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model)
* [Sci-kit Learn Ensemble Models](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.ensemble)
* [Sci-kit Learn Metrics](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics)
* [Sci-kit Leatn Model Selection](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection)
* [Torque](https://en.wikipedia.org/wiki/Torque)
* [Velocity](https://en.wikipedia.org/wiki/Velocity)


### Hand-In Procedure
* Execute all cells so they are showing correct results
* Notebook (from Jupyter or Colab):
  + Submit this file (.ipynb) to the Gradscope Notebook HW4 dropbox
* Note: there is no need to submit a PDF file or to submit directly to Canvas
* Note II: this homework assignment will take some real time to execute.  Leave yourself time for this

In [None]:
import pandas as pd
import numpy as np
import pickle as pkl
import scipy.stats as stats
import os, re, fnmatch
import matplotlib.pyplot as plt
import matplotlib.patheffects as peffects

from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.linear_model import LinearRegression, SGDRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn import metrics
from sklearn.linear_model import Ridge


# Default figure parameters
plt.rcParams['figure.figsize'] = (10,5)
plt.rcParams['font.size'] = 12
plt.rcParams['legend.fontsize'] = 10
plt.rcParams['xtick.labelsize'] = 10
plt.rcParams['ytick.labelsize'] = 10
plt.rcParams['figure.constrained_layout.use'] = True
plt.rcParams['axes.titlesize'] = 18
plt.rcParams['axes.labelsize'] = 12

%matplotlib inline


%matplotlib inline

In [None]:
# Mount the Google Drive

from google.colab import drive
drive.mount('/content/drive')

# LOAD DATA

In [None]:
""" TODO
Load the BMI data from all the folds
"""
# Local file location
#fname = '/home/fagg/datasets/bmi/bmi_dataset.pkl'
# CoLab file location
fname = '/content/drive/MyDrive/MLP_2022/datasets/bmi_dataset.pkl'

with open(fname, 'rb') as f:
  bmi = pkl.load(f)

# TODO: finish extracting the MI data folds (other folds provided)
print(bmi.keys())
theta_folds = bmi['theta']
dtheta_folds = bmi['dtheta']
torque_folds = bmi['torque']
time_folds = bmi['time']
#MI_folds = # TODO

# Combine the data for the individual folds together into their own tuples
alldata_folds = zip(MI_folds, theta_folds, dtheta_folds, torque_folds, time_folds)

nfolds = len(MI_folds)
nfolds

In [None]:
"""
Print out the shape of all the data for each fold
"""
# TODO: finish by including shape of time data
for i, (MI, theta, dtheta, torque, time) in enumerate(alldata_folds):
  print(f"Fold {i} {MI.shape} {theta.shape} {dtheta.shape} {torque.shape} *** ")

In [None]:
""" PROVIDED
Print out the first few rows and columns of the MI data
for a few folds
"""
for i, MI in enumerate(MI_folds[:3]):
  print(f"Fold {i}")
  print(MI[:5,:20])

In [None]:
""" TODO
Check the data for any NaNs
"""
def anynans(X):
    return np.isnan(X).any()

alldata_folds = zip(MI_folds, theta_folds, dtheta_folds, torque_folds, time_folds)

# TODO: finish by checking the MI data for any NaNs
for i, (MI, theta, dtheta, torque, time) in enumerate(alldata_folds):
  print(f"FOLD {i}  *** ")

In [None]:
""" PROVIDED
For the first 4 folds, plot the data for the elbow and shoulder
and from one neuron
"""
f = 4
data_folds = zip(MI_folds[:f], theta_folds[:f], dtheta_folds[:f], 
                 torque_folds[:f], time_folds[:f])

for i, (MI, theta, dtheta, torque, time) in enumerate(data_folds):
    fig, axs = plt.subplots(4, 1)
    fig.subplots_adjust(hspace=.05)
    axs = axs.ravel()
    
    # Neural Activation Counts
    axs[0].stem(time, MI[:,0], label='counts', use_line_collection=True)
    #axs[0].plot(time,MI[:,0], label='counts')
    axs[0].set_title(f"Fold {i}")
    axs[0].legend(loc='upper left')
    
    lgnd = ['shoulder', 'elbow']
    
    # Position
    axs[1].plot(time, theta)
    axs[1].set_ylabel(r"$\theta \;(rad)$")
    axs[1].legend(lgnd, loc='upper left')
    
    # Velocity
    axs[2].plot(time, dtheta)
    axs[2].set_ylabel(r"$d\theta\; /\; dt \;(rad/s)$")
    axs[2].legend(lgnd, loc='upper left')
    
    # Torque
    axs[3].plot(time, torque)
    axs[3].set_ylabel(r"$\tau \;(Nm)$")
    axs[3].legend(lgnd, loc='upper left')
    if i == (f-1): 
        axs[3].set_xlabel('Time (s)')

# MODEL OUTPUTS

In [None]:
""" PROVIDED
For fold 6, visualize the correlation between the shoulder
and elbow for the angular position, the angular velocity, and the 
torque
"""
f = 6

y_pos = theta_folds[f]
y_vel = dtheta_folds[f]
y_tor = torque_folds[f]
time = time_folds[f]

nrows = 3
ncols = 2
fig, axs = plt.subplots(nrows, ncols)
fig.subplots_adjust(wspace=.3, hspace=.7)
axs = axs.ravel()
xlim = [970, 1000]

# POSITION
p = 0
axs[p].plot(time, y_pos)
axs[p].set_ylabel(r'$\theta \;(rad)$')
axs[p].legend(['shoulder', 'elbow'], loc='upper left')
axs[p].set_xlim(xlim)

p = 1
axs[p].plot(y_pos[:,0], y_pos[:,1])
axs[p].set_ylabel('elbow')
#axs[p].set_title(r'$\theta \; (rad)$')

# VELOCITY
p = 2
axs[p].plot(time, y_vel)
axs[p].set_ylabel(r'$d\theta\;/\;dt\;(rad/s)$')
#axs[p].set_title(r'$d\theta\;/\;dt\;(rad/s)$')
axs[p].legend(['shoulder', 'elbow'], loc='upper left')
axs[p].set_xlim(xlim)

p = 3
axs[p].plot(y_vel[:,0], y_vel[:,1])
axs[p].set_ylabel('elbow')
#axs[p].set_title(r'd$\theta\;/\;dt\;(rad/s)$')

# TORQUE
p = 4
axs[p].plot(time, y_tor)
axs[p].set_ylabel(r'$\tau \;(Nm)$')
#axs[p].set_title(r'$\tau$')
axs[p].legend(['shoulder', 'elbow'], loc='upper left')
axs[p].set_xlabel('Time (s)')
axs[p].set_xlim(xlim)

p = 5
axs[p].plot(y_tor[:,0], y_tor[:,1])
axs[p].set_xlabel('shoulder')
axs[p].set_ylabel('elbow')
#axs[p].set_title(r'$\tau \;(Nm)$')

# REGRESSION
Predict Velocity of the shoulder and the elbow from the neural activations

In [None]:
""" TODO
Evaluate the training performance of an already trained model

PARAMS:
    trues: N x k numpy matrix of ground truth state (k = # dimensions
       that the model outputs; N = number of examples)
    preds: N x k numpy matrix of predictions
RETURNS:
    mse, rmse_rads: k numpy vectors
    rmse_degs: 1 x k numpy matrix
"""
def mse_rmse(trues, preds):
    '''
    Compute MSE and rMSE for each column separately.
    '''
    mse = np.sum(np.square(trues - preds), axis=0) / trues.shape[0]
    rmse_rads = np.sqrt(mse)
    rmse_degs = rmse_rads * 180 / np.pi
    return mse, rmse_rads, np.reshape(rmse_degs, (1, -1))

# TODO: finish implementation
def predict_score_eval(model, X, y):
    '''
    Compute the model predictions and cooresponding scores.
    PARAMS:
        model: the trained model used to make predicitons
        X: feature data
        y: cooresponding output
    RETURNS:
        mse: mean squared error for each column (k vector)
        rmse_rads: rMSE in radians (k vector)
        rmse_deg: rMSE in degrees (1 x k matrix)
        score: score computed by the models score() method (scalar)
        preds: predictions of the model from X (N x k matrix)
    '''
    preds = # TODO: use the model to predict the outputs from the input data
    
    
    # TODO: use the model to compute the score
    #       This can also be done using a function from  sklearn.metrics 
    #       but calling the model's score method will give us the default
    #       scoring method for that model. 
    #       For the LinearRegression model, this is the coefficient of 
    #       determination: R^2
    #       see the Sci-kit Learn documentation for LinearRegression for more details
    #       Also see: https://scikit-learn.org/stable/modules/model_evaluation.html
    score = # TODO
    
    mse, rmse_rads, rmse_deg = # TODO: use mse_rmse() to compute the mse and rmse

    return mse, rmse_rads, rmse_deg, score, preds


### Training

In [None]:
""" TODO
Extract the MI data from fold 6 as input and the velocity data from 
fold 6 as the output, for a multiple linear regression model (i.e.
the model will simultaneously predict shoulder and elbow velocity).
Create a LinearRegression() model and train it using fit() on the 
data from fold 6
"""
fold_idx = 6
X = MI_folds[fold_idx]
y = dtheta_folds[fold_idx]
time = time_folds[fold_idx]

# TODO

In [None]:
# Provided
# Execute this cell
X.shape, y.shape

In [None]:
""" TODO
Evaluate the training performace of the model, using predict_score_eval()
Print the results displaying MSE, rMSE in rads and degrees, and the 
correlation
"""
# TODO: call predict_score_eval() and get the corresponding outputs
mse, rmse_rads, rmse_degs, score, preds = # TODO

# TODO: print the results of predict_score_eval()


In [None]:
""" TODO
Plot the true velocity and the predicted velocity for the shoulder and 
elbow, over time. Use 2 subplots (one subplot per output).

Focus on the time range 760 to 770 seconds
"""
titles = ['Shoulder', 'Elbow']
xlim = [980,990]

# TODO: Generate the plots
fig = plt.figure()
for i in range(2):
    plt.subplot(1, 2, i+1)
    # TODO

### Testing

In [None]:
""" TODO
Evaluate the performace of the model on unseen data from fold 1.
Recall that your model was trained using data from fold 6.
Print the results displaying MSE, rMSE in rads and degrees, and 
the correlation
"""
ft = 1
Xtest = MI_folds[ft]
ytest = dtheta_folds[ft]
time_tst = time_folds[ft]

# TODO: call predict_score_eval() and get the corresponding outputs
(
    mse_test, 
    rmse_rads_test, rmse_degs_test, 
    score_test, 
    preds_test
) = # TODO 

# TODO: print the results of predict_score_eval()


In [None]:
""" TODO
Plot the true velocity and the predicted velocity over time, for the 
shoulder and the elbow. Use 2 subplots (one for the shoulder and 
the other for the elbow)

Focus on the time range 170 to 180 seconds
"""
titles = ['Shoulder', 'Elbow']
xlim = [170, 180]

# TODO: Generate the plots
fig = plt.figure()
for i in range(2):
    plt.subplot(1, 2, i+1)
    # TODO

### Evaluate Train vs Test

In [None]:
""" TODO
Compare the scores (MSE, RMSE rad, RMSE deg, correlation) on the train and test folds
hint: it may be helpful for some to compare the magnitude using the absolute value. This is built in to python: `abs`
"""

# TODO

## Reflection
In 1-3 sentences, explain the meaning of the above comparison and the difference between the last two plots. Why does the prediction match the data on fold 5, but not on fold 1? How can we tell this from our score comparison?1?

# TODO

### Training Size Sensitivity
For this section, you will be training the model on a different number of folds, each time testing it on the same unseen data from another fold not used in the training procedure.

In [None]:
""" TODO
Fill in the missing lines of code
"""
def training_set_size_loop(model, X, y, folds_inds, val_fold_idx):
    '''
    Train a model on multiple training set sizes
    
    PARAMS:
        model: object to train
        X: input data
        y: output data
        folds_inds: list of the number of folds to use for different 
                    training sets
        val_fold_idx: fold index to use as the validation set. This
                      must be greater than the max value of
                      folds_inds
    RETURNS:
        rmse: dict of train and validation RMSE lists
        corr: dict of train and validation R^2 lists
    '''
    # Create dictionaries to record performance metrics
    ncats = y[0].shape[1]
    rmse = {'train':np.empty((0, ncats)), 'val':np.empty((0, ncats))}
    corr = {'train':[], 'val':[]}
    
    # Data used for validation
    Xval = X[val_fold_idx]
    yval = y[val_fold_idx]
    
    # Loop over the different experiments
    for f in folds_inds:
        # Construct training set 
        Xtrain = np.concatenate(X[:f])
        ytrain = np.concatenate(y[:f])
        
        # TODO: Train the model
        
        # TODO: call predict_score_eval using the training data
        _, _, rmse_degs, score, _ = 
        # TODO: call predict_score_eval using the validation data
        _, _, rmse_degs_val, score_val, _ = 

        # Record the performance metrics for this experiment
        rmse['train'] = np.append(rmse['train'], rmse_degs, axis=0)
        corr['train'].append(score)
        rmse['val'] = np.append(rmse['val'], rmse_degs_val, axis=0)
        corr['val'].append(score_val)
        
    return rmse, corr

In [None]:
""" TODO 
Create a new linear model and train the model on different training set sizes, 
using training_set_size_loop() with training sets of sizes 1,2,3,5,8,13,18 
and use 19 as the val_fold_idx.
The input data is the MI data and the output data is the velocity for both the 
shoulder and elbow.
""" 
val_fold = 19
training_sizes = [1,2,3,5,8,13,18]

# TODO: Create a new LinearRegression model
model = #TODO

# TODO: get the list of rMSE and correlation values per training set fold, by
#       using training_set_size_loop 
X = # TODO
y = # TODO

rmse, corr = # TODO

In [None]:
""" TODO
Plot rMSE as a function of the training set size for
the shoulder and the elbow; also plot correlation as
a function of training set size. Use three subplots
(one for the shoulder rMSE, one for the elbow rMSE, 
and one with the correlation)
"""
titles = ['Shoulder', 'Elbow']

fig = plt.figure()
fig.subplots_adjust(hspace=.15)

# Shoulder
plt.subplot(3, 1, 1)
plt.plot(folds, rmse['train'][:,0].T, label='Training')
plt.plot(folds, rmse['val'][:,0].T, label='Validation')
plt.ylabel('shoulder rmse ')
plt.legend()
plt.xticks([])

# Elbow
# TODO
    
# Correlation
# TODO

plt.xticks(training_sizes)

## Reflection
In 1-3 sentences, explain the results shown in the above graphs. How should we interpret them, and how much data do we really need for training?