NAME:__FULLNAME__  

# Homework 3: Classifiers

### Objectives
Follow the TODOs and read through and understand the provided code.
For this assignment you will work with extracting different types of labels,
constructing predictive classifier models from these labels, and evaluating 
the generalized performance of these models. Additionally, it is good practice 
to have a high level understanding of the data that one is working with.  Upon 
loading the data, we will display the info and summary statistics, and examine the data head/tail, and whether there are any missing data (flagged as NaNs).

This assignment utilizes code examples from the lecture on classifiers

* Pipelines
* Classification
  + Label extraction and construction
  + Prediction
  + Performance Evaluation
  + Utilization of Built-In Cross Validation Tools
* Do not save work within the MLP_2022 folder
  + create a folder in your home directory for assignments, and copy the templates there  

### General References
* [Python Built-in Functions](https://docs.python.org/3/library/functions.html)
* [Python Data Structures](https://docs.python.org/3/tutorial/datastructures.html)
* [Numpy Reference](https://docs.scipy.org/doc/numpy/reference/index.html)
* [Summary of matplotlib](https://matplotlib.org/3.1.1/api/pyplot_summary.html)
* [Pandas DataFrames](https://urldefense.proofpoint.com/v2/url?u=https-3A__pandas.pydata.org_pandas-2Ddocs_stable_reference_api_pandas.DataFrame.html&d=DwMD-g&c=qKdtBuuu6dQK9MsRUVJ2DPXW6oayO8fu4TfEHS8sGNk&r=9ngmsG8rSmDSS-O0b_V0gP-nN_33Vr52qbY3KXuDY5k&m=mcOOc8D0knaNNmmnTEo_F_WmT4j6_nUSL_yoPmGlLWQ&s=h7hQjqucR7tZyfZXxnoy3iitIr32YlrqiFyPATkW3lw&e=)
* [Sci-kit Learn Linear Models](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model)
  + [SGDClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html#sklearn.linear_model.SGDClassifier)
* [Sci-kit Learn Ensemble Models](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.ensemble)
* [Sci-kit Learn Metrics](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics)
* [Sci-kit Learn Model Selection](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection)

### Hand-In Procedure
* Execute all cells so they are showing correct results
* Notebook (from Jupyter or Colab):
  + Submit this file (.ipynb) to the Gradscope Notebook HW3 dropbox
* Note: there is no need to submit a PDF file or to submit directly to Canvas

In [None]:
import pandas as pd
import numpy as np
import os, re, fnmatch
import matplotlib.pyplot as plt
import matplotlib.patheffects as peffects

from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.metrics import mean_squared_error, confusion_matrix, roc_curve, auc
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import GradientBoostingClassifier

# Default figure parameters
plt.rcParams['figure.figsize'] = (8,4)
plt.rcParams['font.size'] = 12
plt.rcParams['legend.fontsize'] = 10
plt.rcParams['xtick.labelsize'] = 10
plt.rcParams['ytick.labelsize'] = 10
plt.rcParams['figure.constrained_layout.use'] = True
plt.rcParams['axes.titlesize'] = 18
plt.rcParams['axes.labelsize'] = 16

%matplotlib inline

In [None]:
# Execute only if using CoLab
from google.colab import drive
drive.mount('/content/drive')

# LOAD DATA

In [None]:
""" TODO
Load data from subject k2 for week 05
Display info() for the data

These are data obtained from a baby on the SIPPC. 3D Position (i.e. kinematic)
data are collected at 50 Hz, for the x, y, and z positions in meters, for 
various joints such as the wrists, elbows, shoulders, etc.
"""
# Local file name
#fname = '~/datasets/baby1/subject_k2_w05.csv'

# File name if using CoLab
fname = '/content/drive/MyDrive/MLP_2022/datasets/baby1/subject_k2_w05.csv'
baby_data_raw = # TODO
#TODO 

In [None]:
""" TODO
Display the first few examples
"""


In [None]:
""" TODO
Display the last few examples
"""


In [None]:
""" TODO
Display the summary statistics
"""


In [None]:
""" TODO
Check the dataframe for any NaNs using pandas methods
isna() and any() for a summary of the missing data
"""


# Data Selection

In [None]:
""" PROVIDED
"""
## Support for identifying kinematic variable columns
def get_kinematic_properties(data):
    # Regular expression for finding kinematic fields
    regx = re.compile("_[xyz]$")

    # Find the list of kinematic fields
    fields = list(data)
    fieldsKin = [x for x in fields if regx.search(x)]
    return fieldsKin

def position_fields_to_velocity_fields(fields, prefix='d_'):
    '''
    Given a list of position columns, produce a new list
    of columns that include both position and velocity
    '''
    fields_new = [prefix + x for x in fields]
    return fields + fields_new


In [None]:
""" PROVIDED
Get the names of the sets of fields for the kinematic features and the 
velocities
"""
fieldsKin = get_kinematic_properties(baby_data_raw)
fieldsKinVel = position_fields_to_velocity_fields(fieldsKin)
print(fieldsKinVel)

# Construct Pipeline Components

In [None]:
""" PROVIDED
"""
# Pipeline component: select subsets of attributes
class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribs):
        self.attribs = attribs
    def fit(self, x, y=None):
        return self
    def transform(self, X):
        return X[self.attribs]

# Pipeline component: drop all rows that contain invalid values
class DataSampleDropper(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    def fit(self, x, y=None):
        return self
    def transform(self, X):
        return X.dropna(how='any')

# Pipeline component: Compute derivatives
class ComputeDerivative(BaseEstimator, TransformerMixin):
    def __init__(self, attribs, dt=1.0, prefix='d_'):
        self.attribs = attribs
        self.dt = dt
        self.prefix = prefix
    def fit(self, x, y=None):
        return self
    def transform(self, X):
        # Compute derivatives
        Xout = X.copy()
        for field in self.attribs:
            # Extract the values for this field
            values = Xout[field].values
            # Compute the difference between subsequent values
            diff = values[1:] - values[0:-1]
            # Bring the length to be the same as original data
            np.append(diff, 0)
            # Name of the new field
            name = self.prefix + field
            Xout[name] = pd.Series(diff / self.dt)
        return Xout


# Construct Pipelines

In [None]:
""" PROVIDED
Create four pipelines. 

The first pipeline computes the derivatives of select features
within the dataframe and then drops rows containing NaNs.

The second pipeline extracts the kinematic and velocity (derivative)
features from the dataframe.

The third pipeline extracts the time from the dataframe.

The fourth pipeline extracts the sippc_action from the dataframe.
"""
# Sampling rate: number of seconds between each time sample
dt = .02

# Initial pre-processing
pipe_der_drop = Pipeline([
    ('derivative', ComputeDerivative(fieldsKin, dt=dt)),
    ('dropper', DataSampleDropper())
])

# Position, velocity selector
pipe_kin_vel = Pipeline([
    ('selector', DataFrameSelector(fieldsKinVel))
])

# Time selector
pipe_time = Pipeline([
    ('selector', DataFrameSelector(['time']))
])

# Robot velocity selector
pipe_robot_vel = Pipeline([
    ('selector', DataFrameSelector(['robot_vel_l', 'robot_vel_r']))
])


## Pre-process and extract data

In [None]:
""" TODO
Use the pipelines to extract the data with kinematic and velocity features, 
the time, and the sippc actions.
See the lecture on classifers for examples
"""
# TODO: use the first pipeline to perform an initial cleaning of the data
baby_data_prcd =  # TODO

# TODO: Use the result from the first pipeline to get the kinematic and 
#       velocity features by using the pipe_kin_vel pipeline
data_pos_vel = # TODO

# TODO: Use the result from the first pipeline to get the time by using
#       the pipe_time pipeline
data_time = # TODO


# TODO: Use the result from the first pipeline to get the robot velocity by using
#       the pipe_robot_vel pipeline
data_robot_vel = #TODO

# PROVIDED: Get the dataframes as numpy arrays
inputs_pos_vel = data_pos_vel.values
time = data_time.values
robot_vel = data_robot_vel.values

nsamples =  #TODO

## Examine Robot Velocity

In [None]:
""" TODO
Create a plot that contains both the linear velocity (robot_vel[:,0]) and
rotational velocity (robot_vel[:,1]).  The plot should contain appropriate labels

Note: units are m/s and rad/s, respectively
"""

# TODO
plt.figure()


In [None]:
""" PROVIDED
Create labels that correspond to "fast forward motion" and
"fast rotational motion"

"""
# Fast forward motion
labels_linear = robot_vel[:,0] > 0.0005

# Leftward turns
labels_rotational = (robot_vel[:,1]) > 0.004

In [None]:
""" TODO
Augment the figure you created above to show the two newly-created
class labels.  Make sure that the resulting figure is easy to read
"""
# TODO
plt.figure()


## Classification Using Cross Validation
### Linear Velocity Labels

In [None]:
""" TODO
LINEAR VELOCITY

Create a SGDClassifier with random_state=42, max_iter=1e4, tol=1e-3, and
that uses a loss function. Fit the model using the position x, y, z
and velocity x, y, z for all limbs as the input features to the model. Use
the robot linear velocity labels as the output of the model.

Use cross_val_predict() to compute predictions for each sample and their
corresponding scores. Use 20 cross validation splits (i.e. cv=20).

NOTES:
- For older versions of scikit-learn (e.g., what is running in CoLab, use 'log' as the loss function.
- For modern veresions of scikiet-learn, use 'log_loss'
- Expect that this will take some time to compute

"""
# Model input
X = inputs_pos_vel
# Model output
y = labels_linear

# TODO: Create and fit the classifer
#loss ='loss_log' works for version 1.1 and up. Sklearn in Google Colab is at 1.02. Check version with sklearn.__version__
clf =  # TODO
clf.fit(X, y)

# TODO: use cross_val_predict() to compute the scores by setting the 'method'
#       parameter equal to 'decision_function'. Please see the reference 
#       links above
scores = # TODO

# TODO: use cross_val_predict() to compute the predicted labels by setting 
#       the 'method' parameter equal to 'predict'. Please see the reference 
#       links above
preds = # TODO

In [None]:
# PROVIDED: Compare the true labels to the predicted labels and the scores

plt.figure()
plt.plot(time, y, 'b', label='Targets')
plt.plot(time, preds-2, 'r', label='Predictions')
plt.plot(time, scores-8, 'g', label='Scores')
plt.plot([0, time.max()], [-8, -8], 
         'k', label='threshold')
plt.xlabel("Time (s)")
plt.legend()

## Plotting Functions - Performance Results
## Linear Velocity
* Confusion Matrix Color Map
* K.S. Plot
* ROC Curve Plot

In [None]:
""" PROVIDED
"""
# Generate a color map plot for a confusion matrix
def confusion_mtx_colormap(mtx, xnames, ynames, cbarlabel=""):
    ''' 
    Generate a figure that plots a colormap of a matrix
    PARAMS:
        mtx: matrix of values
        xnames: list of x tick names
        ynames: list of the y tick names
        cbarlabel: label for the color bar
    RETURNS:
        fig, ax: the corresponding handles for the figure and axis
    '''
    nxvars = mtx.shape[1]
    nyvars = mtx.shape[0]
    
    # create the figure and plot the correlation matrix
    fig, ax = plt.subplots()
    im = ax.imshow(mtx, cmap='summer')
    if not cbarlabel == "":
        cbar = ax.figure.colorbar(im, ax=ax)
        cbar.ax.set_ylabel(cbarlabel, rotation=-90, va="bottom")
    
    # Specify the row and column ticks and labels for the figure
    ax.set_xticks(range(nxvars))
    ax.set_yticks(range(nyvars))
    ax.set_xticklabels(xnames)
    ax.set_yticklabels(ynames)
    ax.set_xlabel("Predicted Labels")
    ax.set_ylabel("Actual Labels")

    # Rotate the tick labels and set their alignment.
    plt.setp(ax.get_xticklabels(), rotation=45, 
             ha="right", rotation_mode="anchor")

    # Loop over data dimensions and create text annotations.
    lbl = np.array([['TN', 'FP'], ['FN', 'TP']])
    for i in range(nyvars):
        for j in range(nxvars):
            text = ax.text(j, i, "%s = %.3f" % (lbl[i,j], mtx[i, j]),
                           ha="center", va="center", color="k")
            #text.set_path_effects([peffects.withStroke(linewidth=2, 
            #foreground='w')])

    return fig, ax

# Compute the ROC Curve and generate the KS plot
def ks_roc_plot(targets, scores, FIGWIDTH=12, FIGHEIGHT=6, FONTSIZE=16):
    ''' 
    Generate a figure that plots the ROC Curve and the distributions of the 
    TPR and FPR over a set of thresholds
    PARAMS:
        targets: list of true target labels
        scores: list of predicted labels or scores
    RETURNS:
        fpr: false positive rate
        tpr: true positive rate
        thresholds: thresholds used for the ROC curve
        auc: Area under the ROC Curve
        fig, axs: corresponding handles for the figure and axis
    '''
    fpr, tpr, thresholds = roc_curve(targets, scores)
    auc_res = auc(fpr, tpr)

    # Generate KS plot
    fig, ax = plt.subplots(1, 2, figsize=(FIGWIDTH,FIGHEIGHT))
    axs = ax.ravel()
    ax[0].plot(thresholds, tpr, color='b')
    ax[0].plot(thresholds, fpr, color='r')
    ax[0].plot(thresholds, tpr - fpr, color='g')
    ax[0].invert_xaxis()
    ax[0].set_xlabel('threshold', fontsize=FONTSIZE)
    ax[0].set_ylabel('fraction', fontsize=FONTSIZE)
    ax[0].legend(['TPR', 'FPR', 'K-S Distance'], fontsize=FONTSIZE)
    
    # Generate ROC Curve plot
    ax[1].plot(fpr, tpr, color='b')
    ax[1].plot([0,1], [0,1], 'r--')
    ax[1].set_xlabel('FPR', fontsize=FONTSIZE)
    ax[1].set_ylabel('TPR', fontsize=FONTSIZE)
    ax[1].set_aspect('equal', 'box')
    auc_text = ax[1].text(.05, .95, "AUC = %.4f" % auc_res, 
                          color="k", fontsize=FONTSIZE)
    print("AUC:", auc_res)

    return fpr, tpr, thresholds, auc_res, fig, axs


In [None]:
""" TODO

Compute the confusion matrix using sklearn's confusion_matrix() function and 
generate a color map using the provided confusion_mtx_colormap() for the model 
built using the distance labels.
"""
label_names = ['slow', 'fast forward']

dist_confusion_mtx = # TODO
confusion_mtx_colormap( # TODO )

nneg = dist_confusion_mtx[0].sum()
npos = dist_confusion_mtx[1].sum()
npos, nneg

In [None]:
""" TODO
Plot histograms of the scores from the model built using the linear velocity labels.
Comparing distribution of scores for positive and negative examples.
Create one subplot of the distribution of all the scores. 
Create a second subplot overlaying the distribution of the scores of the 
positive examples (i.e. positive here means examples with a label of 1) with 
the distributionof the negative examples (i.e. negative here means examples 
with a label of 0). Use 41 as the number of bins.
See the lecture on classifiers for examples
"""
scores_pos = [scores[idx] for (idx, y_) in enumerate(y) if y_ > 0]
scores_neg = [scores[idx] for (idx, y_) in enumerate(y) if y_ == 0]

nbins = 41


plt.figure()
plt.subplot(1,2,1)
# TODO

plt.xlabel('score')
plt.ylabel('count')

plt.subplot(1,2,2)
# TODO

plt.xlabel('score')
plt.ylabel('count')
plt.legend(loc='upper right')

In [None]:
""" TODO
DISTANCE
Use ks_roc_plot() to plot the ROC curve and the KS plot for the model
constructed with the linear velocity labels
"""

fpr, tpr, thresholds, auc_res, ks_roc_fig, ks_roc_axs = # TODO

# Classification Using Cross Validation
## Rotational Velocity Labels

In [None]:
""" TODO
ROTATIONAL VELOCITY

Create a SGDClassifier with random_state=42, max_iter=1e4, tol=1e-3, and
that uses a log loss function. Fit the model using the position x, y, z
and velocity x, y, z for all limbs as the input features to the model. Use
the robot linear velocity labels as the output of the model.

Use cross_val_predict() to get predictions for each sample and their
cooresponding scores. Use 20 cross validation splits (i.e. cv=20).

Plot the true labels, predictions, and the scores.
For more information observe the general references above
"""
# Model input
X = inputs_pos_vel
# Model output
y = labels_rotational

# TODO: Create and fit the classifer
clf = #TODO

# TODO: use cross_val_predict() to compute the scores by setting the 'method'
#       parameter equal to 'decision_function'. Please see the reference 
#       links above
scores = # TODO

# TODO: use cross_val_predict() to compute the predicted labels by setting 
#       the 'method' parameter equal to 'predict'. Please see the reference 
#       links above
preds = # TODO

In [None]:
# PROVIDED: Compare the true labels to the predicted labels and the scores

plt.figure()
plt.plot(time, y, 'b', label='Targets')
plt.plot(time, preds-2, 'r', label='Predictions')
plt.plot(time, scores-8, 'g', label='Scores')
plt.plot([0, time.max()], [-8, -8], 
         'k', label='threshold')
plt.xlabel("Time (s)")
plt.legend()

## Plotting Functions - Performance Results
Linear Velocity
* Confusion Matrix Color Map
* K.S. Plot
* ROC Curve Plot

In [None]:
""" TODO

Compute the confusion matrix using sklearn's confusion_matrix() function and 
generate a color map using the provided confusion_mtx_colormap() for the model 
built using the distance labels.
"""
label_names = ['slow', 'fast forward']

dist_confusion_mtx = # TODO
confusion_mtx_colormap(# TODO)

nneg = dist_confusion_mtx[0].sum()
npos = dist_confusion_mtx[1].sum()
npos, nneg

In [None]:
""" TODO
Plot histograms of the scores from the model built using the linear velocity labels.
Comparing distribution of scores for positive and negative examples.
Create one subplot of the distribution of all the scores. 
Create a second subplot overlaying the distribution of the scores of the 
positive examples (i.e. positive here means examples with a label of 1) with 
the distribution of the negative examples (i.e. negative here means examples 
with a label of 0). Use 41 as the number of bins.
See the lecture on classifiers for examples
"""
scores_pos = [scores[idx] for (idx, y_) in enumerate(y) if y_ > 0]
scores_neg = [scores[idx] for (idx, y_) in enumerate(y) if y_ == 0]

nbins = 41


plt.figure()
plt.subplot(1,2,1)
# TODO

plt.xlabel('score')
plt.ylabel('count')

plt.subplot(1,2,2)
# TODO

plt.xlabel('score')
plt.ylabel('count')
plt.legend(loc='upper right')

In [None]:
""" TODO
KS-DISTANCE
Use ks_roc_plot() to plot the ROC curve and the KS plot for the model
constructed with the linear velocity labels
"""

# TODO
fpr, tpr, thresholds, auc_res, ks_roc_fig, ks_roc_axs = #TODO

## Reflection
Write a short paragraph that compares the results for the two classificaiton problems that you have just solved (specifically, the linear vs rotational labels problems).  How well does each do?  For each, which is the best choice of score threshold?  And, which problem does the SGDClassifier solve better?  Note that you do not need to make a statistical argument at this time.

### My answer
TODO: paragraph here
