__NAME:__ __FULLNAME__  
__SECTION:__ __NUMBER__  
__CS 5703: Machine Learning Practices__

# Homework 11: Dimensionality Reduction

## Assignment Overview
Follow the TODOs and read through and understand any provided code.  
For all plots, make sure all necessary axes and curves are clearly and 
accurately labeled. Include figure/plot titles appropriately as well.


### Task
For this assignment you will be exploring dimensionality reduction using
Prinicipal Componenet Analysis (PCA). Having a large number of features 
can dramatically increase training times and the likelihood of overfitting.
Additionally, it's difficult to visualize and understand patterns in high 
dimensional spaces. It's not uncommon that a lower dimensional subspace
of the full feature space will better characterize trends within the data.
PCA is one such technique that attempts to locate such subspaces and projects
the data into the determined subspace.


### Data set   
Heart Arrhythmia: distinguishing Normal vs Abnormal arrhythmia.


### Objectives
Gain experience in using:
* Dimensionality Reduction
* Principal Component Analysis (PCA)
* PCA as a preprocessing step to a classifier


### General References
* [Guide to Jupyter](https://www.datacamp.com/community/tutorials/tutorial-jupyter-notebook)
* [Python Built-in Functions](https://docs.python.org/3/library/functions.html)
* [Python Data Structures](https://docs.python.org/3/tutorial/datastructures.html)
* [Numpy Reference](https://docs.scipy.org/doc/numpy/reference/index.html)
* [Numpy Cheat Sheet](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Numpy_Python_Cheat_Sheet.pdf)
* [Summary of matplotlib](https://matplotlib.org/3.1.1/api/pyplot_summary.html)
* [DataCamp: Matplotlib](https://www.datacamp.com/community/tutorials/matplotlib-tutorial-python?utm_source=adwords_ppc&utm_campaignid=1565261270&utm_adgroupid=67750485268&utm_device=c&utm_keyword=&utm_matchtype=b&utm_network=g&utm_adpostion=1t1&utm_creative=332661264365&utm_targetid=aud-299261629574:dsa-473406587955&utm_loc_interest_ms=&utm_loc_physical_ms=9026223&gclid=CjwKCAjw_uDsBRAMEiwAaFiHa8xhgCsO9wVcuZPGjAyVGTitb_-fxYtkBLkQ4E_GjSCZFVCqYCGkphoCjucQAvD_BwE)
* [Pandas DataFrames](https://urldefense.proofpoint.com/v2/url?u=https-3A__pandas.pydata.org_pandas-2Ddocs_stable_reference_api_pandas.DataFrame.html&d=DwMD-g&c=qKdtBuuu6dQK9MsRUVJ2DPXW6oayO8fu4TfEHS8sGNk&r=9ngmsG8rSmDSS-O0b_V0gP-nN_33Vr52qbY3KXuDY5k&m=mcOOc8D0knaNNmmnTEo_F_WmT4j6_nUSL_yoPmGlLWQ&s=h7hQjqucR7tZyfZXxnoy3iitIr32YlrqiFyPATkW3lw&e=)
* [Sci-kit Learn Linear Models](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model)
* [Sci-kit Learn Ensemble Models](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.ensemble)
* [Sci-kit Learn Metrics](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics)
* [Sci-kit Learn Model Selection](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection)
* [Sci-kit Learn Pipelines](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html)
* [Sci-kit Learn Preprocessing](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing)
* [SciPy Paired t-test for Dependent Samples](https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.ttest_rel.html)

### Hand-In Procedure
* Execute all cells so they are showing correct results
* Notebook (from Jupyter or Colab):
  + Submit this file (.ipynb) to the Gradescope Notebook HW11 dropbox
* Note: there is no need to submit a PDF file or to submit directly to Canvas

In [None]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

import metrics_plots
from pipeline_components import DataSampleDropper, DataFrameSelector


import pandas as pd
import numpy as np
import os
import time as timelib
import matplotlib.pyplot as plt

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import RobustScaler, Binarizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score
from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import PCA
import joblib

# Default figure parameters
plt.rcParams['figure.figsize'] = (6,5)
plt.rcParams['font.size'] = 10
plt.rcParams['legend.fontsize'] = 10
plt.rcParams['xtick.labelsize'] = 10
plt.rcParams['ytick.labelsize'] = 10
plt.rcParams['figure.constrained_layout.use'] = False
plt.rcParams['axes.titlesize'] = 14
plt.rcParams['axes.labelsize'] = 12


plt.style.use('ggplot')

In [None]:
# Only execute in Colab

# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Only execute in Colab

## We've discovered a better way to do imports into colab
## Now, instead of executing the files, we will copy them
## into your colab VM, then import them as normal
import pathlib
import shutil
import re

# TODO: fill in the right folder location
folder = '/content/drive/My Drive/TODO'

for n in pathlib.Path(folder).iterdir():
  if re.search(r'.*\.py', n.name):
    shutil.copy(n, n.name)

In [None]:
# COLAB and Local execution
import metrics_plots
from pipeline_components import DataSampleDropper, DataFrameSelector, DataSampleSwapper

# LOAD DATA

In [None]:
filename = '/content/drive/My Drive/MLP_2022/datasets/heart_arrhythmia.csv'
#filename = 'heart_arrhythmia.csv'

heart = pd.read_csv(filename, delimiter=',', nrows=None)
heart.dataframeName = filename

nRows, nCols = heart.shape
print(f'There are {nRows} rows and {nCols} columns')

In [None]:
heart.columns

In [None]:
d=heart['diagnosis'].values
plt.hist(d)
np.sum(d==1)

# Classification

In [None]:
""" PROVIDED
Evaluate the training performance of an already trained model. 

Used to evaluate a PCA model
"""
def compute_rmse(x, y):
    return np.sqrt(np.nanmean((x - y)**2))

""" PROVIDED
Evaluate the training performance of an already trained classifier model
"""
def predict_and_score(model, X, y):
    '''
    Compute the model predictions and cooresponding scores.
    PARAMS:
        X: feature data
        y: corresponding output
    RETURNS:
        preds: predictions of the model from X
        score: score computed by the models score() method
        f1: F1 score
        
    '''
    preds = model.predict(X)

    f1 = f1_score(y, preds)
    score = model.score(X, y)
    
    return preds, score, f1

In [None]:
""" PROVIDED
Create a Pipeline to prepare the data
"""
# Features to keep in the analysis
feature_names_initial = heart.columns.drop(['J'])

# Features to keep as inputs to the model
feature_names = heart.columns.drop(['diagnosis', 'J'])

# Preprocessing pipeline will be a component of the input/output pipelines
pipe_pre = Pipeline([
    ("removeAttribs", DataFrameSelector(feature_names_initial)),
    ("Cleanup", DataSampleSwapper((('?', np.nan),))),
    ("NaNrowDropper", DataSampleDropper()),
])

# Input pipeline
pipe_X = Pipeline([
    ("pipe_pre", pipe_pre),
    ("selectAttribs", DataFrameSelector(feature_names)),
    ("scaler", RobustScaler())
])

# Output pipeline
pipe_y = Pipeline([
    ("pipe_pre", pipe_pre),
    ("selectAttribs", DataFrameSelector(['diagnosis'])),
    #("binarizer", Binarizer())
])

In [None]:
""" PROVIDED
Format the data to provide to the models
"""
X = pipe_X.fit_transform(heart)
y = pipe_y.fit_transform(heart).values.ravel()

# y is an int - convert to 1/0
# 1.0 = Normal; 0.0 = abnormal
y = (y == 1) + 0.0

In [None]:
X.shape, y.shape

In [None]:
plt.hist(y)

In [None]:
# Split the data into training and test sets
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, stratify=y, 
                                                test_size=0.5, random_state=42)

In [None]:
target_names = ['Abnormal', 'Normal']

# BENCHMARK
The task is to predict the normal arrhythmia in patients.
We are going to compare the performance of a LogisticRegression model trained on the original data to a LogisticRegression model trained using PCA-transformed data

## LogisticRegresson Benchmarks

In [None]:
""" TODO
LogisticRegression benchmark for comparision.  

Do not use regularization.  Use as many iterations that you need 
to allow the model to converge to a solution.
"""
benchmark_lnr = LogisticRegression(TODO)
benchmark_lnr.fit(Xtrain, ytrain)

# Compute predictions on fully trained model for train set
preds, score, f1 = predict_and_score(benchmark_lnr, Xtrain, ytrain)
print("Train:\tF1: %.3f\tScore: %.3f" % (f1, score))
# Compute predictions on fully trained model for val set
preds, score, f1 = predict_and_score(benchmark_lnr, Xtest, ytest)
print("Test:\tF1: %.3f\tScore: %.3f" % (f1, score))

## Reflection #1

What are the F1 scores for both the training and test sets?

__TODO__

Train: 1.0
Test: 0.747 (could vary a little, depending on configuration)

# Principal Component Analysis

In [None]:
""" TODO
Train a PCA model using the training set with whiten=True
"""


In [None]:
""" TODO
Examine how much variance is accounted for by each PC.
"""


In [None]:
""" TODO
How many PCs are necessary to achieve a specified variance?
"""


## Reflection #2



How many PCs are necessary to account for 90% of the data variance?

__TODO__


How many PCs are necessary to account for 95%?

__TODO__


How many PCs are necessary to account for 99% of the data variance?

__TODO__


In [None]:
""" TODO
Using the number of PCs obtained for 99% variance, re-fit the PCA with
whiten=True and project the training data into PC space
"""

#TODO

# PROVIDED: Compute the reconstruction error (rmse)
compute_rmse(Xtrain, Xtrain_recon)

In [None]:
""" TODO
Implement a model Pipeline. The first step of the pipeline is 
PCA with n_components set to the number of PCs determined above
that account for 99% of the data variance and whiten to true.

The second step of the pipeline is LogisticRegression() with no regularization.
"""
# TODO: Create Pipeline model

# TODO: Fit model to entire train set
pca_model.fit(Xtrain, ytrain)

## Reflection #3

What are the F1 scores for both the training and test sets for the 80% variance case?

__TODO__


## GRIDSEARCH 
Use the GridSearchCV class to search for hyper-parameters for the Pipeline object that
you created above.  

The hyper-parameter you should vary for PCA is n_components.  When creating the hyper-parameter dictionary for a Pipeline object, the hyper-parameter names are of the form: A__B, where A is the Pipeline element name and B is the hyper-parameter name for that pipeline element.  Example: 'Classifier__max_iter'


In [None]:
# PROVIDED: List of number of PCs to try
components = np.linspace(1,100,num=40, dtype=int)
components

In [None]:
""" TODO
Create the GridSearchCV object using the PCA 
pipeline model created above, and use GridSearchCV with cv=5 and opt_metric='f1'
"""
# Grid Search Parameters
opt_metric = 'f1'
maximize_opt_metric = False
CV = 5

# GridSearch pipeline hyper-parameters can be specified 
# with ‘__’ separated parameter names



In [None]:
""" TODO
Display the GridSearch results as a pandas dataframe 
"""
pd.DataFrame(search.cv_results_)

In [None]:
""" TODO
Plot the mean f1 score vs the number of PCs on the train and validation sets 
for each model, using the 'mean_train_score' and 'mean_test_score'
keys of the search.cv_results_ dictionary

"""


In [None]:
""" TODO
Train the best estimator with the full training set
"""


## Reflection #4

What is the optimal number of components with respect to the validation set?

__TODO__




What are the F1 scores for both the training and test sets for the 80% variance case?

__TODO__



# Logistic Regression with Regularization

TODO: 
1. Use GridSearchCV to find the most appropriate regularization parameter value for LogisticRegression (do not use PCA)

2. Refit the training set to the best parameter choice

3. Evaluate with respect to the training and test sets


## Reflection #5

Compare the test set performance for the best PCA-LR and the LR-with-regularization models.

__TODO__




