[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/SmilodonCub/DS4VS/blob/master/Week11/DS4VS_week11_Supervised_Classification.ipynb)

## Week 11 Introduction to Supervised Learning: Categorical Targets

## a Brief Recap:

* Hello, how are you?
* How are you with homework 3? Have you looked at Homework4?
* A makeup date for SfN week.
* Today: Supervised Learning Methods for Categorical Targets
* Next 2 Weeks: Unsupervised Learning Methods

## Supervised Learning for Categorical Data

#### Fundamental Supervised Learning Classification Algorithms: 

* Logistic (& Multinomial) Logistic Regression
* Support Vector Machines (SVM)
* k-Nearest Neighbors (kNN)
* Decision Trees & Random Forest

We will spend the next 2 hours taking a tour of these methods.

## MNIST Dataset

* [MNIST](http://yann.lecun.com/exdb/mnist/) is a classic machine learning dataset
* a collection of labelled handwritten digits created in 1998
* images are grayscale, centered, 28x28 
* is made readily available from `scikit-learn`
    - 60k training
    - 10k test
* subject of [numerous studies](https://paperswithcode.com/sota/image-classification-on-mnist) some with accuracies higher than human performance.
    - most recent record: 0.18% error rate
* has spawned many other datasts
    - Fashion MNIST
    - EMNIST
    - and [many others](https://www.kaggle.com/datasets?search=MNIST&datasetsOnly=true)

Let's take a look!......

### Bringing MNIST into our environment

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version=1)
print( type( mnist ) )
mnist.keys()

### Getting to know MNIST

In [None]:
X, y = mnist["data"], mnist["target"]
print( 'X: ', X.shape, '\ny: ', y.shape )
print( 'some ys: ', y.iloc[0:10].values )

In [None]:
# most scikit-learn algos are expecting a numeric classification
y = y.astype( np.uint8 )
print( 'some ys: ', y.iloc[0:10].values )

In [None]:
# Visualize digit images
fig = plt.subplots(figsize=(15,12))
plt.subplots_adjust( hspace=0.8 )
a = 5  # number of rows
b = 5  # number of columns

for character in range(0,a*b):
    some_digit = np.array( X.iloc[character] )
    plt.subplot( a,b,character + 1 )
    some_digit_image = some_digit.reshape( 28, 28 )
    plt.imshow( some_digit_image, cmap = "binary" )
    plt.title('label: %i\n' % y.iloc[character], fontsize = 14)
    plt.axis( 'off' )

In [None]:
# functionalize this plot for future use
def plot_digits( images, index_list, images_mat_dim=10):
    a = images_mat_dim  # number of rows
    b = images_mat_dim  # number of columns

    for idx, im_idx in enumerate( index_list ):
        some_digit = np.array( images.iloc[im_idx] )
        plt.subplot( a,b,idx + 1 )
        some_digit_image = some_digit.reshape( 28, 28 )
        plt.imshow( some_digit_image, cmap = "binary" )
        plt.axis( 'off' )
    plt.show()

In [None]:
mat_size = 5
plot_digits( X, list( range( 0, mat_size*mat_size ) ), mat_size)

### Train/Test splits

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.20, random_state=42)

We'll work with the training set to train our model.  
Once we develop our model, we can evaluate with the test dataset.

## Binary Classification

Predicting a classification where only two outcomes are possible. 

### Logistic Regression

Uses a logistic function to model a binary target variable

### Remembering the Logistic Function

In [None]:
plt.style.use('seaborn-white')
t = np.linspace(- 10, 10, 100)
sig = 1 / (1 + np.exp(-t))
plt.figure(figsize=(9, 3))
plt.plot([-10, 10], [0, 0], "k-")
plt.plot([-10, 10], [0.5, 0.5], "k:")
plt.plot([-10, 10], [1, 1], "k:") 
plt.plot([0, 0], [-1.1, 1.1], "k-")
plt.plot(t, sig, "b-", linewidth=2, label=r"$\sigma(t) = \frac{1}{1 + e^{-t}}$")
plt.xlabel("t") 
plt.legend(loc="upper left", fontsize=20)
plt.axis([-10, 10, -0.1, 1.1])
plt.show()

* Logistic Function: $$\sigma(t) = \frac{e^t}{e^t+1} = \frac{1}{1 + e^{-t}}$$

* We assume a Linear relationship for our classification problem:  $$t = \beta_0 + \beta_1x + \cdots$$

* We can rewrite our Logistic Regression Function as: $$\sigma(t) = \frac{e^t}{e^t+1} = \frac{1}{1 + e^{-(\beta_0 + \beta_1x + \cdots)}}$$

* fit to our data (e.g. Gradient Descent methods) such that we minimize a cost function.

[a casual explanation](https://towardsdatascience.com/whats-linear-about-logistic-regression-7c879eb806ad)

### Rephase MNIST for Binary Logistic Regression

Let's build a binary logistic classifier to determine if a given digit is itself or not.  
[`scikit-learn` docs](https://scikit-learn.org/stable/modules/linear_model.html)

In [None]:
exp = 5
y_train_exp = (y_train == exp) # True where y==exp ; False everywhere else
y_test_exp = (y_test == exp) # "..."

In [None]:
from sklearn.linear_model import LogisticRegression

# Instantiate a Logistic Regression class object
log_reg = LogisticRegression(penalty='l1', solver='saga', tol=0.1) # lbfgs is a faster solver

In [None]:
# fit to the data
log_reg.fit(X_train, y_train_exp)

### `LogisticRegression()` predictions

evaluate the model's fit to the training data 

In [None]:
print( log_reg.predict(X_train[0:10]) )
print( y_train_exp.iloc[0:10].values )

In [None]:
from sklearn.model_selection import cross_val_score
cross_val_score(log_reg, X_train, y_train_exp, cv=5, scoring="accuracy")

### `cross_val_score`

`cross_val_score` - evaluate a score by cross-validation

* **scoring** - [model selection and evaluation tools](https://scikit-learn.org/stable/modules/model_evaluation.html)
* **cross-validation** - a resampling method. a portion of the training data is held out from model training and used to estimate model accuracy during model development.

<img src="https://scikit-learn.org/stable/_images/grid_search_cross_validation.png" width="60%" style="margin-left:auto; margin-right:auto">

### Evaluating a Classifier

Questions: What would chance performance be here?

**Dummy Classifier** - a type of classifier that does not generate any insight about the data and classifies by a simple rule without trying to find a pattern in the data. For example, classify everything as the most common class.

What **accuracy** do you expect from a classifier that classifies everything as `False`?

### Confusion Matrix

`cross_val_predict` - generate a set of predictions. the cross-validation means that for each 'fold', the model was fit on the remainder of the data and returns a set of predictions on the unseen fold.  

**confusion matrix**  

|                  |   **Predicted True**  |    **Predicted False**   |
|:----------------:|:---------------------:|:------------------------:|
|  **Actual True** | true positives (TP)  | false negatives (FN)  |
| **Actual False** |    false positives (FP)    |      true negatives (TN)     |

What would the confusion matrix for a perfect classifier look like?

#### Confusion matrix for our model...

In [None]:
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix

y_train_pred = cross_val_predict(log_reg, X_train, y_train_exp, cv=5)
confusion_matrix(y_train_exp, y_train_pred)

### Precision & Recall

* **precision** - what proportion of actual positives get correctly labeled as such? $\frac{TP}{TP+FP}$

* **sensitivity** (recall) - what proportion of predicted positives are actual positives? $\frac{TP}{TP+FN}$

* **$F_1 \mbox{score}$** we can combine these precision and recall as a metric to evaluate our model. a harmonic mean: will only get a high score if both precision and recall are high. $\frac{\mbox{precision}*\mbox{recall}}{\mbox{precision}+\mbox{recall}}$

In [None]:
from sklearn.metrics import precision_score, recall_score, f1_score

print( 'precision: ', precision_score(y_train_exp, y_train_pred) )
print( 'recall: ', recall_score(y_train_exp, y_train_pred) )
print( 'f1_score: ', f1_score( y_train_exp, y_train_pred))

### Where is the Model Getting Confused?

Let's visualize some of the mislabeled digits

In [None]:
cl_T, cl_F = True, False
X_TT = X_train[(y_train_exp == cl_T) & (y_train_pred == cl_T)]
X_TF = X_train[(y_train_exp == cl_T) & (y_train_pred == cl_F)]
X_FT = X_train[(y_train_exp == cl_F) & (y_train_pred == cl_T)]
X_FF = X_train[(y_train_exp == cl_F) & (y_train_pred == cl_F)]
mat_size = 5

plot_digits( X_TT, list( range( 0, mat_size*mat_size ) ), mat_size)  #True Positives
#plot_digits( X_TF, list( range( 0, mat_size*mat_size ) ), mat_size)  #False Negative 
#plot_digits( X_FT, list( range( 0, mat_size*mat_size ) ), mat_size)  #False Positive
#plot_digits( X_FF, list( range( 0, mat_size*mat_size ) ), mat_size)  #True Negatives

We could visually evaluate model performance by plotting $$\mbox{precision} \sim \mbox{recall}$$  

However, it is more common to evaluate $$\mbox{true positive rate} \sim \mbox{false positive rate} == \mbox{recall} \sim 1-\mbox{sensitivity}$$
...in other words, the **ROC curve**

In [None]:
from sklearn.metrics import roc_curve


y_scores = cross_val_predict(log_reg, X_train, y_train_exp, cv=5,
                             method="decision_function")
log_reg_fpr, log_reg_tpr, thresholds = roc_curve( y_train_exp, y_scores )

In [None]:
# visualize the ROC
plt.figure(figsize=(8, 6))
plt.plot( log_reg_fpr, log_reg_tpr, linewidth = 2 )
plt.plot( [0,1], [0,1], 'k--')
plt.axis([0, 1, 0, 1])                                  
plt.xlabel('False Positive Rate') 
plt.ylabel('True Positive Rate')   

plt.show()

### Area Under the Curve

**Comparing Models** - the Area Under the Curve (AUC) is often used as a measure to compare classifier performance. A perfect classifier will have and AUC = 1.  

What would be the AUC of a model performing at chance?

In [None]:
from sklearn.metrics import roc_auc_score

roc_auc_score(y_train_exp, y_scores)

### Support Vector Machines 

* a simple algorithm that finds a **decision boundary** to separate target classes
* decision boundary: classification will be assigned depending on what side of the boundary an observation appears
* goal: maximize the distance between the nearest instances for each class


### The SVM decision boundary

<img src="https://miro.medium.com/max/809/1*GPFxwsE4cqcPxul4GWGp1A.png" width="80%" style="margin-left:auto; margin-right:auto">

Let's see SVM in action with [this demo](https://jgreitemann.github.io/svm-demo)

### Fitting an SVM model to MNIST

In [None]:
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform( X_train )
X_test_scaled = scaler.transform( X_test )
#svm_clf = SVC( kernel='linear', gamma='auto', probability=False )
#svm_clf

### SVM does not scale well in high dimentions.

training will take very long time.  
If you would like to try, uncomment the code below and train your own SVM.  
There is an interesting new python library that will let you play Galaga while you model trains: [`TrainInvaders`](https://github.com/aporia-ai/TrainInvaders)

In [None]:
# TrainInvaders
# install the library
#!pip3 install train_invaders --upgrade
# import TrainInvaders
import train_invaders.start
# it will let you play until your model is done training.
# you can turn it off:
#import train_invaders.stop

In [None]:
# if you would like to train your own SVM

#svm_clf.fit( X_train_scaled, y_train_exp )
# but this will take a long time

In [None]:
# load a pre-trained model

import pickle
filename = 'svm_clf'
#pickle.dump(svm_clf, open(filename, 'wb'))  # to 'pickle' a Python object

# load the model from disk
svm_clf = pickle.load(open(filename, 'rb'))
svm_clf

### Evaluating Model Fit to the Training Data

In [None]:
y_train_pred_svm = svm_clf.predict(X_train)
confusion_matrix(y_train_exp, y_train_pred_svm)

In [None]:
print( 'precision: ', precision_score(y_train_exp, y_train_pred_svm) )
print( 'recall: ', recall_score(y_train_exp, y_train_pred_svm) )
print( 'f1_score: ', f1_score( y_train_exp, y_train_pred_svm))

### Comparing Classification Models

In [None]:
svm_y_scores = svm_clf.decision_function( X_train )
svm_clf_fpr, svm_clf_tpr, svm_thresholds = roc_curve( y_train_exp, svm_y_scores )

In [None]:
# visualize the ROCs for both approaches
plt.figure(figsize=(8, 6))
plt.plot( log_reg_fpr, log_reg_tpr, linewidth = 2, color = 'red' )
plt.plot( svm_clf_fpr, svm_clf_tpr, linewidth = 2, color = 'green' )
plt.plot( [0,1], [0,1], 'k--')
plt.axis([0, 1, 0, 1])                                  
plt.xlabel('False Positive Rate') 
plt.ylabel('True Positive Rate')   

plt.show()

### Compare AUCs for both approaches

In [None]:
print( 'logistic AUC: ', roc_auc_score(y_train_exp, y_scores) )
print( 'SVM AUC: ', roc_auc_score(y_train_exp, svm_y_scores) )

### SVM has hyperparameters

<img src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRYG48xX8xCzQZ-EdDa_gfFjGGVY3onfwIB_A&usqp=CAU" width="40%" style="margin-left:auto; margin-right:auto">

### Training SVM

* **C**
* **gamma**

There are also different the kernels

In [None]:
#from sklearn.pipeline import Pipeline
#from sklearn.model_selection import GridSearchCV

#steps = [('scaler', StandardScaler()), ('SVM', SVC(kernel='linear'))]
#pipeline = Pipeline(steps)
#parameters = {'SVM__C':[0.001, 0.1, 100, 10e5], 'SVM__gamma':[10,1,0.1,0.01]}

#grid = GridSearchCV(pipeline, param_grid=parameters, cv=5)
#grid.fit(X_train, y_train)


<img src="https://static.tvtropes.org/pmwiki/pub/images/intermission_3696.jpg" width="70%" style="margin-left:auto; margin-right:auto">

## Multinomial Classification

distinguishing between more than two classes

* Logistic Regression and SVM are stricktly binary classifiers.
* There are methods to adapt a Multinomial Logistic Regression and SVM for more than two classes
    - **OvR** - one-vs-the-rest
    - **OvO** - one-vs-one
* There are other classification algorithms that can directly handle more than 2 classes:  
    - kNN, naive Bayes, Decision Trees, etc

### Multinomial Logistic & SVM Classification

we are going to quickly train multinomial versions of the Logistic & SVM models that we built for the binary case above.  
The Python code looks remarkably similar, because `scikit-learn` detects the number of classes to adjust the algorithm. However, conceptually we are building very different classification models.

In [None]:
# Multinomial Logistic Regression
mlr_mod = LogisticRegression(penalty='l1', solver='saga', tol=0.1)
mlr_mod.fit(X_train, y_train) # not y_train_exp

In [None]:
mlr_y_train_pred = cross_val_predict(mlr_mod, X_train, y_train, cv=5) #not y_train_exp
mlr_conf_mat = confusion_matrix(y_train, mlr_y_train_pred)
print( mlr_conf_mat )

In [None]:
row_sums = mlr_conf_mat.sum(axis=1, keepdims=True)
norm_conf_mx = mlr_conf_mat / row_sums
np.fill_diagonal(norm_conf_mx, 0)
plt.figure(figsize=(8, 6))
plt.matshow(norm_conf_mx, cmap=plt.cm.gray, fignum=1)
plt.show()

In [None]:
from sklearn.metrics import classification_report
print( classification_report( y_train, mlr_y_train_pred ) )

#### SVM multinomial classification

In [None]:
# Multinomial SVM
#multisvm_clf = SVC( kernel='linear', gamma='auto', probability=False )
#multisvm_clf.fit( X_train_scaled, y_train ) #not y_train_exp
# you can do this if you really want to ....or if you just want to play more TrainInvaders

In [None]:
filename = 'multisvm_clf'
#pickle.dump(multisvm_clf, open(filename, 'wb'))  # to 'pickle' a Python object

# load the model from disk
multisvm_clf = pickle.load(open(filename, 'rb'))
multisvm_clf

In [None]:
msvm_y_train_pred = multisvm_clf.predict(X_train) #not y_train_exp
msvm_conf_mat = confusion_matrix(y_train, msvm_y_train_pred)
print( msvm_conf_mat )

In [None]:
row_sums = msvm_conf_mat.sum(axis=1, keepdims=True)
norm_conf_mx = msvm_conf_mat / row_sums
np.fill_diagonal(norm_conf_mx, 0)
plt.figure(figsize=(8, 6))
plt.matshow(norm_conf_mx, cmap=plt.cm.gray, fignum=1)
plt.show()

In [None]:
print( classification_report( y_train, msvm_y_train_pred ) )

## k-Nearest Neighbors (kNN)

classify an observation based on the known labels of the nearest neighbors 

<img src="https://vitalflux.com/wp-content/uploads/2020/09/Screenshot-2020-09-22-at-2.34.57-PM.png" width="60%" style="margin-left:auto; margin-right:auto">


### kNN pseudocoded

kNN is a very simple algorithm!

* for each training data point:  
    - calculate the distance between all other training data
        * what is 'distance': Euclidean, cosine, Chebyshev, Manhattan
    - sort by increasing distance
    - use the 'k' closest points to determine
        * winner-takes-all, or a weighted distance etc.
        
There are many resouces that walk through building up [kNN from scratch](https://www.youtube.com/watch?v=n3RqsMz3-0A)

In [None]:
from sklearn.neighbors import KNeighborsClassifier

knn_clf = KNeighborsClassifier()
knn_clf.fit( X_train, y_train )

In [None]:
# retrun the number of neighbors used by our classifier
knn_clf.n_neighbors

### Tuning 'k'

How do we determine the best value of 'k'?

In [None]:
from sklearn.model_selection import GridSearchCV
grid_params = {'n_neighbors': list( range( 1, 25, 2 ) ) }
grid_knn = GridSearchCV( knn_clf, grid_params, cv = 3 )
knn_grid_res = grid_knn.fit( X_train, y_train )

In [None]:
#filename = 'knn_grid'
#pickle.dump(knn_grid_res, open(filename, 'wb'))  # to 'pickle' a Python object

# load the model from disk
#knn_grid = pickle.load(open(filename, 'rb'))
#knn_grid

In [None]:
knn_grid_res.best_estimator_.get_params()

In [None]:
best_mnist_knn = knn_grid_res.best_estimator_

#knn_y_train_pred = best_mnist_knn.predict(X_train) #not y_train_exp
knn_y_train_pred = cross_val_predict(best_mnist_knn, X_train, y_train, cv=5) #not y_train_exp
knn_conf_mat = confusion_matrix(y_train, knn_y_train_pred)
print( knn_conf_mat )

In [None]:
row_sums = knn_conf_mat.sum(axis=1, keepdims=True)
norm_conf_mx = knn_conf_mat / row_sums
np.fill_diagonal(norm_conf_mx, 0)
plt.figure(figsize=(8, 6))
plt.matshow(norm_conf_mx, cmap=plt.cm.gray, fignum=1)
plt.show()

In [None]:
print( classification_report( y_train, knn_y_train_pred ) )

## Decision Trees

Similar to kNN, Decision Trees are another non-parametric algorithm.  
However, the classification approach is very different:  

* Decision trees split the data on a measure of purity
* `sklearn.DecisionTreeClassifier()` uses a common approach: CART
* **CART** - Classification and Regression Tree
    - uses the **gini** index as a purity measure
    - a greedy algorithm - makes the locally optimal choice at each level

### An MNIST illustration

<img src="https://www.researchgate.net/profile/Soham-Saha-2/publication/329318295/figure/fig1/AS:792545151946752@1565968897034/The-hierarchy-tree-for-MNIST-learned-by-our-proposed-Class2Str-network-and-Latent.png" width="70%" style="margin-left:auto; margin-right:auto">

In [None]:
from sklearn import tree
tree_clf = tree.DecisionTreeClassifier()

In [None]:
tree_clf.fit( X_train, y_train )

### Visualizing the Decision Tree

In [None]:
plt.figure(figsize=(12,12))
tree.plot_tree(tree_clf, max_depth=4, fontsize = 12)
plt.show()

### Evaluate the Decision Tree

In [None]:
tree_y_train_pred = cross_val_predict(tree_clf, X_train, y_train, cv=5) #not y_train_exp
tree_conf_mat = confusion_matrix(y_train, tree_y_train_pred)
print( tree_conf_mat )

In [None]:
row_sums = tree_conf_mat.sum(axis=1, keepdims=True)
norm_conf_mx = tree_conf_mat / row_sums
np.fill_diagonal(norm_conf_mx, 0)
plt.figure(figsize=(8, 6))
plt.matshow(norm_conf_mx, cmap=plt.cm.gray, fignum=1)
plt.show()

In [None]:
print( classification_report( y_train, tree_y_train_pred ) )

## Which Classification Method to use?

which model are we most satisfied with and why?

## Python & Pandas 4 Penguins

Let's use an example dataset to learn about multinomial logistic regression.  
We will be using the [Palmer Penguins](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0090081) [dataset](https://github.com/allisonhorst/palmerpenguins). It's a newer alternative to the [classic Iris dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set).

In [None]:
penguins = pd.read_csv("https://raw.githubusercontent.com/SmilodonCub/DS4VS/master/datasets/penguins.csv")

culmen_columns = ["Culmen Length (mm)", "Culmen Depth (mm)"]
target_column = "Species"

## Take your time to dive into each classification method with the relatively simple Palmers Penguins dataset
<img src="https://content.techgig.com/photo/80071467/pros-and-cons-of-python-programming-language-that-every-learner-must-know.jpg?132269" width="100%" style="margin-left:auto; margin-right:auto">