# Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow

## Chapter 2: Classification

---
### Imports

In [None]:
from pathlib import Path
from typing import Optional

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from sklearn.datasets import fetch_openml
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import (confusion_matrix, f1_score, precision_recall_curve,
    precision_score, recall_score, roc_auc_score, roc_curve)
from sklearn.model_selection import cross_val_predict, cross_val_score
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

from tensorflow_2.exceptions import InputError

### Variables

In [None]:
DATA_DIR = Path('../../data/ch3_classification')

---
## Load Data

In [None]:
mnist = fetch_openml('mnist_784', version=1, data_home=DATA_DIR.parent)
print(mnist['DESCR'])

In [None]:
mnist['target'] = mnist['target'].astype(np.uint8)
x_train, y_train = [mnist[k][:60000] for k in ('data', 'target')]
x_test, y_test = [mnist[k][60000:] for k in ('data', 'target')]

In [None]:
def plot_example(dset: str='train', idx: int=0, score: Optional[bool]=None):
    """
    Plot example from dataset.
    
    :param dset: choose either `train` or `test`
    :param idx: index of example
    :param score: model predicted score
    """
    if dset not in ('train', 'test'):
        raise InputError(
            f'dset={dset}',
            f'Valid inputs for dset are "train" or "test"')
    x = x_train if dset == 'train' else x_test
    y = y_train if dset == 'train' else y_test
    score = '' if score is None else f'   Predict: {score}'
    plt.imshow(x[idx].reshape(28, 28), cmap='binary')
    plt.title(f'Label: {y[idx]}{score}')
    plt.axis('off')
    plt.show()
    
    
plot_example('train', 10)

---
## Train Binary Classifier

Stochastic Gradient Descent (SGD) classifier
- capable of handling very large datasets efficiently
- evaluates training instances independently
    - suited for online learning
- relies on randomness during training

In [None]:
binary_value = 5
y_train_binary = y_train == binary_value
y_test_binary = y_test == binary_value

sgd_classifier = SGDClassifier(random_state=42)
sgd_classifier.fit(x_train, y_train_binary)

In [None]:
for n in range(10):
    score = sgd_classifier.predict([x_train[n]])[0]
    plot_example(dset='train', idx=n, score=score)

---
## Cross-Validation of Binary Classifier

Algorithm
1. Randomly split the training set in k distinct subsets called ***folds***.
1. Train the model on k-1 folds.
1. Evaluate the model on the one fold that was not included in training.
1. Repeat until all folds have been used as an evaluation set.
1. Average the results of all the trained folds.

### Example implementation of Cross-Validation
```python
from sklearn.base import clone
from sklearn.model_selection import StratifiedKFold

skfolds = StratifiedKFold(n_splits=3, random_state=42)
score = []
for train_idx, test_idx in skfolds.split(x_train, y_train):
    clone_model = clone(model)
    x_train_folds = x_train[train_idx]
    y_train_folds = y_train[train_idx]
    x_test_fold = x_train[test_idx]
    y_test_fold = y_train[test_idx]
    clone_model.fit(x_train_folds, y_train_folds)
    predict = clone_model.predict(x_test_fold)
    n_correct = sum(predict == y_test_fold)
    score.append(n_correct / len(pedict)) 
```

### Evaluate Cross-Validataion Accuracy of Binary Classifier

In [None]:
cross_val_score(sgd_classifier, x_train, y_train_binary, cv=3, scoring='accuracy')

If the classifier said a two never appeared in this dataset the model would have an accuracy of 90%.

<font color='red'>
    Accuracy is generally not the preferred performance measure for classifiers, especially when dealing with *skewed* datasets.
</font>

### Confusion Matrix of Binary Classifier

In [None]:
y_train_pred = cross_val_predict(sgd_classifier, x_train, y_train_binary, cv=3)
c_matrix = confusion_matrix(y_train_binary, y_train_pred)
pd.DataFrame(c_matrix,
             columns=['Predicted False', 'Predicted True'],
             index=['Actual False', 'Actual True'])

<font color='red'>
    <b>
        Increasing precision reduces recall, and vice versa (Precision/Recal trade-off)
    </b>
</font>

<br>
<br>

<font color='green'>
    For the binary case: tn, fp, fn, tp = confusion_matrix().ravel()
</font>

#### Precision
$$precision = \frac{TP}{TP + FP}$$

#### Recall
$$recall = \frac{TP}{TP + FN}$$

In [None]:
tn, fp, fn, tp = c_matrix.ravel()
precision = tp / (tp + fp)
recall = tp / (tp + fn)

print(f'Precision: {precision}')
print(f'Recall: {recall}')

<font color='red'>
    Use Scikit-Learn functions for Precision and Recall
</font>

In [None]:
precision, recall = [f(y_train_binary, y_train_pred)
                     for f in (precision_score, recall_score)]

In [None]:
precision

In [None]:
recall

#### F1 Score

- Combination of precision and recall into a single metric.
- The harmonic mean of precision and recall.
- Metric gives much more weight to low values.
- A high F1 score requires *both* precision and recall to be high.

$$F_1 = \frac{2}{\frac{1}{precision} + \frac{1}{recall}}$$

$$F_1 = 2 \left( \frac{precision \cdot recall}{precision + recall} \right)$$

$$F_1 = \frac{TP}{TP + \frac{FN + FP}{2}}$$

In [None]:
f1_score(y_train_binary, y_train_pred)

### Precision vs. Recall

- Increasing the threshold decreases recall and will generally impove precision (sometimes precision will decrease)
- Lowering the threshold increases recall and reduces precision

<font color='red'>
    Scikit-Learn uses a default threshold of zero.
</font>

In [None]:
decision_function_scores = cross_val_predict(
    sgd_classifier, x_train, y_train_binary, cv=3, method='decision_function'
)

precisions, recalls, thresholds = precision_recall_curve(
    y_train_binary, decision_function_scores
)

df = pd.DataFrame(
    np.c_[precisions[:-1], recalls[:-1]],
    index=thresholds,
    columns=['Precision', 'Recall'],
).rename_axis('Threshold')

fig = px.line(df, title='Precision & Recall vs Threshold')
fig.show()

fig = px.line(df, x='Recall', y='Precision', title='Precision vs Recall')
fig.show()

#### Find Threshold to Acheive 90% Precision

In [None]:
threshold_90_precision = thresholds[np.argmax(precisions >= 0.9)]
y_train_pred_90 = decision_function_scores >= threshold_90_precision

precision, recall = [f(y_train_binary, y_train_pred_90)
                     for f in (precision_score, recall_score)]

In [None]:
precision

In [None]:
recall

### Random Forest Classifier

- `RandomForestClassifier` does not have a `decision_funtions()` method
- use the `predict_proba()` method
    - returns an array containing a row per instance and a column per class, each containing the probability that the given instance belongs to the given class.

In [None]:
rf_classifier = RandomForestClassifier(random_state=42)
y_probas_rf = cross_val_predict(rf_classifier, x_train, y_train_binary,
                                cv=3, method="predict_proba")
y_scores_rf = y_probas_rf[:, 1]

#### Find the Threshold to Acheive 90% Precision

In [None]:
precisions_rf, recalls_rf, thresholds_rf = precision_recall_curve(
    y_train_binary, y_scores_rf)
treshold_90_precision_rf = thresholds_rf[np.argmax(precisions_rf >= 0.9)]
y_train_pred_90_rf = y_scores_rf >= treshold_90_precision_rf

precision_rf, recall_rf = [f(y_train_binary, y_train_pred_90_rf)
                           for f in (precision_score, recall_score)]

In [None]:
precision_rf

In [None]:
recall_rf

### Receiver Operating Characteristic (ROC)

- similar to the precision vs recall curve
- true positive rate (TPR) vs false positive rate (FPR)
    - TPR is another name for ***Recall***
    - FPR is the ratio of negative instances that are incorrectly classified as positive.
    - FPR is equal to 1 - true negative rate (TNR)
- Sensitivity vs 1 - Specificity

$$TPR = Recall = \frac{TP}{TP + FN}$$
$$FPR = 1 - TNR = 1 - Specificity = 1 - \frac{TN}{TN + FP}$$


In [None]:
fpr, tpr, thresholds = roc_curve(y_train_binary, decision_function_scores)
auc = roc_auc_score(y_train_binary, decision_function_scores)
threshold = fpr[np.nonzero(tpr == recall)[0]]

fpr_rf, tpr_rf, thresholds_rf = roc_curve(y_train_binary, y_scores_rf)
auc_rf = roc_auc_score(y_train_binary, y_scores_rf)
threshold_rf = fpr_rf[np.nonzero(tpr_rf == recall_rf)[0]]

fig = go.Figure()
fig.add_trace(go.Scatter(x=fpr, y=tpr, mode='lines', name='SGD', hovertext=f'AUC: {auc}'))
fig.add_trace(go.Scatter(x=threshold, y=[recall], mode='markers', name='Threshold',
    marker=dict(size=12,), showlegend=False))

fig.add_trace(go.Scatter(x=fpr_rf, y=tpr_rf, mode='lines', name='RF', hovertext=f'AUC: {auc_rf}'))
fig.add_trace(go.Scatter(x=threshold_rf, y=[recall_rf], mode='markers', name='Threshold',
    marker=dict(size=12,), showlegend=False))

fig.add_trace(go.Scatter(x=[0, 1], y=[0, 1], mode='lines', name='Random',
    line=dict(color='black', dash='dash'), showlegend=False))

fig.update_layout(
    title='Receiver Operating Characteristic (ROC)',
    xaxis_title='False Positive Rate',
    yaxis_title='True Positive Rate',
)
fig.show()

### Binary Classification Process Summary

1. Create an instance of a model
1. Create binary labels for the test data
1. Fit the model to the test data
1. Run Cross-Validation on the test data (`cross_val_score`)
1. Plot Confusion Matrix
1. Calculate Precision (`precision_score`)
1. Calculate Recall (`recall_score`)
1. Calculate F1 (`f1_score`)
1. Plot Precision vs Recall
1. Choose a Threshold
1. Plot ROC Curve with threshold

---
## Multiclass Classification

- also call **Multinomial Classifiers**

| Model                       | Binary Classification | Multiclass Classification |
|:--------------------------- |:---------------------:|:-------------------------:|
| k Nearest Neighbors         | X                     | X                         |
| naive Bayes                 | X                     | X                         |
| Random Forest               | X                     | X                         |
| Stochastic Gradient Descent | X                     | X                         |
| Logistic Regression         | X                     |                           |
| Support Vector Machine      | X                     |                           |

OvR -> one-versus-the-rest classifier
OvO -> one-versus-one

Scikit-Learn will use OvR or OvO when a binary classifier is asked to perform multiclass classification.
- `from sklearn.multiclass import OneVsRestClassifier`
- `from sklearn.multiclass import OneVsOneClassifier`

<br>
<font color='green'>
    The list of target classes is stored in the classes_ attribute.
</font>

### Support Vector Machine Classifier

<font color='red'>
    Warning:<br>
      45 models will be trained for the default OvO strategy.
      If a GPU is not available the following cell will take a while to execute.
</font>

In [None]:
svm_classifier = SVC()
svm_classifier.fit(x_train, y_train)
len(svm_classifier.estimators_)

In [None]:
svm_classifier.predict([x_train[0]])

In [None]:
svm_classifier.classes_

In [None]:
svm_scores = svm_classifier.decision_function([x_train[0]])
svm_scores

#### Use OvR Stretegy

In [None]:
ovr_svm = OneVsRestClassifier(SVC())
ovr_svm.fit(x_train, y_train)
len(ovr_svm.estimators_)

In [None]:
ovr_svm.predict([x_train[0]])

In [None]:
svm_classifier.classes_

### Stochastic Gradient Descent

In [None]:
sgd_classifier.fit(x_train, y_train)
sgd_classifier.predict([x_train[0]])

In [None]:
sgd_classifier.decision_function([x_train[0]])

#### SGD with Cross-Validation

<br>
<font color='red'>
    <b>
        Make sure to scale the inputs!
    </b>
</font>

In [None]:
scaler = StandardScaler()
x_train_scaled = scaler.fit_transform(x_train.astype(np.float64))
cross_val_score(sgd_classifier, x_train_scaled, y_train, cv=3,
                scoring='accuracy')

### Random Forest

In [None]:
cross_val_score(rf_classifier, x_train_scaled, y_train, cv=3,
                scoring='accuracy')

---
## Error Analysis

- Plot standardized Confusion Matrix
    - Divide each value in the confusion matrix by the number of images in the corresponding class (rows) to view error rates instead of absolute numbers of errors.

---
## Multilabel Classification

- Output a binary vector indicating each class as either pressent or missing.