# Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow

## Chapter 2: Classification

---
### Imports

In [None]:
from pathlib import Path
from typing import Optional

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotly.express as px
from sklearn.datasets import fetch_openml
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import confusion_matrix, f1_score, precision_recall_curve, precision_score, recall_score
from sklearn.model_selection import cross_val_predict, cross_val_score

from tensorflow_2.exceptions import InputError

### Variables

In [None]:
DATA_DIR = Path('../../data/ch3_classification')

---
## Load Data

In [None]:
mnist = fetch_openml('mnist_784', version=1, data_home=DATA_DIR.parent)
print(mnist['DESCR'])

In [None]:
mnist['target'] = mnist['target'].astype(np.uint8)
x_train, y_train = [mnist[k][:60000] for k in ('data', 'target')]
x_test, y_test = [mnist[k][60000:] for k in ('data', 'target')]

In [None]:
def plot_example(dset: str='train', idx: int=0, score: Optional[bool]=None):
    """
    Plot example from dataset.
    
    :param dset: choose either `train` or `test`
    :param idx: index of example
    :param score: model predicted score
    """
    if dset not in ('train', 'test'):
        raise InputError(
            f'dset={dset}',
            f'Valid inputs for dset are "train" or "test"')
    x = x_train if dset == 'train' else x_test
    y = y_train if dset == 'train' else y_test
    score = '' if score is None else f'   Predict: {score}'
    plt.imshow(x[idx].reshape(28, 28), cmap='binary')
    plt.title(f'Label: {y[idx]}{score}')
    plt.axis('off')
    plt.show()
    
    
plot_example('train', 10)

---
## Train Binary Classifier

Stochastic Gradient Descent (SGD) classifier
- capable of handling very large datasets efficiently
- evaluates training instances independently
    - suited for online learning
- relies on randomness during training

In [None]:
binary_value = 4
y_train_binary = y_train == binary_value
y_test_binary = y_test == binary_value

sgd_classifier = SGDClassifier(random_state=42)
sgd_classifier.fit(x_train, y_train_binary)

In [None]:
for n in range(10):
    score = sgd_classifier.predict([x_train[n]])[0]
    plot_example(dset='train', idx=n, score=score)

---
## Cross-Validation of Binary Classifier

Algorithm
1. Randomly split the training set in k distinct subsets called ***folds***.
1. Train the model on k-1 folds.
1. Evaluate the model on the one fold that was not included in training.
1. Repeat until all folds have been used as an evaluation set.
1. Average the results of all the trained folds.

### Example implementation of Cross-Validation
```python
from sklearn.base import clone
from sklearn.model_selection import StratifiedKFold

skfolds = StratifiedKFold(n_splits=3, random_state=42)
score = []
for train_idx, test_idx in skfolds.split(x_train, y_train):
    clone_model = clone(model)
    x_train_folds = x_train[train_idx]
    y_train_folds = y_train[train_idx]
    x_test_fold = x_train[test_idx]
    y_test_fold = y_train[test_idx]
    clone_model.fit(x_train_folds, y_train_folds)
    predict = clone_model.predict(x_test_fold)
    n_correct = sum(predict == y_test_fold)
    score.append(n_correct / len(pedict)) 
```

### Evaluate Cross-Validataion Accuracy of Binary Classifier

In [None]:
cross_val_score(sgd_classifier, x_train, y_train_binary, cv=3, scoring='accuracy')

If the classifier said a two never appeared in this dataset the model would have an accuracy of 90%.

<font color='red'>
    Accuracy is generally not the preferred performance measure for classifiers, especially when dealing with *skewed* datasets.
</font>

### Confusion Matrix of Binary Classifier

In [None]:
y_train_pred = cross_val_predict(sgd_classifier, x_train, y_train_binary, cv=3)
c_matrix = confusion_matrix(y_train_binary, y_train_pred)
pd.DataFrame(c_matrix,
             columns=['Predicted False', 'Predicted True'],
             index=['Actual False', 'Actual True'])

<font color='red'>
    <b>
        Increasing precision reduces recall, and vice versa (Precision/Recal trade-off)
    </b>
</font>

<br>
<br>

<font color='green'>
    For the binary case: tn, fp, fn, tp = confusion_matrix().ravel()
</font>

#### Precision
$$precision = \frac{TP}{TP + FP}$$

#### Recall
$$recall = \frac{TP}{TP + FN}$$

In [None]:
tn, fp, fn, tp = c_matrix.ravel()
precision = tp / (tp + fp)
recall = tp / (tp + fn)

print(f'Precision: {precision}')
print(f'Recall: {recall}')

<font color='red'>
    Use Scikit-Learn functions for Precision and Recall
</font>

In [None]:
precision, recall = [f(y_train_binary, y_train_pred)
                     for f in (precision_score, recall_score)]

In [None]:
precision

In [None]:
recall

#### F1 Score

- Combination of precision and recall into a single metric.
- The harmonic mean of precision and recall.
- Metric gives much more weight to low values.
- A high F1 score requires *both* precision and recall to be high.

$$F_1 = \frac{2}{\frac{1}{precision} + \frac{1}{recall}}$$

$$F_1 = 2 \left( \frac{precision \cdot recall}{precision + recall} \right)$$

$$F_1 = \frac{TP}{TP + \frac{FN + FP}{2}}$$

In [None]:
f1_score(y_train_binary, y_train_pred)

### Precision vs. Recall

- Increasing the threshold decreases recall and will generally impove precision (sometimes precision will decrease)
- Lowering the threshold increases recall and reduces precision

<font color='red'>
    Scikit-Learn uses a default threshold of zero.
</font>

In [None]:
decision_function_scores = cross_val_predict(
    sgd_classifier, x_train, y_train_binary, cv=3, method='decision_function'
)

precisions, recalls, thresholds = precision_recall_curve(
    y_train_binary, decision_function_scores
)

df = pd.DataFrame(
    np.c_[precisions[:-1], recalls[:-1]],
    index=thresholds,
    columns=['Precision', 'Recall'],
)
df.index.name='Threshold'

fig = px.line(df, title='Precision & Recall vs Threshold')
fig.show()

fig = px.line(df, x='Recall', y='Precision', title='Precision vs Threshold')
fig.show()

#### Find Threshold to Acheive 90% Precision

In [None]:
threshold_90_precision = thresholds[np.argmax(precisions >= 0.9)]
y_train_pred_90 = decision_function_scores >= threshold_90_precision

precision, recall = [f(y_train_binary, y_train_pred_90)
                     for f in (precision_score, recall_score)]

In [None]:
precision

In [None]:
recall