# Classification

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Data-exploration" data-toc-modified-id="Data-exploration-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Data exploration</a></span></li><li><span><a href="#Logistic-regression" data-toc-modified-id="Logistic-regression-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Logistic regression</a></span><ul class="toc-item"><li><span><a href="#1-predictor" data-toc-modified-id="1-predictor-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>1 predictor</a></span><ul class="toc-item"><li><span><a href="#Model" data-toc-modified-id="Model-2.1.1"><span class="toc-item-num">2.1.1&nbsp;&nbsp;</span>Model</a></span></li><li><span><a href="#Accuracy-score" data-toc-modified-id="Accuracy-score-2.1.2"><span class="toc-item-num">2.1.2&nbsp;&nbsp;</span>Accuracy score</a></span></li></ul></li><li><span><a href="#Many-predictors" data-toc-modified-id="Many-predictors-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Many predictors</a></span><ul class="toc-item"><li><span><a href="#Model" data-toc-modified-id="Model-2.2.1"><span class="toc-item-num">2.2.1&nbsp;&nbsp;</span>Model</a></span></li><li><span><a href="#Accuracy-score" data-toc-modified-id="Accuracy-score-2.2.2"><span class="toc-item-num">2.2.2&nbsp;&nbsp;</span>Accuracy score</a></span></li><li><span><a href="#predict_proba" data-toc-modified-id="predict_proba-2.2.3"><span class="toc-item-num">2.2.3&nbsp;&nbsp;</span><code>predict_proba</code></a></span></li><li><span><a href="#Setting-threshold-manually" data-toc-modified-id="Setting-threshold-manually-2.2.4"><span class="toc-item-num">2.2.4&nbsp;&nbsp;</span>Setting threshold manually</a></span></li></ul></li><li><span><a href="#Under-the-hood" data-toc-modified-id="Under-the-hood-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Under the hood</a></span></li></ul></li><li><span><a href="#Metrics" data-toc-modified-id="Metrics-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Metrics</a></span><ul class="toc-item"><li><span><a href="#Accuracy" data-toc-modified-id="Accuracy-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Accuracy</a></span></li><li><span><a href="#Recall" data-toc-modified-id="Recall-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Recall</a></span></li><li><span><a href="#Precision" data-toc-modified-id="Precision-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Precision</a></span></li><li><span><a href="#F1-score" data-toc-modified-id="F1-score-3.4"><span class="toc-item-num">3.4&nbsp;&nbsp;</span>F1 score</a></span></li><li><span><a href="#F_beta-score" data-toc-modified-id="F_beta-score-3.5"><span class="toc-item-num">3.5&nbsp;&nbsp;</span>F_beta score</a></span></li></ul></li><li><span><a href="#Choosing-the-best-threshold" data-toc-modified-id="Choosing-the-best-threshold-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Choosing the best threshold</a></span></li><li><span><a href="#Summary" data-toc-modified-id="Summary-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Summary</a></span></li><li><span><a href="#Further-materials" data-toc-modified-id="Further-materials-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Further materials</a></span></li></ul></div>

As an introductory notebook, I am not using train and test split for metric evaluation  
But you should always do!

In [None]:
import pandas as pd
import seaborn as sns

 * Regression models are used when the target variable is **quantitative**: 
  - salaries
  - gas emissions
  - age of person in a picture
  - ...
 * **Classification** models are used when the target variable is **qualitative**: 
  - surviving (or not) the Titanic
  - paying back (or not) a loan
  - identifying a dog (or not) in a picture
  - deciding which one of 3 plant species is this one
  - ...

Metrics are specially important in classification problems!

You must understand the business goal in order to choose the appropiate metric.

## Data exploration

In [None]:
df = pd.read_csv("../datasets/breast_cancer.csv")

In [None]:
df.shape

In [None]:
df.sample(5)

In [None]:
df.columns

Documentation:  
a) radius (mean of distances from center to points on the perimeter)  
b) texture (standard deviation of gray-scale values)  
c) perimeter  
d) area  
e) smoothness (local variation in radius lengths)  
f) compactness (perimeter^2 / area - 1.0)  
g) concavity (severity of concave portions of the contour)  
h) concave points (number of concave portions of the contour)  
i) symmetry  
j) fractal dimension ("coastline approximation" - 1)  

Target variable is `is_cancer`  
It is a categorical variable, taking possible values $0$ and $1$

In [None]:
sns.countplot(x=df.is_cancer)

## Logistic regression

Logistic regression is the simplest classification ML model

### 1 predictor

#### Model

Lets try to predict `is_cancer` using only as a predictor `mean_radius`

In [None]:
sns.scatterplot(x=df.mean_radius, y=df.is_cancer)

In [None]:
df2 = df[["mean_radius", "is_cancer"]].copy()

In [None]:
df2.sample(10).sort_values("mean_radius")

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
log = LogisticRegression()

In [None]:
log.fit(
    X=df2[["mean_radius"]],
    y=df2.is_cancer
)

A trained model is a *prediction machine*

How would it have performed with the 5 first patients?

In [None]:
df2[["mean_radius"]][:5]

In [None]:
df.is_cancer[:5]

In [None]:
log.predict(df2[["mean_radius"]][:5])

It would have acertado 4 out of 5

In [None]:
df2["prediction_cancer"] = log.predict(df2[["mean_radius"]])

In [None]:
df2.head(10)

#### Accuracy score

How many predictions were correct?

In [None]:
df2["correct"] = (df2.is_cancer == df2.prediction_cancer)

In [None]:
df2.head(10)

Accuracy is the proportion of good predictions

In [None]:
accuracy = df2.correct.sum() / df2.shape[0]

In [None]:
accuracy

How many True/False Positives/Negatives do we have?

We build the confusion matrix

<img src="https://github.com/andrewwlong/classification_metrics_sklearn/raw/541a0d065ffb8b3ff705161f6d16088d434b2ea7/img/conf_matrix.png">

In [None]:
df.shape

In [None]:
pd.crosstab(
    df2.is_cancer,
    df2.prediction_cancer
)

Equivalent to:

In [None]:
from sklearn.metrics import confusion_matrix

In [None]:
confusion_matrix(
    y_true=df2.is_cancer,
    y_pred=df2.prediction_cancer
)

Graphical interpretation

In [None]:
x = np.linspace(df.mean_radius.min(), df.mean_radius.max(), 100)
y = log.predict(x.reshape(-1, 1))

In [None]:
import matplotlib.pyplot as plt

In [None]:
sns.scatterplot(x=df.mean_radius, y=df.is_cancer)
plt.plot(x, y, c= "red")

### Many predictors

#### Model

More iterations more time

In [None]:
# if no max_iter, convergence problems arise (python warns you)
log = LogisticRegression(max_iter=10000)

We use all variables **but** `is_cancer` to try and predict `is_cancer`

In [None]:
df3 = df.copy()

In [None]:
X = df3.drop("is_cancer", axis=1)
y = df3.is_cancer

In [None]:
log.fit(X, y)

In [None]:
df3["prediction_cancer"] = log.predict(X)

In [None]:
df3b = df3[["is_cancer", "prediction_cancer"]].copy()

In [None]:
df3b["correct"] = (df3b.is_cancer == df3.prediction_cancer)

In [None]:
df3b.sample(10)

Confusion matrix

<img src="https://github.com/andrewwlong/classification_metrics_sklearn/raw/541a0d065ffb8b3ff705161f6d16088d434b2ea7/img/conf_matrix.png">

In [None]:
pd.crosstab(
    df3b.is_cancer,
    df3b.prediction_cancer
)

In [None]:
accuracy

In [None]:
log.score(X, y)

In [None]:
acc = (348 + 197) / df3.shape[0]

In [None]:
acc

Using the train test split paradigm:  
 - Split dataset in train and test.
 - Use train to train model.
 - Predict test.
 - Evaluate accuracy between predictions and real test entries.

#### Accuracy score

Accuracy is the proportion of good predictions

In [None]:
accuracy = df3b.correct.sum() / df3b.shape[0]

In [None]:
accuracy

We got better accuracy using many predictors instead of one (as expected!)

#### `predict_proba`

Logistic regression actually predicts probabilities

We, as data scientists, want more precise information than just the *discrete* prediction 0, 1

In [None]:
df.head(5)

In [None]:
df.shape

In [None]:
# lets see what our model predicts for 5 patients
log.predict(X)[35:40]

In [None]:
log.predict_proba(X)[35:40].round(3)[:, 1]

In [None]:
df3["prediction_proba_cancer"] = log.predict_proba(X)[:, 1]

In [None]:
df3b["prediction_proba_cancer"] = df3.prediction_proba_cancer

In [None]:
df3b.sample(10)

Graphical interpretation in the case of ONE PREDICTOR

The curve gives the probability (Y) of cancer given the mean_radius (X)

<img src="https://static.javatpoint.com/tutorial/machine-learning/images/logistic-regression-in-machine-learning.png">

#### Setting threshold manually

Default `predict` just computes `predict_proba` > 0.5

In [None]:
df3b.head()

Remember the original confusion matrix

In [None]:
pd.crosstab(
    df3b.is_cancer,
    df3b.prediction_cancer
)

Lets change the default threshold 0.5 and see how results change

Setting the threshold as 0.1 *invites* predictions to be Positive

Will False Positives increase?  
Will False Negatives increase?

**Example 1: low threshold**

In [None]:
threshold = 0.1

In [None]:
pd.crosstab(
    df3b.is_cancer,
    df3b.prediction_proba_cancer > threshold
)

In [None]:
(319 + 211) / df.shape[0]

Almost all positives were found (higher **recall**)

More false positives appeared (lower **precision**)

**Example 2: high threshold**

Recall 0.5 threshold crosstab

In [None]:
pd.crosstab(
    df3b.is_cancer,
    df3b.prediction_cancer
)

In [None]:
threshold = 0.8

In [None]:
pd.crosstab(
    df3b.is_cancer,
    df3b.prediction_proba_cancer > threshold
)

Almost all predicted positives are true (higher **precision**)

We predicted 28 real cancers as False (low **recall**)

### Under the hood

The logistic function is a mathematical function defined as
$$\sigma(z)=\frac{1}{1 + e^{-z}}$$

**Exercise**: 
 * $\sigma(0)=$

 * $\lim_{z \rightarrow \infty} \sigma(z)=$

 * $\lim_{z \rightarrow -\infty} \sigma(z)=$

In [None]:
z = np.linspace(-10, 10, 100)
sigma = 1 / (1 + np.exp(-z))

In [None]:
plt.plot(z, sigma, "-")

**Exercise**: $$\sigma'(z)=\sigma(z)(1-\sigma(z))$$

In [None]:
log.coef_

The logistic regression finds the **optimal** parameters $\beta_0$, ..., $\beta_n$ such that the predictions 
$$\hat{y}=\sigma(\beta _{0}+\beta _{1}x_1 + ... + \beta_n x_n)=\frac{1}{1 + e^{-(\beta _{0}+\beta _{1}x_1 + ... + \beta_n x_n)}}$$ have the least error

## Metrics

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, fbeta_score

Remember: 
 * a metric should be chosen a-priori, with deep understanding of the problem. Then several models are trained and the one with best metric result is chosen
 * here we present different metrics for the same model as an exercise. The results (0.94, 0.92, 0.95 should by no means be compared!!)

On the left rectangle, Real Positives  
On the right rectangle, Real Negatives

Inside the circle, Predicted Positives  
Outside the circle, Predicted Negatives

<img src="https://github.com/andrewwlong/classification_metrics_sklearn/raw/541a0d065ffb8b3ff705161f6d16088d434b2ea7/img/buckets.png">

<img src="https://github.com/andrewwlong/classification_metrics_sklearn/raw/541a0d065ffb8b3ff705161f6d16088d434b2ea7/img/conf_matrix.png">

### Accuracy

 * It represents the proportion of samples predicted correctly
 * The most common metric for classification
 * Useful when:
  - dataset has balanced classes (similar proportion of True and False)
  - there is symmetry between True and False (for example, predicting "male" or "female")
 * **Often misused!!** since:
  - many problems are not symmetric (for example, cancer vs no cancer)
  - many problems have imbalanced classes (for example, terrorist vs no terrorist)
 * I do not like accuracy metric. Be alert!

<img src="https://github.com/andrewwlong/classification_metrics_sklearn/raw/541a0d065ffb8b3ff705161f6d16088d434b2ea7/img/accuracy.png">

In [None]:
df3b.head()

In [None]:
accuracy_score(
    y_true=df3b.is_cancer,
    y_pred=df3b.prediction_cancer
)

### Recall

(also known as sensitivity) is the fraction of positive events that you predicted correctly, as shown below

It values as **crucial** to identify true instances (useful for cancer detection / terrorist detection airport)

<img src="https://github.com/andrewwlong/classification_metrics_sklearn/raw/541a0d065ffb8b3ff705161f6d16088d434b2ea7/img/recall.png">

In [None]:
recall_score(
    y_true=df3b.is_cancer,
    y_pred=df3b.prediction_cancer
)

### Precision

is the fraction of predicted positives events that are actually positive as shown below

It values as **crucial** to **not** have False Positives (very aggresive treatments of not very malicious diseases)

<img src="https://github.com/andrewwlong/classification_metrics_sklearn/raw/541a0d065ffb8b3ff705161f6d16088d434b2ea7/img/precision.png">

In [None]:
precision_score(
    y_true=df3b.is_cancer,
    y_pred=df3b.prediction_cancer
)

### F1 score

 * **recall** is, in general, more important than precision but...
 * higher recall always implies lower precision (via lower threshold)
 * a tradeoff should be found

The F1 score is the harmonic mean of recall and precision:

<img src="https://github.com/andrewwlong/classification_metrics_sklearn/raw/541a0d065ffb8b3ff705161f6d16088d434b2ea7/img/f1_score.png">

In [None]:
f1_score(
    y_true=df3b.is_cancer,
    y_pred=df3b.prediction_cancer
)

### F_beta score

 * **recall** is, in general, more important than precision but...
 * higher recall always implies lower precision (via lower threshold)
 * a tradeoff should be found

The $F_1$ score is the harmonic mean of recall and precision

The $F_\beta$ score is a weighed harmonic mean of recall and precision

The `beta` parameter determines the weight of **recall** in the combined score

We, toguether with doctors (technicians) and government (money, time) decide that $\beta=3$ is a good choice for breast cancer detection:
 * we value finding real positives...
 * 3 times more than...
 * losing time with false positives

In [None]:
fbeta_score(
    y_true=df3b.is_cancer,
    y_pred=df3b.prediction_cancer,
    beta=3
)

## Choosing the best threshold

Lets choose the threshold that optimizes $F_3$ score

In [None]:
df3b.head()

In [None]:
import numpy as np

In [None]:
results = []

In [None]:
y_true = df3b.is_cancer

for threshold in np.arange(0, 1.01, 0.05):
    y_pred = df3b.prediction_proba_cancer > threshold
    
    result = {
        "threshold": threshold,
        "accuracy": accuracy_score(y_true, y_pred),
        "precision": precision_score(y_true, y_pred),
        "recall": recall_score(y_true, y_pred),
        "f3": fbeta_score(y_true, y_pred, beta=3)
    }
    
    results.append(result)

In [None]:
res = pd.DataFrame(results)
res

In [None]:
optimal_threshold = res.threshold[res.f3.argmax()]
optimal_threshold

<img src="https://github.com/andrewwlong/classification_metrics_sklearn/raw/541a0d065ffb8b3ff705161f6d16088d434b2ea7/img/conf_matrix.png">

In [None]:
pd.crosstab(df3b.is_cancer, df3b.prediction_proba_cancer > optimal_threshold)

In [None]:
# if I were to use a ridiculously low threshold
# I would have 100% recall
# but very low precision
pd.crosstab(df3b.is_cancer, df3b.prediction_proba_cancer > 0.000001)

<img width=500 src="https://2.bp.blogspot.com/-EvSXDotTOwc/XMfeOGZ-CVI/AAAAAAAAEiE/oePFfvhfOQM11dgRn9FkPxlegCXbgOF4QCLcBGAs/s1600/confusionMatrxiUpdated.jpg">

## Summary

 * Classification predicts qualitative outputs (classes)
 * Logistic regression is the most commonly used classification algorithm (not for being best, but for being explainable and classic)
 * Logistic regression predicts probabilities (`.predict_proba`, between 0 and 1) and may apply threshold for you (`.predict`)

 * We have different classification metrics
 * A metric should be decided before training different models! Human criteria translates into metric choice
 * A metric lets us choose the best model (here we chose the best threshold for our final model)
 * For choosing a metric, look at 
  - class balance
  - class symmetry

## Further materials

[ROC Curve](https://en.wikipedia.org/wiki/Receiver_operating_characteristic)