# Exercises Notebook - SLU13 - Validation Metrics for Classification
Associated presentation [here](https://docs.google.com/presentation/d/1lEE9BUWsUKryXzGCLyysX7d78XL3ylANTU-fMKtIeYE/edit?usp=sharing). This notebook only covers validation metrics for **binary classification**.

----
*By: [Hugo Lopes](https://www.linkedin.com/in/hugodlopes/)  
LDSA - SLU13*

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
%matplotlib inline 

In [None]:
# Import the classification metrics
from sklearn.metrics import accuracy_score, precision_score, \
    recall_score, f1_score, roc_auc_score, roc_curve, confusion_matrix
    
from utils import plot_roc_curve

# 1. Load Data!

First, let us load some data to fit into a classifier.

In [None]:
# RUN this cell
df = pd.read_csv('exercise_dataset_SLU13.csv')
print('Shape:', df.shape)
df.head()

As we can see, we have our target named `label` and then 10 columns, that are our features.

# Exercise 1: Dataset Imbalance
Some performance metrics are not recommended its use in highly imbalanced datasets. Check the imbalance of your dataset. What can you tell about it?

In [None]:
def class_imbalance(labels):
    """
    Calculate the class imbalance
    """
    # Calculate the class imbalance, i.e., the ratio of 1s (ones)
    # in the dataset
    # ratio_1s = ...
    ### BEGIN SOLUTION
    ratio_1s = labels.mean()
    ### END SOLUTION
    
    return ratio_1s

In [None]:
print('Ratio of 1s (imbalance):', class_imbalance(df['label']))

Expected output:
    
    Ratio of 1s (imbalance): 0.0702

In [None]:
### BEGIN TESTS
assert np.isclose(class_imbalance(df['label']), 0.0702, atol=1e-3)
### END TESTS

So, this result should put us on alert for the evaluation metrics already!

## Divide into Train and Test sets
Remember: always keep a part of your data separate for final evaluation of its performance. Time to do that:
- X_train: train data  
- y_train: target of train data  
- X_test: test data  
- y_test: target of test data

In [None]:
# RUN this cell
X_train, X_test, y_train, y_test = train_test_split(df.drop('label', axis=1), 
                                                    df['label'], 
                                                    test_size=0.33, 
                                                    random_state=42)

## Fit the Logistic Regression with Train Set
Let's fit the Logistic Regression on our training data.

In [None]:
# RUN cell:
clf = LogisticRegression(random_state=123, tol=1e-10).fit(X_train, y_train)

# Exercise 2: Getting predictions

In [None]:
def calc_probas(clf, X_test):
    """
    Get the predictions (probas) for a test set 'X_test' with a fitted classifier 'clf'
    
    Inputs:
        clf: Logistic Regression classifier (sklearn classifier)
        X_test: test dataset (pandas.DataFrame) (Num_rows, num_features)
    
    Output:
        probas: predicted probabilities (numpy.array, of shape (Num_rows,))
    """
    # Get predictions on the test set, i.e., get the _probabilities of being of class 1_ 
    # for the Test set (`X_test`) by making use of the method 
    # `predict_proba` of your classifier. Assign it to the variable `probas`
    # NOTE: don't forget to extract only the second column.
    # probas = ...
    ### BEGIN SOLUTION
    probas = clf.predict_proba(X_test)[:, 1]
    ### END SOLUTION
    
    return probas

In [None]:
probas = calc_probas(clf, X_test)
print('Probabilities:', probas)

Expected output:

    Probabilities: [0.07192596 0.06016595 0.06723508 ... 0.08183457 0.05332582 0.0543554 ]

In [None]:
### BEGIN TESTS
assert np.isclose(probas[0], 0.07192596, atol=1e-3)
assert len(probas) == 3300, "The length of the variable 'probas' is expected to be 3300."
assert type(probas) == np.ndarray
### END TESTS

# Exercise 3: Binarize Predictions and Confusion matrix
You should have by now the `probas` (you array of probabilities of being 1). There is a point in time where you will have to transform your predictions in the range [0, 1] (e.g., `[0.11582418 0.04812204]`) to something like 0 and 1, or Yellow and Blue. This means you will have _to take a decision_. This decision is taken by taking into account the business characteristics. 

For example, if you want to raise a warning if a person has cancer, you might not want to raise it only when you get a probability of 50% right? It has less consequences to have a False Positive than a False Negative in this case, and the person will thank you for that.

This action of _taking a decision_ is generally done by setting a **threshold** on your predictions, where above that threshold you set all your prediction as `1` and below it you set them as `0`.

Let's do it...

In [None]:
def binarize_probas(probas, threshold):
    """
    Transform probas to 0 or 1 depending on the threshold.
    
    Inputs:
        probas: predicted probabilities (numpy.array, of shape (N,))
        threshold: threshold to convert probas in binary vector (float)
    
    Output:
        predictions (numpy.array, of shape (N,)), dtype=int
    """
    # Transform your float array of `probas` to an int array where
    # the value 0 is below or equal to 'threshold' and 1 is above the 
    # 'threshold'
    # predictions = ...
    ### BEGIN SOLUTION
    predictions = (probas > threshold).astype(int)
    ### END SOLUTION
    
    return predictions

In [None]:
my_threshold = 0.15

predictions = binarize_probas(probas, my_threshold)
print('Array of predictions:', predictions[-36:])
print('Number of 1s (above threshold):', predictions.sum())

Expected output:

    Array of predictions: [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
    Number of 1s (above threshold): 168

In [None]:
### BEGIN TESTS
assert predictions.sum() == 168
assert len(predictions) == 3300, "The length of the variable 'predictions' is expected to be 3300."
assert predictions[-1] == 0
### END TESTS

# Exercise 4: Get the TP, FP, TN, FN
The TP, FP, TN and FN can be obtained by using the [confusion_matrix](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html#sklearn.metrics.confusion_matrix) of sklearn. Let's use it (check its documentation if you need to). We need these metrics to calculate the accuracy, precision and recall in the enxt exercise.

**Important**: you must have completed the previous exercise correctly.

In [None]:
def get_confmat(predictions, y_true):
    """
    Calculate the TP, FP, TN, FN using the sklearn confusion matrix.
    
    Inputs:
        predictions: predictions (0 or 1) (numpy.ndarray)
        y_true: true labels (0 or 1) (numpy.ndarray)
        
    Output:
        Dictionary with TP, FP, TN and FN values
    """
    # Get the TP, FP, TN, FN from `confusion_matrix(...)` of sklearn
    # Assign to the following variables:
    # tn, fp, fn, tp = ...
    ### BEGIN SOLUTION
    tn, fp, fn, tp = confusion_matrix(y_true, predictions).ravel()
    ### END SOLUTION
    
    return {'TP': tp, 'FP': fp, 'TN': tn, 'FN': fn}

In [None]:
confmat = get_confmat(predictions, y_test.values)

print(confmat)

Expected output:
    
    {'TP': 64, 'FP': 104, 'TN': 2971, 'FN': 161}

In [None]:
### BEGIN TESTS
assert confmat['TP'] == 64
assert confmat['FP'] == 104
assert confmat['TN'] == 2971
assert confmat['FN'] == 161
### END TESTS

# Exercise 5: Calculating Accuracy, Precision and Recall by hand
Best way to learn how things work is to do them by hand. Let's implement the following three simple metrics by hand: 
- The **Accuracy** is the fraction (default) or the count (normalize=False) of correct predictions. It is given by:  

$$ A = \frac{TP + TN}{TP + TN + FP + FN} $$


- **Precision** is the ability of the classifier not to label as positive a sample that is negative (i.e., a measure of result relevancy).
$$ P = \frac{TP}{TP+FP} $$  
  
  
- **Recall** is the ability of the classifier to find all the positive samples (i.e., a measure of how many truly relevant results are returned).
$$ R = \frac{TP}{TP+FN} $$  
  

In [None]:
def calc_metrics(confmat):
    """
    Calculate Accuracy, Precision and Recall performance metrics.
    DO NOT use sklearn - Implementation by hand.
    
    Inputs:
        confmat: Dictionary with TP, FP, TN and FN values (dict)
        
    Output:
        Dictionary with accuracy, precision and recall metrics
    """
    # Extracting the needed metrics - to ease readability
    tn = confmat['TN']
    fp = confmat['FP']
    fn = confmat['FN']
    tp = confmat['TP']
    
    # Calculate Accuracy and assign it to the variable 'accuracy'
    # accuracy = ...
    ### BEGIN SOLUTION
    accuracy = (tp + tn) / (tn + fp + fn + tp)
    ### END SOLUTION
    
    # Calculate Precision and assign it to the variable 'precision'
    # precision = ...
    ### BEGIN SOLUTION
    precision = tp / (tp + fp)
    ### END SOLUTION
    
    # Calculate Recall and assign it to the variable 'recall'
    # recall = ...
    ### BEGIN SOLUTION
    recall = tp / (tp + fn)
    ### END SOLUTION
    
    return {'accuracy': accuracy, 'precision': precision, 'recall': recall}

In [None]:
metrics = calc_metrics(confmat)
print('Accuracy: %.2f' % metrics['accuracy'])
print('Precision: %.2f' % metrics['precision'])
print('Recall: %.2f' % metrics['recall'])

Expected output:

    Accuracy: 0.92
    Precision: 0.38
    Recall: 0.28

In [None]:
### BEGIN TESTS
assert np.isclose(metrics['accuracy'], 0.9196969696969697, atol=1e-4)
assert np.isclose(metrics['precision'], 0.38095238095238093, atol=1e-4)
assert np.isclose(metrics['recall'], 0.28444444444444444, atol=1e-4)
### END TESTS

# Exercise 6: Calculate AU ROC curve using Sklearn
The Receiver Operating Characteristic (ROC) curve is a very common (and important) metric for **binary classification problems**. 

**Formally**, it is created by plotting the fraction of true positives out of the positives (TPR = true positive rate, a.k.a., sensitivity) vs. the fraction of false positives out of the negatives (FPR = false positive rate, or 1-specificity), at various threshold settings.  
- The [**`roc_auc_score`**](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html#sklearn.metrics.roc_auc_score) function computes the Area Under the ROC curve (AUROC). I.e., the curve information is summarized in one number.  

Let's check its value.

In [None]:
def get_auc(probas, y_true):
    """
    Get the AU ROC taking the inputs:
    - probas: your predictions (e.g., probabilities)
    - y_true: the actual outcomes (0 or 1)
    """
    
    # Calculate the Area Under ROC Curve. Use the sklearn implementation
    # 'roc_auc_score(...)'
    # auc = ...
    ### BEGIN SOLUTION
    auc = roc_auc_score(y_true, probas)
    ### END SOLUTION
    
    return auc

In [None]:
auc = get_auc(probas, y_test)
print('Area Under ROC curve: %.4f' % auc)

Expected output:

    Area Under ROC curve: 0.6948

In [None]:
### BEGIN TESTS
assert np.isclose(auc, 0.6948, atol=1e-3)
### END TESTS

Looks like the accuracy metric was somewhat misleading right? Our classifiers is not that good from the AUC point of view.

# [EXTRA]: Taking a look at the ROC curve
Taking a visual look at the ROC curve is also important to diagnose model problems. For example, if you see the curve crossing the diagonal of the chart (random behaviour) you might have a problem. So, it is recommended to combine both summary metric AUROC and the data visualization.

Let's take a look at it.
You can use the [**`roc_curve`**](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html#sklearn.metrics.roc_curve) of sklearn to compute Receiver Operating Characteristic (ROC) curve points.

In [None]:
# Get the False Positive Rate and the True Positive Rate values
fpr, tpr, _ = roc_curve(y_test, probas)

In [None]:
# Call an handy plotting function (you can take a look at its code in the utils.py file)
plot_roc_curve(auc, fpr, tpr)

Looking good, right?