# <font color='#eb3483'> Evaluating Classification Models </font>
In this module, we'll be exploring classification models more in-depth, namely how can we evaluate our models to see how they're performing.

In [None]:
from IPython.display import Image
import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer

import seaborn as sns
sns.set(rc={'figure.figsize':(6,6)}) #Set our seaborn aesthetics (we're going to customize our figure size)

import warnings
warnings.simplefilter("ignore")

## <font color='#eb3483'> Preparing our Data </font>

We'll keep using our breast cancer dataset from the last module. We'll quickly get it into a format that will facilitate exploring module evaluation.

In [None]:
# load the dataset 
cancer = ...

# create a new dataframe called "cancer_df" using the data we loaded.
# the columns are from the "feature_names" key
cancer_df = pd.DataFrame(...)

# add a new column, "target", from the "target" key
cancer_df["target"] = ...

cancer_df.head()

In [None]:
cancer["target_names"]

We see that this dataset encodes the 0 as malignant and 1 as benign.

**We are going to replace the target 0 with 1, so the positive class is malignant. We do this because usually a positive test means detection of cancer.**

In [None]:
# replace the 0s in the "target" column with 1s
cancer_df["target"] = ...

We can see the percentage of cases for each class (positive and negative).

In [None]:
# check the counts of the "target" column. Hint: use value_counts(True)

We see there are 62.7% negative cases (benign) and 37.3% positives (malignant).

## <font color='#eb3483'> Training a Model </font>

We will start by training a simple Logistic Regression.

In [None]:
# import the LogisticRegression class from scikit-learn
from sklearn.model_selection import ...

# import train_test_split from from scikit-learn
from sklearn.linear_model import ...

# import metrics from from scikit-learn
from sklearn import ...

In [None]:
# split data into training and testing
X = ...
y = ...

# use 30% of the dataset as the testing set, and set random_state to 42
X_train, X_test, y_train, y_test = ...

We fit the model, and generate predicted labels and prediction probabilities

In [None]:
# create and fit the logistic regression

model = ...

model.fit(...)

# get the predictions
predictions = ...
true_classes = y_test
prediction_probabilities = model.predict_proba(X_test)

We create an auxiliary function that returns a list of true target values and their predicted labels

# <font color='#eb3483'> Binary Classification Concepts </font>

In Binary classification we have *negative cases* (class 0, on the Breast Cancer dataset would be the benign samples) and *positive cases* (class 1, malignant samples).

- Positive Cases: Cases of class 1 (malignant)
- Negative Cases: Cases of class 0 (benign)

These 2 classes combined with the predictions bring us to 4 possible combinations:

- True positives (TP), would be the samples that are malignant and are correctly classified as malignant. 
- False positives (FP), would be the benign samples that are incorrectly classified as malignant.
- True Negatives (TN), would be the benign samples that are correctly classified as benign.
- False Negatives (FN), would be the malignant samples that are incorrectly classified as benign.

![title](media/classification_errors.png)

## <font color='#eb3483'> **Confusion Matrix** </font>

We can use a confusion matrix to easily compare how a classifier has classified each one of the classes.

![title](media/confusion_matrix.png)

In [None]:
# import confusion_matrix from sklearn.metrics 

# create a confusion matrix using the true classes and predictions
confusion_matrix(...)

# <font color='#eb3483'> Classification Metrics </font>

<font color='#eb3483'> **Accuracy** </font>

Accuracy is a general measure of the model's performance. It simply measures the percentage of cases correctly classified.

$$Accuracy=\frac{\text{Number of correctly classified observations}}{\text{Total Number of observations}}= \frac{TP+TN}{TP+TN+FP+FN}$$

Sklearn has a function that calculates the accuracy

In [None]:
# get the accuracy using the true classes and predictions
metrics.accuracy_score(...)

<font color='#eb3483'> **Precision** </font>

Precision measures the model's hability to correctly classify as positives the positive cases.

$$Precision=\frac{\text{Number of positive cases correctly classified}}{\text{Number of cases classified as positive}}= \frac{TP}{TP+FP}$$

In [None]:
# get the precision using the true classes and predictions
metrics.precision_score(...)

<img src="media/precision_accuracy.png" style="width:30em;">

<font color='#eb3483'> **Recall (True Positive Rate, TPR)** </font>
 
Recall gives us an idea of the model's ability to find (detect) all positive cases.

$$Recall=\frac{\text{Number of positive cases correctly classified}}{\text{Number of positive classes}}= \frac{TP}{TP+FN}$$


![title](media/precision_recall.png)

In [None]:
# get the recall using the true classes and predictions
metrics.recall_score(...)

<font color='#eb3483'> **F1 Score** </font>

F1 score is a weighted measure between recall (that tries to classify as many cases as possible as positive cases) and precision (that tries to classify as positive only real positive cases and limit false positives).

F1 Score is defined as the harmonic mean between precision and recall.

$$F1=2*\frac{1}{\frac{1}{precision}+\frac{1}{recall}}=2*\frac{precision*recall}{precision+recall}$$

f1 score is available in scikit-learn

In [None]:
# get the F1 score using the true classes and predictions
metrics.f1_score(...)

###  <font color='#eb3483'> How does a model classify? </font>

An algorithm like logistic regression predicts by measuring distances to a "decission boundary" that are then transformed into class probabilities. 

But at the end of the day we need to know which class to assign to a new observation, and not just the predicted probabilities. Classifiers do that by defining a *threshold* and then assigning a negative class to all those cases with probabilities lower than the threshold and positive those above it.

![title](media/threshold.png)

We usually use the method `model.predict` for predicting a target variable. However, some methods also have a method `predict_proba` that predicts the probabilities that the model consider that the observation has to belong to each one of the classes.

For the binary classification case, `predict_proba` will predict for each observation the probabilities of it being a negative and the probabilities of being a positive case.

In [None]:
# get the first 5 prediction probabilities
prediction_probabilities[...]

In [None]:
df = pd.DataFrame({"true_class":true_classes,
                   "pred_class": predictions,
                   "probabilities_0":model.predict_proba(X_test)[:,0],
                    "probabilities_1":model.predict_proba(X_test)[:,1],
                  })

df["sum_probas"] = df.probabilities_0 + df.probabilities_1

df.sum_probas.head()

We see that for each row, the sum of the probabilities is 1 (which makes sense since its the whole sample space).

In [None]:
df.sample(10)

How does the scikit-learn classifier choose a threshold? Because it has no additional information, it just sets the threshold to 0.5

In [None]:
df.query("probabilities_1>0.5 & pred_class==0")

In [None]:
df.query("probabilities_0>0.5 & pred_class==1")

## <font color='#eb3483'> Area Under the Curve (ROC-AUC) </font>

The Receiving Operating Characteristic [(ROC)](https://en.wikipedia.org/wiki/Receiver_operating_characteristic) is a curve used to evaluate how Recall (TPR) and FPR change based on the threshold. It shows how the model balances the opposing efects of classifying correctly all positive cases without having many false positives.

Just to see how the predictions change based on our threshold level - let's create a function that takes the class probabilities and a desired threshold and returns the predicted class based off that threshold

In [None]:
def probabilities_to_classes(prediction_probabilities, threshold=0.5):
    predictions = np.zeros([len(prediction_probabilities), ])
    predictions[prediction_probabilities[:,1]>=threshold] = 1
    return predictions

In [None]:
# get the first 10 prediction probabilities
prediction_probabilities[...]

Now we can see easily convert those probabilities to predictions

In [None]:
probabilities_to_classes(prediction_probabilities, threshold=0.5)[:10]

If the threshold is closer to 0, more observations will be a positive

In [None]:
probabilities_to_classes(prediction_probabilities, threshold=0.00001)[:10]

And if the desired threshold is closer to 1, less observations will be predicted as a positive (only those where the model is really really sure about them being a positive)

In [None]:
probabilities_to_classes(prediction_probabilities, threshold=0.99999)[:10]

The area under the ROC curve (that is the part of the chart below the curve) is called **Area under the Curve (ROC-AUC or simply AUC)** and is one of the most common metrics on classification problems. It ranges from 0.5 (a random classifier) to 1 (the perfect classifier).

In [None]:
metrics.roc_auc_score(true_classes, predictions)

We can use `sklearn.roc_curve` to generate the ranges for false positive range and true positive range automatically. And we can make the plot so we compare it to a random classifier. *(no need to learn everything, is matplotlib messy code)*

In [None]:
def roc_curve(true_classes, predictions, prediction_probabilities):
    fpr, tpr, _ = metrics.roc_curve(true_classes, prediction_probabilities[:,1])
    roc_auc = metrics.roc_auc_score(true_classes, predictions)

    sns.mpl.pyplot.fill_between(fpr, tpr, step='post', alpha=0.2,color='b')
    sns.lineplot(x=fpr, y=tpr, linestyle='--', label='ROC Curve(area = %0.2f)' % roc_auc)
    sns.lineplot(x=[0,1], y= [0,1], linestyle='--', label = 'Random Classifier')
    
    sns.mpl.pyplot.xlabel('FPR')
    sns.mpl.pyplot.ylabel('TPR (recall)')
    sns.mpl.pyplot.title('ROC Curve')

roc_curve(true_classes, predictions, prediction_probabilities)

## <font color='#eb3483'> Precission-Recall Curve </font>

The precision-Recall curve gives us an idea of how the precision and recall vary depending on the threshold value.

We can use scikit-learn `metrics.precision_recall_curve` to calculate the steps for the curve directly.

In [None]:
sns.lineplot?

In [None]:
def precision_recall_curve(true_classes, prediction_probabilities):
    precision_, recall_, _ = metrics.precision_recall_curve(
        true_classes, prediction_probabilities[:,1])

    sns.lineplot(recall_,precision_, drawstyle='steps-pre', ci=None)
    sns.mpl.pyplot.fill_between(recall_, precision_, step='post', alpha=0.2,
                 color='b')

    sns.mpl.pyplot.xlabel('Recall')
    sns.mpl.pyplot.ylabel('Precision')
    sns.mpl.pyplot.title('Precision-Recall Curve');


precision_recall_curve(true_classes, prediction_probabilities)

And now the obvious question, **why do we need so many metrics, can't we just use accuracy?** Check out the advanced exercises for an answer :)