# Classification Model Evaluation

Common ways of evaluating a __classification__ model's performance.
> A model is an algorithm/classifier that is fit to the training set.
 https://docs.aws.amazon.com/machine-learning/latest/dg/training-ml-models.html
 
1. __Confusion matrix__: is a cross-tabulation of a model's predictions against the actual outcome.
- A confusion matrix describes the performance of a classification model.
https://en.wikipedia.org/wiki/Confusion_matrix

In [39]:
import pandas as pd
import numpy as np
import sklearn
from sklearn.dummy import DummyClassifier
from sklearn.metrics import confusion_matrix

In [3]:
# This is a simplified version on model evaluation to understand the
# fundamentals tools to evaluate the models.
df = pd.DataFrame({
    'actual': ['coffee', 'no coffee', 'no coffee', 'coffee',
               'coffee', 'coffee', 'no coffee', 'coffee'],
    'prediction': ['no coffee', 'no coffee', 'coffee',
                   'coffee', 'coffee', 'coffee', 'no coffee',
                   'no coffee'],
})
print("Our model predicts whether or not someone like coffee.")
df

Our model predicts whether or not someone like coffee.


Unnamed: 0,actual,prediction
0,coffee,no coffee
1,no coffee,no coffee
2,no coffee,coffee
3,coffee,coffee
4,coffee,coffee
5,coffee,coffee
6,no coffee,no coffee
7,coffee,no coffee


In [8]:
# Look at pd.crosstab docs to understand kwargs.
# pd.crosstab?

In [11]:
# This is a confusion matrix
pd.crosstab(df.prediction,
            df.actual,
            margins=True,
            margins_name='total')

actual,coffee,no coffee,total
prediction,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
coffee,3,1,4
no coffee,2,2,4
total,5,3,8


In [23]:
# The function accepts actual outcome, predicted outcome.
# Actual values first, predicted values second.
confusion_M = confusion_matrix(df.actual, df.prediction)
confusion_M

array([[3, 2],
       [1, 2]])

> Working through this simple example, I understand the contents and layout of a confusion matrix!

|Confusion Matrix|Outcome|Prediction|Actual|# of People|
|:---|:---|:---|:---|:---|
|Top Left|True Positive|coffee|coffee|3|
|Bottom Right|True Negative|no coffee|no coffee|2|
|Top Right|False Positive/Type I Error|coffee|no coffee|1|
|Bottom Left|False Negative/Type II Error|no coffee|coffee|2|


|Outcome|English|IRL outcome if put into production|
|:---|:---|:---|
|True Positive|Jarvis predicts a person likes coffee and they do like coffee.|Customer gets coffee. OK.|
|True Negative|Jarvis predicts a person does not like coffee and they do not like coffee.|Customer does not get coffee. OK.|
|False Positive|Jarvis predicts a person likes coffee and they do not like coffee.|Customer gets coffee they didn't ask for. Awkward...|
|False Negative|Jarvis predicts a person does not like coffee and they do like coffee.|Customer doesn't get coffee they wanted. Karen transforms into Godzilla.|

## Baseline model
__DummyClassifier__ is a classifier that makes predictions using simple rules. This classifier is useful as a simple baseline to compare with other (real) classifiers.

https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyClassifier.html#sklearn.dummy.DummyClassifier
<div class="alert alert-block alert-danger">
Do not use it for real problems.
</div>

In [6]:
# Create a dummy classifier with the strategy as 'most _frequent'.
# It predicts the most frequent ACTUAL class/grouping.
baseline_classifier = DummyClassifier(strategy='most_frequent')

# Fit the dummy classifier with predictions and actual outcomes.
baseline_classifier.fit(df.prediction, df.actual)

DummyClassifier(constant=None, random_state=None, strategy='most_frequent')

In [7]:
# The dummy classifier pokemon evolves into its final form, Dummy Model.
# The model predicts the most frequent 'prediction'
# Meaning if df.prediction has 5 'coffee' and 3 'no coffee'
# The classifier will predict that eveyone likes coffee. If this
# model was used in a ml product it would give everyone coffee.
# Oprah would be proud.
baseline_classifier.predict(df.prediction) # EVERYONE LIKE COFFEE!!!

array(['coffee', 'coffee', 'coffee', 'coffee', 'coffee', 'coffee',
       'coffee', 'coffee'], dtype='<U6')

In [22]:
# But, IRL it only gets the prediction right 5/8 times or 62.5%
# The score returned is the models accuracy.
accuracy = baseline_classifier.score(df.prediction, df.actual)
print(f"The baseline model's accuracy is {accuracy:.2%}")

The baseline model's accuracy is 62.50%


# Classification Model Evaluation Metrics

https://www.ritchieng.com/machine-learning-evaluate-classification-model/

### 1. Classification Accuracy
Of the _total_ outcomes, how many times does the classification model make __correct__ predictions?

The accuracy formula is derived as:

|Correct Predictions| = |True Positives + True Negatives|
|:---|:---|:---|
|Total Number of Predictions| = |True Positives + True Negatives + False Positives + False Negatives|


Using the accuracy evaluation metric: The baseline classification model has an accuracy of 5/8 or 62.5%.

In [28]:
# Longhand method
confusion_M.ravel() # unravels matrix column wise
tp, fn, fp, tn = confusion_M.ravel()

# Use the accuracy formula to above to calculate model accuracy
accuracy = (tp + tn )/sum([tp, tn, fp, fn])
print(f"The baseline model's accuracy is {accuracy:.2%}")

The baseline model's accuracy is 62.50%


### 2. Classification Error
> Also known as the _misclassification rate_

Of the total outcomes, how many times does the classification model make __incorrect__ predictions?

The error formula is derived as:

|Incorrect Predictions| = |False Positives + False Negatives|
|:---|:---|:---|
|Total Number of Predictions| = |True Positives + True Negatives + False Positives + False Negatives|

The misclassification rate is also calculated as:

Classification Error = 1 - accuracy_rate

In [36]:
error = (fn + fp)/sum([tp,fp,tn,fn])
print(f"The baseline model's error is {error:.2%}")

classification_error = 1 - accuracy
print(f"Error rate can be calculated as: 1 - accuracy = {classification_error:.2%}")

The baseline model's error is 37.50%
Error rate can be calculated as: 1 - accuracy = 37.50%


### 3. Sensitivity
> Also known as __Recall__ or _True Positive Rate_, or __Precision__.

> Sensitivity, Precision, or Recall is a metric that should be maximized.

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html?highlight=recall#sklearn.metrics.recall_score

Of the Positive predictions, how many times are they __True Positives__?

The sensitivity formula is derived as:

|Correct TP Predictions| = |True Positives|
|:---|:---|:---|
|Total Correct Predictions| = |True Positives + False Negatives|

In [48]:
recall = tp / (tp + tn)
print(f"The baseline model's recall is {recall:.2%}")

sensitivity = sklearn.metrics.recall_score(df.actual, df.prediction, average="binary", pos_label="coffee")
print(f"The baseline model's sensitivity is {sensitivity:.2%}")

The baseline model's recall is 60.00%
The baseline model's sensitivity is 60.00%


### 4. Specificity
Of the outcomes that are negative, how often are they __True Negatives__?
> Specificity is a metric that should be maximized.

The specificity formula is derived as:

|Correct TN Predictions| = |True Negatives|
|:---|:---|:---|
|All Negative Predictions| = |True Negatives + False Negatives|

In [50]:
specificity = tn / (tn + fn)
print(f"The baseline model's specificity is {specificity:.2%}")

The baseline model's specificity is 50.00%


In [53]:
print(sklearn.metrics.classification_report(df.actual, df.prediction))

              precision    recall  f1-score   support

      coffee       0.75      0.60      0.67         5
   no coffee       0.50      0.67      0.57         3

    accuracy                           0.62         8
   macro avg       0.62      0.63      0.62         8
weighted avg       0.66      0.62      0.63         8

