# Classification Model Evaluation

## Examples:

Imagine you're bringing coffee to a meeting, and you need to predict whether each person at the meeting will want a coffee or not. Which metric should you choose? It depends. First, lets define our outcomes.

Outcomes:

- FP: Buy a coffee for someone who won't drink it
- FN: Don't buy a coffee for someone who wanted one
- TP: Buy a coffee for someone who will drink it
- TN: Don't buy a coffee for someone who wouldn't drink it anyway


Now that our outcomes are defined, we can weigh which outcomes are less desirable than others. Naturally, TP and TN are desirable, but if we had to choose between FP or FN, which is worse? It may depend on the scenario:

Scenarios:

- **Revolucion: good coffee, but expensive**

The cost of a false positive is higher than a false negative

Precision is the best metric to evaluate my prediction ability
- We want to be sure about our positive predictions (PPV)

* **Taco cabana: bad coffee, but cheap**

The cost of a false negative is higher than a false positive


Recall is the best metric to evaluate my predictions

- We want to make sure that everyone who wants a coffee, gets a coffee, even if that means we give coffee where its not wanted

- **meeting with super important client**

Cost of false positive? Client doesn't drink the coffee. Maybe I'm out $5
Cost of false negative? Client throws papers in my face. Storms out in anger. Deal fails.

Recall is the best metric

**What if you buy coffee for everyone or just don't buy any coffee?**

This describes a baseline model

### Mini Exercise

Scenario: Build a classifier to predict whether a given face should unlock the iPhone.

- What is the positive and negative case?

In [3]:
#Positive: Face should unlock the iPhone
#Negative: Face should not unlock the iPhone

* What are our possible outcomes?

In [4]:
#TP: We predict the face should unlock the phone, and the actual is that the face that should unlock the phone.
#TN: We predict the face should not unlock the phone, and the actual is the face should not unlock the phone.
#FP: We predict the face should unlock the phone, and the actual is that the face should NOT unlock the phone.
#FN: We predict the face should not unlock the phone, and the actual is that the face should unlock the phone.

- What are the costs of the outcomes?

In [None]:
#FP: Inappropriate phone access granted - bad bad bad
#FN: This is annoying, why doesn't my phone recognize my face - not quite as bad

- Which metric should we use?

In [5]:
#Precision

Scenario: Predict whether an email is spam or not. Emails marked as spam skip the inbox and go to the spam folder.

- What is the positive and negative case?

In [6]:
#Positive: Email is spam.
#Negative: Email is not spam.

* What are our possible outcomes?

In [8]:
#TP: Predict email is spam, and email is spam
#TN: Predict email is not spam, and email is not spam
#FP: Predict email is spam, and email is not spam
#FN: Predict email is not spam, and email is spam

- What are are cost?

In [9]:
#FP: We took a legit email and dumped it in the spam folder
#FN: We let spam in the inbox

- Which metric should we use?

In [10]:
# Let's go with accuracy

Scenario: Predict whether an email is a phishing attempt. When we predict positive, show an additional banner warning the user that this might be a phishing email.

- What is the positive and negative case?

In [11]:
#Positive: Email is a phishing attempt
#Negative: Email is not a phishing attempt

- What are our possible outcomes?

In [12]:
#TP: Predict email is phishing attempt, and it is.
#TN: Predict email is not phishing attempt, and it is not.
#FP: Predict email is phishing attempt, and it is not.
#FN: Predict email is not phishing attempt, and it is.

- What are the costs?

In [13]:
#FP: We give a banner on a innocent email
#FN: We fail to put a banner on a phishing attempt.

- What metric should we use?

In [14]:
#Recall

# Python Implementation

In [15]:
import pandas as pd

df = pd.DataFrame({
    'actual': ['coffee', 'no coffee', 'no coffee', 'coffee', 'coffee', 'coffee', 'no coffee', 'coffee'],
    'prediction': ['no coffee', 'no coffee', 'coffee', 'coffee', 'coffee', 'coffee', 'no coffee', 'no coffee'],
})
df

Unnamed: 0,actual,prediction
0,coffee,no coffee
1,no coffee,no coffee
2,no coffee,coffee
3,coffee,coffee
4,coffee,coffee
5,coffee,coffee
6,no coffee,no coffee
7,coffee,no coffee


In [17]:
pd.crosstab(df.prediction, df.actual)

actual,coffee,no coffee
prediction,Unnamed: 1_level_1,Unnamed: 2_level_1
coffee,3,1
no coffee,2,2


Our choice of positive and negative is arbitrary

- TP: predicted coffee + actual is coffee
- FP: predicted coffee, but they didn't like coffee
- FN: predicted no coffee, but really they liked coffee
- TN: predicted no coffee, actual no coffee

## Metrics

- **accuracy**: (TP + TN) / (TP + TN + FP + FN)
    - (3 + 2) / (3 + 1 + 2 +2) = 62.5%

- **precision**: TP / (TP + FP)
    - 3 / (3 + 1) = 75%
    - FP is more costly than FN

- **recall**: TP / (TP + FN)
    - 3 / (3 + 2) = 60%
    - FN is more costly than FP

In [20]:
df.actual.value_counts()

coffee       5
no coffee    3
Name: actual, dtype: int64

In [21]:
df['baseline'] = 'coffee'

In [22]:
df

Unnamed: 0,actual,prediction,baseline
0,coffee,no coffee,coffee
1,no coffee,no coffee,coffee
2,no coffee,coffee,coffee
3,coffee,coffee,coffee
4,coffee,coffee,coffee
5,coffee,coffee,coffee
6,no coffee,no coffee,coffee
7,coffee,no coffee,coffee


In [24]:
#model accuracy
(df.actual == df.prediction).mean()

0.625

In [26]:
#baseline accuracy
(df.actual == df.baseline).mean()

0.625

In [29]:
# model precision
subset = df[df.prediction == 'coffee']
subset

Unnamed: 0,actual,prediction,baseline
2,no coffee,coffee,coffee
3,coffee,coffee,coffee
4,coffee,coffee,coffee
5,coffee,coffee,coffee


In [31]:
(subset.prediction == subset.actual).mean()

0.75

In [34]:
#model recall
subset = df[df.actual == 'coffee']
subset

Unnamed: 0,actual,prediction,baseline
0,coffee,no coffee,coffee
3,coffee,coffee,coffee
4,coffee,coffee,coffee
5,coffee,coffee,coffee
7,coffee,no coffee,coffee


In [38]:
(subset.prediction == subset.actual).mean()

0.6

In [40]:
from sklearn.metrics import classification_report

print(classification_report(df.actual, df.prediction))

              precision    recall  f1-score   support

      coffee       0.75      0.60      0.67         5
   no coffee       0.50      0.67      0.57         3

    accuracy                           0.62         8
   macro avg       0.62      0.63      0.62         8
weighted avg       0.66      0.62      0.63         8

