# Classification Model Evaluation

## Examples:

Imagine you're bringing coffee to a meeting, and you need to predict whether each person at the meeting will want a coffee or not. Which metric should you choose? It depends

Outcomes:

- FP: Buy a coffee for someone who won't drink it
- FN: Don't buy a coffee for someone who wanted one
- TP: Buy a coffee for someone who will drink it
- TN: Don't buy a coffee for someone who wouldn't drink it anyway

Scenarios

- revolucion: good coffee, but expensive
    - cost of a FP is higher than FN
    - precision is better here because buying a cup of coffee for someone who won't drink it is expensive
    - We want to be sure about our positive predictions
- taco cabana: bad coffee, but cheap
    - cost of a FN is higher than FP
    - optimize for recall because the coffee is cheap, its not bad to buy a cheap coffee for someone who won't drink it; worse to not get someone coffee who wanted it
- meeting with super important client
    - cost of FN is higher, because they might be offended if we dont' get them coffee
    - cost of FN == not signing a contract
    - recall

What if buy coffee for everyone or just don't buy any coffee? Baseline model

### Mini Exercise

Scenario: Build a classifier to predict whether a given face should unlock the iPhone.

- What is the positive and negative case?
- What are the possible outcomes?
- What are the costs of the outcomes? cost of a FP is higher than FN
- Which metric should we use? Precision

Scenario: Predict whether an email is spam or not. Emails marked as spam skip the inbox and go to the spam folder.

- What is the positive and negative case?
    * + = avoid spam, - = mark as span
    * + = mark as spam, - not spam
- What are the possible outcomes?
- What are the costs of the outcomes? sending a real message to the spam folder is worse than a spam message getting through
- Which metric should we use? precision

Scenario: Predict whether an email is a phishing attempt. When we predict positive, show an additional banner warning the user that this might be a phishing email.

- What is the positive and negative case?
- What are the possible outcomes?
- What are the costs of the outcomes?
- Which metric should we use? recall

## Python Implementation

In [1]:
import pandas as pd

df = pd.DataFrame({
    'actual': ['coffee', 'no coffee', 'no coffee', 'coffee', 'coffee', 'coffee', 'no coffee', 'coffee'],
    'prediction': ['no coffee', 'no coffee', 'coffee', 'coffee', 'coffee', 'coffee', 'no coffee', 'no coffee'],
})
df

Unnamed: 0,actual,prediction
0,coffee,no coffee
1,no coffee,no coffee
2,no coffee,coffee
3,coffee,coffee
4,coffee,coffee
5,coffee,coffee
6,no coffee,no coffee
7,coffee,no coffee


## Confusion Matrix

In [2]:
pd.crosstab(df.prediction, df.actual)

actual,coffee,no coffee
prediction,Unnamed: 1_level_1,Unnamed: 2_level_1
coffee,3,1
no coffee,2,2


- TP: predicted coffee + actual is coffee
- FP: predicted coffee, but they didn't like coffee
- FN: predicted no coffee, but really they liked coffee
- TN: predicted no coffee, actual no coffee

Note:

- our choice of positive and negative is arbitrary
- the labels / layout of the confusion matrix varies

## Metrics

- **accuracy**: (TP + TN) / (TP + TN + FP + FN)
    - (3 + 2) / (3 + 1 + 2 +2) = 62.5%
- **precision**: TP / (TP + FP)
    - 3 / (3 + 1) = 75%
    - FP is more costly than FN
- **recall**: TP / (TP + FN)
    - 3 / (3 + 2) = 60%
    - FN is more costly than FP

In [3]:
df.actual.value_counts()

coffee       5
no coffee    3
Name: actual, dtype: int64

In [4]:
df['baseline'] = 'coffee'

In [5]:
df

Unnamed: 0,actual,prediction,baseline
0,coffee,no coffee,coffee
1,no coffee,no coffee,coffee
2,no coffee,coffee,coffee
3,coffee,coffee,coffee
4,coffee,coffee,coffee
5,coffee,coffee,coffee
6,no coffee,no coffee,coffee
7,coffee,no coffee,coffee


### Python Implementation

In [6]:
# model accuracy
(df.actual == df.prediction).mean()

0.625

In [7]:
# baseline accuracy
(df.actual == df.baseline).mean()

0.625

In [8]:
# precision -- how good are our positive predictions
# precision -- model performance | pred +
subset = df[df.prediction == 'coffee']
print(subset)
(subset.prediction == subset.actual).mean()

      actual prediction baseline
2  no coffee     coffee   coffee
3     coffee     coffee   coffee
4     coffee     coffee   coffee
5     coffee     coffee   coffee


0.75

In [9]:
# recall -- how often do we get the actual positive cases
# recall -- model performance | actual +
subset = df[df.actual == 'coffee']
print(subset)
(subset.prediction == subset.actual).mean()

   actual prediction baseline
0  coffee  no coffee   coffee
3  coffee     coffee   coffee
4  coffee     coffee   coffee
5  coffee     coffee   coffee
7  coffee  no coffee   coffee


0.6

What will the precision and recall of our baseline model that always predicts + be?

In [10]:
# precision
subset = df[df.baseline == 'coffee']
print(subset)
(subset.baseline == subset.actual).mean()

      actual prediction baseline
0     coffee  no coffee   coffee
1  no coffee  no coffee   coffee
2  no coffee     coffee   coffee
3     coffee     coffee   coffee
4     coffee     coffee   coffee
5     coffee     coffee   coffee
6  no coffee  no coffee   coffee
7     coffee  no coffee   coffee


0.625

In [11]:
# recall
subset = df[df.actual == 'coffee']
print(subset)
(subset.baseline == subset.actual).mean()

   actual prediction baseline
0  coffee  no coffee   coffee
3  coffee     coffee   coffee
4  coffee     coffee   coffee
5  coffee     coffee   coffee
7  coffee  no coffee   coffee


1.0

## What does "positive" mean?

In [13]:
positive = 'no coffee'

# accuracy -- overall hit rate
model_accuracy = (df.prediction == df.actual).mean()
baseline_accuracy = (df.baseline == df.actual).mean()

# precision -- how good are our positive predictions?
# precision -- model performance | predicted positive
subset = df[df.prediction == positive]
model_precision = (subset.prediction == subset.actual).mean()
subset = df[df.baseline == positive]
baseline_precision = (subset.baseline == subset.actual).mean()

# recall -- how good are we at detecting actual positives?
# recall -- model performance | actual positive
subset = df[df.actual == positive]
model_recall = (subset.prediction == subset.actual).mean()
baseline_recall = (subset.baseline == subset.actual).mean()


print(f'   model accuracy: {model_accuracy:.2%}')
print(f'baseline accuracy: {baseline_accuracy:.2%}')
print()
print(f'   model recall: {model_recall:.2%}')
print(f'baseline recall: {baseline_recall:.2%}')
print()
print(f'model precision: {model_precision:.2%}')
print(f'baseline precision: {baseline_precision:.2%}')

   model accuracy: 62.50%
baseline accuracy: 62.50%

   model recall: 66.67%
baseline recall: 0.00%

model precision: 50.00%
baseline precision: nan%


## Recap

In short:

- accuracy doesn't tell the whole story
- optimize for **precision** when you want to be sure about your positive predictions
- optimize for **recall** when you don't want to miss positive cases
- baseline model predicts the most common class (not necessarily the postive class)
- + / - are somewhat arbitrary, but generally a postive prediction means taking action