<a id="advanced-classification-metrics"></a>
## Classification Metrics

---

When we evaluate the performance of a logistic regression (or any classifier model), the standard metric to use is accuracy: How many class labels did we guess correctly? However, accuracy is only one of several metrics we could use when evaluating a classification model.

$$Accuracy = \frac{total~predicted~correct}{total~predicted}$$

Accuracy alone doesn’t always give us a full picture.

If we know a model is 75% accurate, it doesn’t provide any insight into why the 25% was wrong.

Consider a binary classification problem where we have 165 observations/rows of people who are either smokers or nonsmokers.

<table style="border: none">
<tr style="border: none">
    <td style="border: none; vertical-align: bottom">n = 165</td>
    <td style=""><b>Predicted: No</b></td>
    <td style=""><b>Predicted: Yes</b></td>
</tr>
<tr>
    <td><b>Actual: No</b></td>
    <td style="text-align: center"></td>
    <td style="text-align: center"></td>
    <td style="text-align: center"></td>
</tr>
<tr>
    <td><b>Actual: Yes</b></td>
    <td style="text-align: center"></td>
    <td style="text-align: center"></td>
    <td style="text-align: center"></td>
</tr>
<tr style="border: none">
    <td style="border: none"></td>
    <td style="text-align: center"></td>
    <td style="text-align: center"></td>
</tr>

</table>

There are 60 in class 0, nonsmokers, and 105 observations in class 1, smokers
<table style="border: none">
<tr style="border: none">
    <td style="border: none; vertical-align: bottom">n = 165</td>
    <td style=""><b>Predicted: No</b></td>
    <td style=""><b>Predicted: Yes</b></td>
</tr>
<tr>
    <td><b>Actual: No</b></td>
    <td style="text-align: center"></td>
    <td style="text-align: center"></td>
    <td style="text-align: center">60</td>
</tr>
<tr>
    <td><b>Actual: Yes</b></td>
    <td style="text-align: center"></td>
    <td style="text-align: center"></td>
    <td style="text-align: center">105</td>
</tr>
<tr style="border: none">
    <td style="border: none"></td>
    <td style="text-align: center"></td>
    <td style="text-align: center"></td>
</tr>

</table>

We have 55 predictions of class, predicted as nonsmokers, and 110 of class 1, predicted to be smokers.

<table style="border: none">
<tr style="border: none">
    <td style="border: none; vertical-align: bottom">n = 165</td>
    <td style=""><b>Predicted: No</b></td>
    <td style=""><b>Predicted: Yes</b></td>
</tr>
<tr>
    <td><b>Actual: No</b></td>
    <td style="text-align: center"></td>
    <td style="text-align: center"></td>
    <td style="text-align: center">60</td>
</tr>
<tr>
    <td><b>Actual: Yes</b></td>
    <td style="text-align: center"></td>
    <td style="text-align: center"></td>
    <td style="text-align: center">105</td>
</tr>
<tr style="border: none">
    <td style="border: none"></td>
    <td style="text-align: center">55</td>
    <td style="text-align: center">110</td>
</tr>

</table>

- **True positives (TP):** These are cases in which we predicted yes (smokers), and they actually are smokers.
- **True negatives (TN):** We predicted no, and they are nonsmokers.
- **False positives (FP):** We predicted yes, but they were not actually smokers. (This is also known as a "Type I error.")
- **False negatives (FN):** We predicted no, but they are smokers. (This is also known as a "Type II error.")
<table style="border: none">
<tr style="border: none">
    <td style="border: none; vertical-align: bottom">n = 165</td>
    <td style=""><b>Predicted: No</b></td>
    <td style=""><b>Predicted: Yes</b></td>
</tr>
<tr>
    <td><b>Actual: No</b></td>
    <td style="text-align: center">TN = 50</td>
    <td style="text-align: center">FP = 10</td>
    <td style="text-align: center">60</td>
</tr>
<tr>
    <td><b>Actual: Yes</b></td>
    <td style="text-align: center">FN = 5</td>
    <td style="text-align: center">TP = 100</td>
    <td style="text-align: center">105</td>
</tr>
<tr style="border: none">
    <td style="border: none"></td>
    <td style="text-align: center">55</td>
    <td style="text-align: center">110</td>
</tr>

</table>

**Categorize these as TP, TN, FP, or FN:**

Try not to look at the answers above.
    
- We predict nonsmoker, but the person is a smoker.
- We predict nonsmoker, and the person is a nonsmoker.
- We predict smoker and the person is a smoker.
- We predict smoker and the person is a nonsmoker.

<!--ANSWER
- FN
- TN
- TP
- FP
-->

<a id="accuracy-true-positive-rate-and-false-negative-rate"></a>
### Accuracy, True Positive Rate, and False Negative Rate

**Accuracy:** Overall, how often is the classifier correct?

<span>
    (<span style="color: green">TP</span>+<span style="color: red">TN</span>)/<span style="color: blue">total</span> = (<span style="color: green">100</span>+<span style="color: red">50</span>)/<span style="color: blue">165</span> = 0.91
</span>

<table style="border: none">
<tr style="border: none">
    <td style="border: none; vertical-align: bottom; color: blue">n = 165</td>
    <td style=""><b>Predicted: No</b></td>
    <td style=""><b>Predicted: Yes</b></td>
</tr>
<tr>
    <td><b>Actual: No</b></td>
    <td style="text-align: center; background-color: red">TN = 50</td>
    <td style="text-align: center">FP = 10</td>
    <td style="text-align: center">60</td>
</tr>
<tr>
    <td><b>Actual: Yes</b></td>
    <td style="text-align: center">FN = 5</td>
    <td style="text-align: center; background-color: green">TP = 100</td>
    <td style="text-align: center">105</td>
</tr>
<tr style="border: none">
    <td style="border: none"></td>
    <td style="text-align: center">55</td>
    <td style="text-align: center">110</td>
</tr>

</table>

**True positive rate (TPR)** asks, “Out of all of the target class labels, how many were accurately predicted to belong to that class?”

For example, given a medical exam that tests for cancer, how often does it correctly identify patients with cancer?

<span>
<span style="color: green">TP</span>/<span style="color: blue">actual yes</span> = <span style="color: green">100</span>/<span style="color: blue">105</span> = 0.95
</span>

<table style="border: none">
<tr style="border: none">
    <td style="border: none; vertical-align: bottom">n = 165</td>
    <td style=""><b>Predicted: No</b></td>
    <td style=""><b>Predicted: Yes</b></td>
</tr>
<tr>
    <td><b>Actual: No</b></td>
    <td style="text-align: center">TN = 50</td>
    <td style="text-align: center">FP = 10</td>
    <td style="text-align: center">60</td>
</tr>
<tr>
    <td><b>Actual: Yes</b></td>
    <td style="text-align: center">FN = 5</td>
    <td style="text-align: center;background-color: green">TP = 100</td>
    <td style="text-align: center;color: blue">105</td>
</tr>
<tr style="border: none">
    <td style="border: none"></td>
    <td style="text-align: center">55</td>
    <td style="text-align: center">110</td>
</tr>

</table>

**False positive rate (FPR)** asks, “Out of all items not belonging to a class label, how many were predicted as belonging to that target class label?”

For example, given a medical exam that tests for cancer, how often does it trigger a “false alarm” by incorrectly saying a patient has cancer?

<span>
<span style="color: orange">FP</span>/<span style="color: blue">actual no</span> = <span style="color: orange">10</span>/<span style="color: blue">60</span> = 0.17
</span>

<table style="border: none">
<tr style="border: none">
    <td style="border: none; vertical-align: bottom">n = 165</td>
    <td style=""><b>Predicted: No</b></td>
    <td style=""><b>Predicted: Yes</b></td>
</tr>
<tr>
    <td><b>Actual: No</b></td>
    <td style="text-align: center">TN = 50</td>
    <td style="text-align: center;background-color: orange">FP = 10</td>
    <td style="text-align: center;color:blue">60</td>
</tr>
<tr>
    <td><b>Actual: Yes</b></td>
    <td style="text-align: center">FN = 5</td>
    <td style="text-align: center">TP = 100</td>
    <td style="text-align: center">105</td>
</tr>
<tr style="border: none">
    <td style="border: none"></td>
    <td style="text-align: center">55</td>
    <td style="text-align: center">110</td>
</tr>

</table>

**Can you see that we might weigh TPR AND FPR differently depending on the situation?**

- Give an example when we care about TPR, but not FPR.
- Give an example when we care about FPR, but not TPR.

<!--
ANSWER:
- During an initial medical diagnosis, we want to be sensitive. We want initial screens to come up with a lot of true positives, even if we get a lot of false positives.
- If we are doing spam detection, we want to be precise. Anything that we remove from an inbox must be spam, which may mean accepting fewer true positives.
-->

**More Trade-Offs**

The true positive and false positive rates gives us a much clearer picture of where predictions begin to fall apart.

This allows us to adjust our models accordingly.

**Below we will load in some data on admissions to college.**

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [3]:
# Read in your admissions data...
admissions = pd.read_csv('admissions.csv')

In [4]:
admissions.head()

Unnamed: 0,admit,gre,gpa,prestige
0,0,380.0,3.61,3.0
1,1,660.0,3.67,3.0
2,1,800.0,4.0,1.0
3,1,640.0,3.19,4.0
4,0,520.0,2.93,4.0


In [5]:
# Drop the missing rows...
admissions.dropna(subset = ['gre','gpa','prestige'], inplace = True)

In [6]:
admissions.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 397 entries, 0 to 399
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   admit     397 non-null    int64  
 1   gre       397 non-null    float64
 2   gpa       397 non-null    float64
 3   prestige  397 non-null    float64
dtypes: float64(3), int64(1)
memory usage: 15.5 KB


In [7]:
# Get dummy variables for prestige. Code provided...
admissions = admissions.join(pd.get_dummies(admissions['prestige'], prefix='prestige'))

In [8]:
admissions

Unnamed: 0,admit,gre,gpa,prestige,prestige_1.0,prestige_2.0,prestige_3.0,prestige_4.0
0,0,380.0,3.61,3.0,0,0,1,0
1,1,660.0,3.67,3.0,0,0,1,0
2,1,800.0,4.00,1.0,1,0,0,0
3,1,640.0,3.19,4.0,0,0,0,1
4,0,520.0,2.93,4.0,0,0,0,1
...,...,...,...,...,...,...,...,...
395,0,620.0,4.00,2.0,0,1,0,0
396,0,560.0,3.04,3.0,0,0,1,0
397,0,460.0,2.63,2.0,0,1,0,0
398,0,700.0,3.65,2.0,0,1,0,0


In [39]:
# Class imbalances
admissions['admit'].value_counts()

0    271
1    126
Name: admit, dtype: int64

**We can predict the `admit` class from `gre` and `gpa` and use a train-test split to evaluate the performance of our model on a held-out test set.**

In [40]:
X = admissions[['gre','gpa', 'prestige']] #we will test the performance with other predictors later
y = admissions['admit']

In [41]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state = 42)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((317, 3), (80, 3), (317,), (80,))

In [42]:
# Create or instantiate your model object
logreg = LogisticRegression()

**Recall that our "baseline" accuracy is the proportion of the majority class label.**

In [43]:
1. - y_train.mean()

0.7003154574132492

In [44]:
#fit the logistic regression model on the X and y training sets
logreg.fit(X_train, y_train)

LogisticRegression()

In [45]:
y_preds_train = logreg.predict(X_train)
y_preds_test = logreg.predict(X_test)

In [46]:
#accuracy of training set
accuracy_score(y_train,y_preds_train )

0.7287066246056783

In [47]:
#accuracy of testing set
accuracy_score(y_test, y_preds_test)

0.6

**Create a confusion matrix of predictions on our test set using `metrics.confusion_matrix`**.

In [48]:
confusion_matrix(y_test, y_preds_test)

array([[42,  7],
       [25,  6]])

<table style="border: none">
<tr style="border: none">
    <td style="border: none; vertical-align: bottom; color: blue">n = ?</td>
    <td style=""><b>Predicted: No</b></td>
    <td style=""><b>Predicted: Yes</b></td>
</tr>
<tr>
    <td><b>Actual: No</b></td>
    <td style="text-align: center; ">TN = ?</td>
    <td style="text-align: center">FP = ?</td>
    <td style="text-align: center">?</td>
</tr>
<tr>
    <td><b>Actual: Yes</b></td>
    <td style="text-align: center">FN = ?</td>
    <td style="text-align: center; ">TP = ?</td>
    <td style="text-align: center">?</td>
</tr>
<tr style="border: none">
    <td style="border: none"></td>
    <td style="text-align: center">?</td>
    <td style="text-align: center">?</td>
</tr>

</table>

In [49]:
# Get probability predictions.
logreg_pred_proba = logreg.predict_proba(X_test)

In [54]:
logreg_pred_proba

array([[0.61079667, 0.38920333],
       [0.88057748, 0.11942252],
       [0.83294208, 0.16705792],
       [0.89966919, 0.10033081],
       [0.88892847, 0.11107153],
       [0.44447433, 0.55552567],
       [0.55280659, 0.44719341],
       [0.57502186, 0.42497814],
       [0.76416511, 0.23583489],
       [0.54699615, 0.45300385],
       [0.47664108, 0.52335892],
       [0.51505665, 0.48494335],
       [0.74838233, 0.25161767],
       [0.63906066, 0.36093934],
       [0.64395154, 0.35604846],
       [0.46113017, 0.53886983],
       [0.69265174, 0.30734826],
       [0.49896535, 0.50103465],
       [0.85499428, 0.14500572],
       [0.92053434, 0.07946566],
       [0.81723616, 0.18276384],
       [0.8570354 , 0.1429646 ],
       [0.56581109, 0.43418891],
       [0.84182579, 0.15817421],
       [0.50143527, 0.49856473],
       [0.73416486, 0.26583514],
       [0.61695228, 0.38304772],
       [0.52020779, 0.47979221],
       [0.69616597, 0.30383403],
       [0.63948939, 0.36051061],
       [0.

In [50]:
from sklearn.metrics import classification_report

In [51]:
classification_report(y_test, y_preds_test)

'              precision    recall  f1-score   support\n\n           0       0.63      0.86      0.72        49\n           1       0.46      0.19      0.27        31\n\n    accuracy                           0.60        80\n   macro avg       0.54      0.53      0.50        80\nweighted avg       0.56      0.60      0.55        80\n'

**Answer the following:**

- What is our accuracy?
- True positive rate?
- False positive rate?

<!--
ANSWER: This will depend on the data
-->

In [11]:
# Answer here:

A "good" classifier would have a true positive rate approaching 1 and a false positive rate approaching 0.

In our smoking problem, this model would accurately predict the majority of the smokers as smokers and not predict too many of the nonsmokers as smokers.

In [56]:
from sklearn.metrics import roc_auc_score





In [62]:
logreg_pred_proba[:,1]

array([0.38920333, 0.11942252, 0.16705792, 0.10033081, 0.11107153,
       0.55552567, 0.44719341, 0.42497814, 0.23583489, 0.45300385,
       0.52335892, 0.48494335, 0.25161767, 0.36093934, 0.35604846,
       0.53886983, 0.30734826, 0.50103465, 0.14500572, 0.07946566,
       0.18276384, 0.1429646 , 0.43418891, 0.15817421, 0.49856473,
       0.26583514, 0.38304772, 0.47979221, 0.30383403, 0.36051061,
       0.21217716, 0.5567453 , 0.28441651, 0.1904356 , 0.47979221,
       0.3366624 , 0.26561947, 0.50462156, 0.28764488, 0.27483127,
       0.67715943, 0.12201922, 0.43418891, 0.33751906, 0.30490494,
       0.21168205, 0.40063032, 0.34694975, 0.14275152, 0.18821518,
       0.16932535, 0.12272395, 0.40229243, 0.50585662, 0.16906427,
       0.37340032, 0.30919433, 0.13573255, 0.37632743, 0.41001382,
       0.10378606, 0.32969065, 0.39899903, 0.68393027, 0.2532957 ,
       0.14039315, 0.63640794, 0.47766152, 0.22580851, 0.4793579 ,
       0.3919971 , 0.17490597, 0.32952448, 0.53674748, 0.40271

In [61]:
roc_auc_score(y_test, logreg_pred_proba[:,1])

0.619815668202765

### Trading True Positives and True Negatives

By default, and with respect to the underlying assumptions of logistic regression, we predict a positive class when the probability of the class is greater than .5 and predict a negative class otherwise.

What if we decide to use .3 as a threshold for picking the positive class? Is that even allowed?

This turns out to be a useful strategy. By setting a lower probability threshold we will predict more positive classes. Which means we will predict more true positives, but fewer true negatives.

Making this trade-off is important in applications that have imbalanced penalties for misclassification.

The most popular example is medical diagnostics, where we want as many true positives as feasible. For example, if we are diagnosing cancer we prefer to have false positives, predict a cancer when there is no cancer, that can be later corrected with a more specific test.

We do this in machine learning by setting a low threshold for predicting positives which increases the number of true positives and false positives, but allows us to balance the the costs of being correct and incorrect.

**We can vary the classification threshold for our model to get different predictions.**

In [2]:
# We'll do this together in class and discuss...


### The Accuracy Paradox

Accuracy is a very intuitive metric — it's a lot like an exam score where you get total correct/total attempted. However, accuracy is often a poor metric in application. There are many reasons for this:
- Imbalanced problems problems with 95% positives in the baseline will have 95% accuracy even with no predictive power.
  - This is the paradox; pursuing accuracy often means predicting the most common class rather than doing the most useful work.
- Applications often have uneven penalties and rewards for true positives and false positives.
- Ranking predictions in the correct order be more important than getting them correct.
- In many case we need to know the exact probability of a positives and negatives.
  - To calculate an expected return.
  - To triage observations that are borderline positive.