# [CPSC 322](https://github.com/GonzagaCPSC322) Data Science Algorithms
[Gonzaga University](https://www.gonzaga.edu/)

[Gina Sprint](http://cs.gonzaga.edu/faculty/sprint/)

# Measuring Classifier Performance
What are our learning objectives for this lesson?
* Measure and evaluate classifier performance using different metrics

Content used in this lesson is based upon information in the following sources:
* Dr. Shawn Bowers' Data Mining notes

## Beyond Accuracy: Additional Performance Evaluation Metrics
In Bramer Chapter 12, there is a nice table summarizing commonly used performance metrics for a classifier:

<img src="https://raw.githubusercontent.com/GonzagaCPSC322/U4-Supervised-Learning/master/figures/bramer_perf_measures.png" width="600">

Beyond accuracy, let's look at a few more of these in detail. 

### Error Rate
Error Rate: 1 - accuracy
$$ErrorRate = \frac{FP + FN}{P + N}$$
* Has same issues as accuracy (unbalanced labels)
* For multi-class classification, can take the average error rate per class

### Precision
Precision (AKA positive predictive value): Proportion of instances classified as positive that are really positive
$$Precision = \frac{TP}{TP + FP}$$
* A measure of "exactness"
* When a classifier predicts positive, it is correct $precision$ percent of the time
* A classifier with no false positives has a precision of 1

### Recall
Recall (AKA true positive rate (TPR) AKA sensitivity): The proportion of positive instances that are correctly classified as positive (e.g. labeled correctly)
$$Recall = \frac{TP}{P} = \frac{TP}{TP + FN}$$
* A measure of "completeness"
* A classifier correctly classifies $recall$ percent of all positive cases
* A classifier with no false negatives has a precision of 1
* Used with the false positive rate to create receiver operator graphs and curves (ROC)

Note: There is a trade-off between precision and recall. For a balanced class dataset, a model that predicts mostly positive examples will have a high recall and a low precision.

Q: How can we get a high recall score?
* Label everything as positive
* Note that precision helps keep us honest

Q: What about for precision?
* Be conservative with our positive labels

### F1 Score 
F1-Score (AKA F-Measure): combines precision and recall via the harmonic mean of precision and recall:
$$F = \frac{2 \times Precision \times Recall}{Precision + Recall}$$
* Summarizes a classifier in a single number (however, it is best practice to still investigate precision and recall, as well as other evaluation metrics)
* Alternatively, we can weight precision:
$$F_\beta = \frac{(1+\beta^2) \times Precision \times Recall}{\beta^2 \times Precision + Recall}$$
* Helps deal with class imbalance problem

Note: Sci-kit Learn's [`classification_report()`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html) returns multi-class precision, recall, f1-score, and support given parallel lists of actual and predicted values.

### Lab Task 1
What is the precision, recall, and F-measure for the win-lose (binary) example?
<img src="https://raw.githubusercontent.com/GonzagaCPSC322/U4-Supervised-Learning/master/figures/accuracy_exercise.png" width="300">

In [1]:
# check trace desk calculation with sklearn
from sklearn.metrics import classification_report

# build parallel lists to represent win lose predictions
y_true = ["win"] * 20 + ["lose"] * 20
y_pred = ["win"] * 18 + ["lose"] * 2
y_pred += ["win"] * 12 + ["lose"] * 8

# note that "support" is P, is the number of instances in the test set with the positive label
# classiciation_report() reports metrics for "win" as the positive class
# and for "lose" as the positive class 
# nice text/table form
print(classification_report(y_true, y_pred))
# returned dictionary form
report_dict = classification_report(y_true, y_pred, output_dict=True)
print(report_dict)

              precision    recall  f1-score   support

        lose       0.80      0.40      0.53        20
         win       0.60      0.90      0.72        20

    accuracy                           0.65        40
   macro avg       0.70      0.65      0.63        40
weighted avg       0.70      0.65      0.63        40

{'lose': {'precision': 0.8, 'recall': 0.4, 'f1-score': 0.5333333333333333, 'support': 20}, 'win': {'precision': 0.6, 'recall': 0.9, 'f1-score': 0.7200000000000001, 'support': 20}, 'accuracy': 0.65, 'macro avg': {'precision': 0.7, 'recall': 0.65, 'f1-score': 0.6266666666666667, 'support': 40}, 'weighted avg': {'precision': 0.7, 'recall': 0.65, 'f1-score': 0.6266666666666667, 'support': 40}}


### Precision, Recall, and F-Measure for Multi-class Classification
"Micro" average $\mu$
* Averaging the total true positives, false negatives and false positives
    * E.g. compute TP and FP (or FN) over all the labels to compute precision (or recall))
* Micro-averaging favors bigger classes

$$Precision_\mu = \frac{\sum_{i=1}^{L} TP_i}{\sum_{i=1}^{L} (TP_i + FP_i)}$$

$$Recall_\mu = \frac{\sum_{i=1}^{L} TP_i}{\sum_{i=1}^{L}(TP_i + FN_i)}$$

$$F_\mu = \frac{2 \times Precision_\mu \times Recall_\mu}{Precision_\mu + Recall_\mu}$$

"Macro" averaging $M$
* Averaging the unweighted mean per label
    * E.g. compute each label's precision (or recall) and average over number of labels
* Macro-averaging treats all classes equally
$$Precision_M = \frac{\sum_{i=1}^{L}\frac{TP_i}{TP_i + FP_i}}{L}$$

$$Recall_M = \frac{\sum_{i=1}^{L}\frac{TP_i}{TP_i + FN_i}}{L}$$

$$F_M = \frac{\sum_{i=1}^{L} \frac{2 * Precision_{Mi} * Recall_{Mi}}{Precision_{Mi} + Recall_{Mi}}}{L}$$

"Weighted" macro averaging $W$
* Averaging the support-weighted mean per label
    * E.g. like macro average, but compute each label's precision (or recall) then weight it by its count $P$ (AKA support) and average over the total number of instances
$$Precision_W = \frac{\sum_{i=1}^{L}P_i \times \frac{TP_i}{TP_i + FP_i}}{P + N}$$

$$Recall_W = \frac{\sum_{i=1}^{L}P_i \times \frac{TP_i}{TP_i + FN_i}}{P + N}$$

$$F_W = \frac{\sum_{i=1}^{L} P_i \times \frac{2 * Precision_{Wi} * Recall_{Wi}}{Precision_{Wi} + Recall_{Wi}}}{P + N}$$

### Lab Task 2
What is the precision, recall, and F-measure for the coffee acidity (multi-class) example?
1. Using the "Micro" average approach
1. Using the "Macro" average approach
1. Using the "Weighted" macro average approach

<img src="https://raw.githubusercontent.com/GonzagaCPSC322/U4-Supervised-Learning/master/figures/multi_class_accuracy_exercise.png" width="400">

In [2]:
# check trace desk calculation with sklearn
# build parallel lists to represent coffee acidity predictions
y_true = ["dry"] * 25 + ["sharp"] * 20 + ["moderate"] * 30 + ["dull"] * 30
y_pred = ["dry"] * 20 + ["sharp"] * 2 + ["moderate"] * 2 + ["dull"] * 1
y_pred += ["dry"] * 0 + ["sharp"] * 15 + ["moderate"] * 1 + ["dull"] * 4
y_pred += ["dry"] * 1 + ["sharp"] * 3 + ["moderate"] * 18 + ["dull"] * 8
y_pred += ["dry"] * 4 + ["sharp"] * 10 + ["moderate"] * 4 + ["dull"] * 12

# nice text/table form
print(classification_report(y_true, y_pred, digits=3))
# returned dictionary form
report_dict = classification_report(y_true, y_pred, output_dict=True)
print(report_dict)

              precision    recall  f1-score   support

         dry      0.800     0.800     0.800        25
        dull      0.480     0.400     0.436        30
    moderate      0.720     0.600     0.655        30
       sharp      0.500     0.750     0.600        20

    accuracy                          0.619       105
   macro avg      0.625     0.638     0.623       105
weighted avg      0.629     0.619     0.616       105

{'dry': {'precision': 0.8, 'recall': 0.8, 'f1-score': 0.8000000000000002, 'support': 25}, 'dull': {'precision': 0.48, 'recall': 0.4, 'f1-score': 0.4363636363636364, 'support': 30}, 'moderate': {'precision': 0.72, 'recall': 0.6, 'f1-score': 0.6545454545454547, 'support': 30}, 'sharp': {'precision': 0.5, 'recall': 0.75, 'f1-score': 0.6, 'support': 20}, 'accuracy': 0.6190476190476191, 'macro avg': {'precision': 0.625, 'recall': 0.6375000000000001, 'f1-score': 0.6227272727272728, 'support': 105}, 'weighted avg': {'precision': 0.6285714285714286, 'recall': 0.61904

### False Positive Rate 
False Positive Rate (FPR): The proportion of negative instances that are erroneously classified as positive
$$False Positive Rate = \frac{FP}{N} = \frac{FP}{TN + FP}$$
* Used with the true positive rate to create receiver operator graphs and curves (ROC)

### False Negative Rate 
False Negative Rate (FNR): The proportion of positive instances that are erroneously classified as negative = 1 − True Positive Rate
$$False Negative Rate = \frac{FN}{P} = \frac{FN}{TP + FN}$$

### True Negative Rate 
True Negative Rate (TNR AKA specificity): The proportion of negative instances that are correctly classified as negative
$$False Negative Rate = \frac{TN}{N} = \frac{TN}{TN + FP}$$