# [CPSC 322](https://github.com/GonzagaCPSC322) Data Science Algorithms
[Gonzaga University](https://www.gonzaga.edu/)

[Gina Sprint](http://cs.gonzaga.edu/faculty/sprint/)

# Estimating Classifier Accuracy
What are our learning objectives for this lesson?
* Divide a dataset into training and testing sets using different approaches
* Evaluate classifier performance using accuracy

Content used in this lesson is based upon information in the following sources:
* Dr. Shawn Bowers' Data Mining notes

## Warm-up Task(s)
Open ClassificationFun/main.py
1. Review the PA4 starter code from last class. You can grab mine from Github/U4/ClassificationFun.
1. Let's make sure [Sci-kit Learn's `KNeighborsClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) finds the same `k=3` closest neighbors of `[2, 3]` as the starter code computes:
```
from sklearn.neighbors import KNeighborsClassifier
...
knn_clf = KNeighborsClassifier(n_neighbors=3, metric="euclidean")
knn_clf.fit(X_train, y_train)
print(knn_clf.kneighbors([test_instance]))
```

## Today
* Announcements
    * Nice job getting PA3 done! On to...
    * PA4 is due one week from tomorrow. A few notes:
        * All unit tests in `test_myclassifiers.py` are desk calculations. If a unit test should test against a library, this will be described.
        * See my docstring note for `MyKNeighborsClassifier`: `Assumes data has been properly normalized before use.`. It is the calling code's responsibility to normalize data before use with `MyKNeighborsClassifier`
        * Questions?
* Estimating classifier accuracy
    * Dividing a dataset into training and test sets
    * Lab tasks
* IQ4 last ~15 mins of class

## Warm-up Task(s)
* TBD

## Today 
* Announcements
    * Let's go over IQ4, nice job!
    * The career fair is coming up on 10/24 12-4!
    * MA7 and MA8 are posted.
    * PA4 is due Friday. Questions?
    * PA5 is posted. Please read it before next class and come with questions
* Overview of Naive Bayes
    * Lab tasks
    * MA7

## Training and Testing
Building a classifier starts with a learning (training) phase
* Based on predefined set of examples (AKA the training set)

The classifier is then evaluated for predictive accuracy (% of test instances correctly classified by the classifier)
* Based on another set of examples (AKA the testing set)
* We use the actual labels of the examples to test the predictions

In general, we want to try to avoid overfitting
* That is, encoding particular characteristics/anomalies of the training set into the classifier
* Similar notion is "underfitting" (too simple of a model, e.g., linear instead of polynomial)

We are going to discuss different ways to select training and testing sets
1. The Holdout method
2. Random Subsampling
3. $k$-Fold Cross Validation and Variants
4. Bootstrap Method

### Holdout Method
In the holdout method, the dataset is divided into two sets, the training and the testing set. The training set is used to build the model and the testing set is used to evaluate the model (e.g. the model's accuracy).

![](https://upload.wikimedia.org/wikipedia/commons/thumb/0/09/Supervised_machine_learning_in_a_nutshell.svg/2000px-Supervised_machine_learning_in_a_nutshell.svg.png)
(image from https://upload.wikimedia.org/wikipedia/commons/thumb/0/09/Supervised_machine_learning_in_a_nutshell.svg/2000px-Supervised_machine_learning_in_a_nutshell.svg.png)

Approaches to the holdout method
* Randomly divide data set into a training and test set
* Partition evenly or, e.g., $\frac{2}{3}$ to $\frac{1}{3}$ (2:1) training to test set
* This is random selection without replacement

Q: Write a function to do a 2:1 partition in Python...
```python
import random

def compute_holdout_partitions(table):
    # randomize the table
    randomized = table[:] # copy the table
    n = len(table)
    for i in range(n):
        # pick an index to swap
        j = random.randrange(0, n) # random int in [0,n) 
        randomized[i], randomized[j] = randomized[j], randomized[i]
    # return train and test sets
    split_index = int(2 / 3 * n) # 2/3 of randomized table is train, 1/3 is test
    return randomized[0:split_index], randomized[split_index:]
```

### Random Subsampling Method
* Repeat the holdout method $k$ times
* Accuracy estimate is the average of the accuracy of each iteration

### k-Fold Cross-Validation Method
One of the shortcomings of the hold out method is the evaluation of the model depends heavily on which examples are selected for training versus testing. K-fold cross validation is a model evaluation approach that addresses this shortcoming of the holdout method.
* Initial dataset partitioned into $k$ subsets ("folds") $D_1, D_2,..., D_k$
* Each fold is approximately the same size
* Training and testing is performed $k$ times:
    * In iteration $i$, $D_i$ is used as the test set
    * And $D_1 \cup ... \cup D_{i−1} \cup D_{i+1} \cup ... \cup D_k$ used as training set
* Note each subset is used exactly once for testing
* Accuracy estimate is number of correct classifications over the $k$ iterations, divided by total number of rows (i.e., test instances) in the initial dataset
* Alternatively, average accuracy by label


![](https://upload.wikimedia.org/wikipedia/commons/1/1c/K-fold_cross_validation_EN.jpg)


(image from https://upload.wikimedia.org/wikipedia/commons/1/1c/K-fold_cross_validation_EN.jpg)

### Variants of Cross-Validation
Leave-one-out method
* Special case of cross-validation where $k$ is the number of instances

Stratified Cross-Validation method
* Class distribution within folds is approximately the same as in the initial data

Q: How might you go about generating stratified folds for cross validation?
* One approach:
    * Randomize the dataset
    * Partition dataset so each subset contains rows with of a specific class
        * e.g., if class label is "yes" or "no"
        * Then one partition has all "yes" rows
        * And the other all "no" rows
        * Note: this is a group by class label
    * Generate folds by:
        * Iterating through each partition
        * And distributing the partition (roughly) equally to each fold
        
### Lab Task 1
Consider the following dataset:

|att1|att2|result|
|-|-|-|
|3|2|no|
|6|6|yes|
|4|1|no|
|4|4|no|
|1|2|yes|
|2|0|no|
|0|3|yes|
|1|6|yes|

1. Assume we want to perform k-fold cross validation of our NN classifier for $k$ = 4. Create corresponding folds (partitions) for the dataset.
2. Describe how these $k$ folds would be used to perform cross validation. That is, show how the $k$ test runs are performed.
3. Repeat steps 1 and 2 with *stratified* k-fold cross validation.

    
### The Bootstrap Method
* Like random subsampling but with replacement
* Usually used for small datasets
* The basic ".632" approach:
    * Given a dataset with $D$ rows
    * Randomly select $D$ rows with replacement (i.e., might select same row)
    * This gives a "bootstrap sample" (training set) of $D$ rows
    * The remaining rows (not selected; AKA out of bag instances) form the test set
    * On average, 63.2% of original rows will end up in the training set
    * And 36.8% will end up in the test set
* Why these percentages?
    * Each row has a $1/D$ chance of being selected
    * Each row has a (1 - 1/D) chance of not being selected
    * We select $D$ times, so probability a row not chosen at all is $(1 - 1/D)^D$
    * For large $D$, the probability approaches $e^{-1} = 0.368$ (for $e = 2.718$...)
* The sampling procedure is repeated $k$ times
    * Each iteration uses the test set for an accuracy estimate
    * Since each test can be a different size based on the sampling, the accuracy is the weighted average accuracy over the $k$ bootstrap methods
        * See Bramer 7.2.2 for more details

## Evaluating Classifier Performance
Divide data set into a Training Set and a Test Set
* "Build" classifier on training set
* Test performance on the test set (try to predict their labels)
* For the test set you know the "ground truth"

Assume we have 2-valued class labels (e.g., "yes" and "no")
* As an example, the titanic data set (more later)

|status |age |gender |survived|
|-|-|-|-|
|crew |adult |female |yes|
|first |adult |male |no|
|crew |child |female |no|
|second |adult |male |yes|

* We want to predict/classify survival (i.e., survived is the class)
    * Positive instances: instances of the "main" class of interest (e.g., yes label)
    * Negative instances: all the other instances
* $P$ = the # of positive instances in our test set
* $N$ = the # of negative instances in our test set
* $TP$ = (True Positives) = # of positive instances we classified as positive
* $TN$ = (True Negatives) = # of negative instances we classified as negative
    * Combined, these are our "successful" predictions
* $FP$ (False Positives) = # of negative instances we classified as positive
* $FN$ (False Negatives) = # of positive instances we classified as negative
    * Combined, these are our "failed" predictions

A generalized "confusion matrix" for (binary) classification:  
<img src="https://raw.githubusercontent.com/GonzagaCPSC322/U4-Supervised-Learning/master/figures/binary_confusion_matrix.png" width="400">

Note: Sci-kit Learn's [`confusion_matrix()`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html) returns a confusion matrix given parallel lists of actual and predicted values.

## Confusion Matrix Application: Computing Accuracy
Let's see how confusion matrices can help us calculate accuracy and interpret classifier performance. We will start with binary classification, then we will see how accuracy (and other metrics) can be adapted to multi-class classification (e.g. 3 or more class labels).

### Accuracy
Accuracy: The proportion of instances that are correctly classified
$$Accuracy = \frac{TP + TN}{P + N} = \frac{TP + TN}{TP + FP + TN + FN}$$
* Sometimes called "recognition rate"
* Referred to as "predictive accuracy" in the textbook
* Warning: can be skewed if unbalanced distribution of class labels
    * e.g., lots of negative cases that are easily detected (e.g. 99% accuracy when 99% of the dataset is the negative class)
    * shadows performance on positive cases
* Note: Sci-kit Learn's (`accuracy_score()`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html) returns prediction accuracy given parallel lists of actual and predicted values.
    
### Lab Task 2
What is the accuracy for the following binary classification confusion matrix?
<img src="https://raw.githubusercontent.com/GonzagaCPSC322/U4-Supervised-Learning/master/figures/accuracy_exercise.png" width="300">

[//]: # ($$Accuracy = \frac{TP + TN}{P + N} = \frac{18 + 8}{40} = 65\%$$)
    
### Accuracy for Multi-class Classification
Q: How do we adopt/apply accuracy to MPG data set (lots of classes)?
* This is called "multi-class classification" (vs "binary" classification)

Multi-Class Accuracy Example: Assume labels $L = \{a, b, c\}$ and $R = \#$ of total instances
<img src="https://raw.githubusercontent.com/GonzagaCPSC322/U4-Supervised-Learning/master/figures/multi_confusion_matrix.png" width="400">

* One approach (the micro approach):
    * Number of correctly classified divided by total number of instances $Accuracy = \frac{TP + TN}{P+N} = \frac{R_a^a + R_b^b + R_c^c}{R}$
    * Easily skewed by certain classes
* Another approach (the macro approach):
    * Average accuracy per class label
    * Basically one binary confusion matrix per label (then average of these)
    * If $L$ is the number of labels: $$\frac{\sum_{i=1}^{L}\frac{TP_i+TN_i}{P_i+N_i}}{L}$$
    * Have to be careful of empty classes in the test set (don't include in $L$)
    
To compute the accuracy of label "a": $$Accuracy_a = \frac{TP_a + TN_a}{P_a+N_a}\\= \frac{R_a^a + (R_b^b + R_b^c +R_c^b+R_c^c)}{R_a + (R_b + R_c)}\\=\frac{R - (FN_a +FP_a)}{R}\\=\frac{R - (R_a^b+R_a^c+R_b^a+R_c^a)}{R}$$

We could do this for each label, then average the results

### Lab Task 3
What is the accuracy for the following multi-class classification confusion matrix?

Coffee acidity labels: dry, sharp, moderate, or dull
<img src="https://raw.githubusercontent.com/GonzagaCPSC322/U4-Supervised-Learning/master/figures/multi_class_accuracy_exercise.png" width="400">

1. 1st Approach (percent correctly classified; AKA micro approach):
1. 2nd Approach (average accuracy per label; AKA macro approach). First let's do the label dry:

<img src="https://raw.githubusercontent.com/GonzagaCPSC322/U4-Supervised-Learning/master/figures/multi_class_exercise_matrix.png" width="400">

Then finish the approach for the remaining labels:  