# [CPSC 322]() Data Science Algorithms
[Gonzaga University](https://www.gonzaga.edu/) |
[Sophina Luitel](https://www.gonzaga.edu/school-of-engineering-applied-science/faculty/detail/sophina-luitel-phd-0dba6a9d)

---

# Estimating Classifier Accuracy
What are our learning objectives for this lesson?
* Divide a dataset into training and testing sets using different approaches
* Evaluate classifier performance using accuracy

Content used in this lesson is based upon information in the following sources:  
* Dr. Gina Sprint's Data Science Algorithm notes

## Today
* Announcements
    * Midterm this Friday (17th oct) (covers all topics up to KNN. Review notebooks and IQs)
* Estimating classifier accuracy
    * Dividing a dataset into training and test sets
    * Lab tasks

## Training and Testing
Building a classifier starts with a learning (training) phase
* Based on predefined set of examples (training set)

The classifier is then evaluated for predictive accuracy (% of test instances correctly classified by the classifier)
* Based on another set of examples (testing set)
* We use the actual labels of the examples to test the predictions

In general, we want to try to avoid overfitting  
* This occurs when the model learns specific details or noise from the training data, capturing patterns that don’t generalize to new data.
* A related issue is underfitting, which happens when the model is too simple to capture the true relationships in the data (e.g., using a linear model for nonlinear patterns).

We are going to discuss different ways to select training and testing sets
1. The Holdout method
2. Random Subsampling
3. $k$-Fold Cross Validation and Variants
4. Bootstrap Method

### Holdout Method
In the holdout method, the dataset is divided into two sets, the training and the testing set. The training set is used to build the model and the testing set is used to evaluate the model (e.g. the model's accuracy).

![](https://upload.wikimedia.org/wikipedia/commons/thumb/0/09/Supervised_machine_learning_in_a_nutshell.svg/2000px-Supervised_machine_learning_in_a_nutshell.svg.png)
(image from https://upload.wikimedia.org/wikipedia/commons/thumb/0/09/Supervised_machine_learning_in_a_nutshell.svg/2000px-Supervised_machine_learning_in_a_nutshell.svg.png)

Approaches to the holdout method
* Randomly divide data set into a training and test set
* Partition the data evenly or use a predefined ratio, e.g., $\frac{2}{3}$ for training and $\frac{1}{3}$ for testing (2:1 ratio)
* The division is done through random selection without replacement, meaning each data point appears in only one of the two sets.


Q: Write a function to do a 2:1 partition in Python...
```python
import random

def compute_holdout_partitions(table):
    # randomize the table
    randomized = table[:] # copy the table
    n = len(table)
    for i in range(n):
        # pick an index to swap
        j = random.randrange(0, n) # random int in [0,n) 
        randomized[i], randomized[j] = randomized[j], randomized[i]
    # return train and test sets
    split_index = int(2 / 3 * n) # 2/3 of randomized table is train, 1/3 is test
    return randomized[0:split_index], randomized[split_index:]
```

### Random Subsampling Method
* Repeat the holdout method $k$ times
* Accuracy estimate is the average of the accuracy of each iteration

### k-Fold Cross-Validation Method
One of the shortcomings of the hold out method is the evaluation of the model depends heavily on which examples are selected for training versus testing. K-fold cross validation is a model evaluation approach that addresses this shortcoming of the holdout method.
* Initial dataset partitioned into $k$ subsets ("folds") $D_1, D_2,..., D_k$
* Each fold is approximately the same size
* Training and testing is performed $k$ times:
    * In iteration $i$, $D_i$ is used as the test set
    * And $D_1 \cup ... \cup D_{i−1} \cup D_{i+1} \cup ... \cup D_k$ used as training set
* Note each subset is used exactly once for testing
* Accuracy estimate is number of correct classifications over the $k$ iterations, divided by total number of rows (i.e., test instances) in the initial dataset
* Alternatively, average accuracy by label


![](https://upload.wikimedia.org/wikipedia/commons/1/1c/K-fold_cross_validation_EN.jpg)


(image from [wikimedia](https://upload.wikimedia.org/wikipedia/commons/1/1c/K-fold_cross_validation_EN.jpg))

### Variants of Cross-Validation
Leave-one-out method
* Special case of cross-validation where $k$ is the number of instances

Stratified Cross-Validation method
* Class distribution within folds is approximately the same as in the initial data

Q: How might you go about generating stratified folds for cross validation?
* One approach:
    * Randomize the dataset
    * Partition dataset so each subset contains rows with of a specific class
        * e.g., if class label is "yes" or "no"
        * Then one partition has all "yes" rows
        * And the other all "no" rows
        * Note: this is a group by class label
    * Generate folds by:
        * Iterating through each partition
        * And distributing the partition (roughly) equally to each fold
        
### Lab Task 1
Consider the following dataset:

|att1|att2|result|
|-|-|-|
|3|2|no|
|6|6|yes|
|4|1|no|
|4|4|no|
|1|2|yes|
|2|0|no|
|0|3|yes|
|1|6|yes|

1. Assume we want to perform k-fold cross validation of our NN classifier for $k$ = 4. Create corresponding folds (partitions) for the dataset.
2. Describe how these $k$ folds would be used to perform cross validation. That is, show how the $k$ test runs are performed.
3. Repeat steps 1 and 2 with *stratified* k-fold cross validation.

    
### The Bootstrap Method
* Like random subsampling but with replacement
* Usually used for small datasets
* The basic ".632" approach:
    * Given a dataset with $D$ rows
    * Randomly select $D$ rows with replacement (i.e., might select same row)
    * This gives a "bootstrap sample" (training set) of $D$ rows
    * The remaining rows (not selected; AKA out of bag instances) form the test set
    * On average, 63.2% of original rows will end up in the training set
    * And 36.8% will end up in the test set
* Why these percentages?
    * Each row has a $1/D$ chance of being selected
    * Each row has a (1 - 1/D) chance of not being selected
    * We select $D$ times, so probability a row not chosen at all is $(1 - 1/D)^D$
    * For large $D$, the probability approaches $e^{-1} = 0.368$ (for $e = 2.718$...)
* The sampling procedure is repeated $k$ times
    * Each iteration uses the test set for an accuracy estimate
    * Since each test can be a different size based on the sampling, the accuracy is the weighted average accuracy over the $k$ bootstrap methods
        * See Bramer 7.2.2 for more details