# [CPSC 222](https://github.com/GonzagaCPSC222) Intro to Data Science
[Gonzaga University](https://www.gonzaga.edu/)

[Gina Sprint](http://cs.gonzaga.edu/faculty/sprint/)

# Classifier Evaluation
What are our learning objectives for this lesson?
* Evaluate classifier performance using different metrics
* Divided a dataset into training and testing sets using different approaches

Content used in this lesson is based upon information in the following sources:
* Dr. Shawn Bowers' Data Mining notes

## Warm-up Task(s)
1. Download the shirt_sizes_long.csv file from Github (it is in the ClassificationFun folder)
    * In kNN.ipynb, load this dataset into a dataframe
    * We are going to use this slightly larger dataset to explore different ways to divide a dataset into training and testing sets
1. Take a look at this tutorial to see an overview of how decision trees work: http://www.r2d3.us/visual-intro-to-machine-learning-part-1/
    * Read about [sklearn.tree.DecisionTreeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)
    * We are going to see how this classifier performs compared to kNN

## Today
* Announcements
    * Work on your project (Cleaning, EDA, and at least 1 hypothesis test)
        * Check-in due sometime after we come back from Easter break (in class on 5/29 at the latest)
        * Bonus points for demoing early (week of 4/22-4/25) during office hours
    * Let's go over DA7 and take questions
    * Game dev talk tonight 6pm Bollier 120 -- Pizza and snacks!!
    * Have a great Easter weekend 🐣
* Today
    * ClassificationFun
        * Closing thoughts on kNN
        * Evaluating classifier performance
        * (if time) Decision tree example
    * IQ9 last ~15 mins

## Evaluating Classifier Performance
Divide data set into a Training Set and a Test Set
* "Build" classifier on training set
* Test performance on the test set (try to predict their labels)
* For the test set you know the "ground truth"
    * More on how to do this later

Assume we have 2-valued class labels (e.g., "yes" and "no")
* As an example, the titanic data set (more later)

|status |age |gender |survived|
|-|-|-|-|
|crew |adult |female |yes|
|first |adult |male |no|
|crew |child |female |no|
|second |adult |male |yes|

* We want to predict/classify survival (i.e., survived is the class)
    * Positive instances: instances of the "main" class of interest (e.g., yes label)
    * Negative instances: all the other instances
* $P$ = the # of positive instances in our test set
* $N$ = the # of negative instances in our test set
* $TP$ = (True Positives) = # of positive instances we classified as positive
* $TN$ = (True Negatives) = # of negative instances we classified as negative
    * Combined, these are our "successful" predictions
* $FP$ (False Positives) = # of negative instances we classified as positive
* $FN$ (False Negatives) = # of positive instances we classified as negative
    * Combined, these are our "failed" predictions

A generalized "confusion matrix" for (binary) classification

<img src="https://raw.githubusercontent.com/GonzagaCPSC222/U7-Machine-Learning-NLP/master/figures/binary_confusion_matrix.png" width="400">

## Metrics
### Accuracy
Accuracy: % of test instances correctly classified by the classifier
$$Accuracy = \frac{TP + TN}{P + N} = \frac{TP + TN}{TP + FP + TN + FN}$$
* Sometimes called "recognition rate"
* Use the [KNeighborsClassifier.score(X, y)](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier.score) method to determine the accuracy of a kNN classifier 
        * Description: "Return the mean accuracy on the given test data and labels."
    * Note: for other classifiers, check the documentation for the `score()` method to see what the default evaluation metric is
* Warning: can be skewed if unbalanced distribution of class labels
    * e.g., lots of negative cases that are easily detected (e.g. 99% accuracy when 99% of the dataset is the negative class)
    * shadows performance on positive cases
    
### Example
What is the accuracy for the following binary classification confusion matrix?
<img src="https://raw.githubusercontent.com/GonzagaCPSC222/U6-Machine-Learning/master/figures/accuracy_exercise.png" width="300">

[//]: # ($$Accuracy = \frac{TP + TN}{P + N} = \frac{18 + 8}{40} \approx 60\%$$)

### Error Rate
Error Rate: 1 - accuracy
$$ErrorRate = \frac{FP + FN}{P + N}$$
* Has same issues as accuracy (unbalanced labels)
* For multi-class classification, can take the average error rate per class

### More Classifier Evaluation Metrics to Look Into
* Precision: measure of "exactness"
* Recall (AKA sensitivity): measure of "completeness"
* F-Measure (AKA F1 score): combine the two via the harmonic mean of precision and recall

## Training and Testing
Building a classifier starts with a learning (training) phase
* Based on predefined set of examples (AKA the training set)

The classifier is then evaluated for predictive accuracy
* Based on another set of examples (AKA the testing set)
* We use the actual labels of the examples to test the predictions

In general, we want to try to avoid overfitting
* That is, encoding particular characteristics/anomalies of the training set into the classifier
* Similar notion is "underfitting" (too simple of a model, e.g., linear instead of polynomial)

We are going to discuss different ways to select training and testing sets
1. The Holdout method
2. Random Subsampling
3. $k$-Fold Cross Validation and Variants
4. Bootstrap Method

### Holdout Method
In the holdout method, the dataset is divided into two sets, the training and the testing set. The training set is used to build the model and the testing set is used to evaluate the model (e.g. the model's accuracy).

![](https://upload.wikimedia.org/wikipedia/commons/thumb/0/09/Supervised_machine_learning_in_a_nutshell.svg/2000px-Supervised_machine_learning_in_a_nutshell.svg.png)
(image from https://upload.wikimedia.org/wikipedia/commons/thumb/0/09/Supervised_machine_learning_in_a_nutshell.svg/2000px-Supervised_machine_learning_in_a_nutshell.svg.png)

Approaches to the holdout method
* Randomly divide data set into a training and test set
* Partition evenly or, e.g., $\frac{2}{3}$ to $\frac{1}{3}$ (2:1) training to test set
* This is random selection without replacement
* Use the [train_test_split()](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function to apply the holdout method to a dataset
    * Description: "Split arrays or matrices into random train and test subsets"
    * `test_size` parameter: "If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is set to the complement of the train size. If train_size is also None, it will be set to 0.25"
    * `random_state` parameter: "Controls the shuffling applied to the data before applying the split. Pass an int for reproducible output across multiple function calls."
    * `stratify` parameter: "If not None, data is split in a stratified fashion, using this as the class labels."

### Random Subsampling Method
* Repeat the holdout method $k$ times
* Accuracy estimate is the average of the accuracy of each iteration

### k-Fold Cross-Validation Method
One of the shortcomings of the hold out method is the evaluation of the model depends heavily on which examples are selected for training versus testing. K-fold cross validation is a model evaluation approach that addresses this shortcoming of the holdout method.
* Initial dataset partitioned into $k$ subsets ("folds") $D_1, D_2,..., D_k$
* Each fold is approximately the same size
* Training and testing is performed $k$ times:
    * In iteration $i$, $D_i$ is used as the test set
    * And $D_1 \cup ... \cup D_{i−1} \cup D_{i+1} \cup ... \cup D_k$ used as training set
* Note each subset is used exactly once for testing
* Accuracy estimate is number of correct classifications over the $k$ iterations, divided by total number of rows (i.e., test instances) in the initial dataset
    * Alternatively, average accuracy by label
![](https://upload.wikimedia.org/wikipedia/commons/1/1c/K-fold_cross_validation_EN.jpg)
(image from https://upload.wikimedia.org/wikipedia/commons/1/1c/K-fold_cross_validation_EN.jpg)
* Use the [cross_val_score(estimator, X, y)](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html#sklearn.model_selection.cross_val_score) function to apply k-fold cross-validation to a dataset
    * Description: "Split arrays or matrices into random train and test subsets"
    * `cv` parameter: "Determines the cross-validation splitting strategy." By default it is set to 5-fold cross validation, but you can pass in an to specify the number of folds in a (Stratified)KFold
    * `scoring` parameter: "A str (see model evaluation documentation) or a scorer callable object". By default it is set to the estimator's default scorer (if available)

### More Approaches to Building a Test Set to Look Into
* Leave-one-out method cross validation
    * Special case of cross-validation where $k$ is the number of instances
* Stratified Cross-Validation method
    * Class distribution within folds is approximately the same as in the initial data
* The Bootstrap Method
    * Like random subsampling but with replacement
    * Usually used for small datasets