# [CPSC 322](https://github.com/GonzagaCPSC322) Data Science Algorithms
[Gonzaga University](https://www.gonzaga.edu/)

[Gina Sprint](http://cs.gonzaga.edu/faculty/sprint/)

# PA5 Estimating Classifier Accuracy (75 pts)

## Learner Objectives
At the conclusion of this programming assignment, participants should be able to:
* Estimate classifier accuracy using train/test sets sampled using
    * Random sub-sampling
    * k-fold cross validation
    * Stratified k-fold cross validation
    * Bootstrap method
* Create and interpret confusion matrices

## Prerequisites
Before starting this programming assignment, participants should be able to:
* Implement test-driven development
* Implement simple classifiers
* Evaluate classifiers against a dummy classifier
* Tell a data science story using Jupyter Notebook
* Understand Bramer Chapter 7 (Estimating the Predictive Accuracy of a Classifier)

## Acknowledgments
Content used in this assignment is based upon information in the following sources:
* Part 1 (`mysklearn`): The [Sci-kit Learn](https://scikit-learn.org/stable/) machine learning library for Python
* Part 2 (auto dataset classification): Dr. Shawn Bowers' Data Mining HW3

## Github Classroom Setup
For this assignment, you will use GitHub Classroom to create a private code repository to track code changes and submit your assignment. Open this PA5 link to accept the assignment and create a private repository for your assignment in Github classroom: https://classroom.github.com/a/qPjdZ46E

Your repo, for example, will be named GonzagaCPSC322/pa5-yourusername (where yourusername is your Github username). I highly recommend committing/pushing regularly so your work is always backed up.

## Overview and Requirements
This assignment involves implementing simple classification and classification evaluation algorithms. It has two main parts:
1. `mysklearn`: Test and implement general and re-usable classification and evaluation algorithms
1. Auto Dataset Classification (pa5.ipynb): Write a Jupyter Notebook that uses `mysklearn` to perform data classification tasks on an automobile dataset

I highly encourage you to design functions that are generic and re-usable for future programming assignments and data mining tasks.

Note: we are learning data science from scratch! The only non-standard Python libraries you should need to use for this assignment are `tabulate`, `numpy` (math functions, random number generation, etc.), and `scipy` (sparingly). This means that beyond these libraries, you should not `pip install` any additional libraries beyond what is included in the [continuumio/anaconda3:2024.06-1](https://hub.docker.com/r/continuumio/anaconda3) Docker image and you should not use `pandas/sklearn/`etc... (exceptions are made for testing purposes only!!).

## Part 1: `mysklearn` (45 pts)
Complete the `mysklearn.myevaluation` functions and test your code for functional correctness against the provided unit tests in `test_myevaluation.py`:
1. `train_test_split(X, y, test_size=0.33, random_state=None, shuffle=True)`
    * Function inspiration: Sci-kit Learn's [train_test_split()](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)
1. `kfold_split(X, n_splits=5, random_state=None, shuffle=False)`
    * Function inspiration: Sci-kit Learn's [KFold split()](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html)
1. `bootstrap_sample(X, y, n_samples=None, random_state=None)`
    * Function inspiration: Sci-kit Learn's [resample()](https://scikit-learn.org/stable/modules/generated/sklearn.utils.resample.html)
1. `accuracy_score(y_true, y_pred, normalize=True)`
    * Function inspiration: Sci-kit Learn's [accuracy_score()](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html#sklearn.metrics.accuracy_score)
1. `confusion_matrix(y_true, y_pred, labels)`
    * Function inspiration: Sci-kit Learn's [confusion_matrix()](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html)
    
Note: you can run all the unit tests in a project with pytest by omitting the name of the test module: `pytest --verbose` Also, the `-s` flag is nice because it allows you to see print statement output from your test execution.

## Part 2: 🚗 Auto Classification 🚗 (20 pts)
Write a Jupyter Notebook (pa5.ipynb) that uses your `mysklearn` package to evaluate the simple kNN and Dummy classifiers for the "pre-processed" automobile dataset (auto-data-removed-NA.txt) you created for PA2. Create classifiers **that predicts DOE mpg ratings using number of cylinders, weight, and acceleration attributes** (like in PA4). In the Notebook, describe the steps, log any assumptions and/or issues you had in doing the steps, and provide insights on the step results. All re-usable utility functions should be separate from your Notebook in an appropriate module.

### Step 1 Train/Test Sets: Random Sub-sampling
Compute the predictive accuracy and error rate for each classifier using random sub-sampling with k = 10 and a 2:1 train/test split. To solve this, I highly recommend writing a function called `random_subsample()` that calls your `train_test_split()` in a loop.

Your output should look something like this (where the ??'s should be replaced by actual values):

```
===========================================
STEP 1: Predictive Accuracy
===========================================
Random Subsample (k=10, 2:1 Train/Test)
k Nearest Neighbors Classifier: accuracy = 0.??, error rate = 0.??
Dummy Classifier: accuracy = 0.??, error rate = 0.??
```

### Step 2 Train/Test Sets: Cross Validation
Compute the predictive accuracy and error rate for each classifier using k-fold cross validation (or if you do the bonus, stratified k-fold cross validation) with k = 10. To solve this, I highly recommend writing a function called `cross_val_predict()` that calls your `kfold_split()` (or if you do the bonus, `stratified_kfold_split()` if a keyword argument `stratify` is `=True`).

Your output should look something like this (where the ??'s should be replaced by actual values):

```
===========================================
STEP 2: Predictive Accuracy
===========================================
10-Fold Cross Validation
k Nearest Neighbors Classifier: accuracy = 0.??, error rate = 0.??
Dummy Classifier: accuracy = 0.??, error rate = 0.??

(BONUS) Stratified 10-Fold Cross Validation
k Nearest Neighbors Classifier: accuracy = 0.??, error rate = 0.??
Dummy Classifier: accuracy = 0.??, error rate = 0.??
```

### Step 3 Train/Test Sets: Bootstrap Method
Compute the predictive accuracy and error rate for each classifier using the bootstrap method with k = 10. To solve this, I highly recommend writing a function called `bootstrap_method()` that calls your `bootstrap_sample()` in a loop.

Your output should look something like this (where the ??'s should be replaced by actual values):

```
===========================================
STEP 3: Predictive Accuracy
===========================================
k=10 Bootstrap Method
k Nearest Neighbors Classifier: accuracy = 0.??, error rate = 0.??
Dummy Classifier: accuracy = 0.??, error rate = 0.??
```

### Step 4 Confusion Matrices
Create confusion matrices for each classifier based on the 10-fold cross validation results (or if you do the bonus, use your stratified 10-fold cross validation results).

You can use the `tabulate` package to display your confusion matrices (it is also okay to format the table
manually). Here is an example:

```
===========================================
STEP 4: Confusion Matrices
===========================================
```

<img src="https://raw.githubusercontent.com/GonzagaCPSC322/PAs/master/figures/stratified_mpg_confusion_matrix.png" width="700">

Note: your output should have columns properly aligned for readability. I took a screen shot for the stratified table so you could see the alignment.

## Bonus (5 pts)
Complete the `stratified_kfold_split(X, y, n_splits=5, random_state=None, shuffle=False)` function in `mysklearn.myevaluation` test your code for functional correctness against the provided unit test in `test_myevaluation.py`
* Note: Function inspiration: Sci-kit Learn's [StratifiedKFold split()](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html)

Then add commentary to Part 2 Step 2 to answer the question: Does stratification improve the classifier's performance?

## Submitting Assignments
1. Turn in your assignment files via a Github Classroom repo. See the "Github Classroom Setup" section at the beginning of this document for details on how to do this.
    1. Your repo should contain all of the files needed to run and test your solution (e.g. .py file(s), input files, etc.). 
    1. Double-check that this is the case by "pretending to be the grader": clone (or download a zip) your submission repo and run your code in a fresh [continuumio/anaconda3:2024.06-1](https://hub.docker.com/r/continuumio/anaconda3) Docker container like we will when we grade your code.
1. Submit this PA’s associated assignment in Canvas to mark your PA as "done" and ready for grading. We will then pull your Github repo and grade your PA as soon as possible. The date and time you submit the PA assignment in Canvas will be used for marking your assignment as "late" or "on-time."

## Grading Guidelines
This assignment is worth 75 points + 5 points bonus. Your assignment will be evaluated based on a successful execution in the [continuumio/anaconda3:2024.06-1](https://hub.docker.com/r/continuumio/anaconda3) Docker container and adherence to the program requirements. We will grade according to the following criteria:
* 5 pts for correct part 1 (pass `train_test_split()` unit test)
* 15 pts for correct part 1 (pass `kfold_split()` unit test)
* 10 pts for correct part 1 (pass `bootstrap_sample()` unit test)
* 5 pts for correct part 1 (pass `accuracy_score()` unit test)
* 10 pts for correct part 1 (pass `confusion_matrix()` unit test)
* 5 pts for correct part 2 step 1 (random subsampling)
* 5 pts for correct part 2 step 2 (cross validation)
* 5 pts for correct part 2 step 3 (bootstrap method)
* 5 pts for correct part 2 step 4 (confusion matrices)
* 10 pts for adherence to course [coding standard](https://nbviewer.jupyter.org/github/GonzagaCPSC322/PAs/blob/master/Coding%20Standard.ipynb), including data storytelling (narrative is clear and grammatically correct, Notebook is organized with headers, formulas are typeset with Latex, code receives a "good" `pylint` rating, etc.).
    * See [coding standard](https://nbviewer.jupyter.org/github/GonzagaCPSC322/PAs/blob/master/Coding%20Standard.ipynb) for details on how to run `pylint` from command line)
    * The `pylint` scoring portion of these 10 points is 5 pts scaled to 1/2 of the `pylint` rating, meaning an 8/10 `pylint` rating would score 4/5 pts (rounded to nearest 1/2 integer)