# [CPSC 322](https://github.com/GonzagaCPSC322) Data Science Algorithms
[Gonzaga University](https://www.gonzaga.edu/)

[Gina Sprint](http://cs.gonzaga.edu/faculty/sprint/)

# PA6 Naive Bayes and Measuring Performance (100 pts)

## Learner Objectives
At the conclusion of this programming assignment, participants should be able to:
* Implement a Naive Bayes classifier
* Measure classifier performance using precision, recall, and F1 score
* Calculate conditional probabilities using a Gaussian distribution (bonus)

## Prerequisites
Before starting this programming assignment, participants should be able to:
* Implement test-driven development
* Implement a $k$NN classifier
* Evaluate classifiers using train/test sets
* Tell a data science story using Jupyter Notebook
* Understand Bramer Chapter 3 (Intro to Classification: Naive Bayes and Nearest Neighbour) and Chapter 12 (Measuring the Performance of a Classifier)

## Acknowledgments
Content used in this assignment is based upon information in the following sources:
* Part 2: Dr. Shawn Bowers' Data Mining HW4

## Github Classroom Setup
For this assignment, you will use GitHub Classroom to create a private code repository to track code changes and submit your assignment. Open this PA6 link to accept the assignment and create a private repository for your assignment in Github classroom: https://classroom.github.com/a/Pc805GIb

Your repo, for example, will be named GonzagaCPSC3322/pa6-yourusername (where yourusername is your Github username). I highly recommend committing/pushing regularly so your work is always backed up.

## Overview and Requirements
This assignment involves implementing a Naive Bayes classifier and further exploring how to evaluate the performance of a classifier. It has three main parts:
1. `mysklearn`: Test and implement general and re-usable classification and evaluation algorithms
1. Titanic Dataset Classification (pa6.ipynb): Write a Jupyter Notebook that uses `mysklearn` to perform data classification tasks on a titanic survival dataset

I highly encourage you to design functions that are generic and re-usable for future programming assignments and data mining tasks.

Note: we are learning data science from scratch! The only non-standard Python libraries you should need to use for this assignment are `tabulate`, `numpy` (math functions, random number generation, etc.), and `scipy` (sparingly). This means that beyond these libraries, you should not `pip install` any additional libraries beyond what is included in the [continuumio/anaconda3:2024.06-1](https://hub.docker.com/r/continuumio/anaconda3) Docker image and you should not use `pandas/sklearn/`etc... (exceptions are made for testing purposes only!!).

## Part 1: `mysklearn` (65 pts)
### Step 1: Implement Naive Bayes Unit Tests for `myclassifiers.py`
Finish the Naive Bayes unit tests in `test_myclassifiers.py` for `MyNaiveBayesClassifier` (Class API design inspiration: Sci-kit Learn's [Naive Bayes](https://scikit-learn.org/stable/modules/naive_bayes.html))
1. `fit(X_train, y_train)`
    1. Finish the test function `test_naive_bayes_classifier_fit()` by implementing the following test cases
        1. Use the 8 instance training set example traced in class on the iPad, asserting against our desk check of the priors and posteriors
        1. Use the 15 instance training set example from MA7, asserting against your desk check of the priors and posteriors
        1. Use Bramer 3.2 Figure 3.1 *train* dataset example, asserting against the priors and posteriors solution in Figure 3.2. 
1. `predict(X_test)`
    1. Finish the test function `test_naive_bayes_classifier_predict()` by implementing the following test cases
        1. Use the 8 instance training set example traced in class on the iPad, asserting against our desk check prediction
        1. Use the 15 instance training set example from MA7, asserting against your desk check predictions for the two test instances
        1. Use Bramer 3.2 unseen instance `["weekday", "winter", "high", "heavy"]` and Bramer 3.6 Self-assessment exercise 1 unseen instances, asserting against the solution prediction on pg. 28-29 and the exercise solution predictions in Bramer Appendix E
        
For convenience, I've provided the datasets as Python lists below.

In [1]:
# in-class Naive Bayes example (lab task #1)
header_inclass_example = ["att1", "att2"]
X_train_inclass_example = [
    [1, 5], # yes
    [2, 6], # yes
    [1, 5], # no
    [1, 5], # no
    [1, 6], # yes
    [2, 6], # no
    [1, 5], # yes
    [1, 6] # yes
]
y_train_inclass_example = ["yes", "yes", "no", "no", "yes", "no", "yes", "yes"]

# MA7 (fake) iPhone purchases dataset
header_iphone = ["standing", "job_status", "credit_rating"]
X_train_iphone = [
    [1, 3, "fair"],
    [1, 3, "excellent"],
    [2, 3, "fair"],
    [2, 2, "fair"],
    [2, 1, "fair"],
    [2, 1, "excellent"],
    [2, 1, "excellent"],
    [1, 2, "fair"],
    [1, 1, "fair"],
    [2, 2, "fair"],
    [1, 2, "excellent"],
    [2, 2, "excellent"],
    [2, 3, "fair"],
    [2, 2, "excellent"],
    [2, 3, "fair"]
]
y_train_iphone = ["no", "no", "yes", "yes", "yes", "no", "yes", "no", "yes", "yes", "yes", "yes", "yes", "no", "yes"]

# Bramer 3.2 train dataset
header_train = ["day", "season", "wind", "rain"]
X_train_train = [
    ["weekday", "spring", "none", "none"],
    ["weekday", "winter", "none", "slight"],
    ["weekday", "winter", "none", "slight"],
    ["weekday", "winter", "high", "heavy"],
    ["saturday", "summer", "normal", "none"],
    ["weekday", "autumn", "normal", "none"],
    ["holiday", "summer", "high", "slight"],
    ["sunday", "summer", "normal", "none"],
    ["weekday", "winter", "high", "heavy"],
    ["weekday", "summer", "none", "slight"],
    ["saturday", "spring", "high", "heavy"],
    ["weekday", "summer", "high", "slight"],
    ["saturday", "winter", "normal", "none"],
    ["weekday", "summer", "high", "none"],
    ["weekday", "winter", "normal", "heavy"],
    ["saturday", "autumn", "high", "slight"],
    ["weekday", "autumn", "none", "heavy"],
    ["holiday", "spring", "normal", "slight"],
    ["weekday", "spring", "normal", "none"],
    ["weekday", "spring", "normal", "slight"]
]
y_train_train = ["on time", "on time", "on time", "late", "on time", "very late", "on time",
                 "on time", "very late", "on time", "cancelled", "on time", "late", "on time",
                 "very late", "on time", "on time", "on time", "on time", "on time"]

### Step 2: `fit()` and `predict()`
Complete the `mysklearn.myclassifiers.MyNaiveBayesClassifier` methods `fit()` and `predict()` and test your code for functional correctness against the above unit tests.

### Step 3 `mysklearn.myevaluation` Functions
Complete the `mysklearn.myevaluation` functions and test your code for functional correctness against the provided unit tests in `test_myevaluation.py`:
1. `binary_precision_score(y_true, y_pred, labels=None, pos_label=None)`
    * Function inspiration: Sci-kit Learn's [`precision_score`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html)
    * Note: our implementation of `precision_score()` only supports binary classification
1. `binary_recall_score(y_true, y_pred, labels=None, pos_label=None)`
    * Function inspiration: Sci-kit Learn's [`recall_score`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html)
    * Note: our implementation of `recall_score()` only supports binary classification
1. `binary_f1_score(y_true, y_pred, labels=None, pos_label=None)`
    * Function inspiration: Sci-kit Learn's [`f1_score`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html)
    * Note: our implementation of `f1_score()` only supports binary classification

### Part 2: 🚢 Titanic Classification 🚢 (25 pts) 
The titanic dataset (included in the `input_data` directory) consists of instances representing passengers aboard the Titanic ship that sank in the North Atlantic Ocean on 15 April 1912. The dataset has three attributes describing a passenger (class, age, sex) and a binary class label (survived; 1490 "yes" and 711 "no") denoting whether the passenger survived the shipwreck or not.

Write a Jupyter Notebook (pa6.ipynb) that uses your `mysklearn` package to build Naive Bayes, $k$-nearest neighbor classifier, and dummy classifiers to predict survival from the titanic dataset **using k-fold cross validation (with k = 10)** (use stratified k-fold cross validation if you implemented it for the PA5 bonus). Your classifiers should use class, age, and sex attributes to determine the survival value. Note that since that class, age, and sex are categorical attributes, you will need to update your kNN implementation to properly compute the distance between categorical attributes. See the [B Nearest Neighbors Classification](https://github.com/GonzagaCPSC322/U4-Supervised-Learning/blob/master/B%20Nearest%20Neighbor%20Classification.ipynb) notes on Github for how to go about doing this.

How well does $k$NN, Dummy, and Naive Bayes classify the titanic dataset? For each classifier, report the following:
1. Accuracy and error rate
1. Precision, recall, and F1 measure
1. Confusion matrices

In the Notebook, describe the steps, log any assumptions and/or issues you had in doing the steps, and provide insights on the results. All re-usable utility functions should be separate from your Notebook in an appropriate module.

## Bonus `classification_report()` (5 pts)
Add support to part 1 (`mysklearn`) and part 2 (dataset results) to compute and display a "classification report" like Sci-kit Learn's [`classification_report()`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html). To do this: 
1. (2 pts) Write a unit test for `classification_report()`
    * The docstring for this function is provided below. Note that our implementation of `classification_report()` will have all the same parameters as Sci-kit Learn's, except:
        * Omit `target_names`
        * Omit `sample_weight`
        * Omit `digits`
        * Omit `zero_division` (always use 0.0 when division by zero occurs)
    * Put the unit test in a file called `test_bonus.py`
    * Use test cases from in-class lab tasks (e.g. binary win-lose and multi-class coffee acidity) and from Sci-kit Learn's [`classification_report()`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html) documentation page examples
1. (2 pts) Implement `classification_report()`
    * Put the function in `myevaluation.py`
    * Use `tabulate` to produce a nicely formatted classification report table
1. (1 pt) Add classification reports to all results reported in part 2

```python
"""Build a text report and a dictionary showing the main classification metrics.

    Args:
        y_true(list of obj): The ground_truth target y values
            The shape of y is n_samples
        y_pred(list of obj): The predicted target y values (parallel to y_true)
            The shape of y is n_samples
        labels(list of obj): The list of possible class labels. If None, defaults to
            the unique values in y_true
        output_dict(bool): If True, return output as dict instead of a str

    Returns:
        report(str or dict): Text summary of the precision, recall, F1 score for each class.
            Dictionary returned if output_dict is True. Dictionary has the following structure:
                {'label 1': {'precision':0.5,
                            'recall':1.0,
                            'f1-score':0.67,
                            'support':1},
                'label 2': { ... },
                ...
                }
            The reported averages include macro average (averaging the unweighted mean per label) and
            weighted average (averaging the support-weighted mean per label).
            Micro average (averaging the total true positives, false negatives and false positives)
            multi-class with a subset of classes, because it corresponds to accuracy otherwise
            and would be the same for all metrics. 

    Notes:
        Loosely based on sklearn's classification_report():
            https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html
    """
```

## Submitting Assignments
1. Turn in your assignment files via a Github Classroom repo. See the "Github Classroom Setup" section at the beginning of this document for details on how to do this.
    1. Your repo should contain all of the files needed to run and test your solution (e.g. .py file(s), input files, etc.). 
    1. Double-check that this is the case by "pretending to be the grader": clone (or download a zip) your submission repo and run your code in a fresh [continuumio/anaconda3:2024.06-1](https://hub.docker.com/r/continuumio/anaconda3) Docker container like we will when we grade your code.
1. Submit this PA’s associated assignment in Canvas to mark your PA as "done" and ready for grading. We will then pull your Github repo and grade your PA as soon as possible. The date and time you submit the PA assignment in Canvas will be used for marking your assignment as "late" or "on-time."

## Grading Guidelines
This assignment is worth 100 points + 5 points bonus. Your assignment will be evaluated based on a successful execution in the [continuumio/anaconda3:2024.06-1](https://hub.docker.com/r/continuumio/anaconda3) Docker container and adherence to the program requirements. We will grade according to the following criteria:
* 10 pts for correct part 1 step 1 (define `MyNaiveBayesClassifier` unit tests)
* 20 pts for correct part 1 step 2 (finish `MyNaiveBayesClassifier` `fit()` and pass test)
* 10 pts for correct part 1 step 2 (finish `MyNaiveBayesClassifier` `predict()` and pass test)
* 10 pts for correct part 1 step 3 (finish `myevaluation.py` `binary_precision_score()` and pass test)
* 10 pts for correct part 1 step 3 (finish `myevaluation.py` `binary_recall_score()` and pass test)
* 5 pts for correct part 1 step 3 (finish `myevaluation.py` `binary_f1_score()` and pass test)
* 25 pts for correct part 2 step 2 (titanic classification)
* 10 pts for adherence to course [coding standard](https://nbviewer.jupyter.org/github/GonzagaCPSC322/PAs/blob/master/Coding%20Standard.ipynb), including data storytelling (narrative is clear and grammatically correct, Notebook is organized with headers, formulas are typeset with Latex, code receives a "good" `pylint` rating, etc.).
    * See [coding standard](https://nbviewer.jupyter.org/github/GonzagaCPSC322/PAs/blob/master/Coding%20Standard.ipynb) for details on how to run `pylint` from command line)
    * The `pylint` scoring portion of these 10 points is 5 pts scaled to 1/2 of the `pylint` rating, meaning an 8/10 `pylint` rating would score 4/5 pts (rounded to nearest 1/2 integer)