# [CPSC 310](https://github.com/GonzagaCPSC310) Data Mining
[Gonzaga University](https://www.gonzaga.edu/)

[Gina Sprint](http://cs.gonzaga.edu/faculty/sprint/)
# PA5 Naive Bayes (100 pts)

## Learner Objectives
At the conclusion of this programming assignment, participants should be able to:
* Implement a Naive Bayes classifier
* Calculate conditional probabilities using a Gaussian distribution
* Implement a Random and Zero-R classifier

## Prerequisites
Before starting this programming assignment, participants should be able to:
* Implement a $k$NN classifier
* Evaluate classifiers using train/test sets

## Acknowledgments
Content used in this assignment is based upon information in the following sources:
* Dr. Shawn Bowers' Data Mining HW4

## Github Classroom Setup
For this assignment, you will use GitHub Classroom to create a private code repositories to track code changes and submit your assignment. Open this PA5 link to accept the assignment and create a private repository for your assignment in Github classroom: https://classroom.github.com/a/VohZiPzY

Your repo, for example, will be named GonzagaCPSC310/pa5-yourusername (where yourusername is your Github username). I highly recommend committing/pushing regularly so your work is always backed up. We will grade your most recent commit, even if that commit is after the due date (your work will be marked late if this is the case).

## Overview and Requirements
Write a program (`pa5.py`) that builds a Naive Bayes classifier for the "pre-processed" automobile dataset (auto-data.txt) you created for PA1 (pick one method to deal with missing values for this assignment (e.g., eliminate rows with missing values, use means or medians, etc.)) and the titanic dataset. Download the titanic.txt and dataset from https://github.com/GonzagaCPSC310/PAs/tree/master/files

For this assignment you will need to perform the following steps and turn in your source code and a log of any assumptions and/or issues you had in doing the steps. Your log needs to be written separately from your .py file and may be written in a .txt or a .md (markdown) file.

Note: as you write solutions for the following steps, I highly encourage you to design functions that are generic and re-usable for future programming assignments and data mining tasks.

Note: we are learning data mining from scratch! The only libraries you should need to use for this assignment include numpy (sparingly), csv (if you'd like to use it for file I/O), and tabulate (if you'd like to use it for pretty printing). This means no pandas...

## Step 1
Create a Naive Bayes classifier for the "train" dataset from section 3.3 in Bramer. Your classifier should predict the class values from the day, season, wind, and rain attributes. Check your prior probabilities and posterior probabilities with Figure 3.2 and check your classifier's posterior probabilities for each class for the test instance (weekday, winter, high, heavy, ????). Only after you are confident that your Naive Bayes classifier is working correctly, move on to the next step. For convenience, I've provided the column names and dataset as Python lists below: 

In [1]:
col_names = ["day", "season", "wind", "rain", "class"]
table = [
    ["weekday", "spring", "none", "none", "on time"],
    ["weekday", "winter", "none", "slight", "on time"],
    ["weekday", "winter", "none", "slight", "on time"],
    ["weekday", "winter", "high", "heavy", "late"], 
    ["saturday", "summer", "normal", "none", "on time"],
    ["weekday", "autumn", "normal", "none", "very late"],
    ["holiday", "summer", "high", "slight", "on time"],
    ["sunday", "summer", "normal", "none", "on time"],
    ["weekday", "winter", "high", "heavy", "very late"],
    ["weekday", "summer", "none", "slight", "on time"],
    ["saturday", "spring", "high", "heavy", "cancelled"],
    ["weekday", "summer", "high", "slight", "on time"],
    ["saturday", "winter", "normal", "none", "late"],
    ["weekday", "summer", "high", "none", "on time"],
    ["weekday", "winter", "normal", "heavy", "very late"],
    ["saturday", "autumn", "high", "slight", "on time"],
    ["weekday", "autumn", "none", "heavy", "on time"],
    ["holiday", "spring", "normal", "slight", "on time"],
    ["weekday", "spring", "normal", "none", "on time"],
    ["weekday", "spring", "normal", "slight", "on time"]
]

## Step 2
Create a Naive Bayes classifier that predicts mpg values (and then maps these to the DOE classification/ranking (given in PA2)) from the cylinders, weight, and model year attributes. For this step, treat cylinders and model year as categorical values and use the following discretization (based on NHTSA vehicle sizes) to convert weight to a categorical attribute:

|Ranking |Range|
|-|-|
|5 |$\geq$ 3500
|4 |3000-3499|
|3 |2500-2999|
|2 |2000-2499|
|1 |$\leq$ 1999|

Test your classifier by repeating steps 2-5 from PA4 using your Naive Bayes classifier.

## Step 3
Create a Naive Bayes classifier as in Step 1, but leave weight as a continuous attribute and calculate
its conditional probability using the Gaussian distribution function from class. As in Step 2, test your
classifier by repeating steps 2-5 from PA4.

## Step 4
Use Naive Bayes and $k$-nearest neighbor to create two different classifiers to predict survival from the titanic dataset (titanic.txt). Note that the first line of the dataset lists the name of each attribute:
* class
* age
* sex
* surivived

Your classifiers should use class, age, and sex attributes to determine the survival value. Be sure to write down any assumptions you make in creating the classifiers. Evaluate the performance of your classifier using stratified k-fold cross validation (with k = 10) and generate confusion matrices for the two classifiers. As in PA4, report both accuracy and error rate for the two approaches.

## Step 5
How well does Naive Bayes and $k$NN classify the titanic dataset? A common approach to evaluate the performance of a classifier is to compare its results to "baseline" classifiers on the same dataset. Two common baseline classifiers are:
1. Random classifier: classifies an instance by randomly choosing a class label (class labels probabilities of being chosen are weighted based on their frequency in the training set)
1. Zero-R: classifies an instance using "zero rules"... it always predicts the most common class label in the training set. For example, if 99% of the dataset is positive instances, it always predicts positive.

Create a Random classifier and a Zero-R classifier to predict survival from the titanic dataset. Compare your results for all four classifiers (Naive Bayes, $k$NN, Random, Zero-R). Be sure to write down any insights you find in your log.

## Submitting Assignments
1. Use Github classroom to submit your assignment via a Github repo. See the "Github Classroom Setup" section at the beginning of this document for details on how to do this. You must commit your solution by the due date and time.
1. Your repo should contain only your .py file(s), your input .csv/.txt file(s), and your log file (.txt or .md) file(s). Double check that this is the case by cloning (or downloading a zip) your submission repo and running your code from command line like we will when we grade your code. 

## Grading Guidelines
This assignment is worth 100 points. Your assignment will be evaluated based on a successful compilation from command line (using the Anaconda Python Distribution v3.7) and adherence to the program requirements. We will grade according to the following criteria:
* 20 pts for correct step 1
* 20 pts for correct step 2
* 15 pts for correct step 3
* 15 pts for correct step 4
* 15 pts for correct step 5
* 5 pts for quantity and quality of Github commit messages
* 10 pts for adherence to course [coding standard](https://nbviewer.jupyter.org/github/GonzagaCPSC310/PAs/blob/master/Coding%20Standard.ipynb)