# Module 9 - Programming Assignment

## Directions

1. Change the name of this file to be your JHED id as in `jsmith299.ipynb`. Because sure you use your JHED ID (it's made out of your name and not your student id which is just letters and numbers).
2. Make sure the notebook you submit is cleanly and fully executed. I do not grade unexecuted notebooks.
3. Submit your notebook back in Blackboard where you downloaded this file.

*Provide the output **exactly** as requested*

## Naive Bayes Classifier

For this assignment you will be implementing and evaluating a Naive Bayes Classifier with the same data from last week:

http://archive.ics.uci.edu/ml/datasets/Mushroom

(You should have downloaded it).

<div style="background: lemonchiffon; margin:20px; padding: 20px;">
    <strong>Important</strong>
    <p>
        No Pandas. The only acceptable libraries in this class are those contained in the `environment.yml`. No OOP, either. You can used Dicts, NamedTuples, etc. as your abstract data type (ADT) for the the tree and nodes.
    </p>
</div>


You'll first need to calculate all of the necessary probabilities using a `train` function. A flag will control whether or not you use "+1 Smoothing" or not. You'll then need to have a `classify` function that takes your probabilities, a List of instances (possibly a list of 1) and returns a List of Tuples. Each Tuple has the best class in the first position and a dict with a key for every possible class label and the associated *normalized* probability. For example, if we have given the `classify` function a list of 2 observations, we would get the following back:

```
[("e", {"e": 0.98, "p": 0.02}), ("p", {"e": 0.34, "p": 0.66})]
```

when calculating the error rate of your classifier, you should pick the class label with the highest probability; you can write a simple function that takes the Dict and returns that class label.

As a reminder, the Naive Bayes Classifier generates the *unnormalized* probabilities from the numerator of Bayes Rule:

$$P(C|A) \propto P(A|C)P(C)$$

where C is the class and A are the attributes (data). Since the normalizer of Bayes Rule is the *sum* of all possible numerators and you have to calculate them all, the normalizer is just the sum of the probabilities.

You will have the same basic functions as the last module's assignment and some of them can be reused or at least repurposed.

`train` takes training_data and returns a Naive Bayes Classifier (NBC) as a data structure. There are many options including namedtuples and just plain old nested dictionaries. **No OOP**.

```
def train(training_data, smoothing=True):
   # returns the Decision Tree.
```

The `smoothing` value defaults to True. You should handle both cases.

`classify` takes a NBC produced from the function above and applies it to labeled data (like the test set) or unlabeled data (like some new data). (This is not the same `classify` as the pseudocode which classifies only one instance at a time; it can call it though).

```
def classify(nbc, observations, labeled=True):
    # returns a list of tuples, the argmax and the raw data as per the pseudocode.
```

`evaluate` takes a data set with labels (like the training set or test set) and the classification result and calculates the classification error rate:

$$error\_rate=\frac{errors}{n}$$

Do not use anything else as evaluation metric or the submission will be deemed incomplete, ie, an "F". (Hint: accuracy rate is not the error rate!).

`cross_validate` takes the data and uses 10 fold cross validation (from Module 3!) to `train`, `classify`, and `evaluate`. **Remember to shuffle your data before you create your folds**. I leave the exact signature of `cross_validate` to you but you should write it so that you can use it with *any* `classify` function of the same form (using higher order functions and partial application). If you did so last time, you can reuse it for this assignment.

Following Module 3's discussion, `cross_validate` should print out the fold number and the evaluation metric (error rate) for each fold and then the average value (and the variance). What you are looking for here is a consistent evaluation metric cross the folds. You should print the error rates in terms of percents (ie, multiply the error rate by 100 and add "%" to the end).

To summarize...

Apply the Naive Bayes Classifier algorithm to the Mushroom data set using 10 fold cross validation and the error rate as the evaluation metric. You will do this *twice*. Once with smoothing=True and once with smoothing=False. You should follow up with a brief explanation for the similarities or differences in the results.

In [105]:
import math
import random
from typing import List, Dict, Tuple, Callable
from copy import deepcopy


def parse_data(file_name: str) -> List[List]:
    data = []
    file = open(file_name, "r")
    for line in file:
        datum = [value for value in line.rstrip().split(",")]
        data.append(datum)
    random.shuffle(data)
    return data

data = parse_data("agaricus-lepiota.data")

len(data[0])

len(data)



def create_folds(xs: List, n: int) -> List[List[List]]:
    k, m = divmod(len(xs), n)
    # be careful of generators...
    return list(xs[i * k + min(i, m):(i + 1) * k + min(i + 1, m)] for i in range(n))

folds = create_folds(data, 10)

len(folds)

def create_train_test(folds: List[List[List]], index: int) -> Tuple[List[List], List[List]]:
    training = []
    test = []
    for i, fold in enumerate(folds):
        if i == index:
            test = fold
        else:
            training = training + fold
    return training, test

training, test = create_train_test(folds, 0)

len(training)

len(test)

813

### <a id="count_attributes"></a> count_attributes

Formal Parameters:

**data** The data to train the nbc on

**returns** A dictionary mapping each attribute to how many times it appears with `'e'` and `'p'`

used in [classifier_maker](#classifier_maker)

In [106]:
def count_attributes(data):
    count_dict = {}
    for j in data:
        for i in range(1,len(j)):
            if not (i,j[i]) in count_dict.keys():
                count_dict[(i,j[i])] = {'e':0,'p':0}
            count_dict[(i,j[i])][j[0]] += 1
    return count_dict

In [107]:
test_data1 = [['e','a','b','c']]

assert(count_attributes(test_data1) == {(1,'a'):{'e':1,'p':0},\
                                       (2,'b'):{'e':1,'p':0},\
                                       (3,'c'):{'e':1,'p':0}})

test_data2 = [
    ['e','a','b','c'],
    ['p','a','b','c']]

assert(count_attributes(test_data2) == {(1,'a'):{'e':1,'p':1},\
                                       (2,'b'):{'e':1,'p':1},\
                                       (3,'c'):{'e':1,'p':1}}) 
    
test_data3 = [
    ['e','a','b','c'],
    ['p','a','b','c'],
    ['e','a','b','c'],
    ['e','b','c','a'],
]
assert(count_attributes(test_data3) == {(1,'a'):{'e':2,'p':1},
                                        (2,'b'):{'e':2,'p':1},
                                        (3,'c'):{'e':2,'p':1},
                                       (1,'b'):{'e':1,'p':0},
                                       (2,'c'):{'e':1,'p':0},
                                       (3,'a'):{'e':1,'p':0}})

### <a id="count_classes"></a> count_classes

Formal Parameters:

**data** The data to train the nbc on

**returns** the tuple (count_dict: a dict counting each class, total: the total amount of observations)

used in [classifier_maker](#classifier_maker)

In [108]:
def count_classes(data):
    count_dict = {}
    total = 0
    for j in data:
        total+=1
        if j[0] in count_dict.keys():
            count_dict[j[0]]+=1
        else:
            count_dict[j[0]]=1
    return count_dict, total

### <a id="classifier_maker"></a> classifier_maker

Formal Parameters:

**data** The data to train the nbc on

**smoothing** Whether to smooth the data to deal with missing values

**returns** classify (not to be confused with [classify](#classify)) a nested function meant to be the nbc.  
Its formal parameters are:
        **date_points** a nested list with each inner list representing an observation
        **labeled** a boolean that determines whether the data_points  is labeled, helping to clean the data
        **returns** a List of Tuples. Each Tuple has the best class in the first position and
        a dict with a key for everypossible class label and the associated normalized probability

The nbc is trained on the data, but is only half evaluated.  Since an nbc will grow exponentially with the size of an observation, I chose to make it a higher order function that will lazily classify data when called instead of making a giant map for every possibility of observations, saving time and space.

In [109]:
def classifier_maker(data,smoothing):
    class_dict,total = count_classes(data)
    total_e = class_dict['e']
    total_p = class_dict['p']
    if smoothing:
        total_e+=1
        total_p+=1
    count_dict = count_attributes(data)
    def classify(data_points,labeled=True):
        if not labeled:
            data_points = deepcopy(data_points)
            for data in data_points:
                data = [''] + data 
        l = []
        for data_point in data_points:
            prob_e = total_e/total
            prob_p = total_p/total
            for i in range(1,len(data_point)):
                curr_p = count_dict[(i,data_point[i])]['p']
                curr_e = count_dict[(i,data_point[i])]['e']
                curr_total = curr_p+curr_e
                if smoothing:
                    curr_p+=1
                    curr_e+=1
                prob_e*=(curr_e/total_e)
                prob_p*=(curr_p/total_p)
            prob_total = prob_e+prob_p
            prob_e = prob_e/prob_total
            prob_p = prob_p/prob_total
            if prob_e > prob_p:
                l.append(('e',{'e':prob_e,'p':prob_p}))
            else:
                l.append(('p',{'e':prob_e,'p':prob_p}))
        return l   
    return classify
    

### <a id="train"></a> train

Formal Parameters:

**data** The data to train the nbc on

**smoothing** Whether to smooth the data to deal with missing values

**returns** A call to [classifier_maker](#classifier_maker), which returns the nbc as a higher order function.

The nbc is trained on the data, but is only half evaluated.  Since an nbc will grow exponentially in the size of an observation, I chose to make it a higher order function that will lazily classify data when called instead of making a giant map for every possibility of observations, saving time and space.

In [110]:
def train(data, smoothing = True):
    return classifier_maker(data,smoothing)
    

In [116]:
test_data3 = [
    ['e','a','b','c'],
    ['p','a','b','c'],
    ['e','a','b','c'],
    ['e','b','c','a'],
]
nbc = train(test_data3,False)
print(nbc(test_data3))
nbc = train(test_data3,True)
print(nbc(test_data3))

[('p', {'e': 0.47058823529411764, 'p': 0.5294117647058824}), ('p', {'e': 0.47058823529411764, 'p': 0.5294117647058824}), ('p', {'e': 0.47058823529411764, 'p': 0.5294117647058824}), ('e', {'e': 1.0, 'p': 0.0})]
[('p', {'e': 0.4576271186440678, 'p': 0.5423728813559322}), ('p', {'e': 0.4576271186440678, 'p': 0.5423728813559322}), ('p', {'e': 0.4576271186440678, 'p': 0.5423728813559322}), ('e', {'e': 0.6666666666666666, 'p': 0.3333333333333333})]


### <a id="classify"></a> classify

Formal Parameters:

**nbc** The nbc, a higher order function

**observations**  a list of attribute values observed

**labeled** Whether the observations are labeled with a classification, defaults to `True`

**returns** `'p'` or `'e'`, if the mushroom is poisonous or edible, based on the observations.

This classifies observations based on a nbc.  The labeled parameter helps determine if we are testing or not, and helps the function clean the data.

In [112]:
def classify(nbc,observations,labeled=True):
    return nbc(observations,labeled)

### <a id="evaluate"></a> evaluate

Formal Parameters:

**nbc** The nbc, a higher order function

**test**  A list of lists of test data

**model** A higher order function, but actually just [classify](#classify)

**returns** The error rate: amount of errors/total

This determines the error rate of the nbc on the test data.

In [113]:
def evaluate(nbc,test,model,labeled):
    total = len(test)
    error = 0
    for data_point in test:
        prediction = model(nbc,[data_point],labeled)
        prediction = prediction[0][0]
        actual = data_point[0]
        if actual!=prediction:
            error+=1
    
    rate = error/total
    return rate


### <a id="cross_validate"></a> cross_validate

Formal Parameters:

**folds** A decision tree as a nested dict

**smoothing**  Whether the nbc should perform smoothing to deal with missing values

**labeled** Whether the data is labeled or not.  Used by [classify](#classify)

**returns** The average error rate of 10 folds of cross validation

**prints** Average error rate, variance

This determines the average error rate and variance of the different nbc over 10 folds of cross validation

In [114]:
def cross_validate(folds,smoothing = True,labeled=True):
    print("nbc evaluation with smoothing = "+ str(smoothing))
    total = 0
    rates = []
    variance = 0
    for i in range(10):
        training, test = create_train_test(folds, i)
        nbc = train(training,smoothing)
        error = evaluate(nbc,test,classify,labeled)
        print("fold "+str(i)+ " error rate: " + str(100*error)+"%")
        total += error
        rates.append(error)
      
    mean = total/10
    for r in rates:
        variance += (r-mean)**2
        
    variance /=9
        
        

    print("mean error rate: " + str(100*mean)+"%")
    print("variance: " + str(10000*variance) +"%")
    return 100*total/10

In [115]:
folds = create_folds(data,10)
cross_validate(folds)
cross_validate(folds,smoothing=False)

nbc evaluation with smoothing = True
fold 0 error rate: 4.797047970479705%
fold 1 error rate: 4.551045510455105%
fold 2 error rate: 5.289052890528905%
fold 3 error rate: 4.674046740467404%
fold 4 error rate: 4.433497536945813%
fold 5 error rate: 4.1871921182266005%
fold 6 error rate: 3.9408866995073892%
fold 7 error rate: 4.1871921182266005%
fold 8 error rate: 5.0492610837438425%
fold 9 error rate: 4.556650246305419%
mean error rate: 4.566587291488679%
variance: 0.1685584432940179%
nbc evaluation with smoothing = False
fold 0 error rate: 0.4920049200492005%
fold 1 error rate: 0.7380073800738007%
fold 2 error rate: 0.24600246002460024%
fold 3 error rate: 0.24600246002460024%
fold 4 error rate: 0.12315270935960591%
fold 5 error rate: 0.24630541871921183%
fold 6 error rate: 0.3694581280788177%
fold 7 error rate: 0.0%
fold 8 error rate: 0.12315270935960591%
fold 9 error rate: 0.49261083743842365%
mean error rate: 0.3076697023127867%
variance: 0.04792399804009961%


0.3076697023127867

We can clearly see that without smoothing, the nbc has about a 10x lower mean error rate and 4x lower variance.  I assume this is because smoothing is used to handle missing data, and this particular data set has relatively few missing values, and all the missing values are for one attribute.

## Before You Submit...

1. Did you provide output exactly as requested?
2. Did you re-execute the entire notebook? ("Restart Kernel and Rull All Cells...")
3. If you did not complete the assignment or had difficulty please explain what gave you the most difficulty in the Markdown cell below.
4. Did you change the name of the file to `jhed_id.ipynb`?

Do not submit any other files.