## Naive Bayes Classifier

For this assignment you will be implementing and evaluating a Naive Bayes Classifier with the same data from last week:

http://archive.ics.uci.edu/ml/datasets/Mushroom

(You should have downloaded it).

<div style="background: lemonchiffon; margin:20px; padding: 20px;">
    <strong>Important</strong>
    <p>
        No Pandas. The only acceptable libraries in this class are those contained in the `environment.yml`. No OOP, either. You can used Dicts, NamedTuples, etc. as your abstract data type (ADT) for the the tree and nodes.
    </p>
</div>


You'll first need to calculate all of the necessary probabilities using a `train` function. A flag will control whether or not you use "+1 Smoothing" or not. You'll then need to have a `classify` function that takes your probabilities, a List of instances (possibly a list of 1) and returns a List of Tuples. Each Tuple has the best class in the first position and a dict with a key for every possible class label and the associated *normalized* probability. For example, if we have given the `classify` function a list of 2 observations, we would get the following back:

```
[("e", {"e": 0.98, "p": 0.02}), ("p", {"e": 0.34, "p": 0.66})]
```

when calculating the error rate of your classifier, you should pick the class label with the highest probability; you can write a simple function that takes the Dict and returns that class label.

As a reminder, the Naive Bayes Classifier generates the *unnormalized* probabilities from the numerator of Bayes Rule:

$$P(C|A) \propto P(A|C)P(C)$$

where C is the class and A are the attributes (data). Since the normalizer of Bayes Rule is the *sum* of all possible numerators and you have to calculate them all, the normalizer is just the sum of the probabilities.

You will have the same basic functions as the last module's assignment and some of them can be reused or at least repurposed.

`train` takes training_data and returns a Naive Bayes Classifier (NBC) as a data structure. There are many options including namedtuples and just plain old nested dictionaries. **No OOP**.

```
def train(training_data, smoothing=True):
   # returns the Decision Tree.
```

The `smoothing` value defaults to True. You should handle both cases.

`classify` takes a NBC produced from the function above and applies it to labeled data (like the test set) or unlabeled data (like some new data). (This is not the same `classify` as the pseudocode which classifies only one instance at a time; it can call it though).

```
def classify(nbc, observations, labeled=True):
    # returns a list of tuples, the argmax and the raw data as per the pseudocode.
```

`evaluate` takes a data set with labels (like the training set or test set) and the classification result and calculates the classification error rate:

$$error\_rate=\frac{errors}{n}$$

Do not use anything else as evaluation metric or the submission will be deemed incomplete, ie, an "F". (Hint: accuracy rate is not the error rate!).

`cross_validate` takes the data and uses 10 fold cross validation (from Module 3!) to `train`, `classify`, and `evaluate`. **Remember to shuffle your data before you create your folds**. I leave the exact signature of `cross_validate` to you but you should write it so that you can use it with *any* `classify` function of the same form (using higher order functions and partial application). If you did so last time, you can reuse it for this assignment.

Following Module 3's discussion, `cross_validate` should print out the fold number and the evaluation metric (error rate) for each fold and then the average value (and the variance). What you are looking for here is a consistent evaluation metric cross the folds. You should print the error rates in terms of percents (ie, multiply the error rate by 100 and add "%" to the end).

To summarize...

Apply the Naive Bayes Classifier algorithm to the Mushroom data set using 10 fold cross validation and the error rate as the evaluation metric. You will do this *twice*. Once with smoothing=True and once with smoothing=False. You should follow up with a brief explanation for the similarities or differences in the results.

In [1]:
import random
import numpy as np
from typing import List, Dict, Tuple, Any

In [2]:
def parse_data(file_name: str) -> List[List]:
    data = []
    file = open(file_name, "r")
    for line in file:
        datum = [str(value) for value in line.rstrip().split(",")]
        data.append(datum)
    random.shuffle(data)
    return data

In [3]:
data = parse_data("data/agaricus-lepiota.data")

In [4]:
len(data)

8124

In [5]:
attributes = [
    "cap-shape",
    "cap-surface",
    "cap-color",
    "bruises?",
    "odor",
    "gill-attachment",
    "gill-spacing",
    "gill-size",
    "gill-color",
    "stalk-shape",
    "stalk-root",
    "stalk-surface-above-ring",
    "stalk-surface-below-ring",
    "stalk-color-above-ring",
    "stalk-color-below-ring",
    "veil-type",
    "veil-color",
    "ring-number",
    "ring-type",
    "spore-print-color",
    "population",
    "habitat",
]


In [6]:
len(attributes)

22

<a id="extract_column"></a>
### extract_column

`extract_column` extracts a column into a list from a list of lists. **Used by**:[c_class](#c_class), [evaluate](#evaluate)

* **data** List[List]: the list of lists
* **column** int: determines which column to extract

**return**: List: the extracted column as a list

In [7]:
def extract_column(data: List[List], column) -> List:
    extract = []
    for i, value in enumerate(data):
        extract.append(data[i][column])
    return extract

<a id="create_folds"></a>
### create_folds

`create_folds` creates folds of data to use for cross_validations. **Used by**:[cross_validate](#cross_validate)

* **xs** List: the list to create folds with
* **n** int: number of folds

**return**: List[List[List]]: list of folds stored as list of lists

In [8]:
def create_folds(xs: List, n: int) -> List[List[List]]:
    k, m = divmod(len(xs), n)
    return list(xs[i * k + min(i, m) : (i + 1) * k + min(i + 1, m)] for i in range(n))

<a id="create_train_test"></a>
### create_train_test

`create_train_test` creates a training set and a test set from folded data. **Used by**: [cross_validate](#cross_validate)

* **folds** List[List[List]]: the list of folded data
* **index** int: which fold to use as test data

**return**: [List[List[List], List]: One test data list, and the rest of the data as training data

In [9]:
def create_train_test(
    folds: List[List[List]], index: int
) -> Tuple[List[List], List[List]]:
    training = []
    test = []
    for i, fold in enumerate(folds):
        if i == index:
            test = fold
        else:
            training = training + fold
    return training, test


<a id="train"></a>
### train

`train` Runs the training algorithm and returns the model. **Used by**: [cross_validate](#cross_validate) **Uses**: [clf_nb](#clf_nb)

* **training_data** List[List[List]]: a list of trainging data
* **attributes** List: list of attributes that the data represents
* **smoothing** Bool: determines whether to use +1 smoothing or not

**return**: Dict: the model of probabilities that can be used to classify data

In [10]:
def train(training_data, attributes, smoothing=True) -> Dict:
    model = clf_nb(training_data, attributes, smoothing)
    return model

<a id="classify"></a>
### classify

`classify` Runs the classification algorithm to attach labels to data based on the Naive Bayes model. **Used by**:[cross_validate](#cross_validate) **Uses**: [classify_nb](#classify_nb)

* **nbc** Dict: the Naive Bayes model
* **observations** List[List]: The list of data to be classified
* **attributes** List: list of attributes that the data represents
* **Labeled** Bool: determines whether the data is labeled or only contains features

**return**: List[Tuple]: The list of data labeled using the model

In [11]:
def classify(nbc, observations, attributes, labeled=True) -> List:
    result = []
    if isinstance(observations[0], list):
        for obs in observations:
            instance_result = classify_nb(nbc, obs, attributes, labeled)
            result.append(instance_result)
    else:
        instance_result = classify_nb(nbc, observations, attributes, labeled)
        result.append(instance_result)
    return result

<a id="classify_nb"></a>
### classify_nb

`classify_nb` Runs the classification algorithm to attach labels to an instance of data based on the Naive Bayes model. **Used by**:[classify](#classify) **Uses**:[probability_of](#probability_of), [normalize](#normalize)

* **probs** Dict: the Naive Bayes model
* **instance** List: The list of data to be classified
* **attributes** List: list of attributes that the data represents
* **labeled** Bool: determines whether the data is labeled or only contains features

**return**: Tuple: The best value found and the probabilities of the class labels

In [12]:
def classify_nb(probs, instance, attributes, labeled):
    result = {}
    best = 0
    class_labels = list(probs["class"].keys())
    for label in class_labels:
        result[label] = probability_of(instance, label, probs, attributes, labeled)
    result = normalize(result)
    for k in result:
        if result[k] > best:
            best = result[k]
            best_val = k
    return (best_val, result)

<a id="probability_of"></a>
### probability_of

`classify_nb` Runs the classification algorithm to attach labels to an instance of data based on the Naive Bayes model. **Used by**:[classify_nb](#classify_nb) 

* **probs** Dict: the Naive Bayes model
* **label** str: the class label to be evaluated for
* **instance** List: The list of data to be classified
* **attributes** List: list of attributes that the data represents
* **labeled** Bool: determines whether the data is labeled or only contains features

**return**: float: the probability of the instance matching the class label

In [13]:
def probability_of(instance, label, probs, attributes, labeled):
    probability = 1
    for i, a in enumerate(instance):
        if labeled is True:
            if i == 0:
                continue
            probability = probability * probs[attributes[i - 1]][a][label]
        else:
            probability = probability * probs[attributes[i]][a][label]
    probability = probability * probs["class"][label]
    return probability

<a id="normalize"></a>
### normalize

`normalize` normalizes the probability for each class label. **Used by**:[classify_nb](#classify_nb) 

* **result** Dict: the probabilities for each class label for the instance

**return**: Dict: the normalized probabilities for each class label for the instance

In [14]:
def normalize(result):
    norm = 0
    for k in result:
        norm += result[k]
    for k in result:
        result[k] = result[k] / norm
    return result

<a id="evaluate"></a>
### evaluate

`evaluate` evaluates the classified data versus the actual result. **Used by**:[cross_validate](#cross_validate) **Uses**: [extract_column](#extract_column)

* **data_set** Dict: the data_set that was classified
* **classification_result** List[Tuple]: list of classifications for the data_set

**return**: float: the percentage of data that was correctly classified

In [15]:
def evaluate(data_set, classification_result) -> float:
    correct = 0
    actual = extract_column(data_set, 0)
    for i, value in enumerate(actual):
        if classification_result[i][0] != actual[i]:
            correct += 1
    result = 100 * correct / len(actual)
    return result

<a id="clf_nb"></a>
### clf_nb

`clf_nb` Runs the Naive Bayes algorithm to collect the probabilties of the features as they relate to the class labels **Used by**:[train](#train) **Uses**:[p_class](#p_class), [p_feature](#p_feature)

* **data** List[List]: the dataset as a list of lists
* **attributes** List[Str]: list of attributes in the dataset 
* **smoothing** Bool: determines if smoothing is turned on or off

**return**: Dict: the class labels and fea and their probability of occurance 

In [16]:
def clf_nb(data, attributes, smoothing=True) -> Dict:
    p_c = p_class(data, 0)
    p_fc = p_feature(data, attributes, 0, smoothing)
    nb_dict = {"class": p_c, **p_fc}
    return nb_dict

<a id="p_feature"></a>
### p_feature

`p_feature` finds the probability of occurence of each feature for each class label in the dataset. **Used by**:[clf_nb](#clf_nb)  **Uses**:[c_class](#c_class)

* **data** List[List]: the dataset as a list of lists
* **attributes** List[Str]: list of attributes in the dataset 
* **class_loc** int: determines which column to extract as the class labels
* **smoothing** Bool: determines if smoothing is turned on or off

**return**: Dict: the class labels and their probability of occurance 

In [17]:
def p_feature(data, attributes, class_loc, smoothing):
    p_dict = {}
    c_counts = c_class(data, class_loc)
    for i, a in enumerate(attributes):
        p_dict[a] = {}
        a_counts = c_class(data, i + 1)
        for k2 in a_counts:
            p_dict[a][k2] = {}
            for k in c_counts:
                p_dict[a][k2][k] = 0
                for d in data:
                    if d[i + 1] == k2 and d[0] == k:
                        p_dict[a][k2][k] += 1
                if smoothing is True:
                    p_dict[a][k2][k] = (p_dict[a][k2][k] + 1) / (c_counts[k] + 1)
                else:
                    p_dict[a][k2][k] = p_dict[a][k2][k] / c_counts[k]
    return p_dict

<a id="p_class"></a>
### p_class

`p_class` finds the probability of occurence for each class label in the dataset. **Used by**:[clf_nb](#clf_nb)  **Uses**:[c_class](#c_class)

* **data** List[List]: the dataset as a list of lists
* **class_loc** int: determines which column to extract as the class labels

**return**: Dict: the class labels and their probability of occurance 

In [18]:
def p_class(data, class_loc) -> Dict:
    class_dict = c_class(data, class_loc)
    for k in class_dict:
        class_dict[k] = class_dict[k] / len(data)
    return class_dict

<a id="c_class"></a>
### c_class

`c_class` counts the occurence for each class label in the dataset. **Used by**: [p_class](#p_class), [p_feature](#p_feature) **Uses**:[extract_column](#extract_column)

* **data** List[List]: the dataset as a list of lists
* **class_loc** int: determines which column to extract as the class labels

**return**: Dict: the class labels and their probability of occurance 

In [19]:
def c_class(data, class_loc) -> int:
    class_dict = {}
    class_col = extract_column(data, class_loc)
    unique = np.unique(class_col)
    for u in unique:
        class_dict[u] = 0
        for c in class_col:
            if c == u:
                class_dict[u] += 1
    return class_dict

<a id="cross_validate"></a>
### cross_validate

`cross_validate` Runs the 10 fold cross validation of the Naive Bayes model. **Uses**:[create_train_test](#create_train_test), [train](#train), [classify](#classify), [evaluate](#evaluate)

* **data** List[List]: the dataset as a list of lists
* **attributes** List[Str]: list of attributes in the dataset 
* **classify_function** Callable: determines which classify function to run
* **eval_function** Callable: determines which evaluation function to run
* **smoothing** Bool: determines if smoothing is turned on or off

**return**: Dict: the class labels and their probability of occurance 

In [20]:
def cross_validate(
    data, attributes, classify_function, eval_function, smoothing=True
) -> Any:
    folds = create_folds(data, 10)
    eval_clf = []
    total = 0
    for i, fold in enumerate(folds):
        train_data, test = create_train_test(folds, i)
        model = train(train_data, attributes, smoothing)
        clf = classify_function(model, test, attributes, True)
        eval_clf.append(eval_function(test, clf))
        print(f"Fold {i+1} error rate {eval_clf[i]}%")
    total = sum(eval_clf) / len(eval_clf)
    res = sum((i - total) ** 2 for i in eval_clf) / len(eval_clf)
    print(f"Average Value: {total}, Variance: {res}")
    return total


In [21]:
model = cross_validate(data, attributes, classify, evaluate, False)

Fold 1 error rate 0.12300123001230012%
Fold 2 error rate 0.24600246002460024%
Fold 3 error rate 0.12300123001230012%
Fold 4 error rate 0.24600246002460024%


Fold 5 error rate 0.49261083743842365%
Fold 6 error rate 0.49261083743842365%
Fold 7 error rate 0.6157635467980296%
Fold 8 error rate 0.49261083743842365%
Fold 9 error rate 0.3694581280788177%
Fold 10 error rate 0.24630541871921183%
Average Value: 0.3447366985985131, Variance: 0.026718583698396175


In [22]:
model = cross_validate(data, attributes, classify, evaluate, True)

Fold 1 error rate 4.797047970479705%
Fold 2 error rate 4.059040590405904%
Fold 3 error rate 4.182041820418204%
Fold 4 error rate 4.305043050430505%
Fold 5 error rate 4.187192118226601%
Fold 6 error rate 3.8177339901477834%
Fold 7 error rate 4.926108374384237%
Fold 8 error rate 5.0492610837438425%
Fold 9 error rate 5.41871921182266%
Fold 10 error rate 4.679802955665025%
Average Value: 4.542199116572446, Variance: 0.2326946516402139


The model with no smoothing seemed to have a lower error rate. This is likely due to the data have many points where a feature has all of one class label. When adding the smoothing, that feature no longer has a 0 probability of not being the correct class label. Smoothing will likely make the model more robust in terms of overfitting if more data is collected that is more diverse in it's feature distribution among the classes.