## Naive Bayes Classifier

In [1]:
from typing import Dict, List, Any, Tuple, Callable
import numpy as np
import random

Download the [mushroom dataset](http://archive.ics.uci.edu/ml/datasets/Mushroom). 

http://archive.ics.uci.edu/ml/datasets/Mushroom

Apply the Naive Bayes Classifier algorithm to the Mushroom data set using 10 fold cross validation and the error rate as the evaluation metric. Test with and without smoothing. 

<a id="creat_folds"></a>
## create_folds
*The create_folds function is a helper function that splits the data into n test sets.* **Used by**: [k_fold_validation](#k_fold_validation)

* **data** List[Any]: the dataset
* **n** int: the number of folds

**returns** Tuple[List[List], List[List]].

In [2]:
# function by s. Butcher, 2022
def create_folds(data: List[Any], n: int) -> List[List[List]]:
    k, m = divmod(len(data), n)
    # be careful of generators...
    return list(data[i * k + min(i, m):(i + 1) * k + min(i + 1, m)] for i in range(n))

<a id="creat_train_test"></a>
## create_train_test
*The create_train_test function is a helper that returns the training and test splits for the provided fold index.* **Used by**: [k_fold_validation](#k_fold_validation)

* **folds** List[List[List[str]]]: the folds created by [create_folds](#create_folds)
* **index** Dict[str, str]: the index of the fold to use for the test set

**returns** Tuple[List[List], List[List]].

In [3]:
# function by S. butcher, 2022
def create_train_test(folds: List[List[List[str]]], index: int) -> Tuple[List[List], List[List]]:
    training = []
    test = []
    for i, fold in enumerate(folds):
        if i == index:
            test = fold
        else:
            training = training + fold
    return training, test

<a id="split_labels"></a>
## split_labels
*The split_labels function is a helper function to separate the labels from the data.* **Used by**: [k_fold_validation](#k_fold_validation)

* **data** List[List[str]]: the data with labels
* **label_index** int: the index of the label

**returns** Tuple[List[str], List[List[str]]].

In [4]:
def split_labels(data: List[List[str]], label_index: int) -> Tuple[List[str], List[List[str]]]:
    values = [row[:label_index] + row[label_index+1:] for row in data]
    labels = [row[label_index] for row in data]
    return values, labels

<a id="train_naive_bayes"></a>
## train_naive_bayes
*The train_naive_bayes function calculates the probabilities for a Naive Bayes Classifier. The function stores these probabilities in nested dictionaries. The outer dictionary keys are the class labels. The dictionary values are nested dictionaries, where key "c" is the class probability, and the rest of the keys are attributes. The attribute keys map to nested dictionaries of attribute values storing the conditional probability of each attribute value given the class label. The class probability is known as the prior probability, and the conditional probabilities are known as the posterior probabilities. The [classifier_observation](#classify_observation) function uses these class probabilities and conditional probabilities to classify unseen examples. See the unit tests below for an example of the output format.*

**Used by**: [k_fold_validation](#k_fold_validation)

* **data** List[List[str]]: nested lists containing rows of training data with the labels removed
* **labels** List[str]: the labels for the training data
* **attributes** List[str]: the dataset attributes with the label name removed
* **smoothing** bool: an optional boolean specifying if the train function should apply +1 smoothing when calculating conditional probabilities. The default value is True.

**returns** Dict[str, float | Any].

In [5]:
def train_naive_bayes(data: List[List[str]], labels: List[str], attributes:List[str], smoothing:bool=True) -> Dict[str, float | Any]:
    smoothing_factor = 1 if smoothing == True else 0 
    unique_labels = set(labels)
    output = {}
    
    for class_type in unique_labels:
        class_count = labels.count(class_type)
        output[class_type] = {"c": class_count/len(labels)}
        # calculate the conditional probabilities for each attribute/value combination given the current class
        for index, attribute in enumerate(attributes):
            unique_values = set([row[index] for row in data])
            for value in unique_values:
                value_count = sum(1 for row_index, row in enumerate(data) 
                                if row[index] == value and labels[row_index] == class_type)
                prob = (value_count+smoothing_factor)/(class_count+smoothing_factor)
                if attribute not in output[class_type]:
                    output[class_type][attribute] = {}
                output[class_type][attribute][value] = prob
    return output

In [6]:
# asserions/unit tests
test_data_1 = [
    ["round", "large", "blue"],
    ["square", "large", "green"],  
    ["square", "small", "red"],  
    ["round", "large", "red"],  
    ["square", "small", "blue"],  
    ["round", "small", "blue"],  
    ["round", "small", "red"],  
    ["square", "small", "green"],  
    ["round", "large", "green"],  
    ["square", "large", "green"],  
    ["square", "large", "red"],  
    ["square", "large", "green"],  
    ["round", "large", "red"],  
    ["square", "small", "red"],  
    ["round", "small", "green"]
]

test_labels_1 = ["no", "yes", "no", "yes", "no", "no", "yes", "no", "yes", "yes", "no", "yes", "yes", "no", "no"]

expected_output = {'no': {'c': 0.5333333333333333, 
                          'shape': {'round': 0.4444444444444444, 'square': 0.6666666666666666}, 
                          'size': {'small': 0.7777777777777778, 'large': 0.3333333333333333}, 
                          'color': {'blue': 0.4444444444444444, 'red': 0.4444444444444444, 'green': 0.3333333333333333}}, 
                   'yes': {'c': 0.4666666666666667, 
                           'shape': {'round': 0.625, 'square': 0.5}, 
                           'size': {'small': 0.25, 'large': 0.875}, 
                           'color': {'blue': 0.125, 'red': 0.5, 'green': 0.625}}}

assert train_naive_bayes(test_data_1, test_labels_1, ["shape", "size", "color"]) == expected_output

test_data_2 = [
    ["round", "large", "blue"],
    ["square", "large", "green"],  
    ["square", "small", "red"],  
]

test_labels_2 = ["no", "yes", "no"]

expected_output_2 = {'yes': 
                     {'c': 0.3333333333333333, 
                      'shape': {'round': 0.5, 'square': 1.0}, 
                      'size': {'large': 1.0, 'small': 0.5}, 
                      'color': {'blue': 0.5, 'green': 1.0, 'red': 0.5}}, 
                     'no': 
                     {'c': 0.6666666666666666, 
                      'shape': {'round': 0.6666666666666666, 'square': 0.6666666666666666}, 
                      'size': {'large': 0.6666666666666666, 'small': 0.6666666666666666}, 
                      'color': {'blue': 0.6666666666666666, 'green': 0.3333333333333333, 'red': 0.6666666666666666}}}

assert train_naive_bayes(test_data_2, test_labels_2, ["shape", "size", "color"]) == expected_output_2

test_data_3 = []
test_labels_3 = []

assert train_naive_bayes(test_data_3, test_labels_3, ["shape", "size", "color"]) == {}

<a id="normalize"></a>
## normalize
*The normalize function normalizes the probabilities calculated by the classify_observation function.*
**Used by**: [classify_observation](#classify_observation)

* **results** Dict[str, float] | str: a dictionary of predicted probabilities for each class label

**returns** Dict[str, float].

In [7]:
def normalize(results: dict[str, float]) -> Dict[str, float]:
    new_results = {}
    for key, value in results.items():
        values = results.values()
        new_results[key] = value/sum(values)
    return new_results

In [8]:
# assertions/unit tests
test_results_1 = {"yes": 0.0006, "no": 0.0432}
result_1 = normalize(test_results_1)
assert round(sum(list(result_1.values())), 2) == 1.00

test_results_2 = {"yes": 0.0001, "no": 0}
result_2 = normalize(test_results_2)
assert round(sum(list(result_2.values())), 2) == 1.00

test_results_3 = {"yes": 0.0000, "no": 0.99}
result_3 = normalize(test_results_3)
assert round(sum(list(result_3.values())), 2) == 1.00

<a id="classify_observation"></a>
## classify_observation
*The classify_observation algorithm uses the probabilities for the Naive Bayes Classifier to find the classification for a single observation. The function makes these classifications by creating a dictionary that represents each class's probability given the observation's characteristics. The probability of each class is calculated by multiplying the probability of the class by the conditional probabilities of the attribute value for the class for each value in the observation.*

*Ex: ("round", "small", "red") would produce the following calculations for class labels yes and no:*

**probability of yes**: p(yes) * p(round | yes) * p(small | yes) * p(red | yes)

**probability of no**: p(no) * p(round | no) * p(small | no) * p(red | no)

*The function then returns a tuple containing the prediction in the first position (the class label with the highest probability) and a dictionary of probabilities for each class label in the second position.*

**Used by**: [k_fold_validation](#k_fold_validation)

* **probs** Dict[str, Any]: the probabilities returned by the [train_naive_bayes](#train_naive_bayes) function.
* **attributes** List[str]: the attributes in the dataset (excluding class label)
* **observation** List[str]: the labeled or unlabeled observation to classify

**returns** Tuple[str, Dict[str, float]].

In [9]:
def classify_observation(probs: Dict[str, Any], attributes: List[str], observation: List[str]) -> Tuple[str, Dict[str, float]]:
    results = {}
    for label in probs.keys():
        prob_class = probs[label]["c"]
        for index, value in enumerate(observation):
            prob_class *= probs[label][attributes[index]][value]
            results[label] = prob_class
    norm_results = normalize(results)
    # get the dictionary key with the highest value
    best = max(norm_results, key=norm_results.get)
    return (best, norm_results)

In [10]:
# assertions/unit tests
test_probs = {'no': {'c': 0.533, 
                          'shape': {'round': 0.444, 'square': 0.667}, 
                          'size': {'small': 0.778, 'large': 0.333}, 
                          'color': {'blue': 0.444, 'red': 0.444, 'green': 0.333}}, 
               'yes': {'c': 0.467, 
                       'shape': {'round': 0.625, 'square': 0.5}, 
                       'size': {'small': 0.25, 'large': 0.875}, 
                       'color': {'blue': 0.125, 'red': 0.5, 'green': 0.625}}}

test_instance_1 = ["square", "large", "red"]
results_1 = classify_observation(test_probs, ["shape", "size", "color"], test_instance_1)
assert results_1 == ('yes', {'no': 0.33973153417458696, 'yes': 0.6602684658254131})
# confirm that the normalized probabilities equal 1
assert round(results_1[1]["yes"] + results_1[1]["no"], 2) == 1.00

test_instance_2 = ["round", "small", "blue"]
results_2 = classify_observation(test_probs, ["shape", "size", "color"], test_instance_2)
assert results_2 == ('no', {'no': 0.8996228935625692, 'yes': 0.10037710643743075})
# confirm that the normalized probabilities equal 1
assert round(results_2[1]["yes"] + results_2[1]["no"], 2) == 1.00

# unseen example
test_instance_3 = ["square", "large", "blue"]
results_3 = classify_observation(test_probs, ["shape", "size", "color"], test_instance_3)
assert results_3 == ('no', {'no': 0.673004045771441, 'yes': 0.326995954228559})
# confirm that the normalized probabilities equal 1
assert round(results_3[1]["yes"] + results_3[1]["no"], 2) == 1.00

<a id="classify"></a>
## classify
*The classify function takes a list of labeled or unlabeled data and returns predictions by calling [classify_observation](#classify_observation) for each value.*
**Used by**: [k_fold_validation](#k_fold_validation)

* **probs** Dict[str, Any] | str: the probabilities for the Naive Bayes Classifier represented as nested dictionaries
* **observations** List[List[str]]: the labeled or unlabeled observations to classify
* **attributes** List[str]: the attributes in the dataset (excluding class label)

**returns** List[Tuple[str, Dict[str, float]]].

In [11]:
def classify(probs: Dict[str, Any], observations: List[List[str]], attributes: List[str]) -> List[Tuple[str, Dict[str, float]]]:
    classifications = []
    for observation in observations:
        predicition = classify_observation(probs, attributes, observation)
        classifications.append(predicition)
    return classifications 

In [12]:
# assertions/unit tests
test_probs = {'no': {'c': 0.533, 
                          'shape': {'round': 0.444, 'square': 0.667}, 
                          'size': {'small': 0.778, 'large': 0.333}, 
                          'color': {'blue': 0.444, 'red': 0.444, 'green': 0.333}}, 
               'yes': {'c': 0.467, 
                       'shape': {'round': 0.625, 'square': 0.5}, 
                       'size': {'small': 0.25, 'large': 0.875}, 
                       'color': {'blue': 0.125, 'red': 0.5, 'green': 0.625}}}

test_instances_1 = [["square", "large", "red"], ["round", "large", "blue"], ["square", "small", "blue"]]
expected_output_1 = [('yes', 
                    {'no': 0.33973153417458696, 
                     'yes': 0.6602684658254131}), 
                   ('no', 
                    {'no': 0.5229075788819071, 
                     'yes': 0.4770924211180929}), 
                   ('no', 
                    {'no': 0.9439140906419522, 
                     'yes': 0.05608590935804781})]

assert classify(test_probs, test_instances_1, ["shape", "size", "color"]) == expected_output_1

# test with one example
test_instances_2 = [["square", "large", "red"]]
expected_output_2 = [('yes', 
                    {'no': 0.33973153417458696, 
                     'yes': 0.6602684658254131})]

assert classify(test_probs, test_instances_2, ["shape", "size", "color"]) == expected_output_2

# test with no examples
assert classify(test_probs, [], ["shape", "size", "color"]) == []

<a id="error_rate"></a>
## error_rate
*The error_rate function takes a list of predictions and a list of true values and returns the error rate of the predictions using the formula incorrect_predicitions/total_predictions.*
**Used by**: [k_fold_validation](#k_fold_validation)

* **predictions** List[str]: the predictions for a classification task
* **labels** List[str]: the true class values

**returns** float.

In [13]:
def error_rate(predictions: List[str], labels: List[str]) -> float:
    total = len(predictions)
    incorrect = sum(pred != true_value for pred, true_value in zip(predictions, labels))
    return incorrect / total

In [14]:
# assertions/unit tests
classifications_1 = ["yes", "no", "yes"]
labels_1 = ["yes", "no", "yes"]

assert error_rate(classifications_1, labels_1) == 0.00

classifications_2 = ["yes", "yes", "yes"]
labels_2 = ["no", "no", "no"]

assert error_rate(classifications_2, labels_2) == 1.00

classifications_3 = ["no", "no", "yes", "yes"]
labels_3 = ["no", "no", "no", "no"]

assert error_rate(classifications_3, labels_3) == 0.50

<a id="k_fold_validation"></a>
## k_fold_validation

*The k_fold_validation function applies k-fold validation on the dataset and prints the average error and error variance of the training and test set for each run. The algorithm returns the average error and the error variance for the training and test sets. The function also prints these values. The k-fold validation algorithm works by sampling k unique test samples from the dataset and using the rest of the data for a training set. This algorithm helps evaluate small datasets.*

* **model_function** Callable: the algorithm for the model
* **classify_function** Callable: the classification function
* **eval_function** Callable: the evaluation function
* **label_index** int: the label index in the dataset
* **folds** List[List[List[str]]]: the k folds to evaluate
* **attributes** List[str]: the data attributes (excluding class label)
* **smoothing** bool: specifies whether the training function for the Naive Bayes Classifier should use +1 smoothing

**returns** Tuple[Any].

In [15]:
def k_fold_validation(model_function: Callable, classify_function: Callable, eval_function: Callable, label_index: int, folds: List[List[List[str]]], attributes: List[str], smoothing:bool=True) -> Tuple[Any]:
    total_train_loss, total_test_loss = 0, 0
    train_losses, test_losses = [], []
    for index in range(len(folds)):
        # split data and labels
        train, test = create_train_test(folds, index)
        train_examples, train_labels = split_labels(train, label_index)
        test_examples, test_labels = split_labels(test, label_index)
        # train model
        model = model_function(train_examples, train_labels, attributes, smoothing)
        # classify
        train_predictions = classify_function(model, train_examples, attributes)
        train_loss = eval_function([row[0] for row in train_predictions], train_labels)
        test_predictions = classify_function(model, test_examples, attributes)
        test_loss = eval_function([row[0] for row in test_predictions], test_labels)
        print(f"Fold: {index+1}, train loss: {round(train_loss*100, 2)}%, test loss: {round(test_loss*100, 2)}%")
        total_train_loss += train_loss
        train_losses.append(train_loss)
        total_test_loss += test_loss
        test_losses.append(test_loss)
    print("---------------------------------------------------")
    print(f"Average Train Loss: {round((total_train_loss/len(folds)) * 100, 2)}%, Train Loss Variance: {np.var(train_losses)}")
    print(f"Average Test Loss: {round((total_test_loss/len(folds)) * 100, 2)}%, Test Loss Variance: {np.var(test_losses)}")
    return total_train_loss/len(folds), np.var(train_losses), total_test_loss/len(folds), np.var(test_losses)

In [16]:
# assertions/unit tests
test_data = [
    ["round", "large", "blue", "no"],
    ["square", "large", "green", "yes"],  
    ["square", "small", "red", "no"],  
    ["round", "large", "red", "yes"],  
    ["square", "small", "blue", "no"],  
    ["round", "small", "blue", "no"],  
    ["round", "small", "red", "yes"],  
    ["square", "small", "green", "no"],  
    ["round", "large", "green", "yes"],  
    ["square", "large", "green", "yes"],  
    ["square", "large", "red", "no"],  
    ["square", "large", "green", "yes"],  
    ["round", "large", "red", "yes"],  
    ["square", "small", "red", "no"],  
    ["round", "small", "green", "no"]
]
test_attributes = ["shape", "size", "color"]

folds = create_folds(test_data, 10)
average_train_error, train_variance,  average_test_error, test_variance = k_fold_validation(train_naive_bayes, classify, error_rate, 3, folds, test_attributes)

Fold: 1, train loss: 15.38%, test loss: 50.0%
Fold: 2, train loss: 15.38%, test loss: 0.0%
Fold: 3, train loss: 23.08%, test loss: 0.0%
Fold: 4, train loss: 7.69%, test loss: 50.0%
Fold: 5, train loss: 15.38%, test loss: 0.0%
Fold: 6, train loss: 14.29%, test loss: 100.0%
Fold: 7, train loss: 14.29%, test loss: 0.0%
Fold: 8, train loss: 14.29%, test loss: 0.0%
Fold: 9, train loss: 14.29%, test loss: 0.0%
Fold: 10, train loss: 14.29%, test loss: 100.0%
---------------------------------------------------
Average Train Loss: 14.84%, Train Loss Variance: 0.0012136215432918733
Average Test Loss: 30.0%, Test Loss Variance: 0.16


# Naive Bayes Classifier with the Mushroom Dataset

In [17]:
# column labels
col_names = ["label", "cap-shape", "cap-surface", "cap-color", "bruises?", "odor", "gill-attachment", "gill-spacing", "gill-size", "gill-color", 
             "stalk-shape", "stalk-root", "stalk-surface-above-ring", "stalk-surface-below-ring", "stalk-color-above-ring", "stalk-color-below-ring",
            "veil-type", "veil-color", "ring-number", "ring-type", "spore-print-color", "population", "habitat"]
# read in data
def read_data(filename, delimiter):
    with open(filename, 'r') as f:
        data = [line.strip().split(delimiter) for line in f]
    random.shuffle(data)
    return data

data = read_data('Datasets/agaricus-lepiota.data', ",")

In [18]:
new_folds = create_folds(data, 10)
print("10-fold Validation with +1 Smoothing")
print("---------------------------------------------------")
averge_train_error, train_variance, averge_test_error, test_variance = k_fold_validation(train_naive_bayes, classify, error_rate, 0, new_folds, col_names[1:])

10-fold Validation with +1 Smoothing
---------------------------------------------------
Fold: 1, train loss: 4.43%, test loss: 5.04%
Fold: 2, train loss: 4.36%, test loss: 4.67%
Fold: 3, train loss: 4.47%, test loss: 4.31%
Fold: 4, train loss: 4.45%, test loss: 4.8%
Fold: 5, train loss: 4.5%, test loss: 4.68%
Fold: 6, train loss: 4.47%, test loss: 4.31%
Fold: 7, train loss: 4.28%, test loss: 3.57%
Fold: 8, train loss: 4.53%, test loss: 4.56%
Fold: 9, train loss: 4.58%, test loss: 4.68%
Fold: 10, train loss: 4.55%, test loss: 4.56%
---------------------------------------------------
Average Train Loss: 4.46%, Train Loss Variance: 7.242277321830525e-07
Average Test Loss: 4.52%, Test Loss Variance: 1.4177094278540854e-05


In [19]:
print("10-fold Validation without Smoothing")
print("---------------------------------------------------")
averge_train_error, train_variance, averge_test_error, test_variance = k_fold_validation(train_naive_bayes, classify, error_rate, 0, new_folds, col_names[1:], False)

10-fold Validation without Smoothing
---------------------------------------------------
Fold: 1, train loss: 0.26%, test loss: 0.62%
Fold: 2, train loss: 0.27%, test loss: 0.12%
Fold: 3, train loss: 0.27%, test loss: 0.25%
Fold: 4, train loss: 0.26%, test loss: 0.37%
Fold: 5, train loss: 0.3%, test loss: 0.25%
Fold: 6, train loss: 0.3%, test loss: 0.37%
Fold: 7, train loss: 0.37%, test loss: 0.37%
Fold: 8, train loss: 0.33%, test loss: 0.37%
Fold: 9, train loss: 0.27%, test loss: 0.37%
Fold: 10, train loss: 0.3%, test loss: 0.12%
---------------------------------------------------
Average Train Loss: 0.29%, Train Loss Variance: 1.0559665571808295e-07
Average Test Loss: 0.32%, Test Loss Variance: 1.8767830513330512e-06


## Explanation of Results
The Naive Bayes Classifier performed better without smoothing than it did with smoothing. The model had around 96% accuracy when tested with Laplace smoothing. The model had about 99% accuracy when tested without smoothing. These results could indicate that the training and test sets were very similar for the mushroom dataset. Smoothing is beneficial when the test set contains many unseen observations because probabilities of 0 could negate the other, possibly valuable, feature probabilities used in the multiplier of the classification equation. If the examples in the test set are captured in the training set, Laplace smoothing could introduce noise. The model that applied Laplace smoothing might perform better than the non-smoothing model for a test set containing mushrooms that were not represented in the training set. 