# Module 9 - Programming Assignment

## Directions

1. Change the name of this file to be your JHED id as in `jsmith299.ipynb`. Because sure you use your JHED ID (it's made out of your name and not your student id which is just letters and numbers).
2. Make sure the notebook you submit is cleanly and fully executed. I do not grade unexecuted notebooks.
3. Submit your notebook back in Blackboard where you downloaded this file.

*Provide the output **exactly** as requested*

## Naive Bayes Classifier

For this assignment you will be implementing and evaluating a Naive Bayes Classifier with the same data from last week:

http://archive.ics.uci.edu/ml/datasets/Mushroom

(You should have downloaded it).

<div style="background: lemonchiffon; margin:20px; padding: 20px;">
    <strong>Important</strong>
    <p>
        No Pandas. The only acceptable libraries in this class are those contained in the `environment.yml`. No OOP, either. You can use Dicts, NamedTuples, Data Classes, etc. as your abstract data type (ADT).
    </p>
</div>


You'll first need to calculate all of the necessary probabilities using a `train` function. A flag will control whether or not you use "+1 Smoothing" or not. You'll then need to have a `classify` function that takes your probabilities, a List of instances (possibly a list of 1) and returns a List of Tuples. Each Tuple has the best class in the first position and a dict with a key for every possible class label and the associated *normalized* probability. For example, if we have given the `classify` function a list of 2 observations, we would get the following back:

```
[("e", {"e": 0.98, "p": 0.02}), ("p", {"e": 0.34, "p": 0.66})]
```

when calculating the error rate of your classifier, you should pick the class label with the highest probability; you can write a simple function that takes the Dict and returns that class label.

As a reminder, the Naive Bayes Classifier generates the *unnormalized* probabilities from the numerator of Bayes Rule:

$$P(C|A) \propto P(A|C)P(C)$$

where C is the class and A are the attributes (data). Since the normalizer of Bayes Rule is the *sum* of all possible numerators and you have to calculate them all, the normalizer is just the sum of the probabilities.

You will have the same basic functions as the last module's assignment and some of them can be reused or at least repurposed.

`train` takes training_data and returns a Naive Bayes Classifier (NBC) as a data structure. There are many options including namedtuples and just plain old nested dictionaries. **No OOP**.

```
def train(training_data, smoothing=True):
   # returns the "classifier" (however you decided to represent the probability tables).
```

The `smoothing` value defaults to True. You should handle both cases.

`classify` takes a NBC produced from the function above and applies it to labeled data (like the test set) or unlabeled data (like some new data). (This is not the same `classify` as the pseudocode which classifies only one instance at a time; it can call it though).

```
def classify(nbc, observations, labeled=True):
    # returns a list of tuples, the argmax and the raw data as per the pseudocode.
```

`evaluate` takes a data set with labels (like the training set or test set) and the classification result and calculates the classification error rate:

$$error\_rate=\frac{errors}{n}$$

Do not use anything else as evaluation metric or the submission will be deemed incomplete, ie, an "F". (Hint: accuracy rate is not the error rate!).

`cross_validate` takes the data and uses 5x2 cross validation (from Module 2!) to `train`, `classify`, and `evaluate`. **Remember to shuffle your data before you create your folds**. I leave the exact signature of `cross_validate` to you but you should write it so that you can use it with *any* `classify` function of the same form (using higher order functions and partial application). If you did so last time, you can reuse it for this assignment.

Following Module 2's materials, `cross_validate` should print out the fold number and the evaluation metric (error rate) for each fold and then the average value (and the variance). What you are looking for here is a consistent evaluation metric cross the folds. You should print the error rates in terms of percents (ie, multiply the error rate by 100 and add "%" to the end).

To summarize...

Apply the Naive Bayes Classifier algorithm to the Mushroom data set using 5x2 cross validation and the error rate as the evaluation metric. You will do this *twice*. Once with smoothing=True and once with smoothing=False. You should follow up with a brief hypothesis/explanation for the similarities or differences in the results. You may also compare the results to the Decision Tree and why you think they're different (if they are).

### Provided Functions

You do not need to document these.

You can use this function to read the data file.

In [1]:
import random #Had to add this, code was crashing later

def parse_data(file_name: str) -> list[list]:
    data = []
    file = open(file_name, "r")
    for line in file:
        datum = line.rstrip().split(",")
        data.append(datum)
    random.shuffle(data)
    return data

You can use this function to create 10 folds for 5x2 cross validation.

In [2]:
def create_folds(xs: list, n: int) -> list[list[list]]:
    k, m = divmod(len(xs), n)
    # be careful of generators...
    return list(xs[i * k + min(i, m):(i + 1) * k + min(i + 1, m)] for i in range(n))

Put your code after this line:

-----

In [3]:
from typing import Dict, List
from collections import Counter

## `train` <a id="train"></a>

**Description:**  
This function builds and trains a Naive Baise Classifier with optional Laplace smoothing. It is required for our code as it is what creates the method for Naive Baise to classify. Without this we would not have a classifier of any sort.

**Parameters**:
- `training_data`: List of lists where each inner list represents an observation and the first element is the class label.
- `smoothing`: Boolean indicating whether to use Laplace smoothing.
    
**Returns**:
- A dictionary representing the Naive Bayes Classifier with calculated probabilities.

In [4]:
def train(training_data: List[List[str]], smoothing: bool = True) -> Dict:
    class_counts = Counter([data[0] for data in training_data])
    total_count = len(training_data)
    class_probs = {cls: count / total_count for cls, count in class_counts.items()}
    
    cond_probs: Dict[str, Dict[int, Dict[str, float]]] = {}

    for cls in class_counts:
        subset = [data[1:] for data in training_data if data[0] == cls]
        cond_probs[cls] = {}  
        for i in range(len(subset[0])): 
            attr_values = [data[i] for data in subset]
            attr_counts = Counter(attr_values)
            total_attr_count = len(attr_values)

            cond_probs[cls][i] = {} 
            for attr_val, count in attr_counts.items():
                if smoothing:
                    cond_probs[cls][i][attr_val] = (count + 1) / (total_attr_count + len(attr_counts))
                else:
                    cond_probs[cls][i][attr_val] = count / total_attr_count

    nbc = {"class_probs": class_probs,"cond_probs": cond_probs}
    return nbc

In [5]:
data1 = [['A', 'x'], ['B', 'y'], ['A', 'x'], ['B', 'y']]
nbc = train(data1)
assert(isinstance(nbc, dict))  # NBC should be a dictionary

data2 = [['A', 'x'], ['A', 'y'], ['A', 'z']]
nbc2 = train(data2)
assert(nbc2['class_probs'] == {'A': 1.0})  # All labels are A, so the dict should just have A as 1

data3 = [['A', 'x', 'y'], ['B', 'x', 'n'], ['A', 'z', 'y'], ['B', 'z', 'n']]
nbc3 = train(data3)
assert(nbc3['class_probs'] == {'A': .5, 'B':.5}) # Should be an even split for classes

## `classify` <a id="classify"></a>

**Description:**  
This function takes a Naive Baise Classifier and a list of observations and returns a list of predicted class labels for each observation. The NBC dictionary is traversed to compute the probability of each class given the observed attribute values, ultimately selecting the class with the highest probability. This is necessary for a NBC to work as it is the function that actually lets us classify datapoints.

**Parameters**:  
- `nbc` (`dict`): The Naive Bayes Classifier, represented as a dictionary with class_probs for class probabilities and cond_probs for conditional probabilities of attribute values given the class.
- `observations` (`List[List[str]]`): A list of observations to classify, where each observation is a list of attribute values.
- `labeled` (`bool`): Indicates whether the first element of each observation is the true class label. If `True`, it skips the first element during classification. Defaults to `True`.

**Returns**:  
- A list of predicted class labels for each observation.

In [6]:
def classify(nbc: Dict, observations: List[List[str]], labeled: bool = True):
    predictions = []

    for observation in observations:
        if labeled:
            observation = observation[1:]  

        class_scores = {}
        for cls, cls_prob in nbc["class_probs"].items():
            prob = cls_prob 
            for i, attr_val in enumerate(observation):
                if attr_val in nbc["cond_probs"][cls][i]:
                    prob *= nbc["cond_probs"][cls][i][attr_val]
                else:
                    prob *= 0 
            class_scores[cls] = prob

        total_score = sum(class_scores.values())
        normalized_probs = {cls: score / total_score if total_score > 0 else 0.0
                            for cls, score in class_scores.items()}
        
        best_class = max(normalized_probs, key=normalized_probs.get)
        
        predictions.append((best_class, normalized_probs))
    
    return predictions

In [7]:
nbc = {'class_probs': {'A': 0.5, 'B': 0.5}, 'cond_probs': {'A': {0: {'x': 0.5, 'p': 0.5}, 1: {'y': 1.0}}, 'B': {0: {'y': 0.5, 'z': 0.5}, 1: {'n': 1.0}}}}

observations1 = [['A', 'x'], ['B', 'y']]
assert(classify(nbc, observations1) == [('A', {'A': 1.0, 'B': 0.0}), ('B', {'A': 0.0, 'B': 1.0})])  # Should be a simple A and B 100% likely of each

observations2 = [['x'], ['y']]
assert(classify(nbc, observations2, labeled=False) == [('A', {'A': 1.0, 'B': 0.0}), ('B', {'A': 0.0, 'B': 1.0})])  # Should classify correctly without labels

observations3 = [['A', 'z']]  
assert(classify(nbc, observations3) == [('B', {'A': 0.0, 'B': 1.0})])  # Since z not a possible for A, returns it as B

## `evaluate` <a id="evaluate"></a>

**Description:**  
This function calculates the error rate by comparing actual class labels from the dataset with the predicted labels. It counts the number of incorrect predictions and returns the error rate, which is the proportion of wrong predictions. Necessary to test how well our Naive Baise Classifier is performing

**Parameters**:  
- `data_set` (`List[List[str]]`): The dataset where each inner list represents a data point, and the first element is the actual class label.
- `predictions` (`List[tuple]`): A touple of predicted class labels corresponding to the observations in the dataset.

**Returns**:  
- A float representing the error rate, calculated as the proportion of incorrect predictions out of the total number of predictions.

In [8]:
def evaluate(data_set: List[List[str]], predictions: List[tuple]) -> float:
    errors = 0
    total = len(data_set)
    
    for actual, (predicted_class, _) in zip(data_set, predictions):
        if actual[0] != predicted_class:
            errors += 1

    error_rate = errors / total if total > 0 else 0.0
    return error_rate * 100

In [9]:
data1 = [['A', 'x'], ['B', 'y'], ['C', 'z']]
predictions1 = [('A', {'A': 1.0}), ('B', {'B': 1.0}), ('C', {'C': 1.0})]
assert(evaluate(data1, predictions1) == 0.0)  # No errors, error rate should be 0%

data2 = [['A', 'x'], ['B', 'y'], ['C', 'z']]
predictions2 = [('B', {'B': 0.0}), ('C', {'C': 0.0}), ('B', {'B': 0.0})]
assert(evaluate(data2, predictions2) == 100.0)  # All wrong, error rate should be 100%

data3 = [['A', 'x'], ['B', 'y'], ['C', 'z'], ['D', 'w']]
predictions3 = [('A', {'A': 1.0}), ('B', {'B': 1.0}), ('C', {'C': 1.0}), ('X', {'X': 0.0})]
assert(evaluate(data3, predictions3) == 25.0)  # One wrong, error rate should be 25%

## `cross_validate` <a id="cross_validate"></a>

**Description:**  
This function performs k-fold cross-validation on a Naive Baise Classifier. The data is split into `k` folds, and for each iteration, one fold is used as the test set, while the other folds are used for training. The function is important as it trains a Naive Baise Classifier on the training data and then evaluates its accuracy on the test fold multiple times to get a better reading of how the Naive Baise Classifier is performing.

**Parameters**:  
- `data` (`list[list[str]]`): The dataset, where each inner list represents a data point and the first element of each data point is the class label.
- `k` (`int`): The number of folds to split the data into. Defaults to 5.

**Returns**:  
- A float representing the average error rate across all folds.

In [10]:
def cross_validate(data: List[List[str]], k: int = 5, smoothing=True) -> float:
    error_rates = []

    for _ in range(2):  
        random.shuffle(data) 
        folds = create_folds(data, k)  

        for i in range(k):
            test_fold = folds[i]
            train_folds = [fold for j, fold in enumerate(folds) if j != i]
            train_data = [item for sublist in train_folds for item in sublist]  

            nbc = train(train_data, smoothing)

            predictions = classify(nbc, test_fold, labeled=True)
            error_rate = evaluate(test_fold, predictions)

            error_rates.append(error_rate)
            print(f"Fold {i+1} repetition error rate: {error_rate:.2f}%")

    average_error_rate = sum(error_rates) / len(error_rates)
    return average_error_rate

In [11]:
#Unlike Decision Trees this is a little more random so cant just say X == 0 percent error like for the last assignment
data1 = [['A', 'x'], ['B', 'y'], ['A', 'x'], ['B', 'y']]
result = cross_validate(data1, k=2)
assert(isinstance(result, float)) # Should be a float

assert(0 <= cross_validate(data1, k=2) <= 100) # Error Rate shoulnd ever get below 0 or above 100 

data3 = [['A', 'x'], ['A', 'x'], ['A', 'x'], ['A', 'x']]
assert(cross_validate(data3, k=2) == 0) # All labels being the exact same is the only way to ensure 0% error rate

Fold 1 repetition error rate: 100.00%
Fold 2 repetition error rate: 100.00%
Fold 1 repetition error rate: 100.00%
Fold 2 repetition error rate: 100.00%
Fold 1 repetition error rate: 0.00%
Fold 2 repetition error rate: 0.00%
Fold 1 repetition error rate: 100.00%
Fold 2 repetition error rate: 100.00%
Fold 1 repetition error rate: 0.00%
Fold 2 repetition error rate: 0.00%
Fold 1 repetition error rate: 0.00%
Fold 2 repetition error rate: 0.00%


In [12]:
# Load the data
data = parse_data('data/agaricus-lepiota.data')

# Perform 5x2 cross-validation with smoothing=True
print("Cross-validation with Laplace smoothing enabled (smoothing=True):")
average_error_rate_smoothing = cross_validate(data, k=5)
print(f"Average error rate with smoothing=True: {average_error_rate_smoothing:.2f}%\n")

# Perform 5x2 cross-validation with smoothing=False
print("Cross-validation with Laplace smoothing disabled (smoothing=False):")
average_error_rate_no_smoothing = cross_validate(data, k=5, smoothing=False)
print(f"Average error rate with smoothing=False: {average_error_rate_no_smoothing:.2f}%\n")


Cross-validation with Laplace smoothing enabled (smoothing=True):
Fold 1 repetition error rate: 0.43%
Fold 2 repetition error rate: 0.25%
Fold 3 repetition error rate: 0.68%
Fold 4 repetition error rate: 0.37%
Fold 5 repetition error rate: 0.06%
Fold 1 repetition error rate: 0.37%
Fold 2 repetition error rate: 0.37%
Fold 3 repetition error rate: 0.12%
Fold 4 repetition error rate: 0.68%
Fold 5 repetition error rate: 0.12%
Average error rate with smoothing=True: 0.34%

Cross-validation with Laplace smoothing disabled (smoothing=False):
Fold 1 repetition error rate: 0.62%
Fold 2 repetition error rate: 0.06%
Fold 3 repetition error rate: 0.18%
Fold 4 repetition error rate: 0.49%
Fold 5 repetition error rate: 0.25%
Fold 1 repetition error rate: 0.43%
Fold 2 repetition error rate: 0.12%
Fold 3 repetition error rate: 0.43%
Fold 4 repetition error rate: 0.25%
Fold 5 repetition error rate: 0.55%
Average error rate with smoothing=False: 0.34%



## Conclusions

### Smoothing True vs False
If only looking at the averages it would not appear that smoothing did much besides bring the average error rate down slightly (running it a few times the averages were always around 2-4% lower with smoothing=true, sometimes the average would be higher). However after running it a few times I noticed something interesting. Both smoothing false and true would every once in a while get a super high error rate, close to 80%, but only smoothing=true could consistently get an error rate down below 10%. While smoothing=False would only every once in a while give you a percentage data point that was under 10%. Besides that however both seem to get fairly similar results leading me to believe that for this dataset it is not completely necessary to utilize smoothing=true.

### Decision Tree vs Naive Baise
Overall I can whole heartedly say that the decision tree performed much better on this dataset than Naive Baise performed. This however makes sense, this dataset seems fairly consistent letting the tree make a pretty simple path to determine what is needed for this dataset. Naive Baise does not quite work this way leading to much more error coming out of the model. Most models have pros and cons on how they perform with different datasets and this dataset simply was made more for something like a decision tree.

## Before You Submit...

1. Did you provide output exactly as requested?
2. Did you re-execute the entire notebook? ("Restart Kernel and Rull All Cells...")
3. If you did not complete the assignment or had difficulty please explain what gave you the most difficulty in the Markdown cell below.
4. Did you change the name of the file to `jhed_id.ipynb`?

Do not submit any other files.