# Module 8 - Programming Assignment

## Directions

1. Change the name of this file to be your JHED id as in `jsmith299.ipynb`. Because sure you use your JHED ID (it's made out of your name and not your student id which is just letters and numbers).
2. Make sure the notebook you submit is cleanly and fully executed. I do not grade unexecuted notebooks.
3. Submit your notebook back in Blackboard where you downloaded this file.

*Provide the output **exactly** as requested*

In [1]:
from copy import deepcopy
import random
import math
from collections import Counter
from typing import List, Dict, Any, Callable

## Decision Trees

For this assignment you will be implementing and evaluating a Decision Tree using the ID3 Algorithm (**no** pruning or normalized information gain). Use the provided pseudocode. The data is located at (copy link):

http://archive.ics.uci.edu/ml/datasets/Mushroom

**Just in case** the UCI repository is down, which happens from time to time, I have included the data and name files on Canvas.

<div style="background: lemonchiffon; margin:20px; padding: 20px;">
    <strong>Important</strong>
    <p>
        No Pandas. The only acceptable libraries in this class are those contained in the `environment.yml`. No OOP, either. You can used Dicts, NamedTuples, etc. as your abstract data type (ADT) for the the tree and nodes.
    </p>
</div>

One of the things we did not talk about in the lectures was how to deal with missing values. There are two aspects of the problem here. What do we do with missing values in the training data? What do we do with missing values when doing classifcation?

There are a lot of different ways that we can handle this.
A common algorithm is to use something like kNN to impute the missing values.
We can use conditional probability as well.
There are also clever modifications to the Decision Tree algorithm itself that one can make.

We're going to do something simpler, given the size of the data set: remove the observations with missing values ("?").

You must implement the following functions:

`train` takes training_data and returns the Decision Tree as a data structure.

```
def train(training_data):
   # returns the Decision Tree.
```

`classify` takes a tree produced from the function above and applies it to labeled data (like the test set) or unlabeled data (like some new data).

```
def classify(tree, observations, labeled=True):
    # returns a list of classifications
```

`evaluate` takes a data set with labels (like the training set or test set) and the classification result and calculates the classification error rate:

$$error\_rate=\frac{errors}{n}$$

Do not use anything else as evaluation metric or the submission will be deemed incomplete, ie, an "F". (Hint: accuracy rate is not the error rate!).

`cross_validate` takes the data and uses 5x2 fold cross validation (from Module 2!) to `train`, `classify`, and `evaluate`. **Remember to shuffle your data before you create your folds**. I leave the exact signature of `cross_validate` to you but you should write it so that you can use it with *any* `classify` function of the same form (using higher order functions and partial application).

Following Module 2's material (course notes), `cross_validate` should print out a table in exactly the same format. What you are looking for here is a consistent evaluation metric cross the folds. Print the error rate to 4 decimal places. **Do not convert to a percentage.**

```
def pretty_print_tree(tree):
    # pretty prints the tree
```

This should be a text representation of a decision tree trained on the entire data set (no train/test).

To summarize...

Apply the Decision Tree algorithm to the Mushroom data set using 5x2 cross validation and the error rate as the evaluation metric. When you are done, apply the Decision Tree algorithm to the entire data set and print out the resulting tree.

**Note** Because this assignment has a natural recursive implementation, you should consider using `deepcopy` at the appropriate places.


### Provided Functions

You do not need to document these.

You can use this function to read the data file.

In [2]:
def parse_data(file_name: str) -> list[list]:
    data = []
    file = open(file_name, "r")
    for line in file:
        datum = line.rstrip().split(",")
        data.append(datum)
    random.shuffle(data)
    return data

You can use this function to create 10 folds for 5x2 cross validation.

In [3]:
def create_folds(xs: list, n: int) -> list[list[list]]:
    k, m = divmod(len(xs), n)
    # be careful of generators...
    return list(xs[i * k + min(i, m):(i + 1) * k + min(i + 1, m)] for i in range(n))

Put your code after this line:

-----

## `entropy` <a id="entropy"></a>

**Description:**
This function calculates the entropy of a dataset. Entropy is a measure of impurity or disorder within a set. This function is important as it helps determine the optimal splits for the dataset.

**Parameters**:
- `data` (`list[list[str]]`): where each inner list represents a data point, and the first element of each data point is the class label.

**Returns**:
- A float representing the entropy of the dataset.

In [4]:
def entropy(data:  List[List[str]]):
    labels = [datapoint[0] for datapoint in data]
    label_counts = Counter(labels)
    total_count = len(data)
    
    ent = 0
    for count in label_counts.values():
        prob = count / total_count
        ent -= prob * math.log2(prob)
    return ent

In [5]:
data1 = [['A'], ['B'], ['A'], ['B']]
assert(entropy(data1) == 1.0) #Equal distribution should be 1

data2 = [['A'], ['A'], ['A'], ['A']]
assert(entropy(data2) == 0.0) # Single class should be a 0 entropy

data3 = [['A'], ['A'], ['B'], ['B'], ['C']]
expected_entropy = -(2/5 * math.log2(2/5) + 2/5 * math.log2(2/5) + 1/5 * math.log2(1/5))
assert(abs(entropy(data3) - expected_entropy) < 1e-9) #Entropy with 3 classes, did self calculation to compare it to

## `information_gain` <a id="information_gain"></a>

**Description:**  
This function calculates the information gain of splitting the dataset based on a given attribute. Information gain helps measure the reduction in entropy after splitting a dataset and is critical for building decision trees.

**Parameters**:  
- `data` (`list[list[str]]`): The dataset where each inner list represents a data point.
- `attr_index` (`int`): The index of the attribute in the dataset to split on.

**Returns**:  
- A float representing the information gain for the specified attribute.

In [6]:
def information_gain(data: List[List[str]], attr_index: int):
    total_entropy = entropy(data)
    total_count = len(data)
    
    attr_values = [datapoint[attr_index] for datapoint in data]
    value_counts = Counter(attr_values)
    
    weighted_entropy = 0
    for value, count in value_counts.items():
        subset = [datapoint for datapoint in data if datapoint[attr_index] == value]
        prob = count / total_count
        weighted_entropy += prob * entropy(subset)
    
    return total_entropy - weighted_entropy

In [7]:
data1 = [['A', 'x'], ['A', 'x'], ['B', 'y'], ['B', 'y']]
assert(abs(information_gain(data1, 1)) == 1)  # Perfect split, information gain should be 1.0

data2 = [['A', 'x'], ['A', 'y'], ['A', 'x'], ['A', 'y']]
assert(abs(information_gain(data2, 1)) == 0)  # No class change, information gain should be 0

data3 = [['A', 'x'], ['A', 'x'], ['B', 'y'], ['C', 'y']]
expected_gain = entropy(data3) - (2/4 * entropy([['A', 'x'], ['A', 'x']]) + 2/4 * entropy([['B', 'y'], ['C', 'y']]))
assert(abs(information_gain(data3, 1) - expected_gain) == 0)  # Imperfect split, manually calculated expected gain to check it is returning the correct gain

## `majority_class` <a id="majority_class"></a>

**Description:**  
This function determines the most common class (label) in the dataset. It is useful in decision tree algorithms to assign a class when no further splitting is possible or necessary (i.e., when reaching a leaf node).

**Parameters**:  
- `data` (`list[list[str]]`): The dataset where each inner list represents a data point, and the first element of each data point is the class label.

**Returns**:  
- The most frequent class label (a string or integer).

In [8]:
def majority_class(data: List[List[str]]):
    labels = [datapoint[0] for datapoint in data]
    return Counter(labels).most_common(1)[0][0]

In [9]:
data1 = [['A'], ['B'], ['A'], ['B'], ['A']]
assert(majority_class(data1) == 'A')  # 'A' appears most frequently

data2 = [['B'], ['B'], ['B']]
assert(majority_class(data2) == 'B')  # Only one class, 'B'

data3 = [['A'], ['A'], ['A'], ['B'], ['C'], ['A']]
assert(majority_class(data3) == 'A')  # 'A' is the majority class

## `train` <a id="train"></a>

**Description:**  
This function recursively builds a decision tree using the ID3 algorithm. It selects the attribute that provides the highest information gain at each step and splits the dataset based on that attribute until it reaches a base case. The base cases occur when all examples have the same class or when there are no attributes left.

**Parameters**:  
- `training_data` (`list[list[str]]`): A list of data points, where each data point is a list that contains the class label as the first element, followed by attribute values.

**Returns**:  
- A decision tree as a nested dictionary, where internal nodes are attribute indices and leaf nodes are class labels.


In [10]:
def train(training_data: List[List[str]]):
    attributes = list(range(1, len(training_data[0])))
    
    labels = [datapoint[0] for datapoint in training_data]
    if labels.count(labels[0]) == len(labels):
        return labels[0]
    
    if not attributes:
        return majority_class(training_data)
    
    gains = [information_gain(training_data, attr_index) for attr_index in attributes]
    best_attr = attributes[gains.index(max(gains))]
    
    tree = {best_attr: {}}
    feature_values = set([datapoint[best_attr] for datapoint in training_data])
    
    for value in feature_values:
        subset = [datapoint for datapoint in training_data if datapoint[best_attr] == value]
        tree[best_attr][value] = train(subset)
    
    return tree

In [11]:
data1 = [['A', 'x'], ['B', 'y'], ['A', 'x'], ['B', 'y']]
tree1 = train(data1)
assert(isinstance(tree1, dict))  # Tree should be a dictionary

data2 = [['A', 'x'], ['A', 'y'], ['A', 'z']]
tree2 = train(data2)
assert(tree2 == 'A')  # All labels are A, so the tree should return just A

data3 = [['A', 'x', 'y'], ['B', 'x', 'n'], ['A', 'z', 'y'], ['B', 'z', 'n']]
tree3 = train(data3)
assert(tree3[2] != None) # Split is required so there should be something for 2, if not it didnt split right

## `classify` <a id="classify"></a>

**Description:**  
This function takes a decision tree and a list of observations and returns a list of predicted class labels for each observation. The tree is traversed based on the attribute values in the observations. If the `labeled` parameter is set to `True`, the first element of each observation (assumed to be the actual label) is skipped during classification. This is integral to the decision tree process as it is what allows us to predict new values (labeled set to False).

**Parameters**:  
- `tree` (`dict`): The decision tree, represented as a nested dictionary where internal nodes are attribute indices and leaf nodes are class labels.
- `observations` (`list[list]`): A list of observations to classify, where each observation is a list of attribute values.
- `labeled` (`bool`): Indicates whether the first element of each observation is the true class label. If `True`, it skips the first element during classification. Defaults to `True`.

**Returns**:  
- A list of predicted class labels for each observation.


In [12]:
def classify(tree, observations, labeled=True):
    results = []
    
    for observation in observations:
        if labeled:
            observation = observation[1:]
        
        current_tree = tree
        
        while isinstance(current_tree, dict):
            for attr, branches in current_tree.items():
                value = observation[attr - 1]
                if value in branches:
                    current_tree = branches[value]
                else:
                    current_tree = None  
        
        results.append(current_tree)
    
    return results

In [13]:
tree1 = {1: {'x': 'A', 'y': 'B'}}
observations1 = [['A', 'x'], ['B', 'y']]
assert(classify(tree1, observations1) == ['A', 'B'])  # Should match tree labels 'A' and 'B'

observations2 = [['x'], ['y']]
assert(classify(tree1, observations2, labeled=False) == ['A', 'B'])  # Should classify correctly without labels

tree2 = {1: {'x': 'A'}}
observations3 = [['A', 'z']]  # 'z' is not in the tree
assert(classify(tree2, observations3) == [None])  # Should return None for missing branch

## `evaluate` <a id="evaluate"></a>

**Description:**  
This function calculates the error rate by comparing actual class labels from the dataset with the predicted labels. It counts the number of incorrect predictions and returns the error rate, which is the proportion of wrong predictions. Necessary to test how well our decision tree is performing

**Parameters**:  
- `data_set` (`list[list]`): The dataset where each inner list represents a data point, and the first element is the actual class label.
- `predictions` (`list[str]`): A list of predicted class labels corresponding to the observations in the dataset.

**Returns**:  
- A float representing the error rate, calculated as the proportion of incorrect predictions out of the total number of predictions.

In [14]:
def evaluate(data_set: List[List[str]], predictions: List[str]) -> float:
    errors = 0
    total = len(data_set)
    
    for actual, predicted in zip(data_set, predictions):
        if actual[0] != predicted: 
            errors += 1
            
    return errors / total if total > 0 else 0.0

In [15]:
data1 = [['A', 'x'], ['B', 'y'], ['C', 'z']]
predictions1 = ['A', 'B', 'C']
assert(evaluate(data1, predictions1) == 0.0)  # No errors, error rate should be 0%

data2 = [['A', 'x'], ['B', 'y'], ['C', 'z']]
predictions2 = ['B', 'C', 'A']
assert(evaluate(data2, predictions2) == 1.0)  # All wrong, error rate should be 100%

data3 = [['A', 'x'], ['B', 'y'], ['C', 'z'], ['D', 'w']]
predictions3 = ['A', 'B', 'C', 'X']
assert(evaluate(data3, predictions3) == 0.25)  # One wrong, error rate should be 25%

## `pretty_print_tree` <a id="pretty_print_tree"></a>

**Description:**  
This function recursively prints a decision tree in a human readable format. Each decision node is represented by an attribute and its corresponding branches, and each leaf node is represented by the class label.

**Parameters**:  
- `tree` (`dict`): The decision tree, represented as a nested dictionary where internal nodes are attribute indices and leaf nodes are class labels.
- `depth` (`int`): The current depth of the tree. Used internally to manage indentation levels during the recursive printing. Defaults to 0.

In [16]:
def pretty_print_tree(tree, depth=0):
        if isinstance(tree, dict):
            for attr, branches in tree.items():
                for value, branch in branches.items():
                    print(f"{'    ' * depth} |-- {attr} == {value}")
                    pretty_print_tree(branch, depth + 1)
        else:
            print(f"{'    ' * depth} |-- {tree}")

In [17]:
# Don't really know how to test this with asserts so I will instead just print out a few trees and check they look right

tree1 = {1: {'x': 'A', 'y': 'B'}} #Simple Tree
pretty_print_tree(tree1)

print("\n")
tree2 = {1: {'x': {2: {'y': 'A', 'z': 'B'}}, 'y': 'C'}} #Nested Tree
pretty_print_tree(tree2)

print("\n")
tree3 = 'A' # Just a single leaf node
pretty_print_tree(tree3)

 |-- 1 == x
     |-- A
 |-- 1 == y
     |-- B


 |-- 1 == x
     |-- 2 == y
         |-- A
     |-- 2 == z
         |-- B
 |-- 1 == y
     |-- C


 |-- A


## `cross_validate` <a id="cross_validate"></a>

**Description:**  
This function performs k-fold cross-validation on a decision tree. The data is split into `k` folds, and for each iteration, one fold is used as the test set, while the other folds are used for training. The function is important as it trains a decision tree on the training data and then evaluates its accuracy on the test fold multiple times to get a better reading of how the decision tree is performing.

**Parameters**:  
- `data` (`list[list]`): The dataset, where each inner list represents a data point and the first element of each data point is the class label.
- `k` (`int`): The number of folds to split the data into. Defaults to 5.

**Returns**:  
- A float representing the average accuracy across all folds.


In [18]:
def cross_validate(data: List[List[str]], k: int = 5) -> float:
    folds = create_folds(data, k)
    
    error_rates = []

    for i in range(k):
        test_fold = folds[i]
        train_folds = [fold for j, fold in enumerate(folds) if j != i]
        train_data = [item for sublist in train_folds for item in sublist]  
        
        decision_tree = train(train_data)
        
        observations = [datapoint[1:] for datapoint in test_fold]
        
        predictions = classify(decision_tree, observations, labeled=False)
        
        error_rate = evaluate(test_fold, predictions) * 100
        error_rates.append(error_rate)
        print(f"Fold {i+1} error rate: {error_rate:.2f}%")
    
    average_accuracy = sum(error_rates) / k
    return average_accuracy


In [19]:
data1 = [['A', 'x'], ['B', 'y'], ['A', 'x'], ['B', 'y']]
assert(cross_validate(data1, k=2) == 0)  # Should be 100% accurate as all A will be x and all B will be y

data2 = [['A', 'x'], ['B', 'y'], ['A', 'y'], ['B', 'x']]
assert(cross_validate(data2, k=2) == 100)  # Tree can't accurately train on this data as A and B could be x and y so it should retun 0

data3 = [['A', 'x'], ['B', 'y'], ['A', 'y']]
assert(cross_validate(data3, k=2) == 75)  # Tree should be accurate about 25% of the time (correct 50% for one fold and 0% for the other)


Fold 1 error rate: 0.00%
Fold 2 error rate: 0.00%
Fold 1 error rate: 100.00%
Fold 2 error rate: 100.00%
Fold 1 error rate: 50.00%
Fold 2 error rate: 100.00%


In [20]:
# Load the data 
data = parse_data('data/agaricus-lepiota.data')

# Perform 5-fold cross-validation
average_accuracy = cross_validate(data, k=5)
print(f"Average error rate across 5 folds: {average_accuracy:.2f}%")

# Train the decision tree model on all data
decision_tree = train(data)

# Pretty print the resulting decision tree
pretty_print_tree(decision_tree)

Fold 1 error rate: 0.25%
Fold 2 error rate: 0.00%
Fold 3 error rate: 0.00%
Fold 4 error rate: 0.00%
Fold 5 error rate: 0.00%
Average error rate across 5 folds: 0.05%
 |-- 5 == f
     |-- p
 |-- 5 == y
     |-- p
 |-- 5 == p
     |-- p
 |-- 5 == s
     |-- p
 |-- 5 == m
     |-- p
 |-- 5 == l
     |-- e
 |-- 5 == c
     |-- p
 |-- 5 == a
     |-- e
 |-- 5 == n
     |-- 20 == o
         |-- e
     |-- 20 == k
         |-- e
     |-- 20 == h
         |-- e
     |-- 20 == y
         |-- e
     |-- 20 == n
         |-- e
     |-- 20 == b
         |-- e
     |-- 20 == w
         |-- 22 == d
             |-- 8 == n
                 |-- p
             |-- 8 == b
                 |-- e
         |-- 22 == p
             |-- e
         |-- 22 == l
             |-- 3 == n
                 |-- e
             |-- 3 == w
                 |-- p
             |-- 3 == c
                 |-- e
             |-- 3 == y
                 |-- p
         |-- 22 == g
             |-- e
         |-- 22 == w
    

## Before You Submit...

1. Did you provide output exactly as requested?
2. Did you re-execute the entire notebook? ("Restart Kernel and Rull All Cells...")
3. If you did not complete the assignment or had difficulty please explain what gave you the most difficulty in the Markdown cell below.
4. Did you change the name of the file to `jhed_id.ipynb`?

Do not submit any other files.