# Decision Trees

In [1]:
from copy import deepcopy
from typing import List, Dict, Callable, Tuple, Any
from math import log2
import numpy as np
import random

Download the mushrooms [datatset](http://archive.ics.uci.edu/ml/datasets/Mushroom).

The following notebook implements a custom decision tree using the ID3 algorithm and tests the model on the mushroom dataset.

## Helpers

<a id="creat_folds"></a>
## create_folds
*The create_folds function is a helper function that splits the data into n test sets.* **Used by**: [k_fold_validation](#k_fold_validation)

* **data** List[Any]: the dataset
* **n** int: the number of folds

**returns** Tuple[List[List], List[List]].

In [2]:
# function by S. butcher, 2022
def create_folds(data: List[Any], n: int) -> List[List[List]]:
    k, m = divmod(len(data), n)
    # be careful of generators...
    return list(data[i * k + min(i, m):(i + 1) * k + min(i + 1, m)] for i in range(n))

<a id="creat_train_test"></a>
## create_train_test
*The create_train_test function is a helper that returns the training and test splits for the provided fold index.* **Used by**: [k_fold_validation](#k_fold_validation)

* **folds** List[List[List[str]]]: the folds created by [create_folds](#create_folds)
* **index** Dict[str, str]: the index of the fold to use for the test set

**returns** Tuple[List[List], List[List]].

In [3]:
# function by S. butcher, 2022
def create_train_test(folds: List[List[List[str]]], index: int) -> Tuple[List[List], List[List]]:
    training = []
    test = []
    for i, fold in enumerate(folds):
        if i == index:
            test = fold
        else:
            training = training + fold
    return training, test

<a id="split_labels"></a>
## split_labels
*The split_labels function is a helper function to separate the labels from the data.* **Used by**: [k_fold_validation](#k_fold_validation)

* **data** List[List[str]]: the data with labels
* **label_index** int: the index of the label

**returns** Tuple[List[str], List[List[str]]].

In [4]:
def split_labels(data: List[List[str]], label_index: int) -> Tuple[List[str], List[List[str]]]:
    values = [row[:label_index] + row[label_index+1:] for row in data]
    labels = [row[label_index] for row in data]
    return values, labels

<a id="majority_label"></a>
## majority_label
*The majority_label function takes a list of labels and returns the value that occurs most frequently.* **Used by**: [id3](#id3)

* **data_labels** List[str]: the data labels

**returns** str.

In [5]:
def majority_label(data_labels: List[str]) -> str:
    majority = ""
    max_count = 0
    classes = set(data_labels)
    for element in classes:
        count = data_labels.count(element)
        if count > max_count:
            majority = element
            max_count = count
    return majority

In [6]:
#assertions/unit tests
test_labels_1 = ["y", "n", "y", "n", "y"]

assert majority_label(test_labels_1) == "y"

# even split, return either 
test_labels_2 = ["y", "n", "y", "n"]

assert majority_label(test_labels_2) == "y" or "n"

assert majority_label(["n"]) == "n"

assert majority_label([]) == ""

<a id="get_subset"></a>
## get_subset
*The get_subset function takes a dataset, a list of labels, the attribute index, and a value. The function returns a subset containing values for which the attribute index is equal to the value. The function also returns the updated labels for the subset. The function removes the current attribute from the subset because the id3 algorithm will not use it for calculations on the subset.* **Used by**: [id3](#id3)

* **data** List[List[[str]]: the data
* **labels** List[[str]: the data labels
* **attibute_index** int: the index of the attribute for the subset
* **value** str: the attribute value for the subset

**returns** Tuple[List[List[str]], List[str]].

In [7]:
def get_subset(data: List[List[str]], labels: List[str], attribute_index: int, value: str) -> Tuple[List[List[str]], List[str]]:
    subset = []
    subset_labels = []
    for index, row in enumerate(data):
        if row[attribute_index] == value:
            subset.append(row[:attribute_index] + row[attribute_index+1:])
            subset_labels.append(labels[index])
    return [subset, subset_labels]

In [8]:
test_data = [
    ["red", "round", "large"],
    ["blue", "round", "large"],
    ["green", "square", "small"],
    ["red", "square", "small"],
    ["green", "round", "large"],
]

test_labels = ["yes", "no", "yes", "no", "no"]

# should remove the best attribute from the subset
assert get_subset(test_data, test_labels, 0, "red") == [[["round", "large"], ["square", "small"]], ["yes", "no"]]
assert get_subset(test_data, test_labels, 1, "square") == [[["green", "small"], ["red", "small"]], ["yes", "no"]]
assert get_subset(test_data, test_labels, 2, "small") == [[["green", "square"], ["red", "square"]], ["yes", "no"]]

<a id="get_entropy"></a>
## get_entropy
*The get_entropy function takes a list of labels and returns the entropy. Entropy describes how evenly distributed the data is in terms of class labels. This value is used to calculate the information gain of each attribute. The information gain describes how well an attribute separates classes.* **Used by**: [get_best_attribute](#get_best_attribute), [get_weighted_entropy](#get_weighted_entropy)

* **labels** List[[str]: the data labels

**returns** float.

In [9]:
def get_entropy(labels: List[str]) -> float:
    unique_classes = set(labels)
    entropy = 0
    # get the entropy for each class type and subtract from the total entropy
    for class_type in unique_classes:
        class_count = labels.count(class_type)
        if class_count == len(labels) or class_count == 0:
            # entropy is 0 for either case, no need to add to the calculation
            continue
        # subtract the class entropy from the total entropy
        entropy -= class_count/len(labels) * log2(class_count/len(labels))
    return entropy

In [10]:
# assertions/unit tests
test_labels_1 = ["y", "n", "y", "n", "y", "y", "y", "y"]

assert round(get_entropy(test_labels_1), 2) == 0.81

test_labels_2 = ["n", "n"]

assert get_entropy(test_labels_2) == 0

test_labels_3 = ["y", "n"]

assert get_entropy(test_labels_3) == 1

<a id="get_weighted_entropy"></a>
## get_weighted_entropy
*The get_weighted_entropy function takes a subset containing data for a specific attribute, the attribute index, and the label index. The function returns the weighted entropy of the attribute. The weighted entropy is calculated by calculating the entropy for each of the feature's values and multiplying this value by the proportion of each value. The weighted entropy is the sum of these results. For example, if there are 3 red examples (1 positive, 2 negative) and 5 blue examples (3 positive, 2 negative), the algorithm would calculate weighted entropy as follows:*
Entropy Red: -1/3 *log2(1/3) - 2/3 * log2(2/3)
Entropy Blue: -3/5 * log2(3/5) - 2/5 * log2(2/5)
Weighted Entropy: 3/8(Entropy Red) + 5/8(Entropy Blue)

*The weighted entropy is used to calculate the information gain for a feature.*
**Used by**: [get_best_attribute](#get_best_attribute)

* **attribute_data** List[List[str]]: the attribute subset
* **attribute_index** int: the index of the attribute in the subset
* **label_index** int: the index of the label in the subset
  
**returns** float.

In [11]:
def get_weighted_entropy(attribute_data: List[List[str]], attribute_index: int, label_index: int) -> float:
    unique_values = set([row[attribute_index] for row in attribute_data])
    weighted_entropy = 0
    for value_type in unique_values:
        value_subset = [row for row in attribute_data if row[attribute_index] == value_type]
        # calculate the entropy for each value in the attribute subset
        entropy = get_entropy([row[label_index] for row in value_subset])
        # calculate the weighted entropy by summing the proportion * entropy for each value
        weighted_entropy += entropy * len(value_subset)/len(attribute_data)   
    return weighted_entropy

In [12]:
# assertions/unit tests
test_attribute_data_1 = [["no", "blue"], ["yes", "green"], ["no", "red"], ["yes", "red"], 
                        ["no", "blue"], ["no", "blue"], ["yes", "red"], ["no", "green"], 
                        ["yes", "green"], ["yes", "green"], ["no", "red"], ["yes", "green"],
                        ["yes", "red"], ["no", "red"], ["no", "green"]]

round(get_weighted_entropy(test_attribute_data_1, 1, 0), 2)# == 0.77

# homogenous data split, should return 0
test_attribute_data_2 = [["yes", "blue"], ["no", "green"], ["no", "red"], ["no", "red"], 
                        ["yes", "blue"], ["yes", "blue"], ["no", "red"], ["no", "green"]]

assert round(get_weighted_entropy(test_attribute_data_2, 1, 0), 2) == 0.00

# data evenly distributed, should return 1
test_attribute_data_3 = [["no", "blue"], ["no", "blue"], ["yes", "red"], ["yes", "red"], 
                        ["yes", "blue"], ["yes", "blue"], ["no", "red"], ["no", "red"]]

assert round(get_weighted_entropy(test_attribute_data_3, 1, 0), 2) == 1.00

<a id="get_best_attribute"></a>
## get_best_attribute
*The get_best_attribute function calculates the information gain for each attribute by subtracting the weighted entropy from the initial entropy for the subset or dataset. The information gain describes how well each attribute separates class labels. An attribute with perfect information gain has homogenous subgroups for its values. An example of an attribute with perfect information gain is color (3 red yes, 3 blue no). In this example, color perfectly classifies the data because all red values return yes, and all blue values return no. The function returns the attribute with the highest information gain.*
**Used by**: [id3](#id3)

* **data** List[List[str]]: the dataset or subset
* **labels** List[str]: the labels for the dataset or subset
* **attributes** List[str]: the available attributes in the data or subset (excluding labels)
  
**returns** str.

In [13]:
def get_best_attribute(data: List[List[str]], labels: List[str], attributes: List[str]) -> str:
    information_gains = []
    # calculate the information gain for each attribute and add to the list
    for index, attribute in enumerate(attributes):
        attribute_data = [[labels[row_index], row[index]] for row_index, row in enumerate(data)]
        initial_entropy = get_entropy(labels)
        weighted_entropy = get_weighted_entropy(attribute_data, 1, 0)
        information_gains.append(initial_entropy - weighted_entropy)
    # get the best attribute and best attribute index
    best_attribute_gain = max(information_gains)
    best_attribute_index = information_gains.index(best_attribute_gain)
    return best_attribute_index, attributes[best_attribute_index]  

In [14]:
# assertions/unit tests
test_data_1 = [
    ["round", "large", "blue"],
    ["square", "large", "green"],  
    ["square", "small", "red"],  
    ["round", "large", "red"],  
    ["square", "small", "blue"],  
    ["round", "small", "blue"],  
    ["round", "small", "red"],  
    ["square", "small", "green"],  
    ["round", "large", "green"],  
    ["square", "large", "green"],  
    ["square", "large", "red"],  
    ["square", "large", "green"],  
    ["round", "large", "red"],  
    ["square", "small", "red"],  
    ["round", "small", "green"]
]
test_labels_1 = ["no", "yes", "no", "yes", "no", "no", "yes", "no", "yes", "yes", "no", "yes", "yes", "no", "no"]
test_attributes_1 = ["shape", "size", "color"]

assert get_best_attribute(test_data_1, test_labels_1, test_attributes_1) == (1, "size")

# subset small
test_data_2 = [
    ["square", "red"],  
    ["square", "blue"],  
    ["round", "blue"],  
    ["round", "red"],  
    ["square", "green"],  
    ["square", "red"],  
    ["round", "green"]
]
test_labels_2 = ["no", "no", "no", "yes", "no", "no", "no"]
test_attributes_2 = ["shape", "color"]

# choose first attribute for a tie
assert get_best_attribute(test_data_2, test_labels_2, test_attributes_2) == (0, "shape")

# subset large
test_data_3 = [
    ["round", "blue"],
    ["square", "green"],  
    ["round", "red"],   
    ["round", "green"],  
    ["square","green"],  
    ["square", "red"],  
    ["square", "green"],  
    ["round", "red"] 
]
test_labels_3 = ["no", "yes", "yes", "yes", "yes", "no", "yes", "yes"]
test_attributes_3 = ["shape", "color"]

assert get_best_attribute(test_data_3, test_labels_3, test_attributes_3) == (1, "color")

<a id="id3"></a>
## id3
*The id3 algorithm (Iterative Dichotomiser 3) builds a decision tree using the training data. The id3 algorithm is a recursive algorithm that builds the tree by calculating the information gain of each remaining attribute to determine the best attribute for the next node (See [get_best_attribute](#get_best_attribute)). Classification functions can use this decision tree to predict future values. The function takes an optional depth_limit parameter. Applying a depth limit can help correct overfitting. The function returns the tree as nested dictionaries.*
**Used by**: [k_fold_validation](#k_fold_validation)

* **data** List[List[str]]: the training dataset
* **data_labels** List[str]: the labels for the dataset
* **attributes** List[str]: the attributes in the dataset (excluding class label)
* **default** str: the majority class label
* **current_depth** int: the current depth of the tree
* **depth_limit** int: the depth limit

**returns** Dict[str, Any].

In [15]:
def id3(data: List[List[str]], data_labels: List[str], attributes: List[str], default: str, current_depth: int, depth_limit: int = None) -> Dict[str, Any]:
    if len(data) == 0:
        return default
    if len(set(data_labels)) == 1:
        return data_labels[0]
    if len(attributes) == 0 or current_depth == depth_limit:
        return majority_label(data_labels)
    # get best attribute, majority class, and attribute domain
    best_attribute_index, best_attribute = get_best_attribute(data, data_labels, attributes)
    node = {best_attribute: {}}
    default_label = majority_label(data_labels)
    attribute_domain = set([row[best_attribute_index] for row in data])
    # call id3 recursively for each value in the best attribute's domain
    for value in attribute_domain:
        value_subset, subset_labels = get_subset(data, data_labels, best_attribute_index, value)
        new_attributes = deepcopy(attributes)
        new_attributes.remove(best_attribute)
        child = id3(value_subset, subset_labels, new_attributes, default_label, current_depth+1, depth_limit)
        node[best_attribute][value] = child
    return node

In [16]:
# assertions/unit tests
test_data_1 = [
    ["round", "large", "blue"],
    ["square", "large", "green"],  
    ["square", "small", "red"],  
    ["round", "large", "red"],  
    ["square", "small", "blue"],  
    ["round", "small", "blue"],  
    ["round", "small", "red"],  
    ["square", "small", "green"],  
    ["round", "large", "green"],  
    ["square", "large", "green"],  
    ["square", "large", "red"],  
    ["square", "large", "green"],  
    ["round", "large", "red"],  
    ["square", "small", "red"],  
    ["round", "small", "green"]
]

test_labels_1 = ["no", "yes", "no", "yes", "no", "no", "yes", "no", "yes", "yes", "no", "yes", "yes", "no", "no"]
test_attributes = ["shape", "size", "color"]

expected_tree_1 = {'size': 
                   {'small': 
                    {'shape': 
                     {'round': 
                      {'color': 
                       {'green': 'no', 
                        'blue': 'no', 
                        'red': 'yes'}}, 
                      'square': 'no'}}, 
                    'large': 
                    {'color': 
                     {'green': 'yes', 
                      'blue': 'no', 
                      'red': 
                      {'shape': 
                       {'round': 'yes', 
                        'square': 'no'}}}}}}

assert id3(test_data_1, test_labels_1, test_attributes, "no", 0, None) == expected_tree_1
# test with depth limit of 0
assert id3(test_data_1, test_labels_1, test_attributes, "no", 0, 0) == 'no'
# test with depth limit of 1
assert id3(test_data_1, test_labels_1, test_attributes, "no", 0, 1) == {'size': {'small': 'no', 'large': 'yes'}}
# test with depth limit of 2
expected_tree_2 = {'size': 
                   {'small': 
                    {'shape': 
                     {'round': 'no', 
                      'square': 'no'}}, 
                    'large': {
                        'color': 
                          {'green': 'yes', 
                           'blue': 'no', 
                           'red': 'yes'}}}}
assert id3(test_data_1, test_labels_1, test_attributes, "no", 0, 2) == expected_tree_2

<a id="classify_observation"></a>
## classify_observation
*The classify_observation algorithm searches a decision tree recursively to find the classification for a single observation.*
**Used by**: [k_fold_validation](#k_fold_validation)

* **current_node** Dict[str, Any] | str: the current tree or subtree to search
* **observation** List[str]: the labeled or unlabeled observation to classify
* **attributes** List[str]: the attributes in the dataset (excluding class label)

**returns** str.

In [17]:
def classify_observation(current_node: Dict[str, Any] | str, observation: List[str], attributes: List[str]) -> str:
    # return when hitting a leaf node 
    if not isinstance(current_node, dict):
        return current_node

    # get the current attribute and its values from the decision tree
    attribute, attribute_values = list(current_node.items())[0]
    attribute_index = attributes.index(attribute)
    
    # get the value of the attribute for this observation
    attribute_value = observation[attribute_index]
    
    # get the subtree for this attribute value
    subtree = attribute_values.get(attribute_value)

    return classify_observation(subtree, observation, attributes)

In [18]:
test_tree = {'size': 
                   {'small': 
                    {'shape': 
                     {'round': 
                      {'color': 
                       {'green': 'no', 
                        'blue': 'no', 
                        'red': 'yes'}}, 
                      'square': 'no'}}, 
                    'large': 
                    {'color': 
                     {'green': 'yes', 
                      'blue': 'no', 
                      'red': 
                      {'shape': 
                       {'round': 'yes', 
                        'square': 'no'}}}}}}

test_attributes = ["shape", "size", "color"]
# assertions/unit tests
assert classify_observation(test_tree, ["round", "large", "green"], test_attributes) == "yes"
assert classify_observation(test_tree, ["round", "small", "green"], test_attributes) == "no"
assert classify_observation(test_tree, ["square", "large", "green"], test_attributes) == "yes"
# unseen example with the knowledge that blue is always no
assert classify_observation(test_tree, ["square", "large", "blue"], test_attributes) == "no"

<a id="classify"></a>
## classify
*The classify function takes a list of labeled or unlabeled data and returns predictions by calling [classify_observation](#classify_observation) for each value.*
**Used by**: [k_fold_validation](#k_fold_validation)

* **decision tree** Dict[str, Any] | str: a decision tree represented as nested dictionaries
* **observations** List[List[str]]: the labeled or unlabeled observations to classify
* **attributes** List[str]: the attributes in the dataset (excluding class label)

**returns** str.

In [19]:
def classify(decision_tree: Dict[str, Any], observations: List[List[str]], attributes: List[str]):
    classifications = []
    for observation in observations:
        predicition = classify_observation(decision_tree, observation, attributes)
        classifications.append(predicition)
    return classifications 

In [20]:
# assertions/unit tests
test_tree = {'size': 
                   {'small': 
                    {'shape': 
                     {'round': 
                      {'color': 
                       {'green': 'no', 
                        'blue': 'no', 
                        'red': 'yes'}}, 
                      'square': 'no'}}, 
                    'large': 
                    {'color': 
                     {'green': 'yes', 
                      'blue': 'no', 
                      'red': 
                      {'shape': 
                       {'round': 'yes', 
                        'square': 'no'}}}}}}

test_attributes = ["shape", "size", "color"]
predictions_1 = classify(test_tree, [["round", "large", "green"], ["round", "small", "green"], ["square", "large", "green"]], test_attributes)
assert predictions_1 == ["yes", "no" , "yes"]

predictions_2 = classify(test_tree, [["round", "large", "blue"], ["square", "large", "green"]], test_attributes)
assert predictions_2 == ["no" , "yes"]

predictions_3 = classify(test_tree, [["square", "large", "blue"], ["round", "small", "blue"]], test_attributes)
assert predictions_3 == ["no" , "no"]

<a id="error_rate"></a>
## error_rate
*The error_rate function takes a list of predictions and a list of true values and returns the error rate of the predictions using the formula incorrect_predicitions/total_predictions.*
**Used by**: [k_fold_validation](#k_fold_validation)

* **predictions** List[str]: the predictions for a classification task
* **labels** List[str]: the true class values

**returns** float.

In [21]:
def error_rate(predictions: List[str], labels: List[str]) -> float:
    total = len(predictions)
    incorrect = sum(pred != true_value for pred, true_value in zip(predictions, labels))
    return incorrect / total

In [22]:
# assertions/unit tests
classifications_1 = ["yes", "no", "yes"]
labels_1 = ["yes", "no", "yes"]

assert error_rate(classifications_1, labels_1) == 0.00

classifications_2 = ["yes", "yes", "yes"]
labels_2 = ["no", "no", "no"]

assert error_rate(classifications_2, labels_2) == 1.00

classifications_3 = ["no", "no", "yes", "yes"]
labels_3 = ["no", "no", "no", "no"]

assert error_rate(classifications_3, labels_3) == 0.50

<a id="k_fold_validation"></a>
## k_fold_validation

*The k_fold_validation function applies k-fold validation on the dataset and prints the average error and error variance of the training and test set for each run. The algorithm returns the average error and the error variance for the training and test sets. The function also prints these values. The k-fold validation algorithm works by sampling k unique test samples from the dataset and using the rest of the data for a training set. This algorithm helps evaluate small datasets.*

* **model_function** Callable: the algorithm for the model
* **classify_function** Callable: the classification function
* **eval_function** Callable: the evaluation function
* **label_index** int: the label index in the dataset
* **folds** List[List[List[str]]]: the k folds to evaluate
* **attributes** List[str]: the data attributes (excluding class label)
* **depth_limit**: (optional) The depth limit for the decision tree (defaults to None)

**returns** Tuple[Any].

In [23]:
def k_fold_validation(model_function: Callable, classify_function: Callable, eval_function: Callable, label_index: int, folds: List[List[List[str]]], attributes: List[str], depth_limit:int=None) -> Tuple[Any]:
    total_train_loss, total_test_loss = 0, 0
    train_losses, test_losses = [], []
    for index in range(len(folds)):
        # split data and labels
        train, test = create_train_test(folds, index)
        train_examples, train_labels = split_labels(train, label_index)
        test_examples, test_labels = split_labels(test, label_index)
        # create decision tree with the training data
        default_train_label = majority_label(train_labels)
        decision_tree = model_function(train_examples, train_labels, attributes, default_train_label, 0, depth_limit)

        train_predictions = classify_function(decision_tree, train_examples, attributes)
        train_loss = eval_function(train_predictions, train_labels)
        test_predictions = classify_function(decision_tree, test_examples, attributes)
        test_loss = eval_function(test_predictions, test_labels)
        print(f"Fold: {index+1}, train loss: {train_loss*100}%, test loss: {test_loss*100}%")
        total_train_loss  += train_loss
        train_losses.append(train_loss)
        total_test_loss  += test_loss
        test_losses.append(test_loss)
    print("---------------------------------------------------")
    print(f"Average Train Loss: {(total_train_loss/len(folds)) * 100}%, Train Loss Variance: {np.var(train_losses)}")
    print(f"Average Test Loss: {(total_test_loss/len(folds)) * 100}%, Test Loss Variance: {np.var(test_losses)}")
    return total_train_loss/len(folds), np.var(train_losses), total_test_loss/len(folds), np.var(test_losses)

In [24]:
# assertions/unit tests
test_data = [
    ["round", "large", "blue", "no"],
    ["square", "large", "green", "yes"],  
    ["square", "small", "red", "no"],  
    ["round", "large", "red", "yes"],  
    ["square", "small", "blue", "no"],  
    ["round", "small", "blue", "no"],  
    ["round", "small", "red", "yes"],  
    ["square", "small", "green", "no"],  
    ["round", "large", "green", "yes"],  
    ["square", "large", "green", "yes"],  
    ["square", "large", "red", "no"],  
    ["square", "large", "green", "yes"],  
    ["round", "large", "red", "yes"],  
    ["square", "small", "red", "no"],  
    ["round", "small", "green", "no"]
]

folds = create_folds(test_data, 10)
average_train_error, train_variance,  average_test_error, test_variance = k_fold_validation(id3, classify, error_rate, 3, folds, test_attributes)

# assetions
assert average_train_error == 0.00
assert average_test_error == 0.25
assert train_variance == 0.00
assert test_variance ==  0.1125

Fold: 1, train loss: 0.0%, test loss: 50.0%
Fold: 2, train loss: 0.0%, test loss: 0.0%
Fold: 3, train loss: 0.0%, test loss: 50.0%
Fold: 4, train loss: 0.0%, test loss: 50.0%
Fold: 5, train loss: 0.0%, test loss: 0.0%
Fold: 6, train loss: 0.0%, test loss: 100.0%
Fold: 7, train loss: 0.0%, test loss: 0.0%
Fold: 8, train loss: 0.0%, test loss: 0.0%
Fold: 9, train loss: 0.0%, test loss: 0.0%
Fold: 10, train loss: 0.0%, test loss: 0.0%
---------------------------------------------------
Average Train Loss: 0.0%, Train Loss Variance: 0.0
Average Test Loss: 25.0%, Test Loss Variance: 0.1125


## 10-Fold Validation on the Mushroom Dataset

In [25]:
# column labels
col_names = ["label", "cap-shape", "cap-surface", "cap-color", "bruises?", "odor", "gill-attachment", "gill-spacing", "gill-size", "gill-color", 
             "stalk-shape", "stalk-root", "stalk-surface-above-ring", "stalk-surface-below-ring", "stalk-color-above-ring", "stalk-color-below-ring",
            "veil-type", "veil-color", "ring-number", "ring-type", "spore-print-color", "population", "habitat"]
# read in data
def read_data(filename, delimiter):
    with open(filename, 'r') as f:
        data = [line.strip().split(delimiter) for line in f]
    random.shuffle(data)
    return data

data = read_data('Datasets/agaricus-lepiota.data', ",")

In [26]:
new_folds = create_folds(data, 10)
averge_train_error, train_variance, averge_test_error, test_variance = k_fold_validation(id3, classify, error_rate, 0, new_folds, col_names[1:])

Fold: 1, train loss: 0.0%, test loss: 0.0%
Fold: 2, train loss: 0.0%, test loss: 0.0%
Fold: 3, train loss: 0.0%, test loss: 0.0%
Fold: 4, train loss: 0.0%, test loss: 0.0%
Fold: 5, train loss: 0.0%, test loss: 0.0%
Fold: 6, train loss: 0.0%, test loss: 0.0%
Fold: 7, train loss: 0.0%, test loss: 0.0%
Fold: 8, train loss: 0.0%, test loss: 0.0%
Fold: 9, train loss: 0.0%, test loss: 0.0%
Fold: 10, train loss: 0.0%, test loss: 0.0%
---------------------------------------------------
Average Train Loss: 0.0%, Train Loss Variance: 0.0
Average Test Loss: 0.0%, Test Loss Variance: 0.0


## Decision Tree Using the Entire Dataset

In [27]:
def pretty_print_tree(tree, indent=""):
    if isinstance(tree, dict):
        for key, value in tree.items():
            print(indent + "|--" + key)
            pretty_print_tree(value, indent + "   ")
    else:
        print(indent + "  " + tree)

In [28]:
new_data = [row[1:] for row in data]
labels = [row[0] for row in data]
tree = id3(new_data, labels, col_names[1:], "p", 0)
pretty_print_tree(tree)

|--odor
   |--p
        p
   |--f
        p
   |--a
        e
   |--l
        e
   |--n
      |--spore-print-color
         |--o
              e
         |--r
              p
         |--n
              e
         |--y
              e
         |--k
              e
         |--w
            |--habitat
               |--p
                    e
               |--l
                  |--cap-color
                     |--w
                          p
                     |--n
                          e
                     |--y
                          p
                     |--c
                          e
               |--g
                    e
               |--w
                    e
               |--d
                  |--gill-size
                     |--n
                          p
                     |--b
                          e
         |--b
              e
         |--h
              e
   |--y
        p
   |--m
        p
   |--s
        p
   |--c
        p


As shown above, the decision tree learned to classify the mushroom dataset perfectly. However, this model might not generalize well to mushroom datasets with different features or new species of mushrooms. Setting a depth limit can help prevent overfitting by model. Validation curves could help identify the best depth limit for the id3 algorithm and the mushroom classification problem. Validation curves work by running k-fold validation on many depth limits (ex: 1-10) and plotting the results for each depth limit. 