In [1]:
import random
from copy import deepcopy
from pprint import pprint
import numpy as np
from typing import List, Dict, Tuple, Any

## Decision Trees

You will be implementing and evaluating a Decision Tree using the ID3 Algorithm (**no** pruning or normalized information gain). Use the provided pseudocode. The data is located at (copy link):

http://archive.ics.uci.edu/ml/datasets/Mushroom

**Just in case** the UCI repository is down, which happens from time to time, I have included the data and name files on Blackboard.

<div style="background: lemonchiffon; margin:20px; padding: 20px;">
    <strong>Important</strong>
    <p>
        No Pandas. The only acceptable libraries in this class are those contained in the `environment.yml`. No OOP, either. You can used Dicts, NamedTuples, etc. as your abstract data type (ADT) for the the tree and nodes.
    </p>
</div>

One of the things we did not talk about in the lectures was how to deal with missing values. There are two aspects of the problem here. What do we do with missing values in the training data? What do we do with missing values when doing classifcation?

For the first problem, C4.5 handled missing values in an interesting way. Suppose we have identifed some attribute *B* with values {b1, b2, b3} as the best current attribute. Furthermore, assume there are 5 observations with B=?, that is, we don't know the attribute value. In C4.5, those 5 observations would be added to *all* of the subsets created by B=b1, B=b2, B=b3 with decreased weights. Note that the observations with missing values are not part of the information gain calculation.

This doesn't quite help us if we have missing values when we use the model. What happens if we have missing values during classification? One approach is to prepare for this advance. When you train the tree, you need to add an implicit attribute value "?" at every split. For example, if the attribute was "size" then the domain would be ["small", "medium", "large", "?"]. The "?" value gets all the data (because ? is now a wildcard). However, there is an issue with this approach. "?" becomes the worst possible attribut value because it has no classification value. What to do? There are several options:

1. Never recurse on "?" if you do not also recurse on at least one *real* attribute value.
2. Limit the depth of the tree.

There are good reasons, in general, to limit the depth of a decision tree because they tend to overfit.
Otherwise, the algorithm *will* exhaust all the attributes trying to fulfill one of the base cases.

You must implement the following functions:

`train` takes training_data and returns the Decision Tree as a data structure. There are many options including namedtuples and just plain old nested dictionaries. **No OOP**.

```
def train(training_data, depth_limit=None):
   # returns the Decision Tree.
```

The `depth_limit` value defaults to None. (What technique would we use to determine the best parameter value for `depth_limit` hint: Module 3!)

`classify` takes a tree produced from the function above and applies it to labeled data (like the test set) or unlabeled data (like some new data).

```
def classify(tree, observations, labeled=True):
    # returns a list of classifications
```

`evaluate` takes a data set with labels (like the training set or test set) and the classification result and calculates the classification error rate:

$$error\_rate=\frac{errors}{n}$$

Do not use anything else as evaluation metric or the submission will be deemed incomplete, ie, an "F". (Hint: accuracy rate is not the error rate!).

`cross_validate` takes the data and uses 10 fold cross validation (from Module 3!) to `train`, `classify`, and `evaluate`. **Remember to shuffle your data before you create your folds**. I leave the exact signature of `cross_validate` to you but you should write it so that you can use it with *any* `classify` function of the same form (using higher order functions and partial application).

Following Module 3's discussion, `cross_validate` should print out the fold number and the evaluation metric (error rate) for each fold and then the average value (and the variance). What you are looking for here is a consistent evaluation metric cross the folds. You should print the error rates in terms of percents (ie, multiply the error rate by 100 and add "%" to the end).

```
def pretty_print_tree(tree):
    # pretty prints the tree
```

This should be a text representation of a decision tree trained on the entire data set (no train/test).

To summarize...

Apply the Decision Tree algorithm to the Mushroom data set using 10 fold cross validation and the error rate as the evaluation metric. When you are done, apply the Decision Tree algorithm to the entire data set and print out the resulting tree.

**Note** Because this assignment has a natural recursive implementation, you should consider using `deepcopy` at the appropriate places.

-----

In [2]:
def parse_data(file_name: str) -> List[List]:
    data = []
    file = open(file_name, "r")
    for line in file:
        datum = [str(value) for value in line.rstrip().split(",")]
        data.append(datum)
    random.shuffle(data)
    return data

In [3]:
data = parse_data("data/agaricus-lepiota.data")

In [4]:
len(data)

8124

In [5]:
attributes = [
    "cap-shape",
    "cap-surface",
    "cap-color",
    "bruises?",
    "odor",
    "gill-attachment",
    "gill-spacing",
    "gill-size",
    "gill-color",
    "stalk-shape",
    "stalk-root",
    "stalk-surface-above-ring",
    "stalk-surface-below-ring",
    "stalk-color-above-ring",
    "stalk-color-below-ring",
    "veil-type",
    "veil-color",
    "ring-number",
    "ring-type",
    "spore-print-color",
    "population",
    "habitat",
]


In [6]:
len(attributes)

22

<a id="extract_column"></a>
### extract_column

`extract_column` extracts a column into a list from a list of lists. **Used by**: 

* **data** List[List]: the list of lists
* **column** int: determines which column to extract

**return**: List: the extracted column as a list

In [7]:
def extract_column(data: List[List], column) -> List:
    extract = []
    for i, value in enumerate(data):
        extract.append(data[i][column])
    return extract

In [8]:
def create_folds(xs: List, n: int) -> List[List[List]]:
    k, m = divmod(len(xs), n)
    return list(xs[i * k + min(i, m) : (i + 1) * k + min(i + 1, m)] for i in range(n))

In [9]:
def create_train_test(
    folds: List[List[List]], index: int
) -> Tuple[List[List], List[List]]:
    training = []
    test = []
    for i, fold in enumerate(folds):
        if i == index:
            test = fold
        else:
            training = training + fold
    return training, test


<a id="id3"></a>
### id3

`id3` runs the id3 algorithm to build a decision tree. **Uses**: 

* **data** List[List]: the list cantaining the data
* **attributes** List: identifies the attributes in the data

**return**: List: the extracted column as a list

In [10]:
def id3(data, attributes, default) -> Dict:
    if len(data) == 0:
        return default
    h = is_homogenous(data)
    if h is not None:
        return h
    if attributes == []:
        label = majority_label(data)
        return label
    best_attr = pick_best_attribute(data, attributes)
    attr = best_attr[0]
    attr_index = attributes.index(best_attr[0]) + 1
    domain = np.unique(extract_column(data, attr_index))
    np.append(domain, "?")
    node = new_node(attr, domain)
    default_value = majority_label(data)

    for value in domain:
        subset = attr_value(data, value, attr_index)
        _attributes = deepcopy(attributes)
        del _attributes[attr_index - 1]
        child = id3(subset, _attributes, default_value)
        node[attr][value] = child
    return node

<a id="attr_value"></a>
### attr_value

`attr_value` creates a subset of data for the recursive calls. **Used by**: 

* **data** List[List]: the list cantaining the data
* **value** List: identifies which value is being used for the subset

**return**: List: subset as a list

In [11]:
def attr_value(data, value, attr_index):
    subset = []
    if value == "?":
        for i in data:
            subset.append(i[:attr_index] + i[attr_index + 1 :])
        return subset
    for i in data:
        if i[attr_index] == value:
            subset.append(i[:attr_index] + i[attr_index + 1 :])
    return subset

<a id="new_node"></a>
### new_node

`new_node` creates a node for the tree. **Used by**: 

* **attribute** str: the attribute of interest
* **domain** List: the unique values in the desired attribute

**return**: Dict: dictionary containing the node

In [12]:
def new_node(attribute, domain):
    for i in domain:
        tree = {attribute: {i: None}}
    return tree

<a id="is_homogenous"></a>
### is_homogenous

`is_homogenous` determines if an attribute is homogenous. **Used by**: 

* **data** List[List]: the list cantaining the data

**return**: str: returns the class label if homogenous or none

In [13]:
def is_homogenous(data):
    col = extract_column(data, 0)
    if col.count("e") == len(data):
        return "e"
    if col.count("p") == len(data):
        return "p"
    else:
        return None

<a id="majority_label"></a>
### majority_label

`majority_label` determines the majority class label. **Used by**: 

* **data** List[List]: the list cantaining the data

**return**: str: returns the class label

In [14]:
def majority_label(data):
    col = extract_column(data, 0)
    if col.count("e") > col.count("p"):
        majority = "e"
    else:
        majority = "p"
    return majority

<a id="pick_best_attribute"></a>
### pick_best_attribute

`pick_best_attribute` determines the attribute with the most gain. **Used by**: 

* **data** List[List]: the list cantaining the data
* **attributes** List: list of attributes

**return**: List: returns the attribute with the most gain and the gain value

In [15]:
def pick_best_attribute(data, attributes) -> List:
    col_class = extract_column(data, 0)
    e_total = entropy(col_class)
    a_dict = create_attribute_dict(data, attributes)
    e_dict = entropy_dict(a_dict)
    i_dict = information_gain(e_dict, e_total, data)
    max_gain = (0, 0)
    for key in i_dict:
        if i_dict[key]["gain"] > max_gain[1]:
            max_gain = (key, i_dict[key]["gain"])
    return max_gain

<a id="create_attribute_dict"></a>
### create_attribute_dict

`create_attribute_dict` creates a dictionary from the attributes to count the class labels for each attribute. **Used by**: 

* **data** List[List]: the list cantaining the data
* **attributes** List: list of attributes

**return**: Dict: returns a dictionary of attributes

In [16]:
def create_attribute_dict(data, attributes) -> Dict:
    a_dict = {}
    for i, a in enumerate(attributes):
        a_dict[a] = {}
        col = extract_column(data, i + 1)
        unique = np.unique(col)
        for u in unique:
            a_dict[a][u] = {"p": 0, "e": 0}
            for d in data:
                if d[i + 1] == u:
                    if d[0] == "p":
                        a_dict[a][u]["p"] += 1
                    if d[0] == "e":
                        a_dict[a][u]["e"] += 1
    return a_dict

<a id="entropy"></a>
### entropy

`entropy` Calculates the entropy of a list of values **Used by**: [pick_best_attribute](#pick_best_attribute)

* **data** List[List]: the list values to calculate entropy from

**return**: Float: the resulting entropy number

In [17]:
def entropy(data: List) -> float:
    p = []
    e = 0
    d = np.unique(data)
    for i, v in enumerate(d):
        c = data.count(v) / len(data)
        p.append(c)
    for i in p:
        e += i * (np.log2(i))
    return abs(e)

In [18]:
def entropy_dict(_a_dict):
    a_dict = deepcopy(_a_dict)
    for key in a_dict:
        for key2 in a_dict[key]:
            t_e = a_dict[key][key2]["e"]
            t_p = a_dict[key][key2]["p"]
            t = t_p + t_e
            if t_e == t or t_p == t:
                a_dict[key][key2]["entropy"] = 0
            else:
                entr = abs((t_p / t) * np.log2(t_p / t) + (t_e / t) * np.log2(t_e / t))
                a_dict[key][key2]["entropy"] = entr
    return a_dict

In [19]:
def information_gain(_a_dict, e_total, data):
    a_dict = deepcopy(_a_dict)
    s = len(data)
    for key in a_dict:
        g_partial = 0
        for key2 in a_dict[key]:
            t_e = a_dict[key][key2]["e"]
            t_p = a_dict[key][key2]["p"]
            t = t_p + t_e
            g_partial += (t / s) * a_dict[key][key2]["entropy"]
        g = e_total - g_partial
        a_dict[key]["gain"] = g
    return a_dict

Creates the tree

In [20]:
def train(training_data, attributes, depth_limit=None) -> Dict:
    model = id3(training_data, attributes, "e")
    return model

Tries to classify data using tree

In [21]:
def classify(tree, attributes, observations, labeled=True) -> List:
    result = []
    attr_index = []
    list_keys = []
    classification = []
    for x in get_keys(tree):
        list_keys.append(x)
    for point in observations:
        for i in list_keys:
            if i in attributes:
                attr_index = attributes.index(i) + 1
                classification.append((i, point[attr_index]))
        value = tree["odor"]
        for a in classification:
            if a[0] in attributes:
                if isinstance(value[a[1]], dict):
                    value = value[a[1]]
                    k_list = list(value)
                    value = value[k_list[0]]
                else:
                    value = value[a[1]]
                    break
        result.append(value)
    return result

In [22]:
def get_keys(dictionary):
    for key, value in dictionary.items():
        yield key
        if isinstance(value, dict):
            yield from get_keys(value)

Classification error rate: errors/number of samples

In [23]:
def evaluate(data_set, classification_result) -> float:
    correct = 0
    actual = extract_column(data_set, 0)
    for i, value in enumerate(actual):
        if classification_result[i] == actual[i]:
            correct += 1
    result = correct / len(actual)
    return result

Cross validation (train, classify, evaluate)

In [24]:
def cross_validate(data, attributes, classify_function, eval_function) -> Any:
    folds = create_folds(data, 10)
    for i, fold in enumerate(folds):
        train_data, test = create_train_test(folds, i)
        model = train(train_data, attributes, None)
        clf = classify_function(model, attributes, test)
        eval_clf = eval_function(test, clf)
        print(f"Fold {i+1} error rate {eval_clf}")
    return eval_clf

In [25]:
model = cross_validate(data, attributes, classify, evaluate)

Fold 1 error rate 0.45264452644526443
Fold 2 error rate 0.5079950799507995
Fold 3 error rate 0.4858548585485855
Fold 4 error rate 0.5043050430504306
Fold 5 error rate 0.5147783251231527
Fold 6 error rate 0.46798029556650245


Fold 7 error rate 0.47167487684729065
Fold 8 error rate 0.4876847290640394
Fold 9 error rate 0.4827586206896552
Fold 10 error rate 0.5012315270935961


In [26]:
b = id3(data, attributes, "e")
pprint(b)

{'odor': {'a': 'e',
          'c': 'p',
          'f': 'p',
          'l': 'e',
          'm': 'p',
          'n': {'spore-print-color': {'b': 'e',
                                      'h': 'e',
                                      'k': 'e',
                                      'n': 'e',
                                      'o': 'e',
                                      'r': 'p',
                                      'w': {'habitat': {'d': {'gill-size': {'b': 'e',
                                                                            'n': 'p'}},
                                                        'g': 'e',
                                                        'l': {'cap-color': {'c': 'e',
                                                                            'n': 'e',
                                                                            'w': 'p',
                                                                            'y': 'p'}},
                          