# PA 6 - Decision Trees

## Step 1 - Decision Tree Classifier

This Assignment is mainly focused on using decision trees to classify instances. As such, our first step is to create a function to generate these decision trees.

First, we will need to import the `math` and `operator` modules later in this code, so we will do that here.

In [1]:
import math
import operator

Then, we will need several helper functions that will be used in the `tdidt()` classifier.

The first set are functions that are copied from the functions developed in class. We will pull `group_by()` and `get_column()` from previous repositories, as well as `partition_instances()` from the starter code for this assignment located in the `DecisionTreeFun` repository.

In [3]:
# @Gina's Repo
def groupBy(table, column_index, include_only_column_index=None):
    # first identify unique values in the column
    group_names = sorted(list(set(get_column(table, column_index))))

    # now, we need a list of subtables
    # each subtable corresponds to a value in group_names
    # parallel arrays
    groups = [[] for name in group_names]
    for row in table:
        # which group does it belong to?
        group_by_value = row[column_index]
        index = group_names.index(group_by_value)
        if include_only_column_index is None:
            groups[index].append(row.copy()) # note: shallow copy
        else:
            groups[index].append(row[include_only_column_index])

    return group_names, groups

# takes a table and a column index
# returns a column at index where values are converted to numeric
def get_column(table, column_index):
    column = []
    for item in table:
        column.append(item[column_index])
    return column

def partition_instances(instances, att_index, att_domain):
    # this is a group by att_domain, not by att_values in instances
    partition = {}
    for att_value in att_domain:
        subinstances = []
        for instance in instances:
            # check if this instance has att_value at att_index
            if instance[att_index] == att_value:
                subinstances.append(instance)
        partition[att_value] = subinstances
    return partition

We will also need to develop several helper functions of our own to use. These are defined below:

* `check_all_same_att()`
    * **Parameters**:
        * `instances` - A list of the current partitioned instances to check
        * `index` - The index to query
    * **Returns**:
        * `True` if all instances in `instances` have the same value at `index`; `False` if else.
* `check_all_same_class()`
     * **Parameters**:
        * `instances` - A list of the current partitioned instances to check
        * `class_index` - The index of the classifying variable
    * **Returns**:
        * `True` if all instances in `instances` have the same class; `False` if else.

* `select_attribute()`
    * **Parameters**:
        * `instances` - A list of the current partitioned instances
        * `att_indexes` - A list of valid indices to split on
        * `class_index` - The index of the classifying attribute
    * **Returns**:
        * Uses the calculations for Entropy and Information Gain discussed in class to return the index attribute with the lowest entropy to be split on next
* `handle_clash()`
    * **Parameters**:
        * `instances` - A list of the current partitioned instances
        * `class_index` - The index of the classifying attribute
    * **Returns**:
        * Uses majority voting to resolve clashes in the decision tree and create a Leaf Node with the most frequent class value

In [4]:
def check_all_same_att(instances, index):
    base = instances[0][index]
    for elem in instances:
        if elem[index] != base:
            return False
    return True

def check_all_same_class(instances, class_index):
    base = instances[0][class_index]
    for elem in instances:
        if elem[class_index] != base:
            return False
    return True

def select_attribute(instances, att_indexes, class_index):
    Entropy_list = {}
    for index in att_indexes:
        E_new = 0
        names, values = groupBy(instances, index)
        for val in values:
            ratios = {}
            total = 0
            for instance in val:
                if instance[class_index] not in ratios:
                    ratios[instance[class_index]] = 1
                else:
                    ratios[instance[class_index]] += 1
                total += 1
            E = 0
            for ratio in ratios:
                E += (ratios[ratio] / total) * math.log((ratios[ratio] / total), 2)
            E_new += (total / len(instances)) * -E
        Entropy_list[index] = E_new

    min_i = att_indexes[0]
    for index in att_indexes:
        if Entropy_list[index] < Entropy_list[min_i]:
            min_i = index
    return min_i

def handle_clash(instances, class_index):
    votes = {}
    for instance in instances:
        if instance[class_index] not in votes:
            votes[instance[class_index]] = 1
        else:
            votes[instance[class_index]] += 1
    # Referenced from https://stackoverflow.com/questions/613183/how-do-i-sort-a-dictionary-by-value
    sorted_x = sorted(votes.items(), reverse=True, key=operator.itemgetter(1))
    return ["Leaf", sorted_x[0][0]]


Now that we have these classifiers, we can construct our `tdidt()` decision tree generator, as well as the `classify_tdidt()` classifier.

* `tdidt()`
    * **Parameters**:
        * `instances` - The currently partitioned instances. On first recursive call, these are initialized as the entire dataset.
        * `att_indexes` - A list of all valid indices to split on. On first recursive call, these are initialized as all attribute indices.
        * `att_domains` - A list of all valid values for each attribute in the dataset.
        * `class_index` - The index of the classifying attribute.
    * **Returns**:
        * A Decision Tree, represented using nested lists
* `classify_tdidt()`
    * **Parameters**:
        * `tree` - A decision tree, generated by `tdidt()`
        * `instance` - The unseen instance to classify
    * **Returns**:
        * A predicted classification using the decision tree

In [6]:
def tdidt(instances, att_indexes, att_domains, class_index):
    if check_all_same_class(instances, class_index):
        return ["Leaf", instances[0][class_index]]
    if att_indexes == []:
        return handle_clash(instances, class_index)
    index = select_attribute(instances, att_indexes, class_index)
    new_indexes = att_indexes[:]
    new_indexes.remove(index)
    if check_all_same_att(instances, index):
        return tdidt(instances, new_indexes, att_domains, class_index)
    else:
        tree = ["Attribute", index]
        partitions = partition_instances(instances, index, att_domains[index])
        for val in partitions:
            if (partitions[val] == []):
                return handle_clash(instances, class_index)
            tree.append(["Value", val, tdidt(partitions[val], new_indexes, att_domains, class_index)])
        return tree

def classify_tdidt(tree, instance):
    if tree[0] == 'Leaf':
        return tree[1]
    else:
        i = 2
        while (instance[tree[1]] != tree[i][1]):
            i += 1
        return classify_tdidt(tree[i][2], instance)


To test our classifier, we will make use of the "interview" dataset provided in class.

In [7]:
table = [
        ["Senior", "Java", "no", "no", "False"],
        ["Senior", "Java", "no", "yes", "False"],
        ["Mid", "Python", "no", "no", "True"],
        ["Junior", "Python", "no", "no", "True"],
        ["Junior", "R", "yes", "no", "True"],
        ["Junior", "R", "yes", "yes", "False"],
        ["Mid", "R", "yes", "yes", "True"],
        ["Senior", "Python", "no", "no", "False"],
        ["Senior", "R", "yes", "no", "True"],
        ["Junior", "Python", "yes", "no", "True"],
        ["Senior", "Python", "yes", "yes", "True"],
        ["Mid", "Python", "no", "yes", "True"],
        ["Mid", "Java", "yes", "no", "True"],
        ["Junior", "Python", "no", "yes", "False"]
    ]

We will first generate a decision tree using `tdidt()`, and then we will classify two instances: ("Senior", "Java", "no", "yes"), which exists in the dataset and should have the classification ("False"); and ("Junior", "Java", "no", "no"), which is not in the dataset but based on our in-class tree, should have the classification ("True")

In [9]:
tree = tdidt(table, [0, 1, 2, 3], [["Senior", "Mid", "Junior"], ["Java", "Python", "R"], ["no", "yes"], ["no", "yes"]], 4)

print(classify_tdidt(tree, ["Senior", "Java", "no", "yes"]))
print(classify_tdidt(tree, ["Junior", "Java", "no", "no"]))

False
True


## Step 2 - 