# 4.3: Decision Tree Implementation

## Import Statements and Starter Code

In [1]:
import importlib
import numpy as np

In [2]:
header = ["level", "lang", "tweets", "phd"]
attribute_domains = {"level": ["Senior", "Mid", "Junior"], 
        "lang": ["R", "Python", "Java"],
        "tweets": ["yes", "no"], 
        "phd": ["yes", "no"]}
X_train = [
        ["Senior", "Java", "no", "no"],
        ["Senior", "Java", "no", "yes"],
        ["Mid", "Python", "no", "no"],
        ["Junior", "Python", "no", "no"],
        ["Junior", "R", "yes", "no"],
        ["Junior", "R", "yes", "yes"],
        ["Mid", "R", "yes", "yes"],
        ["Senior", "Python", "no", "no"],
        ["Senior", "R", "yes", "no"],
        ["Junior", "Python", "yes", "no"],
        ["Senior", "Python", "yes", "yes"],
        ["Mid", "Python", "no", "yes"],
        ["Mid", "Java", "yes", "no"],
        ["Junior", "Python", "no", "yes"]
    ]

y_train = ["False", "False", "True", "True", "True", "False", "True", "False", "True", "True", "True", "True", "True", "False"]

## How do We Represent Trees in Python?
1. nested data structure (e.g. lists)
2. OOP (e.g. `MyTree` class)

for our intents and purposes, we will use a nested tree implementation. 
* At elem 0, we have the "data type" (Attribute, Value, Leaf)
* At elem 1, we have the "data" (attribute name, attribute value, class label)
* At elem 2, it depends on the data type

##### Example: Tree Solution for the Interview Dataset

This is based on the decision tree that we made in Lab Task #1 of 4.2!!!

In [3]:
interview_tree_solution = ["Attribute", "level", 
                            ["Value", "Senior", 
                                ["Attribute", "tweets", 
                                    ["Value", "yes", 
                                        ["Leaf", "True", 2, 5]
                                    ],
                                    ["Value", "no", 
                                        ["Leaf", "False", 3, 5]
                                    ]
                                ]
                            ],
                            ["Value", "Mid", 
                                ["Leaf", "True", 4, 14]
                            ],
                            ["Value", "Junior", 
                                ["Attribute", "phd", 
                                    ["Value", "yes", 
                                        ["Leaf", "False", 2, 5]
                                    ],
                                    ["Value", "no", 
                                        ["Leaf", "True", 3, 5]
                                    ]
                                ]
                            ]
                        ]

In [4]:
# HELPER FUNCTIONS

def select_attribute(instances, attributes):
    # TODO: USE ENTROPY TO COMPUTE/CHOOSE ATTRIBUTE WITH THE SMALLEST E_new
    # for now... just choose randomly
    return np.random.choice(attributes) # my code

    # OR...
    # rand_index = np.random.randint(0, len(attributes))
    # return attributes[rand_index]

def partition_instances(instances, split_attribute):
    # let's use a dictionary
    partitions = {} # key (string) : value (subtable)
    att_index = header.index(split_attribute) # e.g. 0 for level
    att_domain = attribute_domains[att_index] # e.g. ['Junior', 'Mid', 'Senior']
    for att_value in att_domain:
        partitions[att_value] = []
        for instance in instances:
            if instance[att_index] == att_value:
                partitions[att_value].append(instance)
    return partitions

In [5]:
attribute_domains = {
                        0: ["Junior", "Mid", "Senior"],
                        1: ["Java", "Python", "R"],
                        2: ["no", "yes"],
                        3: ["no", "yes"]
                    }
                    # NOTE: can also use strings here instead of integers

def tdidt(current_instances, available_attributes):
    # basic approach (uses recursion!!):
    print("available attributes:", available_attributes)
    
    # select an attribute to split on
    attribute = select_attribute(current_instances, available_attributes)
    print("splitting on attribute", attribute)
    available_attributes.remove(attribute)
    tree = ["Attribute", attribute]
    # group data by attribute domains (creates pairwise disjoint partitions)
    partitions = partition_instances(current_instances, attribute)
    print('partitions:', partitions)
    # for each partition, repeat unless one of the following occurs (base case)
    for att_value, att_partition in partitions.items():
        print('current attribute value:', att_value, len(att_partition))
        value_subtree = ["Value", att_value]
    #    CASE 1: all class labels of the partition are the same => make a leaf node
    if len(att_partition) > 0 and all_same_class(att_partition):
        pass
    #    CASE 2: no more attributes to select (clash) => handle clash w/majority vote leaf node
    elif len(att_partition) > 0 and len(available_attributes) == 0:
        print("CASE 2 no more attributes")
        # TODO: ???? 
    #    CASE 3: no more instances to partition (empty partition) => backtrack and replace attribute node with majority vote leaf node
    elif len(att_partition) == 0:
        print('CASE 3: empty partition')
        # TODO: backtrack to replace the attribute code
        # with a majority vote leaf node
        # tree = ["Attribute", attribute] <- replace this with a majority vote leaf node
    else: # Previous conditions all false so THIS IS THE RECURSIVE STEP
        subtree = tdidt()

    return None

## Some Notes on `fit()`

* Only takes `X_train` and `y_train`, does NOT take any headers
    * ... as such we need to extract the headers (e.g. `["att0", "att1", ...]`)
* Additionally you need to extract the domains

In [7]:
def fit_starter_code():
    # would advise that X_train and y_train get stitched together
    train = [X_train[i] + [y_train[i]] for i in range(len(X_train))]
    # next make a copy of your header since tdidt will modify the list
    available_attributes = header.copy()
    tree = tdidt(train, available_attributes)
    print("tree", tree)

fit_starter_code()

available attributes: ['level', 'lang', 'tweets', 'phd']
splitting on attribute level
tree None


## General $E_{new}$ Algorithm

* For each available attribute:
    * For each attribute value in the domain:
        * Compute the entropy of that value partition (e.g. proportion and log for each class)
    * compute the $E_{new}$ by taking weighted sum of the partition entropies
* Choose to split on the attribute with the smallest $E_{new}$


## Some Tips on PA7

1. `all_same_class()`
2. append subtree to `value_subtree` and to tree appropriately
3. work on CASE 1, then CASE 2, then CASE 3 (write helper functions!)
4. finish the TODOs in `fit_starter_code()`
5. replace random w/ entropy (compare tree w/ `interview_tree_solution`)
6. Implement unit test for `fit()` and move start code over to OOP
7. move on to `predict()`

## Some Final Tree Topics

### Tree Visualization

* we will specify a tree using the dot language
* then we will create a pdf using the dot program
* will be BONUS for PA7 (but still highly recommended)
    * if it wasn't bonus, it would be inserted inbetween 3 and 4
* EXAMPLE: the following are inside the contents of `interview_tree.dot`:

```
graph g {
    level [shape=box];
    phd[shape=box];
    level -- phd [label="Junior"]
    
    true1 [label="True"]
    false1 [label="False"]
    phd -- true1 [label="no"]
    phd -- false1 [label="yes"]

    true2 [label="True"]
    level -- true2 [label="Mid"]

    tweets [shape=box]
    level -- tweets [label="Senior"]
    true3 [label="True"]
    false3 [label="False"]
    tweets -- false3 [label="no"]
    tweets -- true3 [label="yes"]
}
```

Then, run the following command in your docker container:
```
dot -Tpdf interview_tree.dot -o interview_tree.pdf
```


### Pruning

* Decision trees are notorious for overfitting to a training set
    * $\therefore$ trees don't generalize well to unseen instances
* we can combat this issue by doing **post-pruning** with a *pruining set*
* There is no pruning programming part on PA7