# Chapter 17 - Decision Trees

In [5]:
import math
import random
from collections import Counter

import numpy as np
import matplotlib.pyplot as plt

Decision trees can easily handle a mix of numer and categorical attributes and can even classify data for which attributes are missing. Easy to understand and interpret, and the process of reaching a prediction is transparent.

However, finding the optimal decision tree for a training set is a computationally hard problem. It's easy to build a decision tree that is overfitted to the training data, which generalize poorly to unseen data.

Decision trees can be divided into classification trees and regression trees.

Entropy: uncertainty associated with the data. Imagine a set $S$ of data, each member of which is labeled and belongs to a set of finite classes $C_1, ..., C_n$. If all data belongs to a single class, there is no uncertainty and low entropy. If the data is spread out evenly amongst all classes, then there is high entropy. In math terms, $p_i$ is the proportion of data labeld as class $c_i$. Then etropy is: <br>

<center> $H(S) = - p_1 log_2 p_1 - ... - p_n log_2 p_n$, where 0 log 0 = 0. </center>
$-p_i log_2 p_i$ is non-negative and close to zero precisely when p_i is either close to zero or close to one. So entropy is small when every $p_i$ is close to 0 or 1, and large when $p_i$'s are not close to 1 or 0.

In [4]:
def entropy(class_probabilities):
    """given a list of class probabilities, compute the entropy"""
    return sum(-p * math.log(p,2) for p in class_probabilities if p)

In [6]:
def class_probabilites(labels):
    total_count = len(labels)
    return [count / total_count for count in Counter(labels).values()]

In [7]:
def data_entropy(labeled_data):
    labels = [label for _, label in labeled_data]
    probabilities = class_probabilites(labels)
    return entropy(probabilites)