# Decision Trees

Let's implement a decision tree

We'll need two kinds of nodes:

1.  **Internal nodes** to  represent decisions
2.  **Leaf nodes** to represent classes

In [619]:
class Node(object):
    def __init__(self, children):
        self.children = children

class Internal(Node):
    def __init__(self, attrfn, children):
        self.predicate = attrfn
        super().__init__(children)

class Leaf(Node):
    def __init__(self, label):
        print("Building leaf with class {0}".format(label))
        self.class_label = label
        super().__init__(None)

The basic algorithm is simple:

<code>
 build_tree(samples)
  if (y = 0 for all (x, y) in samples) return new leaf(0)
  else if (y = 1 for all (x,y) in samples) return new leaf(1)
  else
    chose best attribute x<sub>j</sub>
    s0 = all (x, y) in samples with x<sub>j</sub> = 0
    s1 = all (x, y) in samples with x<sub>j</sub> = 1
    return new node(x<sub>j</sub>, build_tree(s0), build_tree(s1))
</code>

In [624]:
from collections import Counter

def build_tree(samples, split, attrfns, label):
    """Build a decision tree
       Parameters:
       samples   -- list of samples, where each sample is a list of attributes
       split     -- function that takes a list of samples, a list of attribute functions, and 
                    a label function. The function splits the data based on the best attribute,
                    and returns a tuple:
                    1) dict of data split, where key is the attribute value, and the value
                       is a list of examples that have that attribute value
                    2) the attribute function used for this split
                    
                    Example:
                    
                    samples = [['Monday', 'Yellow', 'X'], ['Monday', 'Red', 'O']]
                    def day(s): return s[0]
                    def color(s): return s[1]
                    def label(s): return s[2]
                    
                    
                    split(samples, [day, color], label)  (may return) =>
                    
                    {'Yellow' => ['Monday', 'Yellow', 'X']
                     'Red'    => ['Monday', 'Red', 'O']} ,
                     color                    
                      
       attrfns   -- list of attribute functions. Each of these functions can be applied to a 
                    a sample to get a specific attribute.
                    e.g., [day_attr, color_attr],  day_attr(sample) => "Monday"
                    or color_attr(sample) => "Red"
       label     -- function that takes a single sample and returns the label for that sample
                    e.g., label(sample) => 'X'
    """
    if all(label(samples[0]) == label(sample) for sample in samples):
        print("Building leaf with {0} same-class samples".format(len(samples)))
        return Leaf(label(samples[0]))

    if not attrfns:
        print("Building leaf with {0} samples".format(len(samples)))
        return Leaf(Counter(label(sample) for sample in samples).most_common()[0][0])
    
    splits, attrfn = split(samples, attrfns, label)
    
    remaining_attrfns = [fn for fn in attrfns if fn != attrfn]

    child_nodes = []
    for group in splits.values():
        child_nodes.append(build_tree(group, split, remaining_attrfns, label))
    return Internal(attrfn, child_nodes)

We have not defined split, that depends on the data and how which criteria to use.

That brings us to entropy and information gain.


Let's take a quick detour and build entropy and information gain. We'll use information gain to pick an attribute (and measure how well it splits the data)

### Entropy and Information Gain

In [454]:
def entropy(examples):
    '''
    Computes entropy of samples
    
    The min entropy is 0.0, the max entropy is log2(n), where n is number of classes.
    
    Parameters
    examples -- list of number of examples per class
    '''
    total = sum(examples)
    entropy = 0.0
    
    # filter out classes with 0 examples to compute - p * log(p) 
    # (i.e., we define 0 * log(0) == 0)
    for n in filter(None, examples):
        entropy -= (n/total) * math.log(n/total , 2)
    return entropy

We use entropy to measure the "purity" of a node. A node is 100% pure (entropy == 0.0) if it contains examples of the same class.

When a node is 100% impure, we'll get log2(n), where n is the number of classes.

In [455]:
0 == entropy([10]) # 100% pure, all same class

True

In [534]:
1 == entropy([5, 5]) # Max impurity, two classes, data is split across classes equally

True

In [457]:
entropy([5, 10])  # impure

0.9182958340544896

In [458]:
entropy([9, 1])  # 9 examples of 1 class, 1 example of another

0.4689955935892812

In [459]:
entropy([50, 1]) # purer split

0.1392329990550989

In [460]:
entropy([99, 1])

0.08079313589591118

In [461]:
entropy([10000, 1])

0.0014729006652121114

In [462]:
entropy([10**6, 1]) # almost pure

2.137424295738942e-05

In [535]:
print(math.log2(6))
math.log2(6) == entropy([1,1,1,1,1,1])  # Max impurity, 6 classes


2.584962500721156


True

#### Gain

In [463]:
def gain(examples, classfn, classes, attrfn, attrvals):
    '''
    Calculates information gain after splitting on an attribute
    
    Parameters
    examples - list of examples. Each example has attributes.
    class_fn - function that returns the class of an example.
    classes  - list of all classses
    attrfn   - function that returns the value of a specific attribute given an example.
       e.g., attrnfn(example) -> attribute value
    attrvals - list of all possible values for the attribute used for this split
       e.g., [1, 2, 3], or ['red', 'yellow', 'green']
    
    '''

    en = entropy(counts_per_class(examples, classfn, classes))    
    total = len(examples)

    for val in attrvals:
        # Get all examples whose value for the attribute is val
        sv = list(filter(lambda example: attrfn(example) == val, examples))
        en -= len(sv)/total * entropy(counts_per_class(sv, classfn, classes))
    return en

In [464]:
moon_day_examples = [
    ['clear', 'cold', 'winter', 'moon'],
    ['rainy', 'warm', 'spring', 'no-moon'],
    ['cloudy', 'cold', 'winter', 'no-moon'],
    ['clear', 'warm', 'summer', 'moon'],
    ['rainy', 'cold', 'fall', 'no-moon'],
    ['rainy', 'cold', 'spring', 'no-moon'],
    ['clear', 'cold', 'spring', 'moon'],
]

In [465]:
def moon(example):
    return example[3]

In [466]:
def counts_per_class(examples, classfn, classes):
    '''
     Returns the distribution of classes for a subset of examples.
       e.g., classfn(examples) -> [3, 4, 5]  (3 are of class 0, 4 of class 1, 5 of class 2)
    '''
    dist = dict()
    for cls in classes:
        dist[cls] = 0
    
    for example in examples:
        cls = classfn(example)
        dist[cls] += 1
    
    flat_dist = []
    for cls in classes:
        flat_dist.append(dist[cls])
    return flat_dist

In [467]:
counts_per_class(moon_day_examples, moon, ['moon', 'no-moon'])  # 3 'moon' and 4 'non-moon'

[3, 4]

In [468]:
def sky_condition(example): return example[0]
def temp(example): return example[1]
def season(example): return example[2]

In [469]:
def attrvalues(examples, attrfn):
    return set(attrfn(example) for example in examples)

In [470]:
classes = ['moon', 'no-moon']

In [471]:
sky = attrvalues(moon_day_examples, sky_condition)
sky

{'clear', 'cloudy', 'rainy'}

What's the gain if we split by **sky condition**?

In [472]:
gain(moon_day_examples, moon, classes, sky_condition, sky)

0.9852281360342516

What's the gain if we split by **temperature**?

In [473]:
temp_values = attrvalues(moon_day_examples, temp)
gain(moon_day_examples, moon, classes, temp, temp_values)

0.005977711423774124

What's the gain if we split by **season**?

In [474]:
season_values = attrvalues(moon_day_examples, season)
gain(moon_day_examples, moon, classes, season, season_values)

0.3059584928680419

As expected, sky condition is the best split to figure out which day we'll see the moon

Another example: when to play tennis?

In [475]:
tennis_dataset = [line.split() for line in """
sunny    hot  high   weak   no
sunny    hot  high   strong no
overcast hot  high   weak   yes
rain     mild high   weak   yes
rain     cool normal weak   yes
rain     cool normal strong no
overcast cool normal strong yes
sunny    mild high   weak   no
sunny    cool normal weak   yes
rain     mild normal weak   yes
sunny    mild normal strong yes
overcast mild high   strong yes
overcast hot  normal weak   yes
rain     mild high   strong no
""".split('\n')[1:-1]]

In [476]:
tennis_dataset

[['sunny', 'hot', 'high', 'weak', 'no'],
 ['sunny', 'hot', 'high', 'strong', 'no'],
 ['overcast', 'hot', 'high', 'weak', 'yes'],
 ['rain', 'mild', 'high', 'weak', 'yes'],
 ['rain', 'cool', 'normal', 'weak', 'yes'],
 ['rain', 'cool', 'normal', 'strong', 'no'],
 ['overcast', 'cool', 'normal', 'strong', 'yes'],
 ['sunny', 'mild', 'high', 'weak', 'no'],
 ['sunny', 'cool', 'normal', 'weak', 'yes'],
 ['rain', 'mild', 'normal', 'weak', 'yes'],
 ['sunny', 'mild', 'normal', 'strong', 'yes'],
 ['overcast', 'mild', 'high', 'strong', 'yes'],
 ['overcast', 'hot', 'normal', 'weak', 'yes'],
 ['rain', 'mild', 'high', 'strong', 'no']]

In [477]:
def outlook(x): return x[0]
def temperature(x): return x[1]
def humidity(x): return x[2]
def wind(x): return x[3]

def play_tennis(x): return x[4]

In [478]:
outlook_values = attrvalues(tennis_dataset, outlook)
gain(tennis_dataset, play_tennis, ['no', 'yes'], outlook, outlook_values)

0.2467498197744391

In [479]:
humidity_values = attrvalues(tennis_dataset, humidity)
gain(tennis_dataset, play_tennis, ['no', 'yes'], humidity, humidity_values)

0.15183550136234136

In [480]:
wind_values = attrvalues(tennis_dataset, wind)
gain(tennis_dataset, play_tennis, ['no', 'yes'], wind, wind_values)

0.04812703040826927

In [481]:
temperature_values = attrvalues(tennis_dataset, temperature)
gain(tennis_dataset, play_tennis, ['no', 'yes'], temperature, temperature_values)

0.029222565658954647

Outlook is the best way to split our data first!

Let's use an interesting dataset

### Credit Card Application Dataset

https://archive.ics.uci.edu/ml/datasets/Credit+Approval

In [482]:
import pandas as pd

In [483]:
cc_data = pd.read_csv('dataset/crx.data', header=None)

In [484]:
cc_data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+


Column 15 is the class. 

"+" means the credit card application was approved

"-" means it was denied


In [9]:
cc_data.shape

(690, 16)

In [23]:
import numpy as np
shuffled_data = cc_data.sample(frac=1).reset_index(drop=True)
shuffled_data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,a,28.58,3.75,u,g,c,v,0.25,f,t,1,t,g,40,154,-
1,b,55.92,11.5,u,g,ff,ff,5.0,t,t,5,f,g,0,8851,+
2,a,19.75,0.75,u,g,c,v,0.795,t,t,5,t,g,140,5,-
3,b,?,5.0,y,p,aa,v,8.5,t,f,0,f,g,0,0,-
4,a,27.25,0.29,u,g,m,h,0.125,f,t,1,t,g,272,108,-


In [571]:
def cc_split(samples, attrfns, label):
    # Compute information gain for every attfn used
    gains = []
    for attrfn in attrfns:
        g = gain(samples, label, ['+', '-'], attrfn, attrvalues(samples, attrfn))
        gains.append((g, attrfn))
    
    g, fn = max(gains, key=lambda x: x[0])
    print("max gain is {0}: {1}".format(fn.__name__, g))  
    return group_by_fn(samples, fn), fn

In [567]:
def group_by_fn(samples, fn):
    vals = attrvalues(samples, fn)
    groups = dict()

    for val in vals:
        groups[val] = []

    for x in samples:
        val = fn(x)
        groups[val].append(x)

    return groups    

In [568]:
def cc_class(x): return x[15]

In [599]:

# Let's build some attribute functions for categorical data

def zeroth(x): return x[0]
def third(x): return x[3]
def fourth(x): return x[4]
def fifth(x): return x[5]
def sixth(x): return x[6]
def eight(x): return x[8]
def ninth(x): return x[9]
def eleventh(x): return x[11]
def twelveth(x): return x[12]

cc_att_fns = [zeroth, third, fourth, fifth, sixth, eight, ninth, eleventh, twelveth]

In [572]:
groups, fn = cc_split(shuffled_data[:100].values.tolist(), cc_att_fns, cc_class)

max gain is eight: 0.4880889369130191


The eigth attribute gave us the best gain.


Here's how this data was split:

In [576]:
for key, examples in groups.items():
    print("{0}: {1}".format(key, len(examples)))


t: 53
f: 47


In [625]:
x = shuffled_data[:100].values.tolist()

build_tree(x, cc_split, cc_att_fns, cc_class)

max gain is eight: 0.4880889369130191
max gain is fifth: 0.21109607566762426
Building leaf with 5 same-class samples
Building leaf with class +
max gain is third: 0.10519553207004634
max gain is eleventh: 0.19811742113040343
max gain is zeroth: 0.9182958340544896
Building leaf with 1 same-class samples
Building leaf with class -
Building leaf with 2 same-class samples
Building leaf with class +
Building leaf with 4 same-class samples
Building leaf with class +
max gain is zeroth: 0.31127812445913283
Building leaf with 1 same-class samples
Building leaf with class +
max gain is sixth: 0.2516291673878229
max gain is fourth: 0.0
max gain is ninth: 0.0
max gain is eleventh: 0.0
max gain is twelveth: 0.0
Building leaf with 2 samples
Building leaf with class +
Building leaf with 1 same-class samples
Building leaf with class -
Building leaf with 3 same-class samples
Building leaf with class +
Building leaf with 1 same-class samples
Building leaf with class +
Building leaf with 1 same-class sa

<__main__.Internal at 0x10b48b4e0>