# Decision Trees

Let's implement a decision tree

We'll need two kinds of nodes:

1.  **Internal nodes** to  represent decisions
2.  **Leaf nodes** to represent classes

In [2]:
class Node(object):
    def __init(self, right, left):
        self.r = right
        self.l = left

class Internal(Node):
    def __init__(self, left, right):
        self.predicate = None
        self.__init(left, right)

class Leaf(Node):
    def __init__(self, label):
        self.class_label = label
        self.__init(None, None)

The basic algorithm is simple:

<code>
 build_tree(samples)
  if (y = 0 for all (x, y) in samples) return new leaf(0)
  else if (y = 1 for all (x,y) in samples) return new leaf(1)
  else
    chose best attribute x<sub>j</sub>
    s0 = all (x, y) in samples with x<sub>j</sub> = 0
    s1 = all (x, y) in samples with x<sub>j</sub> = 1
    return new node(x<sub>j</sub>, build_tree(s0), build_tree(s1))
</code>

In [3]:
def build_tree(samples, split, label):
    """Build a decision tree
       Parameters:
       samples   -- list of samples, where each sample is a list of attributes
       split     -- function that takes a list of samples, and returns a tuple of three things:
          two groups of data
          a function that extracts the attributes used for this split.
          i.e., split(samples) => (a, b, attrfn)
                where attrfn(sample) returns True if the sample should be selected
                False otherwise
       label     -- function that takes a single sample and returns the label for that sample
    """
    if all(label(samples[0]) == label(sample) for sample in samples):
        return Leaf(label(samples[0]))

    a, b, attr = split(samples)
    return Internal(attr, build_tree(a, split, label), build_tree(b, split, label))

We'll need a function to split the attributes. But that depends on the data.

Let's use an interesting dataset

### Credit Card Application Dataset

https://archive.ics.uci.edu/ml/datasets/Credit+Approval

In [6]:
import pandas as pd

In [7]:
cc_data = pd.read_csv('dataset/crx.data', header=None)

In [57]:
cc_data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+


Column 15 is the class. 

"+" means the credit card application was approved

"-" means it was denied


In [9]:
cc_data.shape

(690, 16)

In [23]:
import numpy as np
shuffled_data = cc_data.sample(frac=1).reset_index(drop=True)
shuffled_data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,a,28.58,3.75,u,g,c,v,0.25,f,t,1,t,g,40,154,-
1,b,55.92,11.5,u,g,ff,ff,5.0,t,t,5,f,g,0,8851,+
2,a,19.75,0.75,u,g,c,v,0.795,t,t,5,t,g,140,5,-
3,b,?,5.0,y,p,aa,v,8.5,t,f,0,f,g,0,0,-
4,a,27.25,0.29,u,g,m,h,0.125,f,t,1,t,g,272,108,-


Let's take a quick detour and build entropy and information gain. We'll use information gain to pick an attribute (and measure how well it splits the data)

### Entropy and Information Gain

In [34]:
def entropy(examples):
    '''
    Parameters
    examples -- list of number of examples per class
    '''
    total = sum(examples)
    entropy = 0.0
    
    # filter out classes with 0 examples to compute - p * log(p) 
    # (i.e., we define 0 * log(0) == 0)
    for n in filter(None, examples):
        entropy -= (n/total) * math.log(n/total , 2)
    return entropy

In [32]:
0 == entropy([10]) # 100% pure

True

In [37]:
1 == entropy([5, 5]) # Max impurity

True

In [39]:
entropy([5, 10])  # impure

0.9182958340544896

In [42]:
entropy([9, 1])

0.4689955935892812

In [45]:
entropy([50, 1]) # purer split

0.1392329990550989

In [55]:
entropy([99, 1])

0.08079313589591118

In [51]:
entropy([10000, 1])

0.0014729006652121114

In [58]:
entropy([10**6, 1]) # almost pure

2.137424295738942e-05

#### Gain

In [151]:
def gain(examples, count_per_class_fn, classfn, classes, attrfn, attrvals):
    '''
    Calculates information gain after splitting on an attribute
    
    Parameters
    examples - list of examples. Each example has attributes.
    count_per_class_fn  - function that returns the distribution of classes for a subset of examples.
       e.g., classfn(examples) -> [3, 4, 5]  (3 are of class 0, 4 of class 1, 5 of class 2)
    class_fn - function that returns the class of an example.
    classes  - list of all classses
    attrfn   - function that returns the value of a specific attribute given an example.
       e.g., attrnfn(example) -> attribute value
    attrvals - list of all possible values for the attribute used for this split
       e.g., [1, 2, 3], or ['red', 'yellow', 'green']
    
    '''
    en = entropy(count_per_class_fn(examples, classfn, classes))
    
    total = len(examples)

    for val in attrvals:
        # Get all examples whose value for the attribute is val
        sv = list(filter(lambda example: attrfn(example) == val, examples))
        en -= len(sv)/total * entropy(count_per_class_fn(sv, classfn, classes))
    return en

In [67]:
moon_day_examples = [
    ['clear', 'cold', 'winter', 'moon'],
    ['rainy', 'warm', 'spring', 'no-moon'],
    ['cloudy', 'cold', 'winter', 'no-moon'],
    ['clear', 'warm', 'summer', 'moon'],
    ['rainy', 'cold', 'fall', 'no-moon'],
    ['rainy', 'cold', 'spring', 'no-moon'],
    ['clear', 'cold', 'spring', 'moon'],
]

In [87]:
def moon(example):
    return example[3]

In [133]:
def counts_per_class(examples, classfn, classes):
    dist = dict()
    for cls in classes:
        dist[cls] = 0
    
    for example in examples:
        cls = classfn(example)
        dist[cls] += 1
    
    flat_dist = []
    for cls in classes:
        flat_dist.append(dist[cls])
    return flat_dist

In [96]:
counts_per_class(moon_day_examples, moon, ['moon', 'no-moon'])  # 3 'moon' and 4 'non-moon'

[3, 4]

In [97]:
def sky_condition(example): return example[0]
def temp(example): return example[1]
def season(example): return example[2]

In [142]:
def attrvalues(examples, attrfn):
    return set(attrfn(example) for example in examples)

In [143]:
classes = ['moon', 'no-moon']

In [144]:
sky = attrvalues(moon_day_examples, sky_condition)
sky

{'clear', 'cloudy', 'rainy'}

What's the gain if we split by **sky condition**?

In [145]:
gain(moon_day_examples, counts_per_class, moon, classes, sky_condition, sky)

0.9852281360342516

What's the gain if we split by **temperature**?

In [137]:
temp_values = attrvalues(moon_day_examples, temp)
gain(moon_day_examples, counts_per_class, moon, classes, temp, temp_values)

0.005977711423774124

What's the gain if we split by **season**?

In [150]:
season_values = attrvalues(moon_day_examples, season)
gain(moon_day_examples, counts_per_class, moon, classes, season, season_values)

0.3059584928680419

As expected, sky condition is the best split to figure out which day we'll see the moon