# Decision Trees

Let's implement a decision tree

We'll need two kinds of nodes:

1.  **internal nodes** to  represent decisions
2.  **leaf nodes** to classes

In [5]:
class Node(object):
    def __init(self, right, left):
        self.r = right
        self.l = left

class Internal(Node):
    def __init__(self):
        self.predicate = None
        self.__init()

class Leaf(Node):
    def __init__(self, label):
        self.class_label = label
        self.__init(None, None)

The basic algorithm is simple:

<code>
 build_tree(samples)
  if (y = 0 for all <x, y> in samples) return new leaf(0)
  else if (y = 1 for all <x,y> in samples) return new leaf(1)
  else
    chose best attribute x<sub>j</sub>
    s0 = all <x, y> in samples with x<sub>j</sub> = 0
    s1 = all <x, y> in samples with x<sub>j</sub> = 1
    return new node(x<sub>j</sub>, build_tree(s0), build_tree(s1))
</code>

In [6]:
def build_tree(samples, split, label):
    """Build a decision tree

       Arguments:
       samples   -- list of samples, where each sample is a list of attributes
       split     -- function that takes a list of samples, and returns a tuple of three things:
          two groups of data
          a function that extracts the attributes used for this split.
          i.e., split(samples) => (a, b, attrfn)
                where attrfn(sample) returns True if the sample should be selected
                False otherwise
       label     -- function that takes a single sample and returns the label for that sample
    """
    if all(label(samples[0]) == sample for sample in samples):
        return Leaf(label(samples[0]))

    a, b, attr = split(samples)
    return Internal(battr, build_tree(a), build_tree(b))

We'll need a function to split the attributes. But that depends on the data.

Let's use an interesting dataset

### Credit Card Application Dataset

https://archive.ics.uci.edu/ml/datasets/Credit+Approval

In [7]:
import pandas as pd

In [24]:
cc_data = pd.read_csv('dataset/crx.data', header=None)

In [25]:
cc_data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+


In [26]:
cc_data.shape

(690, 16)

In [42]:
import numpy as np
shuffled_data = cc_data.sample(frac=1).reset_index(drop=True)
shuffled_data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
539,b,80.25,5.5,u,g,?,?,0.54,t,f,0,f,g,0,340,-
122,a,24.75,12.5,u,g,aa,v,1.5,t,t,12,t,g,120,567,+
512,a,44.33,0.0,u,g,c,v,2.5,t,f,0,f,g,0,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
233,b,27.67,13.75,u,g,w,v,5.75,t,f,0,t,g,487,500,+
