#### Priority Queues

# Binary Trees

Binary trees are important in many applications, some of which we will explore.

A binary tree consists of a unique root node and optional additional nodes with the property that

- every node has at most two child nodes L and R.
- every non-root node as a unique parent node
- every node has the root node as an ancestor (parent of a parent of a parent ...)
- every node has some optional additional data associated with it contained in a dictionary. E.g. one item in the dictionary could be that node's depth (number of parent nodes in the path to the root node).

A  **complete binary tree** is one in which, if $D$ is the largest depth of a node, then
- for all depths $d=0,\ldots,D-1$ the number of nodes is $2^{d}$,
- at depth $D$ the nodes are as far to the left as possible.

A complete binary tree can be implemented as a list with the following properties:

- the root node appears in index 0
- the children of node appearing in index i are at indexes
    - 2i+1
    - 2i+2

As a result, the parent of a node appearing in index i appears in index (i-1)//2 (here the // means the integer part obtained when i-1 is divided by 2.

Index Tree EX: 

0

1 2

3 4 5 6

# Priority Queues and Heaps

A priority queue is a data structure where we store items that can be ordered by priority. Priority of an iterm is determined by a numerical value with the smaller value receiving highter priority (sorry but that is the standard way of defining priority - maybe best to think of position of objects in a line numbered 0,1,... so the the 0th position gets higher priority). The data structure supports the following actions:

- insert - adding a new pair (object, priority) to the queue
- pop - returning the element from the queue with highest priority (lowest i-value) and removing it from the queue
- peek - inspecting the highest priority item without removing it
- determining the number of elements left in the queue without making any changes

We want these operations to be efficient and it turns out that a priority queue can be efficiently implemented using a heap, which is a *complete binary tree structure* as described above with items stored at the nodes. Importantly, 

- the binary tree has the *heap property*: a parent is always at least as high priority as its children, and
- the operations described above can be carried out efficiently.
    - insertion - to insert an element, we add a node at the deepest depth and to the right of any existing nodes at that depth, or add a new depth and place the node at the leftmost position at that depth, then repeatedly swap its position with a parent whose priority is lower until its parent has higher priority (this is called bubbling up)
    - pop - return the first element of the list and then replace it by the last element, then repeatedly swap this element with a child if one of them has higher priority until it has at least as high priority as its children.

A heap will be *balanced*, meaning that the distance between a *leaf* (a node with no children) and the root node having one of two possible values.  


**Efficiency of the heap** The operations above will each take at most $C\log2(n)$ operations where $n$ is the number of objects stored and $C$ is a constant representing the cost of a pair of comparisons and swapping of two list elements.

- Let i = index of newly inserted element.
- Let p = floor[(i - 1) / 2.]
- While i > 0 and value[i] < value[p]: swap their position and repeat

High Priority = smaller value

In [73]:
def heapify(L1, L2):
    """
    Simulate min-heap insertion with step-by-step bubble-up prints.
    L1: starting heap (list)
    L2: values to insert
    """
    heap = L1[:]  # copy

    for v in L2:

        print(f"Initial list: {heap}")

        # Append new value to the list, find i and p
        heap.append(v)
        i = len(heap) - 1
        p = (i - 1) // 2 if i > 0 else None

        print(f"New point {v}, List {heap}")

        # Root Case
        if i == 0:
            print(f"Updated list: {heap}\n")
            print("COMPLETE\n")
            continue

        # Bubble-up
        while i > 0:
            p = (i - 1) // 2

            if heap[i] < heap[p]:
                print(f"Swap since heap[i={i}]={heap[i]} < heap[p={p}]={heap[p]}")
                heap[i], heap[p] = heap[p], heap[i]
                print(f"Updated list: {heap}\n")
                i = p
            else:
                print(f"No swap since heap[i={i}]={heap[i]} >= heap[p={p}]={heap[p]}")
                print(f"Updated list: {heap}\n")
                break

        print("COMPLETE\n")

    return heap

In [75]:
# Example usage matching your exercise:
start = [30, 50]
to_add = [65, 90, 45, 20]

final_heap = heapify(start, to_add)
print("Final heap:", final_heap)

Initial list: [30, 50]
New point 65, List [30, 50, 65]
No swap since heap[i=2]=65 >= heap[p=0]=30
Updated list: [30, 50, 65]

COMPLETE

Initial list: [30, 50, 65]
New point 90, List [30, 50, 65, 90]
No swap since heap[i=3]=90 >= heap[p=1]=50
Updated list: [30, 50, 65, 90]

COMPLETE

Initial list: [30, 50, 65, 90]
New point 45, List [30, 50, 65, 90, 45]
Swap since heap[i=4]=45 < heap[p=1]=50
Updated list: [30, 45, 65, 90, 50]

No swap since heap[i=1]=45 >= heap[p=0]=30
Updated list: [30, 45, 65, 90, 50]

COMPLETE

Initial list: [30, 45, 65, 90, 50]
New point 20, List [30, 45, 65, 90, 50, 20]
Swap since heap[i=5]=20 < heap[p=2]=65
Updated list: [30, 45, 20, 90, 50, 65]

Swap since heap[i=2]=20 < heap[p=0]=30
Updated list: [20, 45, 30, 90, 50, 65]

COMPLETE

Final heap: [20, 45, 30, 90, 50, 65]


In [78]:
# print a node and its children
def print_node_and_children_of_node(L,i):
    depth=3*int(np.log2(i+1))
    indent_string="".join([" " for j in range(depth+1)])
    print(indent_string+str(L[i]))
    child1=2*i+1
    child2=2*i+2
    if child1<len(L):
        print_node_and_children_of_node(L,child1)
    if child2<len(L):
        print_node_and_children_of_node(L,child2)

print_node_and_children_of_node(heap,0)

 0.2176579634064998
    0.546924252949452
       0.6690101981140354
          0.780267234719173
          0.8768859478469114
       0.8248801070984133
    0.4336795828421094
       0.4394591084063991
       0.5317478696078605


In [80]:
# popping
x=heapq.heappop(heap)
print(x)
print("\n")
print_node_and_children_of_node(heap,0)

0.4336795828421094


 0.4394591084063991
    0.546924252949452
       0.6690101981140354
       0.8248801070984133
    0.5317478696078605
       0.8768859478469114
       0.780267234719173


In [5]:
# making a class to hide details from the user
import heapq
class PriorityQueue:
    def __init__(self):
        self._heap = []
    
    def push(self, item):
        heapq.heappush(self._heap, item)
    
    def pop(self):
        return heapq.heappop(self._heap)
    
    def peek(self):
        return self._heap[0] if self._heap else None
    
    def print_node_and_children_of_node(self,i):
        depth=3*int(np.log2(i+1))
        indent_string="".join([" " for j in range(depth+1)])
        print(indent_string+str(self._heap[i]))
        child1=2*i+1
        child2=2*i+2
        if child1<len(self._heap):
            self.print_node_and_children_of_node(child1)
        if child2<len(self._heap):
            self.print_node_and_children_of_node(child2)

import numpy as np
PQ=PriorityQueue()
for i in range(15):
    u=np.random.uniform(0,1)
    PQ.push(u)

PQ.print_node_and_children_of_node(0)

 0.004687511326559424
    0.19695598382606339
       0.6573269655123397
          0.8552142152616831
          0.8194034024015693
       0.43903171435909416
          0.6569969372589074
          0.4718516602872821
    0.07846521725964206
       0.165570536114101
          0.9041898630952048
          0.24188050631121905
       0.08480032800808501
          0.7549254836987843
          0.668683692750721


**Some small modifications**

- Make pop and peek return None if the queue is empty
- Add an is_empty() method
- Add a __len__() method

In [6]:
import heapq
class PriorityQueue:
    def __init__(self):
        self._heap = []
    
    def push(self, item):
        heapq.heappush(self._heap, item)
    
    def pop(self):
        if len(self._heap)==0:
            return(None)
        item=heapq.heappop(self._heap)
        return(item)
  
    def peek(self):
        if len(self._heap)==0:
            return(None)
        item=self._heap[0][1]
        return item
        
    def __len__(self):
        return len(self._heap)

    def is_empty(self):
        if len(self._heap)==0:
            return(True)
        return(False)
    def print_node_and_children_of_node(self,i):
        depth=3*int(np.log2(i+1))
        indent_string="".join([" " for j in range(depth+1)])
        print(indent_string+str(self._heap[i]))
        child1=2*i+1
        child2=2*i+2
        if child1<len(self._heap):
            self.print_node_and_children_of_node(child1)
        if child2<len(self._heap):
            self.print_node_and_children_of_node(child2)           

In [8]:
import numpy as np
PQ=PriorityQueue()
for i in range(25):
    p=np.random.uniform(0,1)
    PQ.push(p) 
while not PQ.is_empty():
    x=PQ.pop()
    print(x)

0.004792444074259605
0.04360711584097343
0.0576247539197795
0.06350533882847453
0.08903014763959849
0.10152283156790409
0.13064838501306253
0.14101361852313077
0.1607888251196281
0.21996653307709224
0.2519961070287391
0.2981839779846721
0.30463845084109065
0.36337691501980096
0.5225950745413018
0.6883893346966522
0.699642387763238
0.7025701559366495
0.7075429373408348
0.7972446125880351
0.8114348376368729
0.8436178392504993
0.8781582355593823
0.9616941728925474
0.9987899607436168


**Customized ordering**


What if we want to put objects in a priority queue that are not immediately comparable?

**Example.**

In the following example, we consider a class of objects, each with a list as an attributes and where we use the sum of entres in the list to determine priority - a greater sum leading to higher priorty.

**Notes**

- We provide a \_\_lt\_\_() method for comparing two class instances
- We pass the item as both arguments to the push method.

In [9]:
import numpy as np

# Create a class of things to put in the queue.
class thing:
    def __init__(self,L):
        self.L=L
    
    def __lt__(self,other):
        if sum(self.L)>=sum(other.L):
            return(True)
        return(False)  
    def __str__(self):
        st="["
        for x in self.L:
            st+=str(x)+","
        st+="]"
        st+=" "+str(sum(self.
                        
                        L))
        return(st)

# Create an instance of a priority queue
PQ=PriorityQueue()

# Add some things to the queue 
print("Adding to the queue:")
for i in range(10):
    n=np.random.poisson(5)
    L=[round(np.random.uniform(0,1),3) for j in range(n)]
    t=thing(L)
    PQ.push(t)
    print("   sum = {0:6.3f} L = {1:15s}".format(sum(L),str(L)))
print("Number of things in the queue = " + str(len(PQ)))

Adding to the queue:
   sum =  1.415 L = [0.726, 0.677, 0.012]
   sum =  4.684 L = [0.779, 0.465, 0.685, 0.956, 0.083, 0.808, 0.79, 0.118]
   sum =  2.812 L = [0.48, 0.03, 0.352, 0.322, 0.006, 0.671, 0.951]
   sum =  1.628 L = [0.471, 0.787, 0.37]
   sum =  2.457 L = [0.499, 0.405, 0.594, 0.009, 0.464, 0.486]
   sum =  2.808 L = [0.418, 0.394, 0.61, 0.424, 0.962]
   sum =  1.073 L = [0.985, 0.088] 
   sum =  4.596 L = [0.539, 0.406, 0.657, 0.654, 0.712, 0.582, 0.309, 0.737]
   sum =  3.218 L = [0.869, 0.965, 0.454, 0.298, 0.632]
   sum =  2.082 L = [0.725, 0.703, 0.564, 0.09]
Number of things in the queue = 10


In [10]:
PQ.print_node_and_children_of_node(0)

 [0.779,0.465,0.685,0.956,0.083,0.808,0.79,0.118,] 4.684
    [0.539,0.406,0.657,0.654,0.712,0.582,0.309,0.737,] 4.596
       [0.869,0.965,0.454,0.298,0.632,] 3.218
          [0.726,0.677,0.012,] 1.415
          [0.499,0.405,0.594,0.009,0.464,0.486,] 2.457
       [0.725,0.703,0.564,0.09,] 2.082
          [0.471,0.787,0.37,] 1.6280000000000001
    [0.48,0.03,0.352,0.322,0.006,0.671,0.951,] 2.812
       [0.418,0.394,0.61,0.424,0.962,] 2.808
       [0.985,0.088,] 1.073


**Now pop them all until the queue is empty**

In [11]:
while not PQ.is_empty():
    x=PQ.pop()
    print("   sum = {0:6.3f} L = {1:15s}".format(sum(x.L),str(x.L)))

   sum =  4.684 L = [0.779, 0.465, 0.685, 0.956, 0.083, 0.808, 0.79, 0.118]
   sum =  4.596 L = [0.539, 0.406, 0.657, 0.654, 0.712, 0.582, 0.309, 0.737]
   sum =  3.218 L = [0.869, 0.965, 0.454, 0.298, 0.632]
   sum =  2.812 L = [0.48, 0.03, 0.352, 0.322, 0.006, 0.671, 0.951]
   sum =  2.808 L = [0.418, 0.394, 0.61, 0.424, 0.962]
   sum =  2.457 L = [0.499, 0.405, 0.594, 0.009, 0.464, 0.486]
   sum =  2.082 L = [0.725, 0.703, 0.564, 0.09]
   sum =  1.628 L = [0.471, 0.787, 0.37]
   sum =  1.415 L = [0.726, 0.677, 0.012]
   sum =  1.073 L = [0.985, 0.088] 


#### Binary Decision Trees

**Binary Trees**

To illustrate an application involving classes, we introduce binary trees. These will be important for building _decision trees_, which are important for classification. 

To build a binary tree, we start by creating a node class. An instance of a node has the following:

- parent - a node if this node is not a root node and None if this is a root node
- right child node
- left child node
- data associated with the node 

We want node methods to provide the following capabilities.

- Create a root (parentless) node and add optional data to it.
- Create a node with some parent and add optional data to it.
- Retrieve the data associated with a node
- Assign data to a node
- Get the left child associated with a node if there is one
- Get the right child associated with a node if there is one
- Spawn a left child of a given node
- Spawn a right child of a given node

The data we associate with a node can be quite general. We'll use a dictionary at each node and in the code below, we'll store the depth of each node (depth is 0 for the root node, 1 for its children, 2 for its grandchildren etc.) and a label for each node.

In [12]:
class node:
    __slots__=('parent','left_child','right_child','data')

    def __init__(self,parent,data={}):
        if parent==None:
            # making this a root node
            self.data=data
            self.data["depth"]=0
            self.parent=None
        else:
            # making this a non-root node
            self.data=data
            self.data["depth"]=parent.data["depth"]+1
            self.parent=parent
        self.left_child=None
        self.right_child=None
    def spawn_left_child(self,data={}):
        # create a new node n with self as parent w/ given data
        n=node(parent=self,data=data)
        #n.data=data
        n.data["depth"]=self.data["depth"]+1
        self.left_child=n
        return(n)
    def spawn_right_child(self,data={}):
        n=node(parent=self,data=data)
        #n.data=data
        n.data["depth"]=self.data["depth"]+1
        self.right_child=n
        return(n)
    
    # string consisting of information about node
    def __str__(self):
        s="node label = "+self.data["label"]+"\n"
        if self.parent==None:
            s+="   no parent i.e. root node\n"
        else:
            s+="   parent label = " + self.parent.data["label"]+"\n"
        if self.left_child==None:
            s+="   no left child\n"
        else:
            s+="   left child label " + self.left_child.data["label"]+"\n"
        if self.right_child==None:
            s+="   no right child\n"
        else:
            s+="   right child label " + self.right_child.data["label"]+"\n"
        
        return(s)
    
    
    
rootnode=node(parent=None,data={"label":"0:mother of all nodes"})
#print("parent of root node = "+str(rootnode.parent))
print(rootnode)

node1=rootnode.spawn_left_child(data={"label":"1:daughter of mom of all nodes"})
node2=rootnode.spawn_right_child(data={"label":"1:son of mom of all nodes"})
print(node1)
print(node2)

node label = 0:mother of all nodes
   no parent i.e. root node
   no left child
   no right child

node label = 1:daughter of mom of all nodes
   parent label = 0:mother of all nodes
   no left child
   no right child

node label = 1:son of mom of all nodes
   parent label = 0:mother of all nodes
   no left child
   no right child



In [13]:
print(rootnode)

node label = 0:mother of all nodes
   no parent i.e. root node
   left child label 1:daughter of mom of all nodes
   right child label 1:son of mom of all nodes



**Build a tree**

In [14]:
rootnode=node(parent=None,data={"label":"TD = Top dog"})
node1=rootnode.spawn_left_child(data={"label":"DTD = daughter of Top Dog"})
node2=rootnode.spawn_right_child(data={"label":"STD = son of Top Dog"})
node11=node1.spawn_left_child(data={"label":"DDTD"})
node12=node1.spawn_right_child(data={"label":"SDTD"})
node21=node2.spawn_left_child(data={"label":"DSTD"})
node22=node2.spawn_right_child(data={"label":"SSTD"})
node211=node21.spawn_left_child(data={"label":"DDSTD"})
node2111=node211.spawn_left_child(data={"label":"DDDSTD"})
node2112=node211.spawn_right_child(data={"label":"SDDSTD"})
node212=node21.spawn_right_child(data={"label":"SDSTD"})

In [15]:
print(node2111)
print(node2111.parent)

node label = DDDSTD
   parent label = DDSTD
   no left child
   no right child

node label = DDSTD
   parent label = DSTD
   left child label DDDSTD
   right child label SDDSTD



In [16]:
node2111.parent.data["label"]

'DDSTD'

In [17]:
node2111.data

{'label': 'DDDSTD', 'depth': 4}

**Traverse the tree - depth first**

Once we have created a binary tree, we can recursively traverse it. 

The following code creates a string consisting of the label + new line character of a node and adjoins the same for its children.

A key capability utilized here is that function can call itself.

In [18]:
def node_string(node):
    s=node.data["label"]+"\n"
    if node.left_child!=None:
        s+=node_string(node.left_child)
    if node.right_child!=None:
        s+=node_string(node.right_child)
    return(s)

If we compute node_string of a node, that nodes label is stored in s, then if there is a left-child, the label for the left-child is attached and before the label for the right-child is attached, the labels for the children are attached, and so on.

Let's try this for the rootnode of our tree.

In [19]:
print(node_string(rootnode))

TD = Top dog
DTD = daughter of Top Dog
DDTD
SDTD
STD = son of Top Dog
DSTD
DDSTD
DDDSTD
SDDSTD
SDSTD
SSTD



**Using Indents to Represent the "Child Of" Relationship**

We want to draw a tree using some amount of indentation of children relative to their parent - we indent by a certain amount depending on the depth of a node.

We can use the join function to create strings with some amount of indentation.

In [20]:
":::".join(["cat","bird","dog","turtle"])

'cat:::bird:::dog:::turtle'

In [21]:
for n in range(10):
    # create string with n spaces 
    nspaces="".join([" " for i in range(n)])
    # make spaces prefix for a string
    s=nspaces+"mystring"
    print(s)

mystring
 mystring
  mystring
   mystring
    mystring
     mystring
      mystring
       mystring
        mystring
         mystring


In [22]:
def node_string(node):
    # create string of spaces with size = depth of node
    spaces="".join(["   " for i in range(node.data["depth"])])
    s=spaces+node.data["label"]+"\n"
    if node.left_child!=None:
        s+=node_string(node.left_child)
    if node.right_child!=None:
        s+=node_string(node.right_child)
    return(s)

In [23]:
print(node_string(rootnode))

TD = Top dog
   DTD = daughter of Top Dog
      DDTD
      SDTD
   STD = son of Top Dog
      DSTD
         DDSTD
            DDDSTD
            SDDSTD
         SDSTD
      SSTD



Our function works for any node 

In [24]:
print(node_string(rootnode.left_child))

   DTD = daughter of Top Dog
      DDTD
      SDTD



In [25]:
print(node_string(rootnode.right_child))

   STD = son of Top Dog
      DSTD
         DDSTD
            DDDSTD
            SDDSTD
         SDSTD
      SSTD



**Add class method**

As usual, we can make this function a method of our class. When we do that, we need to re-write the function calls so that they look like "node.node_string()" instead of "node-string(node)"

In [26]:
class node:
    __slots__=('parent','left_child','right_child','data')
    #
    # We instantiate a node by passing a parent (which can be None) 
    # and a dictionary
    #
    def __init__(self,parent,data={}):
        if parent==None:
            # making this a root node
            self.data=data
            self.data["depth"]=0
            self.parent=None
        else:
            # making this a non-root node
            self.data=data
            self.data["depth"]=parent.data["depth"]+1
            self.parent=parent
        self.left_child=None
        self.right_child=None
    def spawn_left_child(self,data={}):
        n=node(parent=self,data=data)
        n.data=data
        n.data["depth"]=self.data["depth"]+1
        self.left_child=n
        return(n)
    def spawn_right_child(self,data={}):
        n=node(parent=self,data=data)
        n.data=data
        n.data["depth"]=self.data["depth"]+1
        self.right_child=n
        return(n)
    #
    # string consisting of information about node
    #
    def __str__(self):
        s="node label = "+self.data["label"]+"\n"
        if self.parent==None:
            s+="   no parent i.e. root node\n"
        else:
            s+="   parent label = " + self.parent.data["label"]+"\n"
        if self.left_child==None:
            s+="   no left child\n"
        else:
            s+="   left child label " + self.left_child.data["label"]+"\n"
        if self.right_child==None:
            s+="   no right child\n"
        else:
            s+="   right child label " + self.right_child.data["label"]+"\n"
        
        return(s)
    def node_string(self):
        spaces="".join([" " for i in range(self.data["depth"])])
        s=spaces+self.data["label"]+"\n"
        if self.left_child!=None:
            s+=self.left_child.node_string()
        if self.right_child!=None:
            s+=self.right_child.node_string()
        return(s)
rootnode=node(parent=None,data={"label":"TD = Top dog"})
node1=rootnode.spawn_left_child(data={"label":"DTD = daughter of Top Dog"})
node2=rootnode.spawn_right_child(data={"label":"STD = son of Top Dog"})
node11=node1.spawn_left_child(data={"label":"DDTD"})
node12=node1.spawn_right_child(data={"label":"SDTD"})
node21=node2.spawn_left_child(data={"label":"DSTD"})
node211=node21.spawn_left_child(data={"label":"DDSTD"})
node2111=node211.spawn_left_child(data={"label":"DDDSTD"})
node2112=node211.spawn_right_child(data={"label":"SDDSTD"})
node212=node21.spawn_right_child(data={"label":"SDSTD"})

s=rootnode.node_string()
print(s)

TD = Top dog
 DTD = daughter of Top Dog
  DDTD
  SDTD
 STD = son of Top Dog
  DSTD
   DDSTD
    DDDSTD
    SDDSTD
   SDSTD



In [27]:
s211=node211.node_string()
print(s211)

   DDSTD
    DDDSTD
    SDDSTD



**Binary Decision Trees**

A binary decision tree is a binary tree that enables us to predict which category an item falls into based on known characteristics of the item. Here is a simple example from finance. Mortgage loans have the following attributes:

- location type (suburban, rural, urban)
- borrower's credit score (numerical)
- loan principle i.e. size of loan (numerical)
- interest rate (numerical)
 
A loan can either be approved or not. We have lots of loan performance data, and based on that, here might be an example of a (by no means realistic) classifier:

* location = rural or suburban
    * credit score>700
        * interest rate>5% => reject
        * interest rate<=5% => approve
    * credit score<=700 => reject
* location = urban
    * credit score > 650
        * principle > 100K => approve
        * principle <= 100K => reject
    * credit score <= 650 => reject


A leaf is a node of a tree that has no chilren. 

Note the tree structure. We can think of a binary decision tree as a binary tree such that, to classifiy an individual with given variable values we start at the root node and move along a path picking a child node at each step from the current node. Every non leaf has two children and a function at the node, which upon evaluation. Every leaf node has a category and we classify an individual according to the category of the leaf node they ultimately reach.

**Key point:** We can include any type of Python object as a node dictionary value- including a function.

Below, we add a key "f" and make the value one of the functions defined below to every node dictionary.

Each function takes as input a dictionary with keys being "location", "credit score", "interest rate", and
"principle" representing properties of a loan to approve or disapprove.

We place a label at each node so that we can see what is going on in the code.


In [28]:
# Define some functions
def f0(x):
    if x["location"]=="rural" or x["location"]=="suburban":
        return("left")
    else:
        return("right")

def f1(x):
    if x["credit score"]>700:
        return("left")
    else:
        return("right")

def f2(x):
    if x["credit score"]>650:
        return("left")
    else:
        return("right")
    
def f11(x):
    if x["interest rate"]>5:
        return("left")
    else:
        return("right")
def f111(x):
    return("reject")
def f112(x):
    return("approve")
def f12(x):
    return("reject")

def f21(x):
    if x["principle"]>100:
        return("left")
    else:
        return("right")
def f211(x):
    return("approve")
def f212(x):
    return("reject")
def f22(x):
    return("reject")
#
# Create the tree.
#
rootnode=node(parent=None,data={"f":f0,"label":"0"})
node1=rootnode.spawn_left_child(data={"f":f1,"label":"1"})
node11=node1.spawn_left_child(data={"f":f11,"label":"11"})
node111=node11.spawn_left_child(data={"f":f111,"label":"111"})
node112=node11.spawn_right_child(data={"f":f112,"label":"112"})
node12=node1.spawn_right_child(data={"f":f12,"label":"12"})
node2=rootnode.spawn_right_child(data={"f":f2,"label":"2"})
node21=node2.spawn_left_child(data={"f":f21,"label":"21"})
node211=node21.spawn_left_child(data={"f":f211,"label":"211"})
node212=node21.spawn_right_child(data={"f":f212,"label":"212"})
node22=node2.spawn_right_child(data={"f":f22,"label":"22"})



**Classification**

The classifier is used to classify a new observation i.e. data for a person seeking a loan.

We assume that this data is stored as a dictionary with keys "location", "credit score", "interest rate", "principle".

Now that we have our tree, we can create a function that uses tree recursion to calculate the action to be taken.


In [29]:
def classify(idata):
    # initialize current node at root node
    cnode=rootnode
    #
    # if current node as child nodes, compute function 
    # to determine which child node to go to
    #
    while cnode.left_child:
        print("current node label = ", cnode.data["label"])
        #
        # compute function value at this node (the result is "left" or "right")
        #
        value=cnode.data["f"](idata)
        print("function value = ",cnode.data["f"](x))
        if value=="left":
            cnode=cnode.left_child
        else:
            cnode=cnode.right_child
    #
    # current node has no children - we are at a leaf
    #
    value=cnode.data["f"](idata)
    print("current node label = ", cnode.data["label"])
    print("function value = "+value)
    return(value)

In [30]:
x={"location":"urban","credit score":680,"interest rate":6.5,"principle":300}
result=classify(x)
print("\n"+result)

current node label =  0
function value =  right
current node label =  2
function value =  left
current node label =  21
function value =  left
current node label =  211
function value = approve

approve


**Prediction with probabilities**

When predicting a binary outcome (rain/no-rain tomomrrow, loan defaults/load doesn't default, patient survives/patient dies) based on data, it is more informative to report a probability rather than the outcome itself. This has the benefit

- the probability reflects uncertainty
- the decision-maker can compute an expected loss associated with either decision and act accordingly

To illustrate, suppose you know that the chance of a hurricaine hitting Miami tomorrow is 10%. Suppose the loss associated with not preparing for the possibility of a hurricaine when it actually hits is \\$ 100,000 and the loss associated with preparing and having it not hit is \\$ 500. Then 

- Expected loss if you don't prepare $E(P=0)=100000(0.1)+0(0.9) = 10,000$  
- Expected loss if you do prepare $E(P=1)=0(0.1)+500(0.9) = 450$  

So in terms of minizing expected loss it is better to prepare. On the other hand, if the probability of the hurricaine hitting is 1 in 50,000, then by this criterion you ought not prepare.

The above is easily modified to return a probability of default (estimated).

In [31]:
def f0(x):
    if x["location"]=="rural" or x["location"]=="suburban":
        return("left")
    else:
        return("right")

def f1(x):
    if x["credit score"]>700:
        return("left")
    else:
        return("right")

def f2(x):
    if x["credit score"]>650:
        return("left")
    else:
        return("right")
    
def f11(x):
    if x["interest rate"]>5:
        return("left")
    else:
        return("right")
def f111(x):
    return(.23)
def f112(x):
    return(.05)
def f12(x):
    return(.17)

def f21(x):
    if x["principle"]>100:
        return("left")
    else:
        return("right")
def f211(x):
    return(.04)
def f212(x):
    return(.09)
def f22(x):
    return(.08)

rootnode=node(parent=None,data={"f":f0,"label":"0"})
node1=rootnode.spawn_left_child(data={"f":f1,"label":"1"})
node11=node1.spawn_left_child(data={"f":f11,"label":"11"})
node111=node11.spawn_left_child(data={"f":f111,"label":"111"})
node112=node11.spawn_right_child(data={"f":f112,"label":"112"})
node12=node1.spawn_right_child(data={"f":f12,"label":"12"})
node2=rootnode.spawn_right_child(data={"f":f2,"label":"2"})
node21=node2.spawn_left_child(data={"f":f21,"label":"21"})
node211=node21.spawn_left_child(data={"f":f211,"label":"211"})
node212=node21.spawn_right_child(data={"f":f212,"label":"212"})
node22=node2.spawn_right_child(data={"f":f22,"label":"22"})

def classify(idata):
    # initialize current node at root node
    cnode=rootnode
    #
    # if current node as child nodes, compute function 
    # to determine which child node to go to
    #
    while cnode.left_child:
        print("current node label = ", cnode.data["label"])
        #
        # compute function value at this node (the result is "left" or "right")
        #
        value=cnode.data["f"](idata)
        print("function value = ",cnode.data["f"](x))
        if value=="left":
            cnode=cnode.left_child
        else:
            cnode=cnode.right_child
    #
    # current node has no children - we are at a leaf
    #
    value=cnode.data["f"](idata)
    print("current node label = ", cnode.data["label"])
    print("function value = "+str(value))
    return(value)
x={"location":"urban","credit score":500,"interest rate":7,"principle":90}
result=classify(x)
print("\n"+str(result))

current node label =  0
function value =  right
current node label =  2
function value =  right
current node label =  22
function value = 0.08

0.08


In [32]:
import pandas as pd
import numpy as np

#### Building Binary Trees

**Building a Binary Decision Tree from Data**

Suppose we have a dataset with some predictor variables $x_1,x_2,\ldots,x_k$ and binary response variable $Y.$ For example, for the mortgage dataset we have predictors (location, principal, interest rate, credit score) and we want to predict the result (default, non-default). Our datset is _flat_/_rectangular_, with N rows and $k+1$ columns, with one column for each variable and one row for each _observation_ (mortgage loan).

We wish to use these data to build a decision tree in which the functions at the nodes are functions of the the predictor variables. 

Assume $Y$ takes the value 0 or 1. 

The predictor variables can be categorical or continuous.

**Recursive Description of the Algorithm**

The algorithm for building the tree has a recursive definition.

We begin by creating a root node with the entire dataset as attached to that node.

We start at the root node.

Whenever we visit a node, we compute and store at the node the following information about the dataset attached to the node:

a) the number of observations in the dataset attached to that node, and 

b) the proportion of observations in each class (Y=0 or 1)

Next, we take one of the following actions:

1) Find a function (splitting function) that splits/partitions the data into two pieces. The two pieces should look different in the sense that one piece tends to have a different proportions of observations with Y=1, and the pieces are each sufficiently large. Call these pieces left piece and right piece. If such a _split_ can be found, we attach the splitting function to the node, spawn two children of the current node, attach piece \#1 to the left child  and piece \#2 to the right child, and visit each of those children. 

or

2) Determine that a splitting function cannot be found so the current node becomes a leaf node (no children). 

The splitting function should be a function of $x_1,x_2,...,x_k$ that returns a value of "left" or "right". Typically, this function is taken to be a function of only one of the $x_i$'s and 

a) for a continuous variable this is a function of the form:  if $xi < c$ return("left") else return("right)

b) for a categorical variable, this function takes the form: if $x \in I$ return("left") else return("right") where $I$ is a subset of the values that the variable can take.

**How to Classify/Predict $Y$ for a New Observation**

Given a new observation with predictor variables $x1,x2,\ldots,xk$ we start at the root node and for each node we visit, we do one of the following:

a) if the node has chilren, apply the splitting function at the current node to determine which child node to visit next, or

b) if the current node is a leaf node, return the proportion $p_1$ of observations with $Y=1$ at that node

Finally, we predict $Y=1$ if $p_1$ exceeds some pre-determined threshold.

**Splitting Criterion - how to find a good splitting function**

We need a criterion for deciding on a good splitting function. There are several possibilities. We focus here on the Gini index.

Given a categorical variable taking K possible values and a set of data for that variable with proportions $p_1,p_2,\ldots,p_K$ of values in each category we define the Gini index by

$$ G = \sum_{i=1}^k p_i(1-p_i)$$

This number has the following interpretation. If we pick a data point at random, and classify it as class 1 with probability $p_1,$ class 2 with probability $p_2,$ etc., $G$ is the probability of incorrectly classifying that observation.

$G$ is a measure of _impurity_ of the dataset with regard to the class variable, if one of the $p_i$ is one and the others are zero (perfect _purity_) we get $G=0.$ In in the case of a binary class variable, with $p_1=p_2=1/2$ we get $G =1/2.$ 

When we split out dataset into two pieces, we would like the two child datasets to be as pure as possible so we try to minimize the quantity

$$ N_{left} G_{left} + N_{right} G_{right}$$

that is, the weighted sum of the impurities of the child datasets weighted by the number of observations in the datasets.

It is typical to require for splitting that the size of each child dataaet be above some pre-determined threshold.

Lower Gini is better

In [33]:
import pandas as pd
mdata=pd.read_csv("mortgage_data.csv")
print(type(mdata))
mdata.head()

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,location,princ,irate,cscore,result
0,suburban,358,7.0,728,default
1,suburban,637,7.25,675,default
2,suburban,303,7.25,645,non-default
3,suburban,397,7.25,609,non-default
4,suburban,420,7.75,669,default


In [34]:
mdata.tail()

Unnamed: 0,location,princ,irate,cscore,result
9859,suburban,769,7.75,586,non-default
9860,suburban,451,7.25,684,non-default
9861,suburban,410,7.0,702,non-default
9862,suburban,851,7.0,774,non-default
9863,suburban,260,7.5,657,default


In [35]:
mdata.shape

(9864, 5)

In [36]:
mdata.loc[4]

location    suburban
princ            420
irate           7.75
cscore           669
result       default
Name: 4, dtype: object

In [37]:
#
# Create a Y variable - Y=1 for default Y=0 for non-default
#
def f(row):
    if row["result"]=="default":
        return(1)
    else:
        return(0)
mdata["Y"]=mdata.apply(f,axis=1)

In [38]:
mdata

Unnamed: 0,location,princ,irate,cscore,result,Y
0,suburban,358,7.00,728,default,1
1,suburban,637,7.25,675,default,1
2,suburban,303,7.25,645,non-default,0
3,suburban,397,7.25,609,non-default,0
4,suburban,420,7.75,669,default,1
...,...,...,...,...,...,...
9859,suburban,769,7.75,586,non-default,0
9860,suburban,451,7.25,684,non-default,0
9861,suburban,410,7.00,702,non-default,0
9862,suburban,851,7.00,774,non-default,0


In [39]:
mdata["Y"][3]

np.int64(0)

In [40]:
mdata.head()

Unnamed: 0,location,princ,irate,cscore,result,Y
0,suburban,358,7.0,728,default,1
1,suburban,637,7.25,675,default,1
2,suburban,303,7.25,645,non-default,0
3,suburban,397,7.25,609,non-default,0
4,suburban,420,7.75,669,default,1


In [41]:
mdata["location"].value_counts()

location
suburban    5347
urban       2423
rural       2094
Name: count, dtype: int64

**Evaluate quality of a split**

Let's write code to evaluate quality of an example of a splitting function.

That code should take a pandas data frame and a function as arguments.

If a split would produce nodes with sizes below some threshold, we return a value so large that it can't reduce Gini coefficient.

**Example of a splitting function**

Here, we classify a row (an observation) according to whether the **cscore** for that row exceeds some threshold.

In [42]:
def f(row):
    if row["cscore"]>550:
        return("left")
    else:
        return("right")
    

**Gini criterion**

The following function takes as an argument a function, a data frame, and a minimum node size, and calcultes the Gini criterion.

If splitting produces a node that has too few observations (less than min_node_size) we return a large value so that we'll not choose this splitting function. 

In [43]:
def Gini_criterion(df,f,min_node_size):
    #
    # calculate f(row) for every row in the data frame
    # this produces a Pandas series
    #
    fvalue=df.apply(f,axis=1)
    #
    # get the series of Y's for which fvalue is "left" 
    # and the series of Y's for whcih fvalue is "right"
    #
    Yleft=df["Y"].loc[fvalue=="left"]
    Yright=df["Y"].loc[fvalue=="right"]
    #
    # compute number of obs in each side
    #
    nleft=Yleft.size
    nright=Yright.size
    #
    # if split puts too few values in a node
    # we return a value that makes it so we'd never choose this f
    #
    if nleft<min_node_size or nright<min_node_size:
        return(nleft+nright)
    
    p1left=Yleft.loc[Yleft==1].size/nleft
    p1right=Yright.loc[Yright==1].size/nright
    #
    # compute the Gini coefficient
    #
    Gini=Yleft.size*p1left*(1-p1left)+Yright.size*p1right*(1-p1right)
    return(Gini)

In [44]:
ginivalue=Gini_criterion(mdata,f,100)
print(ginivalue)

2282.2216076097498


**Another example**

In [45]:
def f(row):
    if row["irate"]>7:
        return("left")
    else:
        return("right")

In [46]:
ginivalue=Gini_criterion(mdata,f,100)
print(ginivalue)

2121.446715049349


**Goal**

We want our children to be as pure as possible, and we see that Gini impurity is lower for this splitting function than the one above, so this one would be preferred. We can ask for the best possible split based on a continuous variable or a categorical variable.

For a continuous variable v we could try every possible split of the form: $v<c$ vs. $v>c$ but that might take too long to compute. Instead we try only using some quantiles  for that variable. Below, quartiles are used, but there are other options, e.g. deciles, percentiles.

If a split would produce a node with too few values, we return a huge gini value (one that can't be smaller than the current one)

In [47]:
def find_best_splitting_function_continuous_variable(data,vname,min_node_size):
    qvalues=[data[vname].quantile(i/20) for i in range(1,20)]
    minginivalue=mdata.shape[0] # Gini can't be this big
    for qvalue in qvalues:
        def f(row):
            if row[vname]<qvalue:
                return("left")
            else:
                return("right")
        ginivalue=Gini_criterion(data,f,min_node_size)
        if ginivalue<minginivalue:
            bestf=f
            bestvalue=qvalue
            minginivalue=ginivalue
    #
    # return the best function, the value and its gini value
    #
    return(bestf,bestvalue,minginivalue)  
    #return(bestf,minginivalue)  

In [48]:
f,v,g=find_best_splitting_function_continuous_variable(mdata,"cscore",100)
#f,g=find_best_splitting_function_continuous_variable(mdata,"cscore",100)
print(g)

2283.100845967919


**Note:** In this function, we save the best function and the best threshold.
But isn't the threshold included in that best function? Let's look at a simplified example of the phenomenon under consideration. Let's find function that roughly splits a data array at the median.

In [49]:
import numpy as np
def find_best_function(x):
    minvalue=10000000
    #
    # Try thresholds in range from 0 to 10 in steps of
    # size 0.1
    #
    for tau in np.linspace(0,10,100):
        def f(u):
            if u<tau:
                return("left")
            else:
                return("right")
        #
        # determine how well this f performs
        # we want the number left as close as
        # possible to the number right
        #
        y=list(map(f,x))
        nleft=y.count("left")
        nright=y.count("right")
        value=np.abs(nleft-nright)
        if value<minvalue:
            bestf=f
            minvalue=value
    return bestf
#
# Determine best f for an example of a list of numbers
#
x=list(np.random.normal(5,1,25))
f=find_best_function(x)
y=list(map(f,x))
nleft=y.count("left")
nright=y.count("right")
print(nleft)
print(nright)

25
0


This does not do the right thing. Do you see why?

In [50]:
import numpy as np
def find_best_function(x):
    minvalue=10000000
    #
    # Try thresholds in range from 0 to 10 in steps of
    # size 0.1
    #
    for tau in np.linspace(0,10,100):
        def f(u):
            if u<tau:
                return("left")
            else:
                return("right")
        #
        # determine how well this f performs
        # we want the number left as close as
        # possible to the number right
        #
        y=list(map(f,x))
        nleft=y.count("left")
        nright=y.count("right")
        value=np.abs(nleft-nright)
        if value<minvalue:
            besttau=tau
            minvalue=value
    def f(u):
        if u<besttau:
            return("left")
        else:
            return("right")
    return f
#
# Determine best f for an example of a list of numbers
#
x=list(np.random.normal(5,1,25))
f=find_best_function(x)
y=list(map(f,x))
nleft=y.count("left")
nright=y.count("right")
print(nleft)
print(nright)

12
13


In [51]:
f,v,g=find_best_splitting_function_continuous_variable(mdata,"irate",100)
print(v)
print(g)

7.0
2108.0970094988857


**Splits for a categorical variable**

We need a function to try all splits of a categorical variable v taking values in a set say S={1,2,3,...,K}

Here we try splitting on a given subset T of S -sending those observations with values of v in T to the left and the others to the right.

We can then iterate over all nonempty subsets T to find the minimizer of the Gini criterion - Note that this code is less than optimal because every set is tested twice - once when we send observations with values in T to the left and again when we send all observations in the complement of T to the left.

The itertools package is handy for getting all combinations of elements in a list of some size.


In [52]:
import itertools as it
L=list(it.combinations([1,2,3,4],2))
print(L)

[(1, 2), (1, 3), (1, 4), (2, 3), (2, 4), (3, 4)]


**Getting all subsets**

We need a list of all ways we can split a list of values into two nonempty pieces. This is straightforward if n, the size of the list is odd, we just need to make a list of all subsets of size 1,2,...,(n-1)/2. But if n is even we don't want to check each split of a subset of size n/2 twice (once for the subset and once for its complement).

In [53]:
def find_all_set_splits(value_list):
    splits=[]
    n=len(value_list)
    m=int(n/2)
    for sz in range(1,m+1):
        combs=it.combinations(value_list,sz)
        for comb in combs:
            splits.append(list(comb))
    if 2*m<n:
        return(splits)
    #
    # even case - need to add in subsets of size n/2
    #
    combs=it.combinations(value_list,m+1)
    svalue_list=set(value_list) # by the way - sets can't contain mutable elements!!!
    for comb in combs:
        s=set(comb)
        sc=svalue_list.difference(s)
        if s not in splits and svalue_list.difference(s):
            splits.append(list(s))
    return(splits)
    

In [54]:
find_all_set_splits(['dog',"cat","bird"])

[['dog'], ['cat'], ['bird']]

In [55]:
find_all_set_splits(["dog","cat","bird","turtle","fish","gerble"])

[['dog'],
 ['cat'],
 ['bird'],
 ['turtle'],
 ['fish'],
 ['gerble'],
 ['dog', 'cat'],
 ['dog', 'bird'],
 ['dog', 'turtle'],
 ['dog', 'fish'],
 ['dog', 'gerble'],
 ['cat', 'bird'],
 ['cat', 'turtle'],
 ['cat', 'fish'],
 ['cat', 'gerble'],
 ['bird', 'turtle'],
 ['bird', 'fish'],
 ['bird', 'gerble'],
 ['turtle', 'fish'],
 ['turtle', 'gerble'],
 ['fish', 'gerble'],
 ['dog', 'cat', 'bird'],
 ['dog', 'cat', 'turtle'],
 ['dog', 'cat', 'fish'],
 ['dog', 'cat', 'gerble'],
 ['dog', 'bird', 'turtle'],
 ['dog', 'bird', 'fish'],
 ['dog', 'bird', 'gerble'],
 ['dog', 'turtle', 'fish'],
 ['dog', 'turtle', 'gerble'],
 ['dog', 'fish', 'gerble'],
 ['cat', 'bird', 'turtle'],
 ['cat', 'bird', 'fish'],
 ['cat', 'bird', 'gerble'],
 ['cat', 'turtle', 'fish'],
 ['cat', 'turtle', 'gerble'],
 ['cat', 'fish', 'gerble'],
 ['bird', 'turtle', 'fish'],
 ['bird', 'turtle', 'gerble'],
 ['bird', 'fish', 'gerble'],
 ['turtle', 'fish', 'gerble'],
 ['cat', 'turtle', 'dog', 'bird'],
 ['cat', 'fish', 'dog', 'bird'],
 ['cat', 

In [56]:
def find_best_splitting_function_categorical_variable(data,vname,min_node_size):
    values=list(data[vname].unique())
    nvalues=len(values)
    minginivalue=data.shape[0] # Gini can't be this big
    subset_list=find_all_set_splits(values)
    for subset in subset_list:
        def f(row):
            if row[vname] in subset:
                return("left")
            else:
                return("right")
        ginivalue=Gini_criterion(data,f,min_node_size)
        if ginivalue<minginivalue:
            bestf=f
            bestsubset=subset
            minginivalue=ginivalue
    return(bestf,bestsubset,minginivalue)  

In [57]:
bestf,bestsubset,minginivalue=find_best_splitting_function_categorical_variable(mdata,"location",100)
print(bestsubset)
print(minginivalue)

['rural']
2242.2295523521884


**Finding best split using all variables (continuous & categorical)**

Now we can try all continuous *and* categorical variables looking for the best split.

The following function takes a data set, a list of continuous variables, and a list of categorical variables as input and finds the best function to split the data on.

In [58]:
def find_best_split(data,cont_vars,cat_vars,min_node_size):
    minginivalue=data.shape[0]
    for catvar in cat_vars:
        f,b,g=find_best_splitting_function_categorical_variable(data,catvar,min_node_size)
        if g<minginivalue:
            minginivalue=g
            bestvar=catvar
            bestvartype="categorical"
            bestvalue=b
            bestf=f
    for contvar in cont_vars:
        f,b,g=find_best_splitting_function_continuous_variable(data,contvar,min_node_size)
        if g<minginivalue:
            minginivalue=g
            bestvar=contvar
            bestvartype="continuous"
            bestvalue=b
            bestf=f
    return bestf,bestvar,bestvartype,bestvalue,minginivalue
find_best_split(mdata,["irate","cscore","princ"],["location"],100)

(<function __main__.find_best_splitting_function_continuous_variable.<locals>.f(row)>,
 'irate',
 'continuous',
 np.float64(7.0),
 2108.0970094988857)

**Build tree recursively**

We need a function that builds a tree by starting at root and recursively splitting each node until a stopping rule kicks in.

To keep things simple, we'll stop splitting if a node has fewer than 25 observations.

Each time we split, we attach a data frame data["df"] to each new node.

As we go along, we'll attach the counts of Y=0 and Y=1 to each node.

In [59]:
import numpy as np
import pandas as pd
import itertools as it
def Gini_criterion(df,f,min_node_size):
    #
    # calculate f(row) for every row in the data frame
    # this produces a Pandas series
    #
    fvalue=df.apply(f,axis=1)
    #
    # get the series of Y's for which fvalue is "left" 
    # and the series of Y's for whcih fvalue is "right"
    #
    Yleft=df["Y"].loc[fvalue=="left"]
    Yright=df["Y"].loc[fvalue=="right"]
    #
    # compute number of obs in each side
    #
    nleft=Yleft.size
    nright=Yright.size
    #
    # if split puts too few values in a node
    # we return a value that makes it so we'd never choose this f
    #
    if nleft<min_node_size or nright<min_node_size:
        return(nleft+nright)
    
    p1left=Yleft.loc[Yleft==1].size/nleft
    p1right=Yright.loc[Yright==1].size/nright
    #
    # compute the Gini coefficient
    #
    Gini=Yleft.size*p1left*(1-p1left)+Yright.size*p1right*(1-p1right)
    return(Gini)
def find_best_splitting_function_continuous_variable(data,vname,min_node_size):
    qvalues=[data[vname].quantile(i/4) for i in range(1,4)]
    minginivalue=mdata.shape[0] # Gini can't be this big
    bestf=None
    bestvalue=None
    for qvalue in qvalues:
        def f(row):
            if row[vname]<qvalue:
                return("left")
            else:
                return("right")
        ginivalue=Gini_criterion(data,f,min_node_size)
        if ginivalue<minginivalue:
            bestf=f
            bestvalue=qvalue
            minginivalue=ginivalue
    #
    # return the best function, the value and its gini value
    #
    return(bestf,bestvalue,minginivalue)  


def find_all_set_splits(value_list):
    splits=[]
    n=len(value_list)
    m=int(n/2)
    for sz in range(1,m+1):
        combs=it.combinations(value_list,sz)
        for comb in combs:
            splits.append(list(comb))
    if 2*m<n:
        return(splits)
    #
    # even case - need to add in subsets of size n/2
    #
    combs=it.combinations(value_list,m+1)
    svalue_list=set(value_list) # by the way - sets can't contain mutable elements!!!
    for comb in combs:
        s=set(comb)
        sc=svalue_list.difference(s)
        if s not in splits and svalue_list.difference(s):
            splits.append(list(s))
    return(splits)
    
def find_best_splitting_function_categorical_variable(data,vname,min_node_size):
    values=list(data[vname].unique())
    nvalues=len(values)
    minginivalue=data.shape[0] # Gini can't be this big
    subset_list=find_all_set_splits(values)
    bestf=None
    bestsubset=None
    for subset in subset_list:
        def f(row):
            if row[vname] in subset:
                return("left")
            else:
                return("right")
        ginivalue=Gini_criterion(data,f,min_node_size)
        if ginivalue<minginivalue:
            bestf=f
            bestsubset=subset
            minginivalue=ginivalue
    return(bestf,bestsubset,minginivalue)  

def find_best_split(data,cont_vars,cat_vars,min_node_size):
    minginivalue=data.shape[0]
    bestf=None
    bestvar=None
    bestvartype=None
    bestvalue=None
    for catvar in cat_vars:
        f,b,g=find_best_splitting_function_categorical_variable(data,catvar,min_node_size)
        if g<minginivalue:
            minginivalue=g
            bestvar=catvar
            bestvartype="categorical"
            bestvalue=b
            bestf=f
    for contvar in cont_vars:
        f,b,g=find_best_splitting_function_continuous_variable(data,contvar,min_node_size)
        if g<minginivalue:
            minginivalue=g
            bestvar=contvar
            bestvartype="continuous"
            bestvalue=b
            bestf=f
    return bestf,bestvar,bestvartype,bestvalue,minginivalue

    # string consisting of information about node
    #
    def __str__(self):
        s="node label = "+self.data["label"]+"\n"
        if self.parent==None:
            s+="   no parent i.e. root node\n"
        else:
            s+="   parent label = " + self.parent.data["label"]+"\n"
        if self.left_child==None:
            s+="   no left child\n"
        else:
            s+="   left child label " + self.left_child.data["label"]+"\n"
        if self.right_child==None:
            s+="   no right child\n"
        else:
            s+="   right child label " + self.right_child.data["label"]+"\n"
        
        return(s)
class node:
    __slots__=('parent','left_child','right_child','data')
    #
    # We instantiate a node by passing a parent (which can be None) 
    # and a dictionary
    #
    def __init__(self,parent,data):
        if parent==None:
            # making this a root node
            self.data=data
            self.data["depth"]=0
            self.parent=None
        else:
            # making this a non-root node
            self.data=data
            self.data["depth"]=parent.data["depth"]+1
            self.parent=parent
        self.left_child=None
        self.right_child=None
    def get_parent(self): # return the node's parent
        return(self.parent)
    def get_data(self):   # return the node's data
        return(self.data)
    def get_depth(self):  # return the node's depth
        return(self.data["depth"])
    def get_label(self):
        return(self.data["label"])
    def set_label(self,label):
        self.data["label"]=label
    def get_left_child(self):
        return(self.left_child)
    def get_right_child(self):
        return(self.right_child)
    def spawn_left_child(self,data):
        # create a new node n with self as parent w/ given data
        n=node(parent=self,data=data)
        #n.data=data
        n.data["depth"]=self.data["depth"]+1
        self.left_child=n
        return(n)
    def spawn_right_child(self,data):
        n=node(parent=self,data=data)
        n.data=data
        n.data["depth"]=self.data["depth"]+1
        self.right_child=n
        return(n)
    #
    # string consisting of information about node
    #
    def __str__(self):
        s="node label = "+self.data["label"]+"\n"
        if self.parent==None:
            s+="   no parent i.e. root node\n"
        else:
            s+="   parent label = " + self.parent.data["label"]+"\n"
        if self.left_child==None:
            s+="   no left child\n"
        else:
            s+="   left child label " + self.left_child.data["label"]+"\n"
        if self.right_child==None:
            s+="   no right child\n"
        else:
            s+="   right child label " + self.right_child.data["label"]+"\n"
        return(s)
    def treestr(self):
        d=self.data
        depth=d["depth"]
        G=d["gini"]
        Gstring="G: {:8.2f} ".format(G)
        Y0=d["Ycts"][0]
        Y1=d["Ycts"][1]
        Ycts_string="N: "+str(Y0+Y1)+" N0: "+str(Y0)+" "+" N1:"+str(Y1)+"\n"
        p0=Y0/(Y0+Y1)
        p1=Y1/(Y0+Y1)
        pstring="p0: {:5.4f} p1: {:5.4f}\n".format(p0,p1)
        spaces="".join(["  " for i in range(depth)])
        s=spaces+d["label"]+"\n"
        s+=spaces+Gstring+Ycts_string
        s+=spaces+pstring
        #
        # if this node has a split, include info about it
        #
        if "splitinfo" in self.data:
            splitinfo=self.data["splitinfo"]
        
        
        
        if self.left_child!=None:
            s+=self.left_child.treestr()
            s+=self.right_child.treestr()
        return(s)
    def treeprint(self):
        s=self.treestr()
        print(s)

**Recursive split node function**

In [60]:
def split_node(cnode,contvars,catvars,min_node_size):  
    cdf=cnode.data["df"]
    
    # compute Y counts in this node and store them
    N0=np.sum(1-cdf["Y"])
    N1=np.sum(cdf["Y"])
    cnode.data["Ycts"]=[N0,N1]
    
    #
    # Gini for a node is N*p(1-p) where p is prop of 1's
    # so this equalis (N0+N1)*(N0/(N0+N1)))*(N1/(N0+N1)) = N0*N1/(N0+N1)
    #
    cnode.data["gini"]=N0*N1/(N0+N1)
    
    if cnode.data["df"].shape[0]>=min_node_size:
        print("new node to try splitting: "+cnode.data["label"]+" size= "+str(cnode.data["df"].shape[0]))
        
        # find best split
        f,v,vtype,value,g=find_best_split(cnode.data["df"],contvars,catvars,min_node_size)
        
        #
        # if the split leads to a bigger gini, we don't split the node
        # so compare to gini at current node
        #
        if g>=cnode.data["gini"]:
            print("node is not split since gini not reduced")
        else:
            
            # determine which rows of current data frame go left and which go right
            child_assignment=cnode.data["df"].apply(f,axis=1)
        
            # compute counts of child nodes if we split
            nleft=np.sum(child_assignment=="left")
            nright=np.sum(child_assignment=="right")
            
            if nleft<min_node_size or nright<min_node_size:
                print("node is not split because of minimum node size constraint")
                
            else:
                
                # attach splitting function to data at this node
                splitinfo={"f":f, "vname": v, "vtype":vtype, "value":value}
                cnode.data["splitinfo"]=splitinfo
                     
                print("splitting node into sizes "+str(nleft)+" "+str(nright))
                # compute data frames to put at child nodes
                dfleft=cnode.data["df"].loc[child_assignment=="left"].copy()
                dfright=cnode.data["df"].loc[child_assignment=="right"].copy()
       
                # replace data frame indices by range
                dfleft.index=range(dfleft.shape[0])
                dfright.index=range(dfright.shape[0])
    
                # create a label 
                dataleft={"df":dfleft,"label":cnode.data["label"]+"L"}
                dataright={"df":dfright,"label":cnode.data["label"]+"R"}
                        
                # create child nodes 
                left_child=cnode.spawn_left_child(dataleft)
                right_child=cnode.spawn_right_child(dataright)
               
            
                # split child nodes
                split_node(left_child,contvars,catvars,min_node_size)
                split_node(right_child,contvars,catvars,min_node_size)

mdata=pd.read_csv("mortgage_data.csv")
#
# Create a Y variable - Y=1 for default Y=0 for non-default
#
def f(row):
    if row["result"]=="default":
        return(1)
    else:
        return(0)
mdata["Y"]=mdata.apply(f,axis=1)
rootnode=node(None,{"df":mdata,"label":""})
split_node(rootnode,["irate","cscore","princ"],["location"],500)

new node to try splitting:  size= 9864
splitting node into sizes 6781 3083
new node to try splitting: L size= 6781
splitting node into sizes 2981 3800
new node to try splitting: LL size= 2981
splitting node into sizes 627 2354
new node to try splitting: LLL size= 627
node is not split since gini not reduced
new node to try splitting: LLR size= 2354
splitting node into sizes 637 1717
new node to try splitting: LLRL size= 637
node is not split since gini not reduced
new node to try splitting: LLRR size= 1717
node is not split because of minimum node size constraint
new node to try splitting: LR size= 3800
splitting node into sizes 2850 950
new node to try splitting: LRL size= 2850
splitting node into sizes 2133 717
new node to try splitting: LRLL size= 2133
splitting node into sizes 661 1472
new node to try splitting: LRLLL size= 661
node is not split since gini not reduced
new node to try splitting: LRLLR size= 1472
node is not split because of minimum node size constraint
new node to t

In [61]:
rootnode.treeprint()


G:  2307.60 N: 9864 N0: 3682  N1:6182
p0: 0.3733 p1: 0.6267
  L
  G:  1644.35 N: 6781 N0: 2803  N1:3978
  p0: 0.4134 p1: 0.5866
    LL
    G:   725.20 N: 2981 N0: 1735  N1:1246
    p0: 0.5820 p1: 0.4180
      LLL
      G:    38.32 N: 627 N0: 586  N1:41
      p0: 0.9346 p1: 0.0654
      LLR
      G:   588.17 N: 2354 N0: 1149  N1:1205
      p0: 0.4881 p1: 0.5119
        LLRL
        G:   126.02 N: 637 N0: 464  N1:173
        p0: 0.7284 p1: 0.2716
        LLRR
        G:   411.72 N: 1717 N0: 685  N1:1032
        p0: 0.3990 p1: 0.6010
    LR
    G:   767.84 N: 3800 N0: 1068  N1:2732
    p0: 0.2811 p1: 0.7189
      LRL
      G:   648.22 N: 2850 N0: 997  N1:1853
      p0: 0.3498 p1: 0.6502
        LRLL
        G:   514.59 N: 2133 N0: 867  N1:1266
        p0: 0.4065 p1: 0.5935
          LRLLL
          G:   124.81 N: 661 N0: 167  N1:494
          p0: 0.2526 p1: 0.7474
          LRLLR
          G:   367.12 N: 1472 N0: 700  N1:772
          p0: 0.4755 p1: 0.5245
        LRLR
        G:   106.4

**Make the split node function a class method**

That function has been renamed to build_tree.

In [62]:
import numpy as np
import pandas as pd
import itertools as it
def Gini_criterion(df,f,min_node_size):
    #
    # calculate f(row) for every row in the data frame
    # this produces a Pandas series
    #
    fvalue=df.apply(f,axis=1)
    #
    # get the series of Y's for which fvalue is "left" 
    # and the series of Y's for whcih fvalue is "right"
    #
    Yleft=df["Y"].loc[fvalue=="left"]
    Yright=df["Y"].loc[fvalue=="right"]
    #
    # compute number of obs in each side
    #
    nleft=Yleft.size
    nright=Yright.size
    #
    # if split puts too few values in a node
    # we return a value that makes it so we'd never choose this f
    #
    if nleft<min_node_size or nright<min_node_size:
        return(nleft+nright)
    
    p1left=Yleft.loc[Yleft==1].size/nleft
    p1right=Yright.loc[Yright==1].size/nright
    #
    # compute the Gini coefficient
    #
    Gini=Yleft.size*p1left*(1-p1left)+Yright.size*p1right*(1-p1right)
    return(Gini)
def find_best_splitting_function_continuous_variable(data,vname,min_node_size):
    qvalues=[data[vname].quantile(i/4) for i in range(1,4)]
    minginivalue=mdata.shape[0] # Gini can't be this big
    bestf=None
    bestvalue=None
    for qvalue in qvalues:
        def f(row):
            if row[vname]<qvalue:
                return("left")
            else:
                return("right")
        ginivalue=Gini_criterion(data,f,min_node_size)
        if ginivalue<minginivalue:
            bestf=f
            bestvalue=qvalue
            minginivalue=ginivalue
    #
    # return the best function, the value and its gini value
    #
    return(bestf,bestvalue,minginivalue)  


def find_all_set_splits(value_list):
    splits=[]
    n=len(value_list)
    m=int(n/2)
    for sz in range(1,m+1):
        combs=it.combinations(value_list,sz)
        for comb in combs:
            splits.append(list(comb))
    if 2*m<n:
        return(splits)
    #
    # even case - need to add in subsets of size n/2
    #
    combs=it.combinations(value_list,m+1)
    svalue_list=set(value_list) # by the way - sets can't contain mutable elements!!!
    for comb in combs:
        s=set(comb)
        sc=svalue_list.difference(s)
        if s not in splits and svalue_list.difference(s):
            splits.append(list(s))
    return(splits)
    
def find_best_splitting_function_categorical_variable(data,vname,min_node_size):
    values=list(data[vname].unique())
    nvalues=len(values)
    minginivalue=data.shape[0] # Gini can't be this big
    subset_list=find_all_set_splits(values)
    bestf=None
    bestsubset=None
    for subset in subset_list:
        def f(row):
            if row[vname] in subset:
                return("left")
            else:
                return("right")
        ginivalue=Gini_criterion(data,f,min_node_size)
        if ginivalue<minginivalue:
            bestf=f
            bestsubset=subset
            minginivalue=ginivalue
    return(bestf,bestsubset,minginivalue)  

def find_best_split(data,cont_vars,cat_vars,min_node_size):
    minginivalue=data.shape[0]
    bestf=None
    bestvar=None
    bestvartype=None
    bestvalue=None
    for catvar in cat_vars:
        f,b,g=find_best_splitting_function_categorical_variable(data,catvar,min_node_size)
        if g<minginivalue:
            minginivalue=g
            bestvar=catvar
            bestvartype="categorical"
            bestvalue=b
            bestf=f
    for contvar in cont_vars:
        f,b,g=find_best_splitting_function_continuous_variable(data,contvar,min_node_size)
        if g<minginivalue:
            minginivalue=g
            bestvar=contvar
            bestvartype="continuous"
            bestvalue=b
            bestf=f
    return bestf,bestvar,bestvartype,bestvalue,minginivalue

class node:
    __slots__=('parent','left_child','right_child','data')
    #
    # We instantiate a node by passing a parent (which can be None) 
    # and a dictionary
    #
    def __init__(self,parent,data):
        if parent==None:
            # making this a root node
            self.data=data
            self.data["depth"]=0
            self.parent=None
        else:
            # making this a non-root node
            self.data=data
            self.data["depth"]=parent.data["depth"]+1
            self.parent=parent
        self.left_child=None
        self.right_child=None
    def get_parent(self): # return the node's parent
        return(self.parent)
    def get_data(self):   # return the node's data
        return(self.data)
    def get_depth(self):  # return the node's depth
        return(self.data["depth"])
    def get_label(self):
        return(self.data["label"])
    def set_label(self,label):
        self.data["label"]=label
    def get_left_child(self):
        return(self.left_child)
    def get_right_child(self):
        return(self.right_child)
    def spawn_left_child(self,data):
        # create a new node n with self as parent w/ given data
        n=node(parent=self,data=data)
        #n.data=data
        n.data["depth"]=self.data["depth"]+1
        self.left_child=n
        return(n)
    def spawn_right_child(self,data):
        n=node(parent=self,data=data)
        n.data=data
        n.data["depth"]=self.data["depth"]+1
        self.right_child=n
        return(n)
    #
    # string consisting of information about node
    #
    def __str__(self):
        s="node label = "+self.data["label"]+"\n"
        if self.parent==None:
            s+="   no parent i.e. root node\n"
        else:
            s+="   parent label = " + self.parent.data["label"]+"\n"
        if self.left_child==None:
            s+="   no left child\n"
        else:
            s+="   left child label " + self.left_child.data["label"]+"\n"
        if self.right_child==None:
            s+="   no right child\n"
        else:
            s+="   right child label " + self.right_child.data["label"]+"\n"
        return(s)
    def treestr(self):
        d=self.data
        depth=d["depth"]
        G=d["gini"]
        Gstring="G: {:8.2f} ".format(G)
        Y0=d["Ycts"][0]
        Y1=d["Ycts"][1]
        Ycts_string="N: "+str(Y0+Y1)+" N0: "+str(Y0)+" "+" N1:"+str(Y1)+"\n"
        p0=Y0/(Y0+Y1)
        p1=Y1/(Y0+Y1)
        pstring="p0: {:5.4f} p1: {:5.4f}\n".format(p0,p1)
        spaces="".join(["  " for i in range(depth)])
        s=spaces+d["label"]+"\n"
        s+=spaces+Gstring+Ycts_string
        s+=spaces+pstring
        #
        # if this node has a split, include info about it
        #
        if "splitinfo" in self.data:
            splitinfo=self.data["splitinfo"]
        
        
        
        if self.left_child!=None:
            s+=self.left_child.treestr()
            s+=self.right_child.treestr()
        return(s)
    def treeprint(self):
        s=self.treestr()
        print(s)
    def build_tree(self,contvars,catvars,min_node_size):  
        cdf=self.data["df"]
    
        # compute Y counts in this node and store them
        N0=np.sum(1-cdf["Y"])
        N1=np.sum(cdf["Y"])
        self.data["Ycts"]=[N0,N1]
    
        #
        # Gini for a node is N*p(1-p) where p is prop of 1's
        # so this equalis (N0+N1)*(N0/(N0+N1)))*(N1/(N0+N1)) = N0*N1/(N0+N1)
        #
        self.data["gini"]=N0*N1/(N0+N1)
    
        if self.data["df"].shape[0]>=min_node_size:
            print("new node to try splitting: "+self.data["label"]+" size= "+str(self.data["df"].shape[0]))
        
            # find best split
            f,v,vtype,value,g=find_best_split(self.data["df"],contvars,catvars,min_node_size)
        
            #
            # if the split leads to a bigger gini, we don't split the node
            # so compare to gini at current node
            #
            if g>=self.data["gini"]:
                print("node is not split since gini not reduced")
            else:
            
                # determine which rows of current data frame go left and which go right
                child_assignment=self.data["df"].apply(f,axis=1)
        
                # compute counts of child nodes if we split
                nleft=np.sum(child_assignment=="left")
                nright=np.sum(child_assignment=="right")
            
                if nleft<min_node_size or nright<min_node_size:
                    print("node is not split because of minimum node size constraint")
                
                else:
                
                    # attach splitting function to data at this node
                    splitinfo={"f":f, "vname": v, "vtype":vtype, "value":value}
                    self.data["splitinfo"]=splitinfo
                     
                    print("splitting node into sizes "+str(nleft)+" "+str(nright))
                    # compute data frames to put at child nodes
                    dfleft=self.data["df"].loc[child_assignment=="left"].copy()
                    dfright=self.data["df"].loc[child_assignment=="right"].copy()
       
                    # replace data frame indices by range
                    dfleft.index=range(dfleft.shape[0])
                    dfright.index=range(dfright.shape[0])
    
                    # create a label 
                    dataleft={"df":dfleft,"label":self.data["label"]+"L"}
                    dataright={"df":dfright,"label":self.data["label"]+"R"}
                        
                    # create child nodes 
                    left_child=self.spawn_left_child(dataleft)
                    right_child=self.spawn_right_child(dataright)
               
            
                    # split child nodes
                    left_child.build_tree(contvars,catvars,min_node_size)
                    right_child.build_tree(contvars,catvars,min_node_size)

**Test the method**

We create a root node, attach a data frame to it, call the build_tree method a this node.

In [63]:
mdata=pd.read_csv("mortgage_data.csv")
#
# Create a Y variable - Y=1 for default Y=0 for non-default
#
def f(row):
    if row["result"]=="default":
        return(1)
    else:
        return(0)
mdata["Y"]=mdata.apply(f,axis=1)
rootnode=node(None,{"df":mdata,"label":""})
rootnode.build_tree(["irate","cscore","princ"],["location"],500)

new node to try splitting:  size= 9864
splitting node into sizes 6781 3083
new node to try splitting: L size= 6781
splitting node into sizes 2981 3800
new node to try splitting: LL size= 2981
splitting node into sizes 627 2354
new node to try splitting: LLL size= 627
node is not split since gini not reduced
new node to try splitting: LLR size= 2354
splitting node into sizes 637 1717
new node to try splitting: LLRL size= 637
node is not split since gini not reduced
new node to try splitting: LLRR size= 1717
node is not split because of minimum node size constraint
new node to try splitting: LR size= 3800
splitting node into sizes 2850 950
new node to try splitting: LRL size= 2850
splitting node into sizes 2133 717
new node to try splitting: LRLL size= 2133
splitting node into sizes 661 1472
new node to try splitting: LRLLL size= 661
node is not split since gini not reduced
new node to try splitting: LRLLR size= 1472
node is not split because of minimum node size constraint
new node to t

In [64]:
rootnode.treeprint()


G:  2307.60 N: 9864 N0: 3682  N1:6182
p0: 0.3733 p1: 0.6267
  L
  G:  1644.35 N: 6781 N0: 2803  N1:3978
  p0: 0.4134 p1: 0.5866
    LL
    G:   725.20 N: 2981 N0: 1735  N1:1246
    p0: 0.5820 p1: 0.4180
      LLL
      G:    38.32 N: 627 N0: 586  N1:41
      p0: 0.9346 p1: 0.0654
      LLR
      G:   588.17 N: 2354 N0: 1149  N1:1205
      p0: 0.4881 p1: 0.5119
        LLRL
        G:   126.02 N: 637 N0: 464  N1:173
        p0: 0.7284 p1: 0.2716
        LLRR
        G:   411.72 N: 1717 N0: 685  N1:1032
        p0: 0.3990 p1: 0.6010
    LR
    G:   767.84 N: 3800 N0: 1068  N1:2732
    p0: 0.2811 p1: 0.7189
      LRL
      G:   648.22 N: 2850 N0: 997  N1:1853
      p0: 0.3498 p1: 0.6502
        LRLL
        G:   514.59 N: 2133 N0: 867  N1:1266
        p0: 0.4065 p1: 0.5935
          LRLLL
          G:   124.81 N: 661 N0: 167  N1:494
          p0: 0.2526 p1: 0.7474
          LRLLR
          G:   367.12 N: 1472 N0: 700  N1:772
          p0: 0.4755 p1: 0.5245
        LRLR
        G:   106.4