**Binary Trees**

To illustrate an application involving classes, we introduce binary trees. These will be important for building _decision trees_, which are important for classification. 

A binary tree consists of a unique root node and optional additional nodes with the property that

- every node has at most two child nodes which are referred to as either its left node or its right node.
- every non-root node as a unique parent node
- every node has the root node as an ancestor (parent of a parent of a parent ...)
- every node has some optional additional data associated with it contained in a dictionary. E.g. one item in the dictionary could be that node's depth (number of parent nodes in the path to the root node).

To build a binary tree, we start by creating a node class. An instance of a node has the following:

- parent - a node if this node is not a root node and None if this is a root node
- right child node
- left child node
- data associated with the node 

We want node methods to provide the following capabilities.

- Create a root (parentless) node and add optional data to it.
- Create a node with some parent and add optional data to it.
- Retrieve the data associated with a node
- Assign data to a node
- Get the left child associated with a node if there is one
- Get the right child associated with a node if there is one
- Spawn a left child of a given node
- Spawn a right child of a given node

The data we associate with a node can be quite general. We'll use a dictionary at each node and in the code below, we'll store the depth of each node (depth is 0 for the root node, 1 for its children, 2 for its grandchildren etc.) and a label for each node.

In [2]:
class node:
    __slots__=('parent','left_child','right_child','data')
    #
    # We instantiate a node by passing a parent (which can be None) 
    # and an optional dictionary called data to store data at that node.
    #
    def __init__(self,parent,data={}):
        if parent==None:
            # making this a root node
            self.data=data
            self.data["depth"]=0
            self.parent=None
        else:
            # making this a non-root node
            self.data=data
            self.data["depth"]=parent.data["depth"]+1
            self.parent=parent
        self.left_child=None
        self.right_child=None
    def spawn_left_child(self,data={}):
        # create a new node n with self as parent w/ given data
        n=node(parent=self,data=data)
        #n.data=data
        n.data["depth"]=self.data["depth"]+1
        self.left_child=n
        return(n)
    def spawn_right_child(self,data={}):
        n=node(parent=self,data=data)
        #n.data=data
        n.data["depth"]=self.data["depth"]+1
        self.right_child=n
        return(n)
    #
    # string consisting of information about node
    #
    def __str__(self):
        s="node label = "+self.data["label"]+"\n"
        if self.parent==None:
            s+="   no parent i.e. root node\n"
        else:
            s+="   parent label = " + self.parent.data["label"]+"\n"
        if self.left_child==None:
            s+="   no left child\n"
        else:
            s+="   left child label " + self.left_child.data["label"]+"\n"
        if self.right_child==None:
            s+="   no right child\n"
        else:
            s+="   right child label " + self.right_child.data["label"]+"\n"
        
        return(s)
    
    
    
rootnode=node(parent=None,data={"label":"0:mother of all nodes"})
print("parent of root node = "+str(rootnode.parent))
print(rootnode)

rootnode.spawn_left_child(data={"label":"1:daughter of mom of all nodes"})
node2=rootnode.spawn_right_child(data={"label":"1:son of mom of all nodes"})
print(node1)
print(rootnode)

parent of root node = None
node label = 0:mother of all nodes
   no parent i.e. root node
   no left child
   no right child

node label = 0:mother of all nodes
   no parent i.e. root node
   left child label 1:daughter of mom of all nodes
   right child label 1:son of mom of all nodes



**Build a tree**

In [3]:
rootnode=node(parent=None,data={"label":"TD = Top dog"})
node1=rootnode.spawn_left_child(data={"label":"DTD = daughter of Top Dog"})
node2=rootnode.spawn_right_child(data={"label":"STD = son of Top Dog"})
node11=node1.spawn_left_child(data={"label":"DDTD"})
node12=node1.spawn_right_child(data={"label":"SDTD"})
node21=node2.spawn_left_child(data={"label":"DSTD"})
node22=node2.spawn_right_child(data={"label":"SSTD"})
node211=node21.spawn_left_child(data={"label":"DDSTD"})
node2111=node211.spawn_left_child(data={"label":"DDDSTD"})
node2112=node211.spawn_right_child(data={"label":"SDDSTD"})
node212=node21.spawn_right_child(data={"label":"SDSTD"})

In [10]:
print(node2111)
print(node2111.parent)

node label = DDDSTD
   parent label = DDSTD
   no left child
   no right child

node label = DDSTD
   parent label = DSTD
   left child label DDDSTD
   right child label SDDSTD



In [11]:
node2111.parent.data["label"]

'DDSTD'

In [13]:
node2111.data

{'label': 'DDDSTD', 'depth': 4}

**Traverse the tree - depth first**

Once we have created a binary tree, we can recursively traverse it. The following code prints out the label of each node (it assumes we have a label key for every node in our tree).

A key capability utilized here is that function can call itself.

We'll use the join function.

In [5]:
":::".join(["cat","bird","dog","turtle"])

'cat:::bird:::dog:::turtle'

In [1]:
def node_string(node):
    # create string of spaces with size = depth of node
    spaces="".join([" " for i in range(node.data["depth"])])
    s=spaces+node.data["label"]+"\n"
    if node.left_child!=None:
        s+=node_string(node.left_child)
    if node.right_child!=None:
        s+=node_string(node.right_child)
    return(s)

In [82]:
"   ".join(["cat","bird","dog","turtle"])

'cat   bird   dog   turtle'

When we print the node_string of the root node, we get a label for every node in the entire  tree and the indentation shows 

In [7]:
print(node_string(rootnode))

TD = Top dog
 DTD = daughter of Top Dog
  DDTD
  SDTD
 STD = son of Top Dog
  DSTD
   DDSTD
    DDDSTD
    SDDSTD
   SDSTD
  SSTD



Our function works for any node 

In [8]:
print(node_string(rootnode.left_child))
print(node_string(rootnode.right_child))

 DTD = daughter of Top Dog
  DDTD
  SDTD

 STD = son of Top Dog
  DSTD
   DDSTD
    DDDSTD
    SDDSTD
   SDSTD
  SSTD



**Add class method**

As usual, we can make this function a method of our class. When we do that, we need to re-write the function calls so that they look like "node.node_string()" instead of "node-string(node)"

In [2]:
class node:
    __slots__=('parent','left_child','right_child','data')
    #
    # We instantiate a node by passing a parent (which can be None) 
    # and a dictionary
    #
    def __init__(self,parent,data={}):
        if parent==None:
            # making this a root node
            self.data=data
            self.data["depth"]=0
            self.parent=None
        else:
            # making this a non-root node
            self.data=data
            self.data["depth"]=parent.data["depth"]+1
            self.parent=parent
        self.left_child=None
        self.right_child=None
    def spawn_left_child(self,data={}):
        n=node(parent=self,data=data)
        n.data=data
        n.data["depth"]=self.data["depth"]+1
        self.left_child=n
        return(n)
    def spawn_right_child(self,data={}):
        n=node(parent=self,data=data)
        n.data=data
        n.data["depth"]=self.data["depth"]+1
        self.right_child=n
        return(n)
    #
    # string consisting of information about node
    #
    def __str__(self):
        s="node label = "+self.data["label"]+"\n"
        if self.parent==None:
            s+="   no parent i.e. root node\n"
        else:
            s+="   parent label = " + self.parent.data["label"]+"\n"
        if self.left_child==None:
            s+="   no left child\n"
        else:
            s+="   left child label " + self.left_child.data["label"]+"\n"
        if self.right_child==None:
            s+="   no right child\n"
        else:
            s+="   right child label " + self.right_child.data["label"]+"\n"
        
        return(s)
    def node_string(self):
        spaces="".join([" " for i in range(self.data["depth"])])
        s=spaces+self.data["label"]+"\n"
        if self.left_child!=None:
            s+=self.left_child.node_string()
        if self.right_child!=None:
            s+=self.right_child.node_string()
        return(s)
rootnode=node(parent=None,data={"label":"TD = Top dog"})
node1=rootnode.spawn_left_child(data={"label":"DTD = daughter of Top Dog"})
node2=rootnode.spawn_right_child(data={"label":"STD = son of Top Dog"})
node11=node1.spawn_left_child(data={"label":"DDTD"})
node12=node1.spawn_right_child(data={"label":"SDTD"})
node21=node2.spawn_left_child(data={"label":"DSTD"})
node211=node21.spawn_left_child(data={"label":"DDSTD"})
node2111=node211.spawn_left_child(data={"label":"DDDSTD"})
node2112=node211.spawn_right_child(data={"label":"SDDSTD"})
node212=node21.spawn_right_child(data={"label":"SDSTD"})

s=rootnode.node_string()
print(s)

TD = Top dog
 DTD = daughter of Top Dog
  DDTD
  SDTD
 STD = son of Top Dog
  DSTD
   DDSTD
    DDDSTD
    SDDSTD
   SDSTD



In [10]:
s211=node211.node_string()
print(s211)

   DDSTD
    DDDSTD
    SDDSTD



**Binary Decision Trees**

A binary decision tree is a binary tree that enables us to predict which category an item falls into based on known characteristics of the item. Here is a simple example from finance. Mortgage loans have the following attributes:

- location type (suburban, rural, urban)
- borrower's credit score (numerical)
- loan principle i.e. size of loan (numerical)
- interest rate (numerical)
 
A loan can either be approved or not. We have lots of loan performance data, and based on that, here might be an example of a (by no means realistic) classifier:

* location = rural or suburban
    * credit score>700
        * interest rate>5% => reject
        * interest rate<=5% => approve
    * credit score<=700 => reject
* location = urban
    * credit score > 650
        * principle > 100K => approve
        * principle <= 100K => reject
    * credit score <= 650 => reject


A leaf is a node of a tree that has no chilren. 

Note the tree structure. We can think of a binary decision tree as a binary tree such that, to classifiy an individual with given variable values we start at the root node and move along a path picking a child node at each step from the current node. Every non leaf has two children and a function at the node, which upon evaluation. Every leaf node has a category and we classify an individual according to the category of the leaf node they ultimately reach.

We place a label at each node so that we can see what is going on in the code.


In [24]:
def f0(x):
    if x["location"]=="rural" or x=="suburban":
        return("left")
    else:
        return("right")

def f1(x):
    if x["credit score"]>700:
        return("left")
    else:
        return("right")

def f2(x):
    if x["credit score"]>650:
        return("left")
    else:
        return("right")
    
def f11(x):
    if x["interest rate"]>5:
        return("left")
    else:
        return("right")
def f111(x):
    return("reject")
def f112(x):
    return("approve")
def f12(x):
    return("reject")

def f21(x):
    if x["principle"]>100:
        return("left")
    else:
        return("right")
def f211(x):
    return("approve")
def f212(x):
    return("reject")
def f22(x):
    return("reject")

rootnode=node(parent=None,data={"f":f0,"label":"0"})
node1=rootnode.spawn_left_child(data={"f":f1,"label":"1"})
node11=node1.spawn_left_child(data={"f":f11,"label":"11"})
node111=node11.spawn_left_child(data={"f":f111,"label":"111"})
node112=node11.spawn_right_child(data={"f":f112,"label":"112"})
node12=node1.spawn_right_child(data={"f":f12,"label":"12"})
node2=rootnode.spawn_right_child(data={"f":f2,"label":"2"})
node21=node2.spawn_left_child(data={"f":f21,"label":"21"})
node211=node21.spawn_left_child(data={"f":f211,"label":"211"})
node212=node21.spawn_right_child(data={"f":f212,"label":"212"})
node22=node2.spawn_right_child(data={"f":f22,"label":"22"})



**Classification**

Now that we have our tree, we can create a function that uses tree recursion to calculate it given an individual's data, which is assumed to be a dictionary with keys "location", "credit score", "interest rate", "principle".

In [3]:
def classify(idata):
    # initialize current node at root node
    cnode=rootnode
    #
    # if current node as child nodes, compute function 
    # to determine which child node to go to
    #
    while cnode.left_child:
        print("current node label = ", cnode.data["label"])
        #
        # compute function value at this node (the result is "left" or "right")
        #
        value=cnode.data["f"](idata)
        print("function value = ",cnode.data["f"](x))
        if value=="left":
            cnode=cnode.left_child
        else:
            cnode=cnode.right_child
    #
    # current node has no children - we are at a leaf
    #
    value=cnode.data["f"](idata)
    print("current node label = ", cnode.data["label"])
    print("function value = "+value)
    return(value)

In [4]:
x={"location":"suburban","credit score":700,"interest rate":7.5,"principle":300}
result=classify(x)
print("\n"+result)

current node label =  TD = Top dog


KeyError: 'f'

**Prediction with probabilities**

When predicting a binary outcome (rain/no-rain tomomrrow, loan defaults/load doesn't default, patient survives/patient dies) based on data, it is more informative to report a probability rather than the outcome itself. This has the benefit

- the probability reflects uncertainty
- the decision-maker can compute an expected loss associated with either decision and act accordingly

To illustrate, suppose you know that the chance of a hurricaine hitting Miami tomorrow is 10%. Suppose the loss associated with not preparing for the possibility of a hurricaine when it actually hits is \\$ 100,000 and the loss associated with preparing and having it not hit is \\$ 500. Then 

- Expected loss if you don't prepare = .1 x \\$ 100,000+ .9 x \\$ 0 = \\$ 10,000
- Expected loss if you do prepare = .1 x \\$ 0 + .9 x \\$ 500 =  \\$ 450

So in terms of minizing expected loss it is better to prepare. On the other hand, if the probability of the hurricaine hitting is 1 in 50,000, then by this criterion you ought not prepare.

The above is easily modified to return a probability of default (estimated).

In [5]:
def f0(x):
    if x["location"]=="rural" or x=="suburban":
        return("left")
    else:
        return("right")

def f1(x):
    if x["credit score"]>700:
        return("left")
    else:
        return("right")

def f2(x):
    if x["credit score"]>650:
        return("left")
    else:
        return("right")
    
def f11(x):
    if x["interest rate"]>5:
        return("left")
    else:
        return("right")
def f111(x):
    return(.23)
def f112(x):
    return(.05)
def f12(x):
    return(.17)

def f21(x):
    if x["principle"]>100:
        return("left")
    else:
        return("right")
def f211(x):
    return(.04)
def f212(x):
    return(.09)
def f22(x):
    return(.08)

rootnode=node(parent=None,data={"f":f0,"label":"0"})
node1=rootnode.spawn_left_child(data={"f":f1,"label":"1"})
node11=node1.spawn_left_child(data={"f":f11,"label":"11"})
node111=node11.spawn_left_child(data={"f":f111,"label":"111"})
node112=node11.spawn_right_child(data={"f":f112,"label":"112"})
node12=node1.spawn_right_child(data={"f":f12,"label":"12"})
node2=rootnode.spawn_right_child(data={"f":f2,"label":"2"})
node21=node2.spawn_left_child(data={"f":f21,"label":"21"})
node211=node21.spawn_left_child(data={"f":f211,"label":"211"})
node212=node21.spawn_right_child(data={"f":f212,"label":"212"})
node22=node2.spawn_right_child(data={"f":f22,"label":"22"})

def classify(idata):
    # initialize current node at root node
    cnode=rootnode
    #
    # if current node as child nodes, compute function 
    # to determine which child node to go to
    #
    while cnode.left_child:
        print("current node label = ", cnode.data["label"])
        #
        # compute function value at this node (the result is "left" or "right")
        #
        value=cnode.data["f"](idata)
        print("function value = ",cnode.data["f"](x))
        if value=="left":
            cnode=cnode.left_child
        else:
            cnode=cnode.right_child
    #
    # current node has no children - we are at a leaf
    #
    value=cnode.data["f"](idata)
    print("current node label = ", cnode.data["label"])
    print("function value = "+str(value))
    return(value)
x={"location":"urban","credit score":500,"interest rate":7,"principle":90}
result=classify(x)
print("\n"+str(result))

current node label =  0
function value =  right
current node label =  2
function value =  right
current node label =  22
function value = 0.08

0.08
