# Decision Tree Basics
* Decision Trees are quite simple in concept, but quite complicated in implementation 
* This lecture is going to cover the basic concepts, later lectures we will discuss the details
* at its core, you can think of a decision tree as just a bunch of nested if statements
* For example, lets look at our spam classifier again
* A decision tree may look like:

![d-tree-pseudo.png](attachment:d-tree-pseudo.png)

![d-tree-diagram.png](attachment:d-tree-diagram.png)

# What makes this machine learning?
* a bunch of if statements doesn't exactly sound very machine learning esque 
* What makes this ML is how we choose the conditions we check in the if statements 
* this is based on information theory 

# One key feature
* decision trees only look at one attribute or aspect at a time 
* in other words each condition checks only 1 column of the X matrix
* Usually we call these **input features**, but they are often called **attributes** when talking about decision trees
* For example, if we are using a persons height to help us make a decision, we may have a condition like:

![d-tree-pseudo2.png](attachment:d-tree-pseudo2.png)

# Geometry 
* What does this tell us about the geometry of the problem?
* Well lets put height on the X1 axis and weight on the X2 axis. 

![d-tree-geometry.png](attachment:d-tree-geometry.png) 

* We see that if we split on X1=5, then everything to the left of it makes one decision, and everything to the right of it makes another decision 
* So while linear classifiers (discriminating line) can create boundaries that are at arbitrary angles to the axss, decision trees can only separate the data by lines that are orthogonal to the axis
* However, that does not just mean we are limited to one side being one thing, and the other side being another thing
* since we are working with trees now, we can have splits at each level, and thus the final decision boundary learned by a decision tree can be highly nonlinear!

# Recursiveness
* Another aspect of decision trees is that because we are using trees, the problem is inherently recursive
* A TreeNode will have children TreeNodes, and we will chose based on some criteria, which child to go to
* That child will then do the same thing, and chose one of its children to go to
* When we arrive at a leaf node, that is when we will make a prediction
* It then bubbles back up to the root node

# Pseudocode
* First lets oppose some limitations on our implementation:
    1. we are only going to do binary classification
    2. each tree will only have 0 or 2 children (only 1 split per node)
    3. If node has children, it does not have prediction function, if node does not have children, it does have prediction function 
    
* lets assume we have an object called **`TreeNode`**, which contains the following:

In [6]:
class TreeNode(object): 
    def __init__(self): 
        self.condition
        self.left_node
        self.right_node
        self.left_prediction
        self.right_prediction

* if the node is not a leaf node, then **`left_node`** and **`right_node`** will also point to tree nodes
* if the node is a leaf node, then **`left_node`** and **`right_node`** will be null, but **`left_prediction`** and **`right_prediction`** will be set to the most likely values
* So the basic algorithm for prediction one sample is:

![d-tree-algo.png](attachment:d-tree-algo.png)

* so first we check the condition on x
* if it is true then we first check if we have a left node 
* if we do have a left node, then we predict one sample on the left node
* otherwise this is a leaf node we return the left prediction 
* we then go to where the condition is not true
* we check to see if we have a right node
* if we do have a right node, we should get the prediction from the right now 
* other wise this is a leaf node, so we should return the right side prediction 

# Notice...
* See how the function we just described above is for predicting one sample? 
* we need to make our predictions for one sample at a time, because the prediction may be true or false depending on which sample we are looking at 
* we will soon discuss how that function is found, and how we can use it to build the fit function which will also be recursive

---
# Information Entropy
* lets now look at the theory behind choosing the best splits in our decision tree
* At a high level we would like to make a split that maximizes the reduction in uncertainty 
* So for example, if there is a split we can make where we go from being 50% certain, to 100% certain, that is going from 50% certain to 75% certain

# Information Entropy
* using information theory, we can quantify this concept
* we use what is called information entropy, and it is related to variance 
* recall that a wider variance means that we don't know much about the data we are going to get 
* A very small variance means we have more confidence about the specific values of the data 

![info%20entropy%201.png](attachment:info%20entropy%201.png)

* Now the equation for information entropy is:

### $$Entropy = E[-log_2\Big(p(x)\Big)]$$

* so we can see that it uses the probability distribution over x, just like variance 
* we can also see that it must be positive or 0, since p(x) must always be between 0 and 1, and the negative log of a number between 0 and 1, is greater than or equal to zero 
* Note that when dealing with entropy, we usually mean log base 2 implicitly 

![info%20entropy%202.png](attachment:info%20entropy%202.png)

# Binary Random Variable
* Lets consider a binary variable specifically (we will call it X)
* Lets that that the probability that X = 1 is 
### $$P(X=1) = p$$
* and that 
### $$P(X=0) = 1 - p$$ 
* The equation for entropy thus becomes: 
### $$H(p) = -plog(p) - (1-p)log(1-p)$$
* the question that we want to ask here is: What is the value of p that maximizes the entropy?
* to find this value we can take the derivative of h wrt p, set it equal to 0, and solve for p
### $$\frac{dH}{dp}$$
* doing this yield the answer that 
### $$p = 0.5$$
* if we were to plot H(X) vs P(x) we would see that H = 0 when p = 0 or 1
* and H is 1 when p = 0.5, which is the peak of H

![binary%20random%20variable.png](attachment:binary%20random%20variable.png)

* Now we can start thinking about the meaning of entropy
* if the probability of a binary variable is 0.5, then there is **no possible way for you to make a good prediction about it**
* no what you predict, you will have a 50% chance of being wrong
* Lets consider a value other than 0.5, lets say p = 0.8
* if we wanted to predict the value of this random variable, we should always predict 1, because that gives us the best chance of being correct 
* Entropy is a measure of how much information we get from knowing (finding out) the value of a random variable 

# Example
* If we flip a coin with p = 0.8 probability of heads, and we get heads (1), we don't gain that much information, we were already 80% certain that we would flip heads
* However, if we flip a coin with p = 0.5 probability of heads, and we get heads, we gain the maximum amount of information we could have
* this is because prior to knowing, we were maximally clueless about the value we would get! 
* In general, the probability distribution that yields the maximum entropy is the uniform distribution 

# Maximizing Information Gain 