# Decision Trees
In this exercise I will explore how a decision tree is splitted using [Information Gain](https://en.wikipedia.org/wiki/Information_gain_(decision_tree)).

In Decision Tree, we decide if a node will be split or not by looking at the **Information Gain** that split would give us.

$$ Information \; Gain = H(p_1^{node}) - \left( w^{left} H(p_1^{left}) + w^{right} H(p_1^{right}) \right)$$

Where $H$ is the entropy, defined as:

$$ H(p_1) = -p_1 \log_2 (p_1) - (1-p_1) \log_2 (1-p_1) $$

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

The data I will use is the following:

|               | Ear Shape | Face Shape | Whiskers|   Cat |
|:-------------:|:---------:|:----------:|:-------:|:-----:|
|<img src="images/0.png" alt="drawing" width="50"> | Pointy | Round     | Present | 1 |
|<img src="images/1.png" alt="drawing" width="50"> | Floppy | Not Round | Present | 1 |
|<img src="images/2.png" alt="drawing" width="50"> | Floppy | Round     | Absent  | 0 |
|<img src="images/3.png" alt="drawing" width="50"> | Pointy | Not Round | Present | 0 |
|<img src="images/4.png" alt="drawing" width="50"> | Pointy | Round     | Present | 1 |
|<img src="images/5.png" alt="drawing" width="50"> | Pointy | Round     | Absent  | 1 |
|<img src="images/6.png" alt="drawing" width="50"> | Floppy | Not Round | Absent  | 0 |
|<img src="images/7.png" alt="drawing" width="50"> | Pointy | Round     | Absent  | 1 |
|<img src="images/8.png" alt="drawing" width="50"> | Floppy | Round     | Absent  | 0 |
|<img src="images/9.png" alt="drawing" width="50"> | Floppy | Round     | Absent  | 0 |

I will use **one-hot encoding** to encode the categorical features.
- Ear Shape: Pointy = 1, Floppy = 0
- Face Shape: Round = 1, Not Round = 0
- Whiskers: Present = 1, Absent = 0

With this election, the dataset is:

In [6]:
# Data set
X_train = np.array([[1,1,1],
                   [0,0,1],
                   [0,1,0],
                   [1,0,1],
                   [1,1,1],
                   [1,1,0],
                   [0,0,0],
                   [1,1,0],
                   [0,1,0],
                   [0,1,0]])

y_train = np.array([1,1,0,0,1,1,0,1,0,0])

On each node, we compute the gain information for each feature, the split the node on the feature with the higher information gain, by comparing the entropy of the node wjth the weighted entropy in the two splitted nodes.

Let's write a function to compute the entropy:

In [None]:
def entropy(p):
    