-----------
# Outline of Notebook
- ### Decision Trees
- ### Decision Tree Learning Intuition
- ### Decision Tree Learning
- ### Modifications to Decision Trees
- ### Tree Ensembles
- ### When to use Decision Trees and Tree Ensembles?
-----------

# Decision Trees

![](2022-07-23-15-20-27.png)

<u>The job the decision tree learning algorithm is that out of all possible decision trees, to try to pick one that hopefully does well on the training set and cross-validation set, and eventually on the test set too.</u>

# Decision Tree Learning Intuition

![](2022-07-23-16-25-23.png)

<u>Maximizing Purity</u> = you want to try to get only one class on one side and the other class on the other side when splitting

<u>Decision 2:</u> When do you stop splitting?
- When a node is 100% one class
- When splitting a node will results in the tree exceeding a maximum depth
    - You might want to do this because if your tree gets too big, it may be prone to overfitting
- When improvements in purity score are below a threshold (purity score not increasing much)
- When number of examples in a node is below a threshold

# Decision Tree Learning

<u>Entropy</u> = measure of the impurity of a set of data
- $p_1$ = fraction of examples that is one class (binary classification example)
- $p_0$ = $1 - p_1$ = fraction of examples that is the other class (binary classification example)
- Entropy function: $H(p_1) = -p_1\log_2{p_1} - p_0\log_2(p_0)$ (Note: $0 * \log_2{0} = 0$, let's assume)

    ![](2022-07-23-16-36-33.png)

<u>Choosing a Splitting Feature</u> = the feature that minimizes entropy = maximizes purity
- Let's assume that we are trying to classify cats from dogs using the features Ear Shape, Face Shape, and Whiskers
- Let's also assume that we originally have 5 cats and 5 dogs in our dataset
- $p_1$ = the fraction of examples that are cats in a dataset

    ![](2022-07-23-16-54-45.png)
    - Above in the picture, we found the entrophy of each split for each feature
    - Then, we computed the weighted average of the left and right datasets for each split
        - Computing the weighted average is important because a small dataset with high impurity is better than a large dataset with high impurity
    - Then, we subtracted the weighted average from the previous entropy (since this is the root node that we are trying to decide, the previous entropy was just 1 because $p_1 = 0.5$ as there were 5 cats out of 10 data points)
    - What this finally measures is the reduction in entropy that occurs and we would choose the split that reduces entrophy the most
        - We go through the hurdle of calculating the reduction in entropy because one way to decide how to stop splitting is if the reduction in entropy is too small and if you're just making your tree prone to overfitting by increasing its size

<u>Information Gain</u> = the calculated reduction in entropy

<u>The Final Algorithm:</u>
- Start with all examples at the root node
- Calculate information gain for all possible features, and pick the one with the highest information gain
- Split the dataset according to selected feature, and create left and right branches of the tree splitting data and putting them into those branches as well
- Keep repeating splitting process until stopping criteria is met:
    - When a node is 100% one class
    - When splitting a node will results in the tree exceeding a maximum depth
        - You might want to do this because if your tree gets too big, it may be prone to overfitting
    - When improvements in purity score are below a threshold (purity score not increasing much)
    - When number of examples in a node is below a threshold

NOTE: To decide your maximum depth, you can use the same path we used to decide the degree of polynomials we should try for regression

# Modifications to Decision Trees

Let's say you have a feature that has 3 unique values instead of 2
- In this case, you can use One Hot Encoding to change the one feature to 3 additional features, each having binary values (0 or 1)
- By One Hot Encoding these features, you convert them to numbers which allows you to feed the data into Logistic Regression or Neural Networks

Let's say you have a feature that is continuous, like the weight of an animal
- In this case, the Decision Tree algorithm will split the weight feature on a number that results in the greatest information gain
    - Ex. the weight feature can split on 8 kgs: anything below 8 kgs goes to the left branch and another equal to or above 8 kgs goes to right branch

# Tree Ensembles (Multiple Decision Trees)

Disadvantage of a Single Decision Tree = the tree can be highly sensitive to small changes in data

![](2022-07-23-17-44-55.png)

<u>How to build Tree Ensembles:</u>

Given training set of size $m$

Given the number of trees you want in your Tree Ensemble $B$

For $b = 1$ to $B$:
- Use sampling <u>with</u> replacement to create a new training set of size $m$ (note that training examples may be repeated)
- Train a decision tree on the new dataset

For the value of $B$, most recommend any value from 64 to 228

This type of Tree Ensemble is called a Bagged Tree Ensemble because of the sampling with replacement

However, there are other like the <u>Random Forest Tree Ensemble</u>:
- The Random Forest Algorithm has one extra addition to the Bagged Tree Algorithm
- At each node, when choosing a feature to use to split, if $n$ features are available, pick a random subset of $k < n$ features and allow the algorithm to only choose from that subset of features
- If $n$ is large, then you usually choose $k = \sqrt{n}$

<u>Usually the Random Forest Algorithm works better than the Bagged Tree Algorithm as well as a Single Decision Tree</u>

There's another widely known algorithm called <u>XGBoost</u>:
- Given training set of size $m$
- For $b = 1$ to $B$:
    - Use sampling with replacement to create a new training set of size $m$
        - But instead of picking from all examples with equal $\frac{1}{m}$ probability, make it more likely to pick examples that the previously trained trees misclassify
    - Train a decision tree on the new dataset

- XGBoost is fast and also has built in regularization to prevent overfitting

# When to use Decision Trees and Tree Ensembles?

Decision Trees and Tree ensembles
- Works well on tabular (structured) data
- Not recommended for unstructured data (images, audio, text)
- Small decision trees may be human interpretable
- Main Advantage: Fast

If you have decided to use Decision Trees:
- Mostly, use XGBoost

Decision Trees vs. Neural Networks
- Neural Networks works well on all types of data, including structured and unstructured data
- But, Neural Networks may be slower than a decision tree
- However, Neural Networks allow you to work with transfer learning