# Decision Trees

## Summary

The goal of using a Decision Tree is to predict the class or value of the target variable by learning simple **decision rules** that split data points. The decision tree algorithm works like a bunch of nested if-else statements wherein successive conditions are checked unless the model reaches a conclusion.


**Keywords**:
- supervised learning
- regression and classification
    - more often used in classification
- **Root Node**: It represents the entire population or sample and this further gets divided into two or more homogeneous sets.
- **Splitting**: It is a process of dividing a node into two or more sub-nodes.
- **Decision Node**: When a sub-node splits into further sub-nodes, then it is called the decision node.
- **Leaf / Terminal Node**: Nodes that do not split are called Leaf or Terminal nodes.
- **Pruning**: When we remove sub-nodes of a decision node, this process is called pruning. You can say the opposite process of splitting.
- **Branch / Sub-Tree**: A subsection of the entire tree is called branch or sub-tree.
- **Parent and Child Node**: A node, which is divided into sub-nodes is called a parent node of sub-nodes, whereas the sub-nodes are called the childen of a parent node.

### Assumptions

- In the beginning, the whole training set is considered as the **root**.
- Feature values are preferred to be categorical. If the values are continuous then they are discretized prior to building the model.
- Records are **distributed recursively** (recursive partitioning, divide and conquer) on the basis of attribute values.
- Order to placing attributes as root or internal node of the tree is done by using some statistical approach.
- Decision trees apply a **top-down approach** to training.

### Pros

- Extremely fast prediction
- Disregards features that have little to no importance in prediction
- Extremely efficient training (provided parameters are reasonable)
- Easy to interpret logic (result resembles a flowchart)
- Constructs feature importance

### Cons

- Tends to overfit
- Changes in data can lead to unnecessary changes in the result
- Large trees can sometimes be difficult to interpret
- Biased towards more splits on features with more levels
- Can struggle when the target variable has many labels (in classification)


## How It Works

Each tree has one root node. Generally, the feature with the highest accuracy among all others is chosen as the root node.

This is the ID3 (Iterative Dichotomiser) algorithm most taught, and it takes a **greedy search** approach. meaning that it makes the choice that seems to be best at that moment.

1. The root node includes the entire dataset.
2. Calculate the entropy of the parent (giving us $E(parent)$).
3. For each feature in the dataset, split the data by that feature and:
    1. Calculate the entropy of each resulting node.
    2. Take the weighted average of entropy for each resulting node (giving us $E(parent|new feature)$).
    3. Calculate information gain by subtracting $E(parent) - E(parent|new feature)$.
4. Repeat steps 2 and 3 for each splitable feature.

### Important Parameters

- Parameters that control splitting of a decision tree:
    - `max_depth`: Maximum depth of nodes. As this increases, time to build the tree also increases.
    - `min_samples_split`: Minimum number of samples required to do a split.
    - `min_samples_leaf`: Minimum number of samples required to be in the leaf node.
    - `max_features`: Maximum features to consider when looking for the best split.
    - The above parameters are considered to be pre-pruning methods.
- Too shallow, and the model can't predict well. Too deep with too many splits, and it may be prone to overfitting.

### Entropy

- **Entropy**: The measure of uncertainty of a given dataset, describing the degree of randomness of a particular node. When the margin of difference is low, the model has less confidence in it's prediction and entropy is high.
- We aim for lower entropy because it represents higher randomness.
- For Binary Classification: $Entropy=-p_{+}logp_{+}-p_{-}logp_{-}$
    - Where $p_+$ is the probability of the positive class
    - [Article showing the math](https://www.analyticsvidhya.com/blog/2021/08/decision-tree-algorithm/)
- For Regression: $Entropy = \sum-p_i\times log_2(p_i)$
- The higher the entropy, the higher the impurity of the resulting node.

### Information Gain

- **Information Gain**: Used to measure the reduction of uncertainty in the dataset, given some feature. It's a deciding factor of which attribute should be selected for the next data split.
- Interpretation of IG = 0.37: the entropy of the dataset with this split will decrease by 0.37.
- IG is a statistical property that measures how well a given attribute separates the training examples according to their target values.
- For Classification: $IG = E(Y) - E(Y|X)$, or $IG = E(parent) - E(parent|new feature)$
    - Where $E(Y)$ is the entropy of the full dataset
    - Where $E(Y|X)$ is the entropy of the dataset given some feature
    - [Article showing the math](https://www.analyticsvidhya.com/blog/2021/08/decision-tree-algorithm/)
- For Regression: $IG(D_p,f)=I(D_p)-\frac{N_{left}}{N}I(D_{left})-\frac{N_{right}}{N}I(D_{right})$
- Favors smaller partitions with distinct values

### Gini Index

- **Gini Index**: Cost function used to evaluate splits in the dataset. It performs only binary splits.
- Higher GI implies higher inequality and higher heterogeneity
- Steps are
    - Calculate Gini for sub-nodes: $Gini=1-\sum{p_i}^2$
        - Where $p_i$ are the probabilites of each class
    - Calculate GI for split using the weighted Gini score of each node of that split
- Favors larger, easier to implement partitions

### Other Attribution Selection Criteria

- Gain Ratio
- Reduction in Variance
- Chi-Square

### Pruning

- **Pruning**: Removing branches that have little or no significance in the decision-making process. This can help with overfitting.
    - **Pre-pruning** is done while growing the tree to stop it growing.
    - **Post-pruning** is done after the tree is built.
- Pruning is best done with a validation dataset


## Improving the Model

- Because it overfits in most cases, especially when there are a large number of features, you can improve your decision tree with:
    - hyperparameter tuning (parameters that limit the size / depth of a tree)
    - pruning
    - PCA
    - random forests
