# Module 10 - Decision Trees Part Two

---

## Learning outcomes

- LO 1: Select the most appropriate tree depth for making a prediction.
- LO 2: Identify important elements of tree ensembles.
- LO 3: Implement important elements of tree ensembles.
- LO 4: Recognise the importance of features in tree ensembles.
- LO 5: Identify real-life applications of decision trees.
- LO 6: Discuss the concepts of interpretability and fairness
- LO 7: Compare the functionality of k-nearest neighbours and decision trees.

## Misc and Keywords
- **Perturbing** is the process of adding noise to data
- **Interpretability** in machine learning is often described as whether a human can understand why the machine learning model has made the decision it has. Interpretability can be an important attribute as it helps us understand if the model is working correctly and fairly. Not all models are interpretable, and some non-interpretable models are very useful. However, people may be more likely to adopt an interpretable model in certain contexts. For example, doctors are much more likely to use a machine learning model if they understand how it’s coming to decisions compared to a black box model where the decision-making process is more opaque.
- **Fairness** in AI is the concept that any AI model should treat all people equally and fairly. It is important that machine learning practitioners don’t allow prejudices and biases from society to creep into their models and reinforce inequalities that marginalised groups already face. Models become biased when the people building the algorithms may deliberately or inadvertently include their own prejudices in how they build the model. This is one reason why it is important to have fair representation among machine learning practitioners and for practitioners to consult affected populations (e.g. patients or customers) when building a model.

## Module Summary Description
- Looks at an approach to avoid overfitting which is more powerful than tree pruning. This is known as **tree ensembles**
- How do tree ensembles work:
    - Each individual tree is unique which is achieved by perturbing (adding noise) the training data i.e., using boot strapping
    - Each tree will be more robust against noise
    - For any new data point, the output is predicted either through majority vote (classification) or averaging (regression)
- When using tree ensembles the disadvantage is the loss of interpretability. That is we can understand what is going on and how the choices are made when the trees are individual.
- When deciding whether to use ensembles or individual trees you need to decide which is more important, the predictve performance (ensembles) or the interpretability (individual)
- Three ways to construct tree ensembles: Bagging, Random Forests, AdaBoost
- Tree ensembeles reduce variance at the cost of a higher bias since optimal splits are not taken in each tree node

---

## Selecting the Tree Depth
- If the tree depth is too shallow, your model won’t be as accurate, and you’ll get a high misclassification rate.
- If the tree depth is too deep, your model will overfit to the training data and won’t perform well when presented with new data.

## Bagging
- Decision trees are very powerful but can suffer from a high variance, one solution is to go from a single decision tree to a collection of decision trees. This involves combining bootstrapping with averaging which is known as the **bagging algorithm**
- Bagging works as follows:
    - **Tree construction**: For each decision tree $b = 1, ... B$, i.e., $(B = 100)$
        - Sample $n$ training data points with replacement (bootstrapping)
        - Grow a complete tree from these samples
    - **Prediction**
        - Predict outcome of a new datapoint in each tree $b = 1, ..., B$
        - Take a majority vote (classification) or average (regression)
- By using bagging it is expected that the overall variance of our prediction will have decreased, however this is not always the case.
    - This occurs when the samples are highly correlated i.e. a high correlation coefficient
 
### Decorrelating Trees
- Two algorithms to decorrelate trees, **random forests** and **boosting**

##### Random Forests
- The idea is to force the decision trees to look at different subsets of predictors in each level
- **Tree construction**: For each decision tree $b = 1, ... B$, i.e., $(B = 100)$
    - Sample $n$ training data points with replacement (bootstrapping)
    - Grow a complete tree from these samples
    - **however** in each splitting step only consider a random sample of $m$ predictors, (i.e., $m =\sqrt{p}$ where $\text{p = number of precitors}$). That is instead or using all of $n$ we use a sample of them.
- **Prediction**
    - Predict outcome of a new datapoint in each tree $b = 1, ..., B$
    - Take a majority vote (classification) or average (regression)
- Consquently, random forests introduces a bias, but in the hope that bias is far oughtweighed by the reduction in variance.

##### Boosting
- The key idea is to attach a weight to each training sample, the higher the weight, the more important the training sample
- Initialise by giving all training data records $i = 1, ..., N$ the same weight $w_i$
- **Tree construction**, For each decision tree $b = 1, ... B$, i.e., $(B = 1,000)$
    - Grow a shallow tree (i.e, depth 1 or 2) from weighted records
    - Increase the weights of the misclassified training records
    - Decrease the weights of the correctly classified records, and repeat for each tree.
- **Prediction**
    - Predict outcome of a new datapoint in each tree $b = 1, ..., B$
    - Take a majority vote (classification) or average (regression)
    - For better performance, weight the contribution of the trees by their misclassification rate

#### Tree Ensembles
- In the assignment 10.2 for tree ensembles, we:
    - Evaluate three models;
        - A single decision tree
        - A random forest classifier
        - The AdaBoost classifier
    - We select the best model, that is the one that performs best on validation data
    - We can then use feature select to help understand the model and remove any irrelevant features. Two approaches in particular were:
        - Impurity metric approach
            - One of the downsides of the inbuilt impurity metric is that it can only be applied to training data. This doesn't give us any indication of which features will be the most important on unseen data.
            - The Impurity Metric approach evaluates feature importance based on how much a feature reduces the impurity in decision trees. Impurity measures how mixed the target variable is within the subsets created by a decision tree’s splits. Common impurity measures include Gini Impurity or Entropy. When a feature is used in a decision tree split, it divides the data into purer subsets—subsets with more homogenous target values. The more a feature reduces impurity, the more important it is for the model's decision-making process. Features that result in larger reductions in impurity are deemed more important, while features with minimal impact on impurity reduction are considered less important. This method inherently works well with decision trees, as it directly measures the contribution of each feature to the tree's ability to make accurate splits and predictions.
        - Permutation importances
            - we can use the permutation importance to measure the feature importances on both the training and validation sets.
            - The permutation importance function measures the change in model performance when the feature values are randomly shuffled. If a feature is shuffled and has minimal impact on the model's performance, it means the feature has little importance, and its permutation importance score will be close to zero. On the other hand, if shuffling a feature actually improves the model's performance, this suggests that the feature might be harmful or irrelevant to the model. In this case, the feature is assigned a negative permutation importance, indicating that it could introduce noise or cause overfitting.

### Determining the importance of features in tree ensembles
- Often, you want to find out which features in your decision tree are the most important, meaning which features have a larger effect on the predicted outcome than others. This helps you to interpret the results and allows you to drop features that turn out to be unimportant. You can determine the relative importance of factors using a feature importance metric.
- One way to think about a feature’s importance is through the expected fraction of samples the feature will contribute to and the reduction in impurity you get from splitting on that variable. Many packages have inbuilt feature selection metrics which weigh the reduction in impurity (i.e. Gini index or entropy) with the number of samples that split affects. It’s also easy to average this metric over many trees to get the mean decrease in impurity.
- However, these impurity-based metrics have some drawbacks:
    - Impurity-based metrics tend to favour features with many unique values. This means they will often put a lot more importance on numerical features than on categorical ones with few categories.
    - Training data is used to calculate impurity-based metrics, meaning they don’t give any indication of performance on held-out test data.
- An alternative measure of feature importance is the permutation importance. This works by calculating the increase in the model’s predictive error after randomly permuting a variable. This is similar to randomly shuffling one column of the input data and seeing how that would affect the predictive error of the model. The permutation importance is a useful metric, as it is quick to calculate, easy to interpret and usable on training and test data. However, the random nature of the permutation means the metric may give slightly different answers depending on the random seed used.
- One other thing to be aware of when using the permutation importance is that if we have two highly correlated features, the feature importance of both will lessen. For example, say you are predicting the number of visitors to a park and, out of the features you are looking at, temperature is the most important one. The permutation importance would rank the temperature highest. If you were then to add a time-of-year feature, which is highly correlated with temperature, the importance of both the time of year and temperature would drop, meaning they may rank somewhere in the middle of the feature importance list.
- Feature selection can help us understand our model and the outputs it gives us, as well as remove any irrelevant predictors. In this section, we will be looking at how to identify the most important features in a decision tree using two different methods
    - Impurity metric approaches
    - Permutation importances

---

## Random Forests vs AdaBoost

### **Random Forests**
- **Type**: Bagging (Bootstrap Aggregating)  
- **Base Learner**: Decision Trees (typically deep trees)  
- **Training**: Trains multiple decision trees independently on random subsets of the data with replacement (bootstrapping).  
- **Combination**: Averages (for regression) or majority vote (for classification) from all trees.  
- **Strengths**:
    - Reduces overfitting compared to individual decision trees.  
    - Works well with high-dimensional data and missing values.  
    - Robust to noisy data.  
- **Weaknesses**:
    - Can be computationally expensive.  
    - Predictions are not as interpretable as a single decision tree.  

### **AdaBoost (Adaptive Boosting)**
- **Type**: Boosting  
- **Base Learner**: Weak learners (often shallow decision trees called "stumps")  
- **Training**: Sequentially trains weak models, where each subsequent model focuses more on misclassified instances from the previous ones by assigning higher weights to difficult samples.  
- **Combination**: Weighted sum of weak learners' predictions.  
- **Strengths**:
    - Often achieves higher accuracy than Random Forests on clean and structured data.  
    - Works well with simple models and avoids overfitting.  
    - More interpretable than Random Forests.  
- **Weaknesses**:
    - Sensitive to noisy data and outliers since it gives higher weight to misclassified points.  
    - Can be slower for large datasets due to sequential training.  

### **When to Use Which?**
- **Use Random Forests** when you need a robust model that handles large datasets well and is resistant to noise.  
- **Use AdaBoost** when you want a model that prioritises difficult cases and when your dataset is relatively clean and structured.


### Bootstrapping
- Bootstrapping typically refers to sampling with replacement from the dataset to create different subsets of the original data. Whether it includes both instances and features multiple times depends on the specific method used:
1. Bootstrapping in Bagging (Bootstrap Aggregating)
    - In methods like Bagging (e.g., Random Forests):
        - Bootstrapping is applied to instances (rows), meaning some instances may be selected multiple times while others may not be selected at all.
        - However, all features (columns) are usually used for training each model unless additional feature selection is applied.
2. Bootstrapping in Random Forests (Instance & Feature Selection)
    - Random Forests extend bagging by adding random feature selection:
        - Instances are sampled with replacement (bootstrapped)
        - Features are randomly selected for each decision tree
        - This means that both instances and features can appear multiple times in different trees, but not necessarily in the same tree.
3. Bootstrapping in General Statistics (Resampling)
    - In traditional bootstrapping (used in statistics to estimate confidence intervals):
        - Only instances (observations) are resampled.
        - Features (variables) remain the same.
4. Bootstrapping in Gradient Boosting?
    - Gradient Boosting (GBM, XGBoost, LightGBM) typically does not use bootstrapping by default.
    - Instead, it builds trees sequentially and can use subsampling (without replacement) of instances or features for regularisation.