# Fundamental Structure of Decision Trees

Decision trees are **hierarchical models** that represent a sequence of decisions, leading to predictions or classifications. They have a tree-like structure, where each node represents a decision or a prediction, and each branch represents an outcome of the decision. The tree starts at the top with the **root node**, which contains the entire dataset, and branches down, ending with **leaf nodes**, which contain the final predictions or classifications. This structure reflects a step-by-step process of answering questions or evaluating conditions to arrive at an outcome.

## Root Nodes, Internal Nodes, and Leaf Nodes

A decision tree is composed of three types of nodes: **root node**, **internal nodes**, and **leaf nodes**. Each node represents a decision or a prediction based on the input features. The nodes are connected by branches, which represent the outcomes of the decisions. The structure of a decision tree reflects a step-by-step process of answering questions or evaluating conditions to arrive at an outcome.

Let's go through the definitions and examples of these nodes:

1. **Root Node:**
The root node is the starting point of the decision tree. It contains the entire dataset and poses the first question or condition to split the data into subsets. The root node sets the tone for the subsequent decisions and branches in the tree.

<font color='Blue'><b>Example:</b></font>
In a medical diagnosis context, the root node could ask, "Is the patient's temperature above 38°C?" Based on the answer, the data splits into two subsets: one with high temperature and another with normal temperature. This split forms the basis for the subsequent tree structure, where further questions are asked to narrow down the diagnosis.

2. **Internal Nodes:**
Internal nodes are the decision points in the tree. They present questions or conditions about specific features to further split the data. They guide the data down different branches based on the answers or outcomes of the questions or conditions. Each internal node represents a feature and its associated condition, which determines the path that the data points will follow as they traverse the tree. Internal nodes play a crucial role in navigating through the decision-making process.

<font color='Blue'><b>Example:</b></font>
Continuing with the medical diagnosis example, an internal node might ask, "Does the patient have a cough?" If the answer is "Yes," the data proceeds down one branch, and if the answer is "No," it goes down another. This split allows the model to consider various symptoms and their interactions in making an accurate diagnosis.

```{figure} Tree_Based_Example.png
---
width: 500px
align: center
---
An Example of Root Node and Internal Nodes
```

3. **Leaf Nodes:**
Leaf nodes are the end points of the decision tree. They represent the final predictions or classifications that the tree makes. Each leaf node corresponds to a specific outcome, category, or numerical value. Once a data point reaches a leaf node, the prediction or classification associated with that leaf node becomes the final decision for that data point.

<font color='Blue'><b>Example:</b></font>
As we progress in the medical diagnosis scenario, we arrive at the leaf nodes, which represent the final conclusions in the decision tree. For instance, a leaf node could correspond to the diagnosis 'Common Cold' if a patient's symptoms match that specific outcome. Alternatively, another leaf node might correspond to the diagnosis 'Influenza' for a different set of symptoms. Each leaf node signifies a definitive decision or classification, and once a patient's data reaches a leaf node, the corresponding diagnosis becomes the final determination for that case.

## Popular tree-based methods

Tree-based methods are powerful and versatile techniques for supervised learning, both for classification and regression problems {cite:p}`james2023introduction`. They use a hierarchical structure of splits to partition the feature space into regions, and then assign a class label or a numerical value to each region. Some splits may be binary, where each split is based on a single feature and a threshold value, or non-binary, where each split is based on a linear combination of features or a non-linear function. Some of the most widely used tree-based methods are:

- **Decision trees**: They are the simplest form of tree-based methods, where each split is based on a criterion, which can be entropy, Gini index, misclassification rate, information gain, chi-square test, or permutation test, depending on the parameter `criterion`. The criterion determines the best feature and threshold for each split. Decision trees are easy to interpret and visualize, but they tend to overfit the data and have high variance.
- **Random forests**: They are an ensemble method that combines many decision trees, each trained on a bootstrap sample of the data and features. A bootstrap sample is a random sample drawn with replacement from the original data. The final prediction is obtained by averaging the predictions of the individual trees for regression problems, or by majority voting for classification problems. Random forests reduce the variance and overfitting of decision trees, and improve their accuracy and robustness.
- **Boosting**: They are another ensemble method that iteratively fits decision trees to the pseudo-residuals of the previous trees, and combines them with weights to form a strong learner. The pseudo-residuals are the negative gradient of the loss function with respect to the predictions. Boosting algorithms, such as AdaBoost or gradient boosting, can achieve high accuracy and performance, but they are more prone to overfitting and require careful tuning of the parameters.

Scikit-Learn (sklearn) is a popular Python library that offers an array of robust and efficient methods for building classification and regression trees, each tailored to specific tasks and data complexities {cite:p}`sklearnUserGuide`. Sklearn provides a consistent and user-friendly interface for creating, fitting, and evaluating tree-based models, as well as tools for feature selection, preprocessing, and visualization.

```{admonition} Examples of Classification Tree Methods
:class: tip

The following are some examples of classification tree methods available in sklearn {cite:p}`sklearnUserGuide`:

1. **[DecisionTreeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html):** This class implements the CART algorithm to build classification trees. It uses a criterion, which can be Gini impurity, entropy, misclassification rate, or information gain, depending on the parameter `criterion`. The criterion determines the best feature and threshold for each split. You can control the complexity and size of the tree by tuning parameters such as `max_depth` and `min_samples_split` {cite:p}`sklearnUserGuide, gallatin2023machine`.

2. **[RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn-ensemble-randomforestclassifier):** This class combines multiple decision trees, each trained on a bootstrap sample of the data and features, and averages their predictions. A bootstrap sample is a random sample drawn with replacement from the original data. This reduces the variance and overfitting of single trees, and improves the accuracy and robustness of the model {cite:p}`sklearnUserGuide, géron2022hands`.

3. **[GradientBoostingClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html#sklearn-ensemble-gradientboostingclassifier):** This class applies the gradient boosting technique to classification problems. It sequentially fits decision trees to the pseudo-residuals of the previous trees, and adjusts their weights to form a strong learner. The pseudo-residuals are the negative gradient of the loss function with respect to the predictions. This allows the model to capture complex nonlinear relationships in the data, but it also requires careful tuning of the parameters and may overfit the data {cite:p}`sklearnUserGuide`.

4. **[AdaBoostClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html#sklearn-ensemble-adaboostclassifier):** This class uses the adaptive boosting algorithm to combine several weak classifiers, usually decision trees, into a powerful classification model. It increases the sample weights of the misclassified instances, and focuses the learning on the difficult cases. The sample weights are used to modify the contribution of each instance to the loss function. This enhances the accuracy and performance of the model, but it may also be sensitive to noise and outliers {cite:p}`sklearnUserGuide`.

```

```{admonition} Examples of Regression Tree Methods
:class: tip

The following are some examples of regression tree methods available in sklearn {cite:p}`sklearnUserGuide`:

1. **[DecisionTreeRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html#sklearn-tree-decisiontreeregressor):** This class uses the CART algorithm to construct regression trees. It splits the feature space based on a criterion, which can be mean squared error, mean absolute error, or median absolute error, depending on the parameter `criterion`. The criterion determines the optimal feature and threshold for each split. DecisionTreeRegressor can produce accurate and interpretable predictions, but it may also overfit the data and have high variance {cite:p}`sklearnUserGuide`.

2. **[RandomForestRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor):** This class combines many regression trees, each trained on a random subset of the data and features, and averages their predictions. This reduces the variance and overfitting of single trees, and improves the accuracy and robustness of the model. RandomForestRegressor can handle complex and nonlinear relationships in the data, but it may require the data to be complete and do not handle missing values internally {cite:p}`sklearnUserGuide`.

3. **[GradientBoostingRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html#sklearn-ensemble-gradientboostingregressor):** This class applies the gradient boosting technique to regression problems. It sequentially fits regression trees to the pseudo-residuals of the previous trees, and adjusts their weights to form a strong learner. The pseudo-residuals are the negative gradient of the loss function with respect to the predictions. GradientBoostingRegressor can achieve high accuracy and performance, but it also requires careful tuning of the parameters and may overfit the data {cite:p}`sklearnUserGuide`.

4. **[AdaBoostRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostRegressor.html#sklearn-ensemble-adaboostregressor):** This class uses the adaptive boosting algorithm to combine several weak regression models, usually decision trees, into a powerful predictor. It increases the sample weights of the data points with high prediction errors, and focuses the learning on the difficult cases. The sample weights are used to modify the contribution of each data point to the loss function. AdaBoostRegressor can enhance the accuracy and performance of the model, but it may also be sensitive to noise and outliers {cite:p}`sklearnUserGuide`.

```

<!-- ## Process of Building Decision Trees

The process of building decision trees for regression and classification tasks involves similar steps with some variations in criteria and considerations. Here's an overview of the process for both scenarios {cite:p}`bishop2006pattern,alpaydin2020introduction,james2023introduction`:

**Building Decision Trees for Classification:**

1. **Data Preparation:** Begin with a labeled dataset containing input features and corresponding class labels.

2. **Choosing a Criterion:** Select a criterion to measure the quality of splits. Common criteria include Gini impurity and entropy. These quantify the impurity or randomness in class distribution within subsets.

3. **Root Node Selection:** Choose the best feature and split point for the root node based on the selected criterion. This minimizes impurity or maximizes information gain.

4. **Recursive Splitting:** For each internal node, recursively select the best feature and split point based on the chosen criterion. Continue until a stopping criterion is met, such as reaching a maximum depth or a minimum number of samples in a leaf node.

5. **Leaf Node Creation:** When a stopping criterion is met or the data is sufficiently pure, create a leaf node. Assign the majority class in the subset to the leaf node.

6. **Prediction:** To classify a new data point, traverse the tree from the root to a leaf node based on feature conditions. The leaf node's majority class becomes the predicted class.

**Building Decision Trees for Regression:**

1. **Data Preparation:** Begin with a dataset containing input features and corresponding numeric target values.

2. **Choosing a Criterion:** In regression, the criterion typically used is mean squared error (MSE). It measures the average squared difference between predicted and actual target values.

3. **Root Node Selection:** Choose the best feature and split point for the root node to minimize the MSE.

4. **Recursive Splitting:** Similar to classification, recursively select the best feature and split point based on the MSE. Stop when a stopping criterion is met.

5. **Leaf Node Creation:** When the stopping criterion is met, create a leaf node. Assign the mean of target values in the subset to the leaf node.

6. **Prediction:** To predict a new target value, traverse the tree from the root to a leaf node based on feature conditions. The leaf node's mean target value becomes the predicted value.

Both regression and classification decision trees aim to create a tree structure that captures patterns and relationships in the data. The main difference lies in the choice of criterion and the type of output being predicted. Keep in mind that while decision trees can be highly interpretable, they are susceptible to overfitting. Techniques like setting maximum depth, pruning, and using ensemble methods can help address this challenge and improve the model's generalization capabilities. -->

## Advantages and Disadvantages of Decision Trees

**Decision Trees (DTs)** are a type of machine learning technique that can be used for both classification and regression tasks. They work by building a tree-like model where each internal node represents a test on an input feature, and each leaf node represents a predicted value or class. Decision trees aim to create simple and interpretable decision rules from the data, and they can approximate complex and nonlinear relationships using piecewise constant functions {cite:p}`james2023introduction,sklearnUserGuide`.

### Advantages of Decision Trees:

- **Interpretability**: Decision trees are easy to understand and explain, and their structure can be visualized using diagrams or graphs.
- **Minimal Data Preparation**: Decision trees do not require much data preprocessing, such as normalization, dummy encoding, or missing value imputation. However, some implementations, such as scikit-learn's, may have some limitations for categorical features or incomplete data.
- **Efficiency**: Decision trees are fast to train and predict, as the cost grows logarithmically with the size of the data.
- **Handling of Mixed Data**: Decision trees can handle both numerical and categorical features, although some implementations, such as scikit-learn's, may have some limitations for categorical features.
- **Multi-output Support**: Decision trees can deal with problems that have multiple target variables, such as multi-label classification or multi-variate regression.
- **White Box Model**: Decision trees are transparent and provide clear explanations for their predictions, unlike "black box" models such as neural networks.
- **Robustness**: Decision trees can perform well even when the model assumptions are not exactly met by the data-generating process.

### Disadvantages of Decision Trees:

- **Overfitting**: Decision trees can fit the training data too closely, leading to poor generalization and high variance. Pruning and limiting the tree depth are some methods to prevent overfitting.
- **Instability**: Decision trees can be sensitive to small changes in the data, resulting in different and inconsistent models. Ensemble methods, such as random forests, can reduce this instability by averaging over many trees.
- **Discontinuity**: Decision trees make predictions using step functions, which are discontinuous and have jumps or breaks in their graph. This makes them unsuitable for extrapolation or interpolation.
- **Computational Complexity**: Finding the optimal decision tree is a NP-hard problem, meaning that it is computationally intractable. Practical algorithms use heuristics, such as greedy search, to find locally optimal solutions. Ensemble methods can also overcome this complexity by combining simpler trees.
- **Difficulty with Complex Concepts**: Decision trees may struggle to learn complex logical concepts, such as XOR, parity, or multiplexer, that require a large number of splits or nodes.
- **Bias in Imbalanced Data**: Decision trees may produce biased models when the data is imbalanced, meaning that some classes are overrepresented or underrepresented. Balancing the data before training, such as using resampling or weighting, can mitigate this bias.