<a href="https://colab.research.google.com/github/JordanDCunha/Hands-On-Machine-Learning-with-Scikit-Learn-and-PyTorch/blob/main/Chapter5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Chapter 5. Decision Trees

Decision trees are versatile machine learning algorithms that can perform
classification, regression, and even multioutput tasks.

They are powerful models capable of fitting very complex datasets.


Decision trees are also the fundamental building blocks of **random forests**
and **gradient boosting**, which are among the most powerful machine learning
methods available.


In this chapter, we will learn how to:
- Train decision trees
- Visualize them
- Make predictions
- Understand how they split data
- Regularize them to reduce overfitting


## Training a Decision Tree Classifier

We will use the Iris dataset and train a decision tree to classify flowers
based on petal length and petal width.


In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier


In [None]:
iris = load_iris()
X = iris.data[:, 2:]  # petal length and width
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, random_state=42
)


In [None]:
tree_clf = DecisionTreeClassifier(max_depth=2, random_state=42)
tree_clf.fit(X_train, y_train)


The `max_depth` hyperparameter limits how deep the tree can grow,
which helps prevent overfitting.


## Visualizing a Decision Tree

One of the greatest strengths of decision trees is their interpretability.
They can be visualized as flowcharts.


In [None]:
from sklearn.tree import export_graphviz
import graphviz


In [None]:
dot_data = export_graphviz(
    tree_clf,
    out_file=None,
    feature_names=iris.feature_names[2:],
    class_names=iris.target_names,
    rounded=True,
    filled=True
)

graphviz.Source(dot_data)


Each node displays:
- The feature used for splitting
- The threshold value
- The impurity (Gini by default)
- The number of samples
- The predicted class


## Making Predictions

To classify a new instance, the decision tree follows a path from the root
to a leaf node based on feature thresholds.


In [None]:
tree_clf.predict([[5.0, 1.5]])


The prediction corresponds to the class stored in the reached leaf node.


## CART Algorithm

Scikit-Learn uses the CART (Classification and Regression Trees) algorithm.

At each node, CART searches for the feature and threshold that produce the
purest possible child nodes.


For classification, purity is typically measured using **Gini impurity**.


Gini impurity is defined as:

Gini = 1 − Σ pₖ²

where pₖ is the proportion of class k in the node.


- Gini = 0 → perfectly pure node
- Higher Gini → more mixed classes


CART greedily selects the split that minimizes the **weighted average**
of impurity across child nodes.


## Regularizing Decision Trees

Decision trees tend to overfit if left unconstrained.
Scikit-Learn provides several hyperparameters for regularization.


Common regularization parameters include:

- `max_depth`
- `min_samples_split`
- `min_samples_leaf`
- `max_features`


In [None]:
tree_clf_reg = DecisionTreeClassifier(
    max_depth=3,
    min_samples_leaf=5,
    random_state=42
)

tree_clf_reg.fit(X_train, y_train)


Smaller, constrained trees usually generalize better than deep,
fully grown trees.


## Decision Trees for Regression

Decision trees can also perform regression tasks by predicting
continuous values.


In [None]:
from sklearn.tree import DecisionTreeRegressor


In [None]:
tree_reg = DecisionTreeRegressor(max_depth=3, random_state=42)
tree_reg.fit(X_train, y_train.astype(float))


For regression, the tree splits data to minimize **mean squared error (MSE)**.

Each leaf predicts the average target value of the samples it contains.


## Limitations of Decision Trees

Despite their strengths, decision trees have important limitations.


1. **High variance**  
   Small changes in the data can produce very different trees.


2. **Greedy optimization**  
   CART finds locally optimal splits, not globally optimal trees.


3. **Axis-aligned splits**  
   Decision boundaries are always perpendicular to feature axes.


4. **Overfitting**  
   Without regularization, trees can memorize the training data.


## Chapter Summary

In this chapter, you learned:

- How decision trees perform classification and regression
- How trees are trained using the CART algorithm
- How to visualize and interpret trees
- How to regularize trees to reduce overfitting
- Why trees are the foundation of ensemble methods

In the next chapter, we will combine many trees together to build
**Random Forests**.


# Training and Visualizing a Decision Tree

To understand decision trees, we will train one and examine how it makes
predictions using the Iris dataset.


We will train a `DecisionTreeClassifier` using only:
- Petal length
- Petal width


In [None]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier


In [None]:
iris = load_iris(as_frame=True)

X_iris = iris.data[["petal length (cm)", "petal width (cm)"]].values
y_iris = iris.target


We limit the tree depth to 2 levels to keep it simple and interpretable.


In [None]:
tree_clf = DecisionTreeClassifier(max_depth=2, random_state=42)
tree_clf.fit(X_iris, y_iris)


The tree is now trained and ready to be visualized.


## Visualizing the Decision Tree

Scikit-Learn provides the `export_graphviz()` function to convert a trained
decision tree into a Graphviz `.dot` file.


In [None]:
from sklearn.tree import export_graphviz


In [None]:
export_graphviz(
    tree_clf,
    out_file="iris_tree.dot",
    feature_names=["petal length (cm)", "petal width (cm)"],
    class_names=iris.target_names,
    rounded=True,
    filled=True
)


The `.dot` file describes the structure of the tree and can be rendered
using Graphviz.


Graphviz is an open-source graph visualization tool that can convert `.dot`
files into formats such as PNG or PDF.


In [None]:
from graphviz import Source


In [None]:
Source.from_file("iris_tree.dot")


Each node in the tree shows:

- The feature used for splitting
- The threshold value
- The impurity (Gini)
- The number of samples
- The predicted class


This visualization corresponds to **Figure 5-1: Iris decision tree**.

You can follow the path from the root to a leaf to see how predictions
are made step by step.


# Making Predictions

Let’s examine how the trained decision tree makes predictions.

We start at the root node (depth 0) and move down the tree based on
feature-based questions until we reach a leaf node.


At the root node, the tree asks:

**Is petal length < 2.45 cm?**

- If yes → move to the left child
- If no → move to the right child


If the petal length is smaller than 2.45 cm, the left child node is a
leaf node.

This node predicts **Iris setosa**.


If the petal length is greater than 2.45 cm, the tree moves to the
right child node (depth 1).

This node asks a second question:

**Is petal width < 1.75 cm?**


- If yes → predict **Iris versicolor**
- If no → predict **Iris virginica**

The prediction process ends once a leaf node is reached.


This top-down traversal makes decision trees easy to interpret and
even apply manually.


## Node Attributes

Each node in the decision tree contains useful information:


- **samples**: number of training instances that reach the node
- **value**: number of instances of each class at the node
- **gini**: Gini impurity of the node


Example:

At depth 1 (right node), 100 training instances reach the node.
Of these:
- 54 are Iris versicolor
- 46 are Iris virginica


A node is **pure** when all instances belong to the same class.

Pure nodes have:
**Gini impurity = 0**


## Gini Impurity

Gini impurity measures how mixed the classes are at a node.

The more mixed the classes, the higher the impurity.


The Gini impurity of node *i* is defined as:

Gᵢ = 1 − Σₖ (pᵢ,ₖ)²

where pᵢ,ₖ is the proportion of class k at node i.


For example, if a node contains:

- 0 setosa
- 49 versicolor
- 5 virginica

out of 54 samples, the impurity is:

1 − (0/54)² − (49/54)² − (5/54)² ≈ 0.168


A perfectly pure node has Gini impurity 0.

A maximally impure node has a higher value.


## CART Algorithm

Scikit-Learn uses the CART (Classification and Regression Trees) algorithm.


CART always produces **binary trees**:

Each split node has exactly two children (yes / no questions).


Other algorithms (such as ID3) can produce nodes with more than two children,
but Scikit-Learn does not use them.


## Decision Boundaries

Each split in the tree corresponds to a decision boundary in feature space.


- The root node creates a vertical boundary at petal length = 2.45 cm
- The right child creates a horizontal boundary at petal width = 1.75 cm


Since `max_depth=2`, the tree stops splitting after depth 2.

Increasing `max_depth` would add additional decision boundaries.


The tree’s full structure is accessible via:

tree_clf.tree_

You can inspect split thresholds, class counts, and impurities directly.


## White Box vs Black Box Models

Decision trees are considered **white box models**.

Their predictions are easy to understand and explain.


In contrast, models like random forests and neural networks are often
considered **black box models**.

They can make excellent predictions but are harder to interpret.


Interpretability is especially important in fields such as:

- Healthcare
- Finance
- Law
- Human resources


Decision trees provide clear, human-readable rules that support
transparent and accountable decision-making.


# Estimating Class Probabilities

A decision tree can also estimate the probability that an instance belongs
to a particular class k.


First, it traverses the tree to find the leaf node for this instance.


Then it returns the proportion of instances of class k among the training
instances that would also reach this leaf node.


For example, suppose you have found a flower whose petals are 5 cm long
and 1.5 cm wide.


The corresponding leaf node is the depth-2 left node, so the decision tree
outputs the following probabilities:


- 0% for Iris setosa (0 / 54)
- 90.7% for Iris versicolor (49 / 54)
- 9.3% for Iris virginica (5 / 54)


If you ask the tree to predict the class, it outputs **Iris versicolor**
(class 1) because it has the highest probability.


Let’s check this:


In [None]:
tree_clf.predict_proba([[5, 1.5]]).round(3)


In [None]:
tree_clf.predict([[5, 1.5]])


Notice that the estimated probabilities would be identical anywhere else
in the bottom-right rectangle of Figure 5-2.


For example, the probabilities would be the same if the petals were
6 cm long and 1.5 cm wide.


This is true even though it seems obvious that such a flower would most
likely be an Iris virginica in this case.


## The CART Training Algorithm


Scikit-Learn uses the **Classification and Regression Tree (CART)** algorithm
to train decision trees (also called “growing” trees).


The algorithm works by first splitting the training set into two subsets
using a single feature *k* and a threshold *tₖ*
(e.g., “petal length ≤ 2.45 cm”).


How does it choose *k* and *tₖ*?


It searches for the pair (*k*, *tₖ*) that produces the **purest subsets**,
weighted by their size.


Equation 5-2 gives the cost function that the algorithm tries to minimize.


**Equation 5-2. CART cost function for classification**


Once the CART algorithm has successfully split the training set in two,
it splits the subsets using the same logic, then the sub-subsets, and so on,
recursively.


It stops recursing once it reaches the maximum depth
(defined by the `max_depth` hyperparameter),
or if it cannot find a split that will reduce impurity.


A few other hyperparameters control additional stopping conditions:


- `min_samples_split`
- `min_samples_leaf`
- `max_leaf_nodes`
- and others


### WARNING


The CART algorithm is a **greedy algorithm**:
it greedily searches for an optimal split at the top level,
then repeats the process at each subsequent level.


It does **not** check whether a split will lead to the lowest possible
impurity several levels down.


A greedy algorithm often produces a solution that is reasonably good,
but it is not guaranteed to be optimal.


Unfortunately, finding the optimal decision tree is known to be an
**NP-complete problem**.


It requires exponential time, making the problem intractable
even for small training sets.


This is why we must settle for a “reasonably good” solution
when training decision trees.


# Computational Complexity


Making predictions requires traversing the decision tree
from the root to a leaf.


Decision trees are generally approximately balanced,
so traversing the tree requires going through roughly
O(log₂(m)) nodes,
where *m* is the number of training instances.


Here, log₂(m) is the binary logarithm of *m*,
equal to log(m) / log(2).


Since each node only requires checking the value of one feature,
the overall prediction complexity is **O(log₂(m))**,
independent of the number of features.


As a result, predictions are very fast,
even when dealing with large training sets.


By default, the training algorithm compares **all features**
on **all samples** at each node.


This results in a training complexity of:


**O(n × m log₂(m))**


where *n* is the number of features
and *m* is the number of training instances.


It is possible to speed up training by limiting
the growth of the tree.


For example, you can:


- Set a maximum tree depth using `max_depth`
- Set a maximum number of features to consider at each node
  (features are then chosen randomly)


These techniques help speed up training considerably
and can also reduce the risk of overfitting.


However, as always, going too far may result in underfitting.


# Gini Impurity or Entropy?


By default, the `DecisionTreeClassifier` class uses the **Gini impurity**
measure.


You can instead use **entropy** by setting the `criterion`
hyperparameter to `"entropy"`.


The concept of **entropy** originated in thermodynamics
as a measure of molecular disorder.


Entropy approaches zero when molecules are still
and well ordered.


Entropy later spread to many other fields,
including **Shannon’s information theory**,
where it measures the average information content of a message.


In information theory, entropy is zero
when all messages are identical.


In machine learning, entropy is frequently used
as an **impurity measure**.


A set’s entropy is zero when it contains
instances of only **one class**.


Equation 5-3 defines the entropy of the *i*th node.


For example, the depth-2 left node in Figure 5-1
has an entropy of approximately **0.445**.


This value is computed as:

−(49/54) log₂(49/54) − (5/54) log₂(5/54)


So, should you use **Gini impurity** or **entropy**?


Most of the time, it does **not** make a big difference:
both criteria tend to produce very similar trees.


Gini impurity is slightly **faster to compute**,
which makes it a good default choice.


When the two criteria differ, they tend to behave differently:


- **Gini impurity** tends to isolate the most frequent class
  in its own branch


- **Entropy** tends to produce slightly more balanced trees


# Regularization Hyperparameters


Decision trees make very few assumptions about the training data.

Unlike linear models, they do not assume linear relationships.


If left unconstrained, a decision tree will adapt itself very closely
to the training data, often resulting in **overfitting**.


Such models are often called **nonparametric models**.


This does *not* mean they have no parameters.

It means the number of parameters is **not fixed before training**.


The model structure is free to grow and fit the data as closely
as possible.


In contrast, **parametric models** (such as linear regression)
have a fixed number of parameters.


This limits their flexibility, reducing overfitting
but increasing the risk of underfitting.


To avoid overfitting decision trees, we must restrict their freedom.

This process is called **regularization**.


The most common form of regularization is limiting the **maximum depth**
of the tree.


In Scikit-Learn, this is controlled by the `max_depth` hyperparameter.

- Default: `None` (unlimited depth)
- Smaller values → stronger regularization


Reducing `max_depth` limits how deep the tree can grow,
helping prevent overfitting.


The `DecisionTreeClassifier` includes several additional
regularization hyperparameters.


### Common Decision Tree Regularization Hyperparameters


- **max_features**  
  Maximum number of features evaluated at each split


- **max_leaf_nodes**  
  Maximum number of leaf nodes in the tree


- **min_samples_split**  
  Minimum number of samples required to split a node


- **min_samples_leaf**  
  Minimum number of samples required in a leaf node


- **min_weight_fraction_leaf**  
  Same as `min_samples_leaf`, but expressed as a fraction
  of the total number of weighted samples


- **min_impurity_decrease**  
  A node will only be split if it reduces impurity
  by at least this amount


- **ccp_alpha**  
  Controls **minimal cost-complexity pruning (MCCP)**

  Larger values → more pruning → smaller trees


To reduce model complexity:

- Increase `min_*` hyperparameters or `ccp_alpha`
- Decrease `max_*` hyperparameters


In practice:

- Tuning `max_depth` is a great default
- `min_samples_leaf` is especially useful for small datasets
- `max_features` helps with high-dimensional data


### Note on Pruning


Some algorithms first grow the tree fully,
then **prune unnecessary nodes** afterward.


A node may be pruned if its purity improvement
is not statistically significant.


Statistical tests such as the χ² (chi-squared) test
are used to estimate whether an improvement
could be due to chance.


If the p-value exceeds a threshold (typically 5%),
the node is removed.


Scikit-Learn primarily uses **pre-pruning**
via hyperparameters like `max_depth` and `min_samples_leaf`.


Let’s test regularization on the **moons dataset**.

This is a toy binary classification dataset shaped like
two interleaving crescent moons.


In [None]:
from sklearn.datasets import make_moons
from sklearn.tree import DecisionTreeClassifier


In [None]:
X_moons, y_moons = make_moons(
    n_samples=150,
    noise=0.2,
    random_state=42
)


We train two decision trees:

- One **without regularization**
- One with `min_samples_leaf = 5`


In [None]:
tree_clf1 = DecisionTreeClassifier(random_state=42)
tree_clf2 = DecisionTreeClassifier(min_samples_leaf=5, random_state=42)

tree_clf1.fit(X_moons, y_moons)
tree_clf2.fit(X_moons, y_moons)


The unregularized tree overfits the data,
while the regularized tree produces smoother decision boundaries.


To verify this, we evaluate both models on a **new test set**
generated with a different random seed.


In [None]:
X_moons_test, y_moons_test = make_moons(
    n_samples=1000,
    noise=0.2,
    random_state=43
)


In [None]:
tree_clf1.score(X_moons_test, y_moons_test)


In [None]:
tree_clf2.score(X_moons_test, y_moons_test)


The regularized model achieves higher test accuracy,
confirming that it generalizes better.


# Regression


Decision trees are also capable of performing **regression tasks**.

Unlike linear regression, decision trees can model
highly **nonlinear relationships**.


They can fit complex datasets by recursively splitting
the feature space into regions and predicting a constant
value in each region.


Let’s build a **regression tree** using Scikit-Learn’s
`DecisionTreeRegressor`.


We will train it on a **noisy quadratic dataset**
and restrict the tree depth to avoid overfitting.


In [None]:
import numpy as np
from sklearn.tree import DecisionTreeRegressor


In [None]:
rng = np.random.default_rng(seed=42)

X_quad = rng.random((200, 1)) - 0.5  # one input feature
y_quad = X_quad ** 2 + 0.025 * rng.standard_normal((200, 1))


We now train a regression tree with `max_depth = 2`.


In [None]:
tree_reg = DecisionTreeRegressor(max_depth=2, random_state=42)
tree_reg.fit(X_quad, y_quad)


The resulting tree structure is shown in Figure 5-4.


This regression tree is very similar to a classification tree.

The key difference is that **each node predicts a numerical value**
instead of a class label.


Each prediction is the **average target value**
of the training instances that reach the leaf node.


### Example Prediction


Suppose we want to predict the value for a new instance with:

x₁ = 0.2


The tree traversal works as follows:

1. Root node checks whether x₁ ≤ 0.343 → True
2. Move to left child
3. Check whether x₁ ≤ −0.302 → False
4. Move to right child (leaf)


The leaf node predicts:

value = 0.038


The leaf node predicts:

value = 0.038


This value is the **mean of the target values**
of the 133 training instances in that leaf.


The mean squared error (MSE) for these instances
is approximately 0.002.


Figure 5-5 shows the predictions produced by two regression trees:

- Left: `max_depth = 2`
- Right: `max_depth = 3`


In each region, the predicted value is always the **average**
of the target values in that region.


The CART algorithm splits regions so that most training instances
are as close as possible to the predicted value.


### CART Cost Function for Regression


For regression tasks, CART does **not** minimize impurity.

Instead, it minimizes the **mean squared error (MSE)**.


At each split, the algorithm searches for the feature and threshold
that produce child nodes with the **lowest weighted MSE**.


This is known as the CART cost function for regression
(Equation 5-4).


### Overfitting in Regression Trees


Just like classification trees, regression trees are
**prone to overfitting**.


If no regularization is applied, the tree will often fit
the training data extremely closely.


Figure 5-6 (left) shows predictions from an **unregularized**
regression tree.


These predictions clearly overfit the training data.


By setting `min_samples_leaf = 10`,
we force each leaf to contain at least 10 samples.


This produces a **much smoother and more reasonable model**,
shown on the right in Figure 5-6.


In [None]:
tree_reg_regularized = DecisionTreeRegressor(
    min_samples_leaf=10,
    random_state=42
)

tree_reg_regularized.fit(X_quad, y_quad)


Regularization significantly improves generalization
by preventing overly fine splits.


# Sensitivity to Axis Orientation


Decision trees have many advantages: they are easy to interpret,
simple to use, versatile, and powerful.


However, they also have important **limitations**.


One key limitation is that decision trees prefer **orthogonal decision boundaries**.

All splits are perpendicular to a feature axis.


Because of this, decision trees are **sensitive to the orientation of the data**.


Figure 5-7 illustrates this problem using a simple linearly separable dataset.


- On the left, the dataset is aligned with the axes.
  A decision tree splits it cleanly.
- On the right, the dataset is rotated by 45°.
  The resulting decision boundary becomes unnecessarily complex.


Although both trees fit the training data perfectly,
the rotated version is **much more likely to overfit**
and generalize poorly.


### Reducing Sensitivity with PCA


One way to reduce this problem is to:
1. Scale the data
2. Apply **Principal Component Analysis (PCA)**


PCA rotates the dataset to reduce feature correlation.

This often (though not always) makes it easier for decision trees
to find simple splits.


We will study PCA in detail in Chapter 7.

For now, it is enough to know that PCA creates new features that
are linear combinations of the original ones.


Let’s build a pipeline that:
- Scales the data
- Applies PCA
- Trains a decision tree


In [None]:
from sklearn.decomposition import PCA
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier


We reuse the Iris dataset features from earlier:
petal length and petal width.


In [None]:
pca_pipeline = make_pipeline(
    StandardScaler(),
    PCA()
)


We now transform (rotate) the original feature space.


In [None]:
X_iris_rotated = pca_pipeline.fit_transform(X_iris)


Next, we train a decision tree on the PCA-transformed data.


In [None]:
tree_clf_pca = DecisionTreeClassifier(max_depth=2, random_state=42)
tree_clf_pca.fit(X_iris_rotated, y_iris)


Figure 5-8 shows the resulting decision boundaries.


Thanks to PCA, the dataset can now be separated effectively
using just **one principal component**, z₁.


The feature z₁ is a linear combination of:
- petal length
- petal width


This rotation makes it possible for the tree to use
simple, clean splits again.


### Practical Tip


Both `DecisionTreeClassifier` and `DecisionTreeRegressor`
support **missing values natively**.

You do **not** need an imputer when using decision trees.


## Decision Trees Have a High Variance


The main drawback of decision trees is that they suffer from **high variance**.


High variance means that **small changes** in:
- the training data, or
- the hyperparameters

can result in **very different models**.


Even retraining the *same* decision tree on the *same* dataset
can produce a noticeably different tree.


This happens because Scikit-Learn’s training algorithm is **stochastic**.


At each node, the algorithm may randomly select
which subset of features to evaluate for splitting.


Unless you explicitly fix the `random_state` hyperparameter,
the training process is not fully deterministic.


Figure 5-9 shows an example of this effect.


Although the new tree is trained on the same data
and uses the same hyperparameters,
its structure looks very different from the earlier one.


Both trees may fit the training data well,
but they can generalize differently to unseen data.


This sensitivity is what makes decision trees unstable on their own.


### Reducing Variance with Ensembles


Fortunately, there is a powerful way to reduce variance:
**averaging many decision trees together**.


Instead of relying on a single high-variance tree,
we train many trees and average their predictions.


This approach dramatically reduces variance
while preserving the low bias of decision trees.


An ensemble of decision trees trained this way
is called a **random forest**.


Random forests are among the most powerful
and widely used machine learning models today.


In the next chapter, we will study **random forests**
and other ensemble methods in detail.
