# Chapter 6: Decision Trees

## 1. Chapter Overview
**Goal:** Understand Decision Trees, a versatile algorithm capable of performing both classification and regression tasks, and even multioutput tasks. They are powerful algorithms capable of fitting complex datasets and form the basis of Random Forests.

**Key Concepts:**
* **Training and Visualization:** How to build and interpret a tree.
* **Making Predictions:** Traversing the tree from root to leaf.
* **Gini Impurity vs. Entropy:** The math behind how the tree decides to split nodes.
* **CART Algorithm:** The greedy algorithm used by Scikit-Learn to grow trees.
* **Regularization:** Preventing overfitting using hyperparameters like `max_depth`.
* **Regression:** Using trees to predict continuous values.

**Practical Skills:**
* Training a `DecisionTreeClassifier` on the Iris dataset.
* Visualizing the tree structure using `export_graphviz`.
* Training a `DecisionTreeRegressor` on quadratic data.
* Understanding the instability of trees (sensitivity to rotation).

In [None]:
# Setup
import sys
import sklearn
import numpy as np
import os
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt

np.random.seed(42)
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

## 2. Theoretical Explanation

### 1. Structure of a Tree
* **Root Node:** The top node where the tree starts.
* **Split Node:** A node that checks a condition (e.g., "Petal length <= 2.45 cm") and splits the data.
* **Leaf Node:** A terminal node that does not have children. It holds the final prediction.

### 2. Gini Impurity
A node's "impurity" measures how mixed the training instances are in that node. A node is "pure" (Gini = 0) if all instances it applies to belong to the same class.
$$ G_i = 1 - \sum_{k=1}^{n} p_{i,k}^2 $$
Where $p_{i,k}$ is the ratio of class $k$ instances among the training instances in the $i$-th node.

### 3. The CART Algorithm
Scikit-Learn uses the **Classification And Regression Tree (CART)** algorithm. It is a "greedy" algorithm: it searches for the single best pair of feature $k$ and threshold $t_k$ that produces the purest subsets (weighted by their size). It repeats this recursively until it reaches `max_depth` or cannot find a split that reduces impurity.

### 4. White Box vs. Black Box
* **White Box (Decision Trees):** Easy to interpret. You can visualize exactly why a decision was made.
* **Black Box (Neural Networks, SVMs):** Often perform better, but hard to explain why specific predictions were made.

## 3. Code Reproduction

### 3.1 Training and Visualizing a Decision Tree
We will train a tree on the Iris dataset to classify iris species.

In [None]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier

iris = load_iris()
X = iris.data[:, 2:] # petal length and width
y = iris.target

# max_depth=2 restricts the tree depth to prevent overfitting
tree_clf = DecisionTreeClassifier(max_depth=2, random_state=42)
tree_clf.fit(X, y)

### Visualization
The book uses `export_graphviz`. This outputs a `.dot` file which can be converted to an image. 
*Note: This requires the Graphviz software to be installed on your system to render the PNG.*

In [None]:
from sklearn.tree import export_graphviz

export_graphviz(
        tree_clf,
        out_file="iris_tree.dot",
        feature_names=iris.feature_names[2:],
        class_names=iris.target_names,
        rounded=True,
        filled=True
    )

# To verify the file was created:
print("iris_tree.dot file created successfully.")

# In a Jupyter environment with graphviz python package installed, you could render it like this:
# import graphviz
# with open("iris_tree.dot") as f:
#     dot_graph = f.read()
# graphviz.Source(dot_graph)

### 3.2 Estimating Class Probabilities
A decision tree can estimate the probability that an instance belongs to a particular class by returning the ratio of training instances of that class in the leaf node.

In [None]:
# Estimating for a flower with petal length 5cm and width 1.5cm
proba = tree_clf.predict_proba([[5, 1.5]])
prediction = tree_clf.predict([[5, 1.5]])

print("Probabilities (Setosa, Versicolor, Virginica):", proba)
print("Predicted Class Index:", prediction)
print("Predicted Class Name:", iris.target_names[prediction][0])

### 3.3 Regularization Hyperparameters
Decision Trees make very few assumptions about the training data (unlike Linear models). If left unconstrained, the tree structure will adapt itself to the training data perfectly, fitting it very closely â€” usually overfitting.

To avoid this, we restrict the tree's freedom during training using:
* `max_depth`: Maximum depth of the tree.
* `min_samples_leaf`: Minimum samples a leaf node must have.

Let's see the effect of regularization on the `moons` dataset.

In [None]:
from sklearn.datasets import make_moons
Xm, ym = make_moons(n_samples=100, noise=0.25, random_state=53)

# 1. No Regularization (Default)
deep_tree_clf = DecisionTreeClassifier(random_state=42)
deep_tree_clf.fit(Xm, ym)

# 2. Regularized (min_samples_leaf=4)
# This prevents the tree from creating leaf nodes with very few samples (likely noise)
regularized_tree_clf = DecisionTreeClassifier(min_samples_leaf=4, random_state=42)
regularized_tree_clf.fit(Xm, ym)

# Simple visualization of accuracy
print("Depth of unregularized tree:", deep_tree_clf.get_depth())
print("Depth of regularized tree:", regularized_tree_clf.get_depth())

### 3.4 Regression Trees
Decision Trees can also perform regression. Instead of predicting a class, it predicts a value (the average target value of the instances in that leaf).

In [None]:
from sklearn.tree import DecisionTreeRegressor

# Generate noisy quadratic data
m = 200
X = np.random.rand(m, 1)
y = 4 * (X - 0.5) ** 2
y = y + np.random.randn(m, 1) / 10

# Train a regression tree
tree_reg = DecisionTreeRegressor(max_depth=2, random_state=42)
tree_reg.fit(X, y)

# Visualization of the prediction line
X_test = np.linspace(0, 1, 500).reshape(-1, 1)
y_pred = tree_reg.predict(X_test)

plt.plot(X, y, "b.")
plt.plot(X_test, y_pred, "r-", linewidth=2, label="Prediction")
plt.title("Decision Tree Regression (max_depth=2)")
plt.legend()
plt.show()

## 4. Step-by-Step Explanation

### 1. Classification (Iris)
**Process:** The `DecisionTreeClassifier` looks for the best feature to split the data. For example, it might find that `Petal Length <= 2.45` perfectly separates the *Setosa* flowers from the rest. It creates a node for this rule. Then it looks at the remaining data (Versicolor and Virginica) and finds the next best split (e.g., `Petal Width <= 1.75`).

### 2. Prediction
For a new flower with petal length 5cm:
1. Start at root: Is length <= 2.45? No.
2. Go to right child: Is width <= 1.75? Yes (assuming 1.5 < 1.75).
3. Reach leaf node: This node contains mostly Versicolor.
4. Output: Class Versicolor (and the probability is the % of Versicolor in that leaf).

### 3. Regression
**Input:** A single feature $X$.
**Process:** The tree splits the $X$ axis into regions (e.g., $X < 0.2$, $0.2 \le X < 0.8$, etc.).
**Output:** For any new instance falling into a specific region, the tree predicts the **average** $y$-value of the training instances in that region. This results in the "staircase" (step function) appearance of the red line in the plot.

## 5. Chapter Summary

* **Interpretation:** Decision Trees are easy to understand and visualize.
* **Data Prep:** They require very little data preparation (no scaling needed).
* **Orthogonal Boundaries:** Trees make splits perpendicular to the feature axes. This makes them sensitive to data rotation (e.g., if you rotate the dataset 45 degrees, the staircase boundary becomes messy).
* **Instability:** Trees are sensitive to small variations in the training data. A small change can result in a completely different tree structure. This is solved by averaging many trees (Random Forests).
* **Overfitting:** Being non-parametric models, they love to overfit. Always use regularization parameters like `max_depth` or `min_samples_leaf`.