# Module 05: Decision Trees

**Difficulty**: ⭐⭐ Intermediate  
**Estimated Time**: 90 minutes  
**Prerequisites**: 
- [Module 03: Linear Regression](03_linear_regression.ipynb)
- [Module 04: Logistic Regression](04_logistic_regression.ipynb)
- Understanding of information theory basics

## Learning Objectives

By the end of this notebook, you will be able to:

1. Understand how decision trees make splits (Gini impurity, entropy)
2. Build and visualize decision trees for classification and regression
3. Interpret tree structure and decision rules
4. Identify and prevent overfitting through pruning parameters
5. Analyze feature importance from decision trees
6. Compare decision trees to linear models
7. Choose appropriate hyperparameters for tree-based models

## 1. Setup and Imports

In [None]:
# Core libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Scikit-learn
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.tree import plot_tree, export_text
from sklearn.metrics import (
    accuracy_score, confusion_matrix, classification_report,
    mean_squared_error, r2_score
)

# Configuration
%matplotlib inline
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
np.random.seed(42)

print("All libraries imported successfully!")

## 2. What are Decision Trees?

### Intuition

A **decision tree** is a flowchart-like structure where:
- **Internal nodes**: Questions about features (e.g., "Is age > 30?")
- **Branches**: Answers to questions (yes/no)
- **Leaf nodes**: Final predictions

### Example: Should I play tennis?

```
Outlook sunny?
├─ Yes → Humidity high?
│         ├─ Yes → Don't play
│         └─ No → Play
└─ No → Wind strong?
          ├─ Yes → Don't play
          └─ No → Play
```

### Key Advantages

✅ **Interpretable**: Easy to understand and visualize  
✅ **Non-linear**: Can capture complex patterns  
✅ **No scaling needed**: Works with raw features  
✅ **Handles mixed data**: Both numerical and categorical  
✅ **Feature importance**: Shows which features matter most  

### Key Disadvantages

❌ **Overfitting**: Can memorize training data  
❌ **Instability**: Small data changes can change entire tree  
❌ **Bias**: Greedy algorithm may not find global optimum  

## 3. How Trees Split: Gini vs Entropy

Decision trees split data to create **pure** nodes (all same class).

### Gini Impurity

**Formula**: Gini = 1 - Σ(p_i)²

Where p_i is the proportion of class i

- **Gini = 0**: Perfectly pure (all same class)
- **Gini = 0.5**: Maximum impurity (50-50 split for binary)
- **Default in scikit-learn**

### Entropy (Information Gain)

**Formula**: Entropy = -Σ(p_i × log₂(p_i))

- **Entropy = 0**: Perfectly pure
- **Entropy = 1**: Maximum impurity (for binary)
- Based on information theory

### Which to Use?

- **Gini**: Faster to compute, works well in practice
- **Entropy**: More theoretically grounded, slightly different trees
- **In practice**: Results are usually similar!

In [None]:
# Demonstrate Gini vs Entropy
def gini_impurity(p):
    """Calculate Gini impurity for probability p of class 1"""
    return 1 - p**2 - (1-p)**2

def entropy(p):
    """Calculate entropy for probability p of class 1"""
    if p == 0 or p == 1:
        return 0
    return -(p * np.log2(p) + (1-p) * np.log2(1-p))

# Calculate for different class distributions
p_values = np.linspace(0.01, 0.99, 100)
gini_values = [gini_impurity(p) for p in p_values]
entropy_values = [entropy(p) for p in p_values]

# Plot
plt.figure(figsize=(10, 6))
plt.plot(p_values, gini_values, label='Gini Impurity', linewidth=2)
plt.plot(p_values, entropy_values, label='Entropy', linewidth=2, linestyle='--')
plt.xlabel('Proportion of Class 1 (p)', fontsize=12)
plt.ylabel('Impurity', fontsize=12)
plt.title('Gini Impurity vs Entropy', fontsize=14)
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.axvline(x=0.5, color='red', linestyle=':', alpha=0.5, label='Maximum impurity')
plt.tight_layout()
plt.show()

print("Key Observations:")
print("  • Both reach maximum at p=0.5 (50-50 split)")
print("  • Both reach minimum at p=0 or p=1 (pure nodes)")
print("  • Entropy slightly higher, but shapes are similar")
print("\nExamples:")
print(f"  Pure node (100% class 1): Gini={gini_impurity(1):.3f}, Entropy={entropy(1):.3f}")
print(f"  50-50 split: Gini={gini_impurity(0.5):.3f}, Entropy={entropy(0.5):.3f}")
print(f"  75-25 split: Gini={gini_impurity(0.75):.3f}, Entropy={entropy(0.75):.3f}")

## 4. Classification Tree Example

Let's build a decision tree classifier on the Iris dataset.

In [None]:
# Load Iris dataset
iris = datasets.load_iris()
X = iris.data
y = iris.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print("Iris Dataset:")
print(f"Features: {iris.feature_names}")
print(f"Classes: {iris.target_names}")
print(f"Training samples: {X_train.shape[0]}")
print(f"Test samples: {X_test.shape[0]}")

In [None]:
# Train a simple decision tree (max_depth=3 for visualization)
tree_clf = DecisionTreeClassifier(max_depth=3, random_state=42)
tree_clf.fit(X_train, y_train)

# Make predictions
y_train_pred = tree_clf.predict(X_train)
y_test_pred = tree_clf.predict(X_test)

# Evaluate
train_acc = accuracy_score(y_train, y_train_pred)
test_acc = accuracy_score(y_test, y_test_pred)

print("Decision Tree Results:")
print(f"Training Accuracy: {train_acc:.2%}")
print(f"Test Accuracy: {test_acc:.2%}")
print(f"\nTree Depth: {tree_clf.get_depth()}")
print(f"Number of Leaves: {tree_clf.get_n_leaves()}")

In [None]:
# Visualize the decision tree
plt.figure(figsize=(20, 10))
plot_tree(
    tree_clf,
    feature_names=iris.feature_names,
    class_names=iris.target_names,
    filled=True,
    rounded=True,
    fontsize=10
)
plt.title('Decision Tree Visualization (Iris Dataset)', fontsize=16)
plt.tight_layout()
plt.show()

print("\nHow to Read the Tree:")
print("  • Top box (root): First split decision")
print("  • Each box shows:")
print("    - Split condition (e.g., 'petal width <= 0.8')")
print("    - Gini impurity")
print("    - Number of samples")
print("    - Class distribution")
print("  • Color indicates majority class")
print("  • Leaf nodes: Final predictions")

In [None]:
# Text representation of the tree
tree_rules = export_text(tree_clf, feature_names=list(iris.feature_names))
print("Decision Tree Rules (Text Format):")
print(tree_rules)

## 5. Feature Importance

Decision trees automatically calculate **feature importance** based on how much each feature reduces impurity.

In [None]:
# Get feature importances
importances = tree_clf.feature_importances_
indices = np.argsort(importances)[::-1]

# Create DataFrame
importance_df = pd.DataFrame({
    'Feature': [iris.feature_names[i] for i in indices],
    'Importance': importances[indices]
})

print("Feature Importances:")
print(importance_df.to_string(index=False))

# Visualize
plt.figure(figsize=(10, 6))
plt.barh(importance_df['Feature'], importance_df['Importance'], edgecolor='k')
plt.xlabel('Importance', fontsize=12)
plt.title('Feature Importance in Decision Tree', fontsize=14)
plt.grid(True, alpha=0.3, axis='x')
plt.tight_layout()
plt.show()

print(f"\nMost important feature: {importance_df.iloc[0]['Feature']}")
print(f"Importance sum: {importances.sum()} (should be 1.0)")

## 6. Overfitting and Pruning

**Problem**: Deep trees can overfit by memorizing training data.

**Solution**: Control tree complexity with hyperparameters.

### Key Hyperparameters

1. **max_depth**: Maximum tree depth (most important!)
2. **min_samples_split**: Minimum samples required to split a node
3. **min_samples_leaf**: Minimum samples required in a leaf
4. **max_leaf_nodes**: Maximum number of leaf nodes
5. **min_impurity_decrease**: Minimum impurity decrease to split

In [None]:
# Compare trees with different max_depth values
depths = [1, 2, 3, 5, 10, None]  # None = no limit
results = []

for depth in depths:
    tree = DecisionTreeClassifier(max_depth=depth, random_state=42)
    tree.fit(X_train, y_train)
    
    train_acc = accuracy_score(y_train, tree.predict(X_train))
    test_acc = accuracy_score(y_test, tree.predict(X_test))
    
    results.append({
        'max_depth': depth if depth else 'unlimited',
        'actual_depth': tree.get_depth(),
        'n_leaves': tree.get_n_leaves(),
        'train_acc': train_acc,
        'test_acc': test_acc
    })

results_df = pd.DataFrame(results)
print("Effect of max_depth on Model Performance:")
print(results_df.to_string(index=False))

print("\nKey Observations:")
print("  • Deeper trees → Higher training accuracy")
print("  • Too deep → Overfitting (gap between train and test)")
print("  • Optimal depth balances complexity and generalization")

In [None]:
# Visualize overfitting
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Accuracy vs Depth
depth_values = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
train_scores = []
test_scores = []

for depth in depth_values:
    tree = DecisionTreeClassifier(max_depth=depth, random_state=42)
    tree.fit(X_train, y_train)
    train_scores.append(accuracy_score(y_train, tree.predict(X_train)))
    test_scores.append(accuracy_score(y_test, tree.predict(X_test)))

axes[0].plot(depth_values, train_scores, 'o-', label='Training', linewidth=2)
axes[0].plot(depth_values, test_scores, 's-', label='Test', linewidth=2)
axes[0].set_xlabel('Max Depth', fontsize=12)
axes[0].set_ylabel('Accuracy', fontsize=12)
axes[0].set_title('Accuracy vs Tree Depth', fontsize=14)
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Plot 2: Complexity (number of nodes)
n_nodes = [DecisionTreeClassifier(max_depth=d, random_state=42).fit(X_train, y_train).tree_.node_count 
          for d in depth_values]
axes[1].plot(depth_values, n_nodes, 'o-', color='green', linewidth=2)
axes[1].set_xlabel('Max Depth', fontsize=12)
axes[1].set_ylabel('Number of Nodes', fontsize=12)
axes[1].set_title('Tree Complexity vs Depth', fontsize=14)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Notice: Training accuracy keeps increasing, but test accuracy plateaus or drops!")

## 7. Regression Trees

Decision trees can also predict continuous values!

**Difference**: Instead of class labels, leaf nodes contain the **mean** of target values.

In [None]:
# Load California housing dataset
housing = datasets.fetch_california_housing()
X_housing = housing.data[:1000]  # Use subset
y_housing = housing.target[:1000]

# Split
X_train_h, X_test_h, y_train_h, y_test_h = train_test_split(
    X_housing, y_housing, test_size=0.2, random_state=42
)

print("Housing Dataset (Regression):")
print(f"Features: {housing.feature_names}")
print(f"Target: Median house value (in $100,000s)")
print(f"Training samples: {X_train_h.shape[0]}")

In [None]:
# Train regression tree
tree_reg = DecisionTreeRegressor(max_depth=5, random_state=42)
tree_reg.fit(X_train_h, y_train_h)

# Predictions
y_train_pred_h = tree_reg.predict(X_train_h)
y_test_pred_h = tree_reg.predict(X_test_h)

# Evaluate
train_r2 = r2_score(y_train_h, y_train_pred_h)
test_r2 = r2_score(y_test_h, y_test_pred_h)
test_rmse = np.sqrt(mean_squared_error(y_test_h, y_test_pred_h))

print("Regression Tree Results:")
print(f"Training R²: {train_r2:.3f}")
print(f"Test R²: {test_r2:.3f}")
print(f"Test RMSE: ${test_rmse*100000:.2f}")
print(f"\nTree depth: {tree_reg.get_depth()}")
print(f"Number of leaves: {tree_reg.get_n_leaves()}")

In [None]:
# Visualize predictions
plt.figure(figsize=(10, 6))
plt.scatter(y_test_h, y_test_pred_h, alpha=0.6, edgecolors='k')
plt.plot([y_test_h.min(), y_test_h.max()], 
        [y_test_h.min(), y_test_h.max()], 
        'r--', lw=2, label='Perfect Prediction')
plt.xlabel('Actual Price ($100,000s)', fontsize=12)
plt.ylabel('Predicted Price ($100,000s)', fontsize=12)
plt.title('Regression Tree: Actual vs Predicted', fontsize=14)
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("Notice: Predictions are 'stepped' - trees make constant predictions in regions")

In [None]:
# Feature importance for regression tree
importances_reg = tree_reg.feature_importances_
indices_reg = np.argsort(importances_reg)[::-1]

plt.figure(figsize=(10, 6))
plt.barh(
    [housing.feature_names[i] for i in indices_reg],
    importances_reg[indices_reg],
    edgecolor='k'
)
plt.xlabel('Importance', fontsize=12)
plt.title('Feature Importance (Regression Tree)', fontsize=14)
plt.grid(True, alpha=0.3, axis='x')
plt.tight_layout()
plt.show()

print(f"Most important feature: {housing.feature_names[indices_reg[0]]}")

## 8. Decision Trees vs Linear Models

Let's compare decision trees to logistic regression.

In [None]:
# Create a dataset with non-linear decision boundary
from sklearn.datasets import make_moons

X_moons, y_moons = make_moons(n_samples=300, noise=0.2, random_state=42)

# Split
X_train_m, X_test_m, y_train_m, y_test_m = train_test_split(
    X_moons, y_moons, test_size=0.2, random_state=42
)

# Train both models
from sklearn.linear_model import LogisticRegression

# Logistic Regression (linear boundary)
log_reg = LogisticRegression()
log_reg.fit(X_train_m, y_train_m)
log_acc = accuracy_score(y_test_m, log_reg.predict(X_test_m))

# Decision Tree (non-linear boundary)
tree_moon = DecisionTreeClassifier(max_depth=5, random_state=42)
tree_moon.fit(X_train_m, y_train_m)
tree_acc = accuracy_score(y_test_m, tree_moon.predict(X_test_m))

print("Non-Linear Dataset Comparison:")
print(f"Logistic Regression Accuracy: {log_acc:.2%}")
print(f"Decision Tree Accuracy: {tree_acc:.2%}")
print(f"\nWinner: {'Decision Tree' if tree_acc > log_acc else 'Logistic Regression'}")

In [None]:
# Visualize decision boundaries
def plot_decision_boundary(model, X, y, title):
    x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5
    y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 200),
                        np.linspace(y_min, y_max, 200))
    
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    plt.contourf(xx, yy, Z, alpha=0.3, cmap='RdYlBu')
    plt.scatter(X[:, 0], X[:, 1], c=y, cmap='RdYlBu', 
               edgecolors='k', s=50, alpha=0.7)
    plt.title(title, fontsize=14)
    plt.xlabel('Feature 1')
    plt.ylabel('Feature 2')
    plt.grid(True, alpha=0.3)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

plt.subplot(1, 2, 1)
plot_decision_boundary(log_reg, X_test_m, y_test_m, 
                      f'Logistic Regression (Acc: {log_acc:.2%})')

plt.subplot(1, 2, 2)
plot_decision_boundary(tree_moon, X_test_m, y_test_m, 
                      f'Decision Tree (Acc: {tree_acc:.2%})')

plt.tight_layout()
plt.show()

print("\nKey Insight:")
print("  • Logistic Regression: Linear boundary (straight line)")
print("  • Decision Tree: Non-linear boundary (rectangular regions)")
print("  • Trees better for complex, non-linear patterns!")

## 9. Practice Exercises

### Exercise 1: Optimal Tree Depth

Using the wine dataset (`datasets.load_wine()`):
1. Train decision trees with max_depth from 1 to 15
2. Plot training and test accuracy
3. Find the optimal depth that maximizes test accuracy

In [None]:
# Your code here


### Exercise 2: Pruning Parameters

Train trees on breast cancer dataset with different pruning parameters:
1. min_samples_split = [2, 10, 50]
2. min_samples_leaf = [1, 5, 20]
3. Which combination gives best test performance?

In [None]:
# Your code here


### Exercise 3: Feature Importance Analysis

Using the diabetes dataset for regression:
1. Train a regression tree
2. Identify the top 3 most important features
3. Train a new tree using only those 3 features
4. How does performance compare?

In [None]:
# Your code here


### Exercise 4: Interpretability Challenge

Train a shallow tree (max_depth=3) on Iris dataset:
1. Export the tree rules as text
2. Manually trace a prediction for a sample
3. Verify your manual prediction matches model.predict()

In [None]:
# Your code here


## 10. Summary

### Key Concepts Learned

1. **Decision Tree Structure**:
   - Internal nodes: Feature-based questions
   - Branches: Decision paths
   - Leaf nodes: Final predictions
   - Highly interpretable flowchart structure

2. **Splitting Criteria**:
   - **Gini Impurity**: 1 - Σ(p_i)² (faster, default)
   - **Entropy**: -Σ(p_i × log(p_i)) (information theory)
   - Goal: Create pure nodes (low impurity)

3. **Classification vs Regression Trees**:
   - Classification: Predict class labels (majority vote in leaf)
   - Regression: Predict continuous values (mean in leaf)
   - Same algorithm, different output types

4. **Feature Importance**:
   - Automatically calculated during training
   - Based on impurity reduction
   - Sum equals 1.0
   - Helps identify key features

5. **Overfitting Prevention (Pruning)**:
   - **max_depth**: Limit tree depth (most important!)
   - **min_samples_split**: Min samples to split node
   - **min_samples_leaf**: Min samples in leaf
   - **max_leaf_nodes**: Limit number of leaves

6. **Advantages vs Disadvantages**:
   - ✅ Interpretable, handles non-linear patterns, no scaling needed
   - ❌ Prone to overfitting, unstable, greedy learning

### When to Use Decision Trees

✅ **Good for**:
- Interpretability is important
- Non-linear relationships
- Mixed data types (numerical + categorical)
- Feature importance analysis
- Quick baseline model

❌ **Not ideal for**:
- High-dimensional data (many features)
- Linear relationships (use linear models)
- Need for stability (small data changes affect tree)
- Production without ensembles (use Random Forest instead)

### Comparison: Trees vs Linear Models

| Aspect | Decision Trees | Linear Models |
|--------|---------------|---------------|
| **Boundary** | Non-linear (rectangular) | Linear (straight) |
| **Interpretability** | Very high (rules) | High (coefficients) |
| **Scaling** | Not needed | Required |
| **Overfitting** | High risk | Lower risk |
| **Stability** | Low (sensitive to data) | High |

### Quick Reference: Scikit-learn

```python
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.tree import plot_tree, export_text

# Classification
tree_clf = DecisionTreeClassifier(
    max_depth=5,
    min_samples_split=10,
    min_samples_leaf=5,
    criterion='gini',  # or 'entropy'
    random_state=42
)

tree_clf.fit(X_train, y_train)
predictions = tree_clf.predict(X_test)

# Feature importance
importances = tree_clf.feature_importances_

# Visualize
plot_tree(tree_clf, feature_names=feature_names, filled=True)

# Text rules
rules = export_text(tree_clf, feature_names=feature_names)
```

### Next Steps

In the next module, we'll explore:
- **Comprehensive evaluation metrics** for classification and regression
- Precision, recall, F1-score deep dive
- ROC curves and AUC
- Choosing the right metric for your problem

### Additional Resources

- [Scikit-learn Decision Trees](https://scikit-learn.org/stable/modules/tree.html)
- [StatQuest: Decision Trees](https://www.youtube.com/watch?v=7VeUPuFGJHk)
- [Visualizing Decision Trees](https://explained.ai/decision-tree-viz/)
- [Information Gain and Entropy](https://towardsdatascience.com/entropy-and-information-gain-in-decision-trees-c7db67a3a293)