# Module 05: Decision Trees

**Difficulty**: ⭐ Beginner  
**Estimated Time**: 60 minutes  
**Prerequisites**: 
- [Module 00: Introduction to ML and scikit-learn](00_introduction_to_ml_and_sklearn.ipynb)
- [Module 01: Supervised vs Unsupervised Learning](01_supervised_vs_unsupervised_learning.ipynb)
- [Module 03: Linear Regression](03_linear_regression.ipynb)
- [Module 04: Logistic Regression](04_logistic_regression.ipynb)

## Learning Objectives

By the end of this notebook, you will be able to:
1. Understand how decision trees make predictions using if-then rules
2. Build classification and regression trees
3. Control tree complexity with max_depth and other hyperparameters
4. Visualize decision trees to understand their logic
5. Interpret feature importance from decision trees
6. Recognize and prevent overfitting in trees

## 1. What are Decision Trees?

**Decision Trees** are a popular machine learning algorithm that makes predictions by learning simple decision rules from data.

### The Big Idea
Think of a flowchart or a game of "20 Questions":
- Ask a series of yes/no questions
- Each answer leads to another question
- Eventually reach a decision

### Real-World Example: Should I Play Tennis?
```
Is it sunny?
├── Yes: Is humidity high?
│   ├── Yes: Don't play (too hot)
│   └── No: Play!
└── No: Is it raining?
    ├── Yes: Is it windy?
    │   ├── Yes: Don't play
    │   └── No: Play!
    └── No: Play!
```

### Key Terminology
- **Root Node**: Top of tree (first question)
- **Internal Nodes**: Decision points (questions)
- **Leaf Nodes**: Final predictions (answers)
- **Branches**: Connections between nodes
- **Depth**: Longest path from root to leaf

### Advantages
- Easy to understand and interpret
- Visual representation
- Handles both numerical and categorical data
- No need for feature scaling
- Captures non-linear relationships

### Disadvantages
- Can easily overfit (memorize training data)
- Sensitive to small changes in data
- Biased toward features with more values
- Not always the most accurate

In [None]:
# Setup
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

# Visualization settings
%matplotlib inline
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')

print("✓ Setup complete!")

## 2. Classification Trees

Let's build a decision tree to classify Iris species.

In [None]:
# Load Iris dataset
iris_df = pd.read_csv('data/sample/iris.csv')

# Prepare features and target
feature_cols = ['sepal length (cm)', 'sepal width (cm)', 
                'petal length (cm)', 'petal width (cm)']
X = iris_df[feature_cols]
y = iris_df['species']

print("Iris Dataset:")
print(f"Samples: {len(X)}")
print(f"Features: {X.shape[1]}")
print(f"Classes: {y.nunique()}")
print(f"\nClass distribution:")
print(y.value_counts().sort_index())

In [None]:
# Split the data
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

print(f"Training samples: {len(X_train)}")
print(f"Testing samples: {len(X_test)}")
print("\nNote: Decision trees don't require feature scaling!")

In [None]:
# Train a simple decision tree
from sklearn.tree import DecisionTreeClassifier

# Create a tree with max_depth=3 (to prevent overfitting)
tree_clf = DecisionTreeClassifier(max_depth=3, random_state=42)
tree_clf.fit(X_train, y_train)

print("✓ Decision Tree trained!")
print(f"\nTree depth: {tree_clf.get_depth()}")
print(f"Number of leaves: {tree_clf.get_n_leaves()}")

In [None]:
# Evaluate the model
from sklearn.metrics import accuracy_score, classification_report

train_accuracy = tree_clf.score(X_train, y_train)
test_accuracy = tree_clf.score(X_test, y_test)

print("Model Performance:")
print(f"Training Accuracy: {train_accuracy:.1%}")
print(f"Testing Accuracy: {test_accuracy:.1%}")

# Make predictions
y_pred = tree_clf.predict(X_test)

print("\nDetailed Classification Report:")
print(classification_report(y_test, y_pred))

## 3. Visualizing Decision Trees

One of the best features of decision trees is that we can visualize them to understand exactly how they make decisions!

In [None]:
# Visualize the decision tree
from sklearn.tree import plot_tree

plt.figure(figsize=(20, 10))
plot_tree(tree_clf, 
         feature_names=feature_cols,
         class_names=['Class 0', 'Class 1', 'Class 2'],
         filled=True,
         rounded=True,
         fontsize=10)
plt.title('Decision Tree Visualization\nEach box shows: condition, samples, values, class', 
         fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

print("How to read the tree:")
print("- Top box: Root node (first decision)")
print("- Each box shows:")
print("  * Decision rule (e.g., petal length <= 2.45)")
print("  * samples: Number of samples at this node")
print("  * value: Distribution across classes")
print("  * class: Predicted class (majority)")
print("- Color intensity: Confidence (darker = more confident)")

## 4. Feature Importance

Decision trees automatically calculate which features are most important for making predictions.

In [None]:
# Get feature importances
importances = tree_clf.feature_importances_

# Create DataFrame for better visualization
feature_importance_df = pd.DataFrame({
    'Feature': feature_cols,
    'Importance': importances
}).sort_values('Importance', ascending=False)

print("Feature Importance:")
print(feature_importance_df.to_string(index=False))

# Visualize feature importance
plt.figure(figsize=(10, 6))
plt.barh(feature_importance_df['Feature'], feature_importance_df['Importance'], 
        color='steelblue', alpha=0.7)
plt.xlabel('Importance Score', fontsize=12)
plt.ylabel('Feature', fontsize=12)
plt.title('Feature Importance in Decision Tree\n(Higher = More Important)', 
         fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3, axis='x')
plt.tight_layout()
plt.show()

print(f"\nInterpretation:")
print(f"- {feature_importance_df.iloc[0]['Feature']} is the most important feature")
print(f"- Accounts for {feature_importance_df.iloc[0]['Importance']:.1%} of decisions")

## 5. Tree Depth and Overfitting

**Key Concept**: Deeper trees can memorize training data (overfit) instead of learning general patterns.

Let's compare trees of different depths.

In [None]:
# Train trees with different depths
depths = [1, 2, 3, 5, 10, None]  # None = no limit
results = []

for depth in depths:
    tree = DecisionTreeClassifier(max_depth=depth, random_state=42)
    tree.fit(X_train, y_train)
    
    train_acc = tree.score(X_train, y_train)
    test_acc = tree.score(X_test, y_test)
    
    results.append({
        'Max Depth': str(depth),
        'Actual Depth': tree.get_depth(),
        'Num Leaves': tree.get_n_leaves(),
        'Train Accuracy': f"{train_acc:.1%}",
        'Test Accuracy': f"{test_acc:.1%}"
    })

results_df = pd.DataFrame(results)
print("Impact of Tree Depth on Performance:")
print(results_df.to_string(index=False))

print("\nKey Observations:")
print("- Shallow trees (depth 1-2): Underfit (low training accuracy)")
print("- Medium trees (depth 3-5): Good balance")
print("- Deep trees (depth 10+): Overfit (perfect training, lower test)")

In [None]:
# Visualize the effect of depth
depths_numeric = [1, 2, 3, 5, 10, 20]
train_accuracies = []
test_accuracies = []

for depth in depths_numeric:
    tree = DecisionTreeClassifier(max_depth=depth, random_state=42)
    tree.fit(X_train, y_train)
    train_accuracies.append(tree.score(X_train, y_train))
    test_accuracies.append(tree.score(X_test, y_test))

plt.figure(figsize=(10, 6))
plt.plot(depths_numeric, train_accuracies, 'o-', linewidth=2, label='Training Accuracy', 
        markersize=8)
plt.plot(depths_numeric, test_accuracies, 's-', linewidth=2, label='Testing Accuracy', 
        markersize=8)
plt.xlabel('Maximum Tree Depth', fontsize=12)
plt.ylabel('Accuracy', fontsize=12)
plt.title('Accuracy vs Tree Depth\n(Gap indicates overfitting)', 
         fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("The gap between training and testing accuracy shows overfitting!")
print("Deeper trees memorize training data instead of learning patterns.")

## 6. Regression Trees

Decision trees can also be used for regression (predicting continuous values)!

In [None]:
# Load California housing dataset
housing_df = pd.read_csv('data/sample/california_housing.csv')

# Prepare data
X_housing = housing_df.drop('median_house_value', axis=1)
y_housing = housing_df['median_house_value']

# Split data
X_train_h, X_test_h, y_train_h, y_test_h = train_test_split(
    X_housing, y_housing, test_size=0.3, random_state=42
)

print("California Housing Dataset:")
print(f"Training samples: {len(X_train_h)}")
print(f"Testing samples: {len(X_test_h)}")
print(f"Features: {X_housing.shape[1]}")

In [None]:
# Train a regression tree
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error, r2_score

tree_reg = DecisionTreeRegressor(max_depth=5, random_state=42)
tree_reg.fit(X_train_h, y_train_h)

# Make predictions
y_pred_h = tree_reg.predict(X_test_h)

# Evaluate
rmse = np.sqrt(mean_squared_error(y_test_h, y_pred_h))
r2 = r2_score(y_test_h, y_pred_h)

print("Regression Tree Performance:")
print(f"RMSE: ${rmse:,.2f}")
print(f"R² Score: {r2:.3f}")
print(f"\nTree depth: {tree_reg.get_depth()}")
print(f"Number of leaves: {tree_reg.get_n_leaves()}")

In [None]:
# Visualize predictions
plt.figure(figsize=(10, 6))
plt.scatter(y_test_h, y_pred_h, alpha=0.3, s=20)
plt.plot([y_test_h.min(), y_test_h.max()], [y_test_h.min(), y_test_h.max()], 
        'r--', linewidth=2, label='Perfect Prediction')
plt.xlabel('Actual House Value ($)', fontsize=12)
plt.ylabel('Predicted House Value ($)', fontsize=12)
plt.title(f'Regression Tree Predictions\nR² = {r2:.3f}', 
         fontsize=14, fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# Feature importance for regression tree
importance_df_reg = pd.DataFrame({
    'Feature': X_housing.columns,
    'Importance': tree_reg.feature_importances_
}).sort_values('Importance', ascending=False)

plt.figure(figsize=(10, 6))
plt.barh(importance_df_reg['Feature'], importance_df_reg['Importance'], 
        color='green', alpha=0.7)
plt.xlabel('Importance Score', fontsize=12)
plt.ylabel('Feature', fontsize=12)
plt.title('Feature Importance for House Price Prediction', 
         fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3, axis='x')
plt.tight_layout()
plt.show()

print("Top 3 Most Important Features:")
print(importance_df_reg.head(3).to_string(index=False))

## 7. Hyperparameters for Controlling Trees

Decision trees have several hyperparameters to control their complexity and prevent overfitting:

### Important Hyperparameters

1. **max_depth**: Maximum depth of the tree
   - Lower = simpler tree, less overfitting
   - Typical values: 3-10

2. **min_samples_split**: Minimum samples required to split a node
   - Higher = fewer splits, simpler tree
   - Typical values: 2-20

3. **min_samples_leaf**: Minimum samples in a leaf node
   - Higher = smoother predictions
   - Typical values: 1-10

4. **max_features**: Number of features to consider when splitting
   - Lower = more diversity, less overfitting
   - Options: 'sqrt', 'log2', or integer

5. **max_leaf_nodes**: Maximum number of leaf nodes
   - Limits tree size directly

In [None]:
# Compare different hyperparameter settings
configs = [
    {'max_depth': 3, 'name': 'Shallow (max_depth=3)'},
    {'max_depth': None, 'min_samples_split': 20, 'name': 'Min Split=20'},
    {'max_depth': None, 'min_samples_leaf': 10, 'name': 'Min Leaf=10'},
    {'max_depth': None, 'max_leaf_nodes': 20, 'name': 'Max Leaves=20'},
    {'max_depth': None, 'name': 'Unlimited (will overfit)'}
]

comparison_results = []

for config in configs:
    name = config.pop('name')
    tree = DecisionTreeClassifier(random_state=42, **config)
    tree.fit(X_train, y_train)
    
    comparison_results.append({
        'Configuration': name,
        'Depth': tree.get_depth(),
        'Leaves': tree.get_n_leaves(),
        'Train Acc': f"{tree.score(X_train, y_train):.1%}",
        'Test Acc': f"{tree.score(X_test, y_test):.1%}"
    })

comparison_df = pd.DataFrame(comparison_results)
print("Comparing Different Tree Configurations:")
print(comparison_df.to_string(index=False))

print("\nBest Practice: Use constraints to prevent overfitting!")

## Exercises

Practice building and analyzing decision trees.

### Exercise 1: Build a Decision Tree for Wine Classification

Steps:
1. Load the wine dataset from 'data/sample/wine.csv'
2. Separate features and target
3. Split data (70/30, stratified)
4. Train a DecisionTreeClassifier with max_depth=4
5. Calculate and print training and testing accuracy
6. Visualize the feature importances

In [None]:
# Your code here


### Exercise 2: Finding Optimal Tree Depth

Using the breast cancer dataset:
1. Load data from 'data/sample/breast_cancer.csv'
2. Prepare features (drop 'target' and 'diagnosis') and target
3. Split data (70/30, stratified)
4. Train trees with max_depth from 1 to 15
5. Plot training and testing accuracy vs depth
6. Identify the optimal depth (best test accuracy without overfitting)

In [None]:
# Your code here


### Exercise 3: Regression Tree vs Linear Regression

Compare a DecisionTreeRegressor with LinearRegression on the diabetes dataset:

1. Load 'data/sample/diabetes.csv'
2. Split data (70/30)
3. Train both models:
   - LinearRegression
   - DecisionTreeRegressor (max_depth=5)
4. Calculate RMSE and R² for both
5. Which performs better? Why might this be?

In [None]:
# Your code here


### Exercise 4: Visualizing a Simple Tree

Create a very simple, interpretable decision tree:

1. Use the Iris dataset (first two features only for simplicity)
2. Train a DecisionTreeClassifier with max_depth=2
3. Visualize the tree using plot_tree()
4. Write out the decision rules in plain English
5. Example: "If petal length <= 2.45, predict Setosa"

In [None]:
# Your code here


## Summary

Congratulations! You've mastered decision trees, an intuitive and powerful ML algorithm.

### Key Concepts

1. **Decision Trees**:
   - Make predictions using series of if-then rules
   - Like a flowchart or game of 20 questions
   - Work for both classification and regression
   - No need for feature scaling

2. **Tree Structure**:
   - **Root node**: First decision
   - **Internal nodes**: Decision points (questions)
   - **Leaf nodes**: Final predictions
   - **Depth**: How many questions in longest path

3. **Feature Importance**:
   - Trees automatically calculate feature importance
   - Shows which features are most useful for predictions
   - Sum to 1.0 across all features
   - Higher value = more important feature

4. **Overfitting in Trees**:
   - Deep trees memorize training data
   - Perfect training accuracy but poor test accuracy
   - Gap between train/test accuracy indicates overfitting
   - Solution: Limit tree complexity

5. **Hyperparameters**:
   - **max_depth**: Limit tree depth (most important)
   - **min_samples_split**: Minimum samples to split node
   - **min_samples_leaf**: Minimum samples in leaf
   - **max_leaf_nodes**: Limit total leaves
   - All help prevent overfitting

6. **Advantages**:
   - Easy to understand and visualize
   - Handles non-linear relationships
   - No feature scaling needed
   - Works with mixed data types
   - Provides feature importance

7. **Disadvantages**:
   - Prone to overfitting
   - Sensitive to small data changes
   - Can create biased trees
   - Often less accurate than ensemble methods

### Best Practices

1. **Always limit tree depth** (start with 3-5)
2. **Monitor train vs test accuracy** for overfitting
3. **Visualize trees** to understand decisions
4. **Check feature importance** for insights
5. **Consider ensemble methods** (Random Forests, Gradient Boosting) for better accuracy

### When to Use Decision Trees

**Good for:**
- Need interpretable models
- Non-linear relationships
- Mixed data types (numerical + categorical)
- Quick baseline models
- Feature importance analysis

**Not good for:**
- Need highest accuracy (use ensembles instead)
- Linear relationships (use linear models)
- Very small datasets (prone to overfitting)

### What's Next?

In **Module 06: Model Evaluation Metrics**, you'll learn:
- Comprehensive classification metrics (precision, recall, F1)
- When to use each metric
- ROC curves and AUC
- Regression metrics in depth
- Confusion matrix analysis

### Additional Resources

- [Decision Trees - StatQuest](https://www.youtube.com/watch?v=7VeUPuFGJHk)
- [scikit-learn Decision Trees](https://scikit-learn.org/stable/modules/tree.html)
- [Visual Introduction to Decision Trees](https://www.r2d3.us/visual-intro-to-machine-learning-part-1/)