# Decision Trees - Complete Guide

## From Theory to Implementation

Decision Trees are **hierarchical models** that make decisions by asking a series of questions about features.

### What You'll Learn
1. Tree structure and terminology
2. Splitting criteria (Gini, Entropy, Information Gain)
3. CART algorithm
4. Implementation from scratch
5. Pruning and regularization
6. Feature importance
7. Regression trees

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris, load_wine, make_classification, make_regression
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor, plot_tree
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from collections import Counter

plt.style.use('seaborn-v0_8-whitegrid')
np.random.seed(42)

## 1. Tree Structure and Terminology

- **Root Node**: Top of tree, entire dataset
- **Internal Node**: Decision point (split based on feature)
- **Leaf Node**: Final prediction (no further splits)
- **Branch**: Connection between nodes
- **Depth**: Length from root to leaf
- **Splitting**: Dividing node into sub-nodes

In [None]:
# Simple decision tree visualization
iris = load_iris()
X, y = iris.data[:, [2, 3]], iris.target  # Use only 2 features for visualization
feature_names = ['Petal Length', 'Petal Width']
target_names = iris.target_names

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train a simple tree
tree_simple = DecisionTreeClassifier(max_depth=2, random_state=42)
tree_simple.fit(X_train, y_train)

# Visualize the tree
plt.figure(figsize=(16, 8))
plot_tree(tree_simple, feature_names=feature_names, class_names=target_names.tolist(),
          filled=True, rounded=True, fontsize=12)
plt.title('Decision Tree Structure (max_depth=2)', fontsize=16, fontweight='bold')
plt.show()

print(f"Tree depth: {tree_simple.get_depth()}")
print(f"Number of leaves: {tree_simple.get_n_leaves()}")

## 2. Splitting Criteria

### Gini Impurity
$$Gini = 1 - \sum_{i=1}^{C} p_i^2$$

### Entropy (Information Gain)
$$Entropy = -\sum_{i=1}^{C} p_i \log_2(p_i)$$

$$Information\ Gain = Entropy_{parent} - \sum \frac{n_j}{n} Entropy_{child_j}$$

Where $p_i$ is the proportion of class $i$ samples.

In [None]:
# Visualize Gini vs Entropy
p = np.linspace(0.001, 0.999, 100)

# Binary classification
gini = 1 - (p**2 + (1-p)**2)
entropy = -(p * np.log2(p) + (1-p) * np.log2(1-p))

fig, axes = plt.subplots(1, 3, figsize=(16, 5))

# Gini Impurity
axes[0].plot(p, gini, 'b-', linewidth=3, label='Gini')
axes[0].fill_between(p, gini, alpha=0.3)
axes[0].set_xlabel('Proportion of Class 1', fontsize=12)
axes[0].set_ylabel('Impurity', fontsize=12)
axes[0].set_title('Gini Impurity', fontsize=14)
axes[0].axvline(0.5, color='r', linestyle='--', label='Max impurity (p=0.5)')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Entropy
axes[1].plot(p, entropy, 'g-', linewidth=3, label='Entropy')
axes[1].fill_between(p, entropy, alpha=0.3, color='green')
axes[1].set_xlabel('Proportion of Class 1', fontsize=12)
axes[1].set_ylabel('Entropy', fontsize=12)
axes[1].set_title('Entropy', fontsize=14)
axes[1].axvline(0.5, color='r', linestyle='--', label='Max entropy (p=0.5)')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

# Comparison
axes[2].plot(p, gini, 'b-', linewidth=3, label='Gini')
axes[2].plot(p, entropy, 'g-', linewidth=3, label='Entropy')
axes[2].set_xlabel('Proportion of Class 1', fontsize=12)
axes[2].set_ylabel('Impurity/Entropy', fontsize=12)
axes[2].set_title('Gini vs Entropy Comparison', fontsize=14)
axes[2].legend()
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Both measures are similar but:")
print("- Gini: Faster to compute (no logarithm)")
print("- Entropy: More theoretically grounded (information theory)")

## 3. Decision Tree Implementation from Scratch

In [None]:
class Node:
    """Decision Tree Node"""
    def __init__(self, feature=None, threshold=None, left=None, right=None, value=None):
        self.feature = feature      # Feature index to split on
        self.threshold = threshold  # Threshold value for split
        self.left = left           # Left child
        self.right = right         # Right child
        self.value = value         # Class label (for leaf nodes)

class DecisionTreeScratch:
    """Decision Tree Classifier from scratch"""
    
    def __init__(self, max_depth=None, min_samples_split=2, criterion='gini'):
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.criterion = criterion
        self.root = None
    
    def _gini(self, y):
        """Calculate Gini impurity"""
        m = len(y)
        if m == 0:
            return 0
        counter = Counter(y)
        impurity = 1.0
        for count in counter.values():
            p = count / m
            impurity -= p ** 2
        return impurity
    
    def _entropy(self, y):
        """Calculate entropy"""
        m = len(y)
        if m == 0:
            return 0
        counter = Counter(y)
        entropy = 0.0
        for count in counter.values():
            p = count / m
            if p > 0:
                entropy -= p * np.log2(p)
        return entropy
    
    def _impurity(self, y):
        """Calculate impurity based on criterion"""
        if self.criterion == 'gini':
            return self._gini(y)
        else:
            return self._entropy(y)
    
    def _information_gain(self, X_column, y, threshold):
        """Calculate information gain for a split"""
        # Parent impurity
        parent_impurity = self._impurity(y)
        
        # Split
        left_mask = X_column <= threshold
        right_mask = X_column > threshold
        
        if sum(left_mask) == 0 or sum(right_mask) == 0:
            return 0
        
        # Weighted child impurity
        n = len(y)
        n_left, n_right = sum(left_mask), sum(right_mask)
        impurity_left = self._impurity(y[left_mask])
        impurity_right = self._impurity(y[right_mask])
        child_impurity = (n_left / n) * impurity_left + (n_right / n) * impurity_right
        
        return parent_impurity - child_impurity
    
    def _best_split(self, X, y):
        """Find the best split"""
        best_gain = -1
        best_feature = None
        best_threshold = None
        
        n_features = X.shape[1]
        
        for feature_idx in range(n_features):
            X_column = X[:, feature_idx]
            thresholds = np.unique(X_column)
            
            for threshold in thresholds:
                gain = self._information_gain(X_column, y, threshold)
                
                if gain > best_gain:
                    best_gain = gain
                    best_feature = feature_idx
                    best_threshold = threshold
        
        return best_feature, best_threshold, best_gain
    
    def _build_tree(self, X, y, depth=0):
        """Recursively build the tree"""
        n_samples, n_features = X.shape
        n_classes = len(np.unique(y))
        
        # Stopping criteria
        if (depth >= self.max_depth if self.max_depth else False) or \
           n_classes == 1 or \
           n_samples < self.min_samples_split:
            leaf_value = Counter(y).most_common(1)[0][0]
            return Node(value=leaf_value)
        
        # Find best split
        best_feature, best_threshold, best_gain = self._best_split(X, y)
        
        if best_gain == 0:
            leaf_value = Counter(y).most_common(1)[0][0]
            return Node(value=leaf_value)
        
        # Split
        left_mask = X[:, best_feature] <= best_threshold
        right_mask = X[:, best_feature] > best_threshold
        
        left = self._build_tree(X[left_mask], y[left_mask], depth + 1)
        right = self._build_tree(X[right_mask], y[right_mask], depth + 1)
        
        return Node(best_feature, best_threshold, left, right)
    
    def fit(self, X, y):
        """Build decision tree"""
        self.root = self._build_tree(X, y)
        return self
    
    def _predict_single(self, x, node):
        """Predict single sample"""
        if node.value is not None:
            return node.value
        
        if x[node.feature] <= node.threshold:
            return self._predict_single(x, node.left)
        else:
            return self._predict_single(x, node.right)
    
    def predict(self, X):
        """Predict multiple samples"""
        return np.array([self._predict_single(x, self.root) for x in X])

# Train and evaluate
tree_scratch = DecisionTreeScratch(max_depth=5, criterion='gini')
tree_scratch.fit(X_train, y_train)
y_pred_scratch = tree_scratch.predict(X_test)

print(f"Accuracy (from scratch): {accuracy_score(y_test, y_pred_scratch):.4f}")

## 4. Scikit-learn Decision Tree

In [None]:
# Load full iris dataset
iris_full = load_iris()
X_full, y_full = iris_full.data, iris_full.target

X_train_f, X_test_f, y_train_f, y_test_f = train_test_split(
    X_full, y_full, test_size=0.3, random_state=42
)

# Train tree
tree_clf = DecisionTreeClassifier(max_depth=3, random_state=42)
tree_clf.fit(X_train_f, y_train_f)

y_pred_f = tree_clf.predict(X_test_f)

print(f"Accuracy: {accuracy_score(y_test_f, y_pred_f):.4f}")
print("\nClassification Report:")
print(classification_report(y_test_f, y_pred_f, target_names=iris_full.target_names))

In [None]:
# Visualize full tree
plt.figure(figsize=(20, 10))
plot_tree(tree_clf, feature_names=iris_full.feature_names, 
          class_names=iris_full.target_names.tolist(),
          filled=True, rounded=True, fontsize=10)
plt.title('Full Decision Tree (max_depth=3)', fontsize=16, fontweight='bold')
plt.show()

## 5. Decision Boundaries

In [None]:
# Visualize decision boundaries for different depths
fig, axes = plt.subplots(2, 3, figsize=(16, 10))
axes = axes.ravel()

depths = [1, 2, 3, 5, 10, None]

for idx, depth in enumerate(depths):
    tree = DecisionTreeClassifier(max_depth=depth, random_state=42)
    tree.fit(X_train, y_train)
    
    # Create mesh
    h = 0.02
    x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5
    y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
    
    Z = tree.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    # Plot
    axes[idx].contourf(xx, yy, Z, alpha=0.4, cmap='RdYlBu')
    axes[idx].scatter(X_train[:, 0], X_train[:, 1], c=y_train, 
                     cmap='RdYlBu', edgecolors='black', s=50)
    
    accuracy = tree.score(X_test, y_test)
    depth_str = depth if depth else 'Unlimited'
    axes[idx].set_title(f'Depth = {depth_str}\nAcc = {accuracy:.3f}', 
                       fontsize=12, fontweight='bold')
    axes[idx].set_xlabel(feature_names[0])
    axes[idx].set_ylabel(feature_names[1])

plt.tight_layout()
plt.show()

## 6. Overfitting and Regularization

### Regularization Parameters:
- `max_depth`: Maximum depth of tree
- `min_samples_split`: Minimum samples to split a node
- `min_samples_leaf`: Minimum samples in a leaf
- `max_features`: Maximum features to consider for split
- `max_leaf_nodes`: Maximum number of leaf nodes

In [None]:
# Demonstrate overfitting
train_scores = []
test_scores = []
depths = range(1, 21)

for depth in depths:
    tree = DecisionTreeClassifier(max_depth=depth, random_state=42)
    tree.fit(X_train_f, y_train_f)
    
    train_scores.append(tree.score(X_train_f, y_train_f))
    test_scores.append(tree.score(X_test_f, y_test_f))

plt.figure(figsize=(12, 6))
plt.plot(depths, train_scores, 'b-o', linewidth=2, label='Training Accuracy')
plt.plot(depths, test_scores, 'r-s', linewidth=2, label='Test Accuracy')
plt.xlabel('Tree Depth', fontsize=12)
plt.ylabel('Accuracy', fontsize=12)
plt.title('Overfitting: Training vs Test Accuracy', fontsize=14)
plt.legend()
plt.grid(True, alpha=0.3)
plt.axvline(x=depths[np.argmax(test_scores)], color='green', 
           linestyle='--', label=f'Optimal depth = {depths[np.argmax(test_scores)]}')
plt.legend()
plt.show()

## 7. Feature Importance

In [None]:
# Feature importance
importances = tree_clf.feature_importances_
indices = np.argsort(importances)[::-1]

# Create DataFrame
feature_importance_df = pd.DataFrame({
    'Feature': [iris_full.feature_names[i] for i in indices],
    'Importance': importances[indices]
})

print("Feature Importances:")
print(feature_importance_df.to_string(index=False))

# Plot
plt.figure(figsize=(10, 6))
plt.barh(feature_importance_df['Feature'], feature_importance_df['Importance'], 
        color='skyblue', edgecolor='black')
plt.xlabel('Importance', fontsize=12)
plt.title('Feature Importances', fontsize=14)
plt.gca().invert_yaxis()
plt.grid(True, alpha=0.3, axis='x')
plt.tight_layout()
plt.show()

## 8. Decision Tree Regression

In [None]:
# Generate regression data
np.random.seed(42)
X_reg = np.sort(5 * np.random.rand(100, 1), axis=0)
y_reg = np.sin(X_reg).ravel() + np.random.randn(100) * 0.1

# Train regression trees with different depths
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

X_test_reg = np.linspace(0, 5, 500)[:, np.newaxis]

for idx, depth in enumerate([2, 5, 20]):
    tree_reg = DecisionTreeRegressor(max_depth=depth, random_state=42)
    tree_reg.fit(X_reg, y_reg)
    y_pred_reg = tree_reg.predict(X_test_reg)
    
    axes[idx].scatter(X_reg, y_reg, color='blue', s=50, label='Training data')
    axes[idx].plot(X_test_reg, y_pred_reg, 'r-', linewidth=2, label=f'Tree (depth={depth})')
    axes[idx].plot(X_test_reg, np.sin(X_test_reg), 'g--', linewidth=2, label='True function')
    axes[idx].set_xlabel('X', fontsize=12)
    axes[idx].set_ylabel('y', fontsize=12)
    axes[idx].set_title(f'Decision Tree Regression\n(max_depth={depth})', fontsize=14)
    axes[idx].legend()
    axes[idx].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 9. Hyperparameter Tuning

In [None]:
# Grid search for best parameters
param_grid = {
    'max_depth': [3, 5, 7, 9, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'criterion': ['gini', 'entropy']
}

grid_search = GridSearchCV(DecisionTreeClassifier(random_state=42), 
                          param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train_f, y_train_f)

print("Best Parameters:", grid_search.best_params_)
print(f"Best CV Score: {grid_search.best_score_:.4f}")
print(f"Test Score: {grid_search.score(X_test_f, y_test_f):.4f}")

# Visualize grid search results
results_df = pd.DataFrame(grid_search.cv_results_)
results_df = results_df.sort_values('rank_test_score').head(10)

plt.figure(figsize=(12, 6))
plt.barh(range(len(results_df)), results_df['mean_test_score'], 
        xerr=results_df['std_test_score'], color='skyblue', edgecolor='black')
plt.yticks(range(len(results_df)), 
          [f"{i+1}. depth={row['param_max_depth']}, split={row['param_min_samples_split']}" 
           for i, (_, row) in enumerate(results_df.iterrows())])
plt.xlabel('Mean CV Accuracy', fontsize=12)
plt.title('Top 10 Hyperparameter Combinations', fontsize=14)
plt.gca().invert_yaxis()
plt.grid(True, alpha=0.3, axis='x')
plt.tight_layout()
plt.show()

## Summary

### Key Takeaways

1. **Tree Structure**: Hierarchical model with internal nodes (decisions) and leaf nodes (predictions)
2. **Splitting Criteria**: Gini impurity and Entropy measure node purity
3. **CART Algorithm**: Greedy recursive splitting
4. **Overfitting**: Trees easily overfit without regularization
5. **Interpretability**: Easy to visualize and understand
6. **Feature Importance**: Automatically calculated

### Pros and Cons

**Pros:**
- Easy to understand and interpret
- Visual representation possible
- No feature scaling needed
- Handles both numerical and categorical features
- Non-parametric (no assumptions about data)
- Captures non-linear relationships
- Feature importance automatically calculated

**Cons:**
- Prone to overfitting
- Unstable (small changes in data can change tree structure)
- Biased toward features with more levels
- Greedy algorithm (may not find global optimum)
- Can create overly complex trees

### When to Use Decision Trees

**Best for:**
- Need interpretable model
- Mixed feature types
- Non-linear relationships
- Feature importance needed
- Building ensemble methods (Random Forest, Gradient Boosting)

**Avoid when:**
- Need stable predictions
- Small datasets (prone to overfitting)
- Linear relationships (simpler models better)

### Practice Problems

1. Implement cost-complexity pruning
2. Compare Gini vs Entropy on different datasets
3. Build a regression tree from scratch
4. Analyze feature importance on high-dimensional data