# Module 06: LightGBM

**Difficulty**: ⭐⭐⭐ Advanced
**Estimated Time**: 90 minutes
**Prerequisites**: 
- Module 00: Introduction to Ensemble Learning
- Module 04: Gradient Boosting Machines
- Module 05: XGBoost

## Learning Objectives

By the end of this notebook, you will be able to:
1. Understand LightGBM's unique algorithmic innovations (GOSS, EFB)
2. Explain histogram-based learning and leaf-wise tree growth
3. Leverage LightGBM's native categorical feature handling
4. Optimize critical LightGBM hyperparameters for speed and accuracy
5. Benchmark LightGBM speed against XGBoost
6. Use both sklearn API and native API effectively
7. Demonstrate memory efficiency on large datasets
8. Apply SHAP for model interpretation

## Setup and Configuration

In [None]:
# Core libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import time
import warnings
warnings.filterwarnings('ignore')

# Machine learning
from sklearn.datasets import load_breast_cancer, make_classification
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import (
    accuracy_score, classification_report, confusion_matrix,
    roc_auc_score, roc_curve, log_loss
)

# XGBoost for comparison
try:
    import xgboost as xgb
    XGB_AVAILABLE = True
except ImportError:
    XGB_AVAILABLE = False
    print("XGBoost not available for comparison")

# LightGBM
try:
    import lightgbm as lgb
    print(f"LightGBM version: {lgb.__version__}")
except ImportError:
    print("Installing LightGBM...")
    !pip install lightgbm -q
    import lightgbm as lgb
    print(f"LightGBM version: {lgb.__version__}")

# SHAP for interpretability
try:
    import shap
    SHAP_AVAILABLE = True
    print(f"SHAP version: {shap.__version__}")
except ImportError:
    print("SHAP not available. Install with: pip install shap")
    SHAP_AVAILABLE = False

# Configuration
%matplotlib inline
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')

# Reproducibility
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 4)

print("\nSetup complete! All libraries imported successfully.")

## 1. What Makes LightGBM Special?

### LightGBM = Light Gradient Boosting Machine

Developed by Microsoft Research (2017), LightGBM is designed for:
- **Speed**: Faster training than XGBoost
- **Memory efficiency**: Uses less RAM
- **Large datasets**: Handles millions of samples
- **Accuracy**: Competitive or better performance

### Key Innovations

#### 1.1 Histogram-Based Learning

**Traditional gradient boosting**:
- Considers all possible split points
- Slow for continuous features

**LightGBM's histogram approach**:
- Bins continuous values into discrete bins (default: 255)
- Only considers bin boundaries as split points
- Much faster, slight accuracy trade-off

**Benefits**:
- Reduced computation: O(data × bins) vs O(data × data)
- Reduced memory: Store bin values, not raw values
- Better cache performance

#### 1.2 Leaf-wise (Best-first) Tree Growth

**Level-wise** (XGBoost, sklearn):
```
        Root
       /    \
     A        B      ← Split all nodes at level 1
    / \      / \
   C   D    E   F    ← Split all nodes at level 2
```
- Balanced trees
- May split nodes that don't benefit much

**Leaf-wise** (LightGBM):
```
        Root
       /    \
     A        B
    / \
   C   D           ← Only split node A (best gain)
  / \
 E   F             ← Only split node C (best gain)
```
- Unbalanced trees
- Always splits node with best gain
- **Faster convergence**, better accuracy
- **Risk**: Can overfit easily (needs regularization)

#### 1.3 GOSS (Gradient-based One-Side Sampling)

**Observation**: Samples with small gradients are well-fitted, large gradients need more work.

**GOSS algorithm**:
1. Sort samples by gradient magnitude
2. Keep all samples with top-k largest gradients (e.g., top 20%)
3. Randomly sample from remaining samples (e.g., 10%)
4. Amplify small-gradient samples to maintain distribution

**Result**: Use fewer samples, maintain accuracy, faster training!

#### 1.4 EFB (Exclusive Feature Bundling)

**Observation**: Sparse features rarely take non-zero values simultaneously.

**EFB algorithm**:
- Bundle mutually exclusive features into single feature
- Reduces feature dimension
- Critical for high-dimensional sparse data

**Example**: One-hot encoded categories can be bundled back!

#### 1.5 Native Categorical Support

Unlike XGBoost:
- No need to one-hot encode
- No need for target encoding
- Direct categorical split finding
- More efficient, often better accuracy

### Comparison: LightGBM vs XGBoost

| Feature | XGBoost | LightGBM |
|---------|---------|----------|
| Tree growth | Level-wise | Leaf-wise |
| Split finding | Exact/Approx | Histogram |
| Speed | Fast | Faster (2-10x) |
| Memory | Moderate | Lower |
| Categorical features | Manual encoding | Native support |
| Large datasets | Good | Excellent |
| Overfitting risk | Moderate | Higher (leaf-wise) |
| Default params | Generally good | Need tuning |

In [None]:
# Load dataset with many features
# Create synthetic dataset to demonstrate speed
print("Creating large synthetic dataset...")
X, y = make_classification(
    n_samples=50000,
    n_features=100,
    n_informative=60,
    n_redundant=20,
    n_clusters_per_class=3,
    class_sep=0.8,
    random_state=RANDOM_STATE
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=RANDOM_STATE
)

print(f"\nDataset created:")
print(f"Training samples: {len(X_train):,}")
print(f"Test samples: {len(X_test):,}")
print(f"Features: {X.shape[1]}")
print(f"Classes: {np.unique(y)}")
print(f"Class distribution: {np.bincount(y)}")

## 2. Speed Benchmark: LightGBM vs XGBoost

Let's test the claim that LightGBM is faster!

In [None]:
# Common parameters for fair comparison
n_estimators = 100
max_depth = 7
learning_rate = 0.1

results = []

# Train LightGBM
print("Training LightGBM...")
start = time.time()
lgb_model = lgb.LGBMClassifier(
    n_estimators=n_estimators,
    max_depth=max_depth,
    learning_rate=learning_rate,
    num_leaves=31,  # LightGBM-specific
    random_state=RANDOM_STATE,
    verbose=-1
)
lgb_model.fit(X_train, y_train)
lgb_train_time = time.time() - start
lgb_acc = lgb_model.score(X_test, y_test)
results.append({'Model': 'LightGBM', 'Train Time': lgb_train_time, 'Accuracy': lgb_acc})

# Train XGBoost (if available)
if XGB_AVAILABLE:
    print("Training XGBoost...")
    start = time.time()
    xgb_model = xgb.XGBClassifier(
        n_estimators=n_estimators,
        max_depth=max_depth,
        learning_rate=learning_rate,
        random_state=RANDOM_STATE,
        eval_metric='logloss',
        verbosity=0
    )
    xgb_model.fit(X_train, y_train)
    xgb_train_time = time.time() - start
    xgb_acc = xgb_model.score(X_test, y_test)
    results.append({'Model': 'XGBoost', 'Train Time': xgb_train_time, 'Accuracy': xgb_acc})

# Display results
df_results = pd.DataFrame(results)
print("\n" + "=" * 70)
print("Speed Benchmark Results")
print("=" * 70)
print(df_results.to_string(index=False))

if XGB_AVAILABLE:
    speedup = xgb_train_time / lgb_train_time
    print(f"\nLightGBM is {speedup:.2f}x faster!")
    print(f"Accuracy difference: {(lgb_acc - xgb_acc) * 100:.2f}%")

In [None]:
# Visualize comparison
if XGB_AVAILABLE:
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Training time
    axes[0].bar(df_results['Model'], df_results['Train Time'], 
                color=['#2ecc71', '#e74c3c'], edgecolor='black', linewidth=1.5)
    axes[0].set_ylabel('Training Time (seconds)', fontsize=12)
    axes[0].set_title('Training Speed Comparison', fontsize=13, fontweight='bold')
    axes[0].grid(axis='y', alpha=0.3)
    
    # Accuracy
    axes[1].bar(df_results['Model'], df_results['Accuracy'], 
                color=['#2ecc71', '#e74c3c'], edgecolor='black', linewidth=1.5)
    axes[1].set_ylabel('Accuracy', fontsize=12)
    axes[1].set_title('Accuracy Comparison', fontsize=13, fontweight='bold')
    axes[1].set_ylim([0.8, 1.0])
    axes[1].grid(axis='y', alpha=0.3)
    
    plt.tight_layout()
    plt.show()

print("\nKey Observation:")
print("LightGBM achieves similar accuracy with significantly faster training!")

## 3. LightGBM Hyperparameters

### 3.1 Core Parameters

**`num_leaves`** (default=31): Maximum number of leaves per tree
- Most important parameter!
- Controls model complexity
- Should be < 2^max_depth (leaf-wise growth)
- Typical: 20-100

**`max_depth`** (default=-1): Maximum tree depth
- -1 means no limit
- Use to prevent overfitting
- Typical: 5-15

**`learning_rate`** (default=0.1): Shrinkage rate
- Lower = need more trees
- Typical: 0.01-0.3

**`n_estimators`** (default=100): Number of boosting rounds
- Use early stopping
- Typical: 100-1000

### 3.2 Data Sampling

**`bagging_fraction`** (subsample, default=1.0): Fraction of data for each tree
- < 1.0 enables bagging
- Typical: 0.7-1.0

**`bagging_freq`** (default=0): Bagging frequency
- 0 = disabled
- k = use bagging every k iterations

**`feature_fraction`** (colsample_bytree, default=1.0): Fraction of features per tree
- Typical: 0.6-1.0

### 3.3 Regularization

**`min_data_in_leaf`** (min_child_samples, default=20): Minimum samples per leaf
- Higher = more conservative
- Critical for preventing overfitting
- Typical: 10-100

**`lambda_l1`** (reg_alpha): L1 regularization
**`lambda_l2`** (reg_lambda): L2 regularization
**`min_gain_to_split`**: Minimum gain to perform split

### 3.4 Speed vs Accuracy

**`max_bin`** (default=255): Number of histogram bins
- Lower = faster but less accurate
- Higher = slower but more accurate
- Typical: 63, 127, 255, 511

In [None]:
# Effect of num_leaves
num_leaves_values = [7, 15, 31, 63, 127, 255]
train_accs = []
test_accs = []
train_times = []

for num_leaves in num_leaves_values:
    start = time.time()
    model = lgb.LGBMClassifier(
        n_estimators=100,
        num_leaves=num_leaves,
        learning_rate=0.1,
        random_state=RANDOM_STATE,
        verbose=-1
    )
    model.fit(X_train, y_train)
    train_times.append(time.time() - start)
    train_accs.append(model.score(X_train, y_train))
    test_accs.append(model.score(X_test, y_test))

# Plot results
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Accuracy
axes[0].plot(num_leaves_values, train_accs, marker='o', linewidth=2, 
             markersize=8, label='Train Accuracy')
axes[0].plot(num_leaves_values, test_accs, marker='s', linewidth=2, 
             markersize=8, label='Test Accuracy')
axes[0].axvline(31, color='red', linestyle='--', alpha=0.5, label='Default (31)')
axes[0].set_xlabel('num_leaves', fontsize=12)
axes[0].set_ylabel('Accuracy', fontsize=12)
axes[0].set_title('Effect of num_leaves on Accuracy', fontsize=13, fontweight='bold')
axes[0].legend(fontsize=10)
axes[0].grid(True, alpha=0.3)
axes[0].set_xscale('log', base=2)

# Training time
axes[1].bar(range(len(num_leaves_values)), train_times, color='steelblue', 
            edgecolor='black', linewidth=1.5)
axes[1].set_xticks(range(len(num_leaves_values)))
axes[1].set_xticklabels(num_leaves_values)
axes[1].set_xlabel('num_leaves', fontsize=12)
axes[1].set_ylabel('Training Time (seconds)', fontsize=12)
axes[1].set_title('Effect of num_leaves on Speed', fontsize=13, fontweight='bold')
axes[1].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

best_idx = np.argmax(test_accs)
print(f"\nOptimal num_leaves: {num_leaves_values[best_idx]}")
print(f"Test accuracy: {test_accs[best_idx]:.4f}")
print(f"Training time: {train_times[best_idx]:.3f} seconds")
print("\nKey insight: Higher num_leaves = more complex model, risk of overfitting")

In [None]:
# Effect of min_data_in_leaf (regularization)
min_data_values = [5, 10, 20, 50, 100, 200]
train_accs = []
test_accs = []

for min_data in min_data_values:
    model = lgb.LGBMClassifier(
        n_estimators=100,
        num_leaves=63,
        min_data_in_leaf=min_data,
        learning_rate=0.1,
        random_state=RANDOM_STATE,
        verbose=-1
    )
    model.fit(X_train, y_train)
    train_accs.append(model.score(X_train, y_train))
    test_accs.append(model.score(X_test, y_test))

# Plot
plt.figure(figsize=(10, 6))
plt.plot(min_data_values, train_accs, marker='o', linewidth=2, 
         markersize=8, label='Train Accuracy')
plt.plot(min_data_values, test_accs, marker='s', linewidth=2, 
         markersize=8, label='Test Accuracy')
plt.axvline(20, color='red', linestyle='--', alpha=0.5, label='Default (20)')
plt.xlabel('min_data_in_leaf', fontsize=12)
plt.ylabel('Accuracy', fontsize=12)
plt.title('Effect of min_data_in_leaf (Regularization)', fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\nObservation:")
print("Higher min_data_in_leaf = stronger regularization = less overfitting")
print("Find sweet spot where test accuracy is maximized!")

## 4. Native Categorical Feature Support

LightGBM's killer feature: no need to encode categorical variables!

### How it works:
1. Mark columns as categorical using `categorical_feature` parameter
2. LightGBM uses optimal split finding for categories
3. Handles high-cardinality categories efficiently
4. Often better than one-hot or target encoding

### Benefits:
- No manual encoding needed
- Better memory efficiency (no one-hot explosion)
- Better accuracy (captures category relationships)
- Handles unseen categories gracefully

In [None]:
# Create dataset with categorical features
print("Creating dataset with categorical features...")

# Start with numeric features
n_samples = 10000
X_numeric = np.random.randn(n_samples, 5)

# Add categorical features
categories_a = np.random.choice(['A', 'B', 'C', 'D'], size=n_samples)
categories_b = np.random.choice(['Low', 'Medium', 'High'], size=n_samples)
categories_c = np.random.choice([f'Cat_{i}' for i in range(20)], size=n_samples)

# Create DataFrame
df = pd.DataFrame(X_numeric, columns=[f'num_{i}' for i in range(5)])
df['cat_a'] = categories_a
df['cat_b'] = categories_b
df['cat_c'] = categories_c

# Create target based on features (including categorical influence)
target_numeric = X_numeric.sum(axis=1)
target_cat_a = (categories_a == 'A').astype(int) * 2
target_cat_b = (categories_b == 'High').astype(int) * 1.5
y_continuous = target_numeric + target_cat_a + target_cat_b + np.random.randn(n_samples) * 0.5
y_cat = (y_continuous > y_continuous.median()).astype(int)

print(f"\nDataset created:")
print(f"Samples: {len(df)}")
print(f"Numeric features: 5")
print(f"Categorical features: 3")
print(f"\nSample data:")
print(df.head())
print(f"\nCategorical column dtypes:")
print(df[['cat_a', 'cat_b', 'cat_c']].dtypes)

In [None]:
# Split data
X_cat = df.copy()
X_train_cat, X_test_cat, y_train_cat, y_test_cat = train_test_split(
    X_cat, y_cat, test_size=0.2, random_state=RANDOM_STATE
)

# Method 1: LightGBM with categorical features (native)
print("Method 1: Native categorical support...")
categorical_features = ['cat_a', 'cat_b', 'cat_c']

lgb_cat = lgb.LGBMClassifier(
    n_estimators=100,
    num_leaves=31,
    learning_rate=0.1,
    random_state=RANDOM_STATE,
    verbose=-1
)

start = time.time()
lgb_cat.fit(
    X_train_cat, y_train_cat,
    categorical_feature=categorical_features
)
native_time = time.time() - start
native_acc = lgb_cat.score(X_test_cat, y_test_cat)

print(f"  Training time: {native_time:.3f}s")
print(f"  Test accuracy: {native_acc:.4f}")

In [None]:
# Method 2: One-hot encoding (traditional)
print("\nMethod 2: One-hot encoding...")
X_train_ohe = pd.get_dummies(X_train_cat, columns=categorical_features)
X_test_ohe = pd.get_dummies(X_test_cat, columns=categorical_features)

# Align columns (in case test has different categories)
X_train_ohe, X_test_ohe = X_train_ohe.align(X_test_ohe, join='left', axis=1, fill_value=0)

lgb_ohe = lgb.LGBMClassifier(
    n_estimators=100,
    num_leaves=31,
    learning_rate=0.1,
    random_state=RANDOM_STATE,
    verbose=-1
)

start = time.time()
lgb_ohe.fit(X_train_ohe, y_train_cat)
ohe_time = time.time() - start
ohe_acc = lgb_ohe.score(X_test_ohe, y_test_cat)

print(f"  Features after encoding: {X_train_ohe.shape[1]}")
print(f"  Training time: {ohe_time:.3f}s")
print(f"  Test accuracy: {ohe_acc:.4f}")

In [None]:
# Comparison
comparison_df = pd.DataFrame([
    {
        'Method': 'Native Categorical',
        'Features': X_train_cat.shape[1],
        'Training Time': native_time,
        'Accuracy': native_acc
    },
    {
        'Method': 'One-Hot Encoding',
        'Features': X_train_ohe.shape[1],
        'Training Time': ohe_time,
        'Accuracy': ohe_acc
    }
])

print("\n" + "=" * 70)
print("Categorical Feature Handling Comparison")
print("=" * 70)
print(comparison_df.to_string(index=False))

print("\n✅ Native categorical support is:")
print(f"   - Simpler (no encoding code)")
print(f"   - Faster ({ohe_time / native_time:.2f}x speedup)")
print(f"   - More memory efficient ({X_train_ohe.shape[1] / X_train_cat.shape[1]:.1f}x fewer features)")
if native_acc > ohe_acc:
    print(f"   - More accurate (+{(native_acc - ohe_acc) * 100:.2f}%)")
else:
    print(f"   - Similar accuracy ({(native_acc - ohe_acc) * 100:+.2f}%)")

## 5. LightGBM Native API

For maximum control and performance, use LightGBM's native API with `Dataset` objects.

In [None]:
# Use breast cancer dataset for native API demo
cancer_data = load_breast_cancer()
X_cancer, y_cancer = cancer_data.data, cancer_data.target
feature_names_cancer = cancer_data.feature_names

X_train_bc, X_test_bc, y_train_bc, y_test_bc = train_test_split(
    X_cancer, y_cancer, test_size=0.2, random_state=RANDOM_STATE
)

# Further split for validation
X_train_bc, X_val_bc, y_train_bc, y_val_bc = train_test_split(
    X_train_bc, y_train_bc, test_size=0.2, random_state=RANDOM_STATE
)

print(f"Training samples: {len(X_train_bc)}")
print(f"Validation samples: {len(X_val_bc)}")
print(f"Test samples: {len(X_test_bc)}")

In [None]:
# Create LightGBM Datasets
train_data = lgb.Dataset(
    X_train_bc, 
    label=y_train_bc,
    feature_name=list(feature_names_cancer)
)

val_data = lgb.Dataset(
    X_val_bc,
    label=y_val_bc,
    reference=train_data,  # Use same binning as training
    feature_name=list(feature_names_cancer)
)

print("LightGBM Datasets created!")
print(f"\nDataset is optimized for LightGBM training.")

In [None]:
# Set parameters
params = {
    'objective': 'binary',
    'metric': 'binary_logloss',
    'num_leaves': 31,
    'learning_rate': 0.05,
    'feature_fraction': 0.8,
    'bagging_fraction': 0.8,
    'bagging_freq': 5,
    'verbose': -1,
    'seed': RANDOM_STATE
}

# Train with early stopping
print("Training with native API and early stopping...")
evals_result = {}
lgb_native = lgb.train(
    params,
    train_data,
    num_boost_round=500,
    valid_sets=[train_data, val_data],
    valid_names=['train', 'valid'],
    callbacks=[
        lgb.early_stopping(stopping_rounds=20),
        lgb.log_evaluation(period=0),
        lgb.record_evaluation(evals_result)
    ]
)

print(f"\nBest iteration: {lgb_native.best_iteration}")
print(f"Best validation score: {lgb_native.best_score['valid']['binary_logloss']:.4f}")

In [None]:
# Visualize training progress
train_loss = evals_result['train']['binary_logloss']
val_loss = evals_result['valid']['binary_logloss']

plt.figure(figsize=(12, 6))
plt.plot(train_loss, label='Train', linewidth=2)
plt.plot(val_loss, label='Validation', linewidth=2)
plt.axvline(lgb_native.best_iteration, color='red', linestyle='--', 
            linewidth=2, label=f'Best iteration ({lgb_native.best_iteration})')
plt.xlabel('Boosting Round', fontsize=12)
plt.ylabel('Binary Log Loss', fontsize=12)
plt.title('LightGBM Training Progress', fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Test set evaluation
y_pred = lgb_native.predict(X_test_bc)
y_pred_binary = (y_pred > 0.5).astype(int)
test_acc = accuracy_score(y_test_bc, y_pred_binary)
test_auc = roc_auc_score(y_test_bc, y_pred)

print(f"\nTest Set Performance:")
print(f"Accuracy: {test_acc:.4f}")
print(f"AUC-ROC: {test_auc:.4f}")

## 6. Memory Efficiency Demonstration

LightGBM is designed for large datasets. Let's demonstrate memory efficiency.

In [None]:
# Create progressively larger datasets and measure memory usage
import sys

dataset_sizes = [10000, 50000, 100000, 200000]
memory_usage = []
training_times = []

for size in dataset_sizes:
    print(f"\nTesting with {size:,} samples...")
    
    # Create dataset
    X_large, y_large = make_classification(
        n_samples=size,
        n_features=50,
        n_informative=30,
        random_state=RANDOM_STATE
    )
    
    # Create LightGBM Dataset (efficient representation)
    lgb_dataset = lgb.Dataset(X_large, label=y_large)
    
    # Estimate memory (rough approximation)
    data_memory = sys.getsizeof(X_large) / (1024 ** 2)  # MB
    memory_usage.append(data_memory)
    
    # Train and time
    start = time.time()
    params = {'objective': 'binary', 'metric': 'binary_logloss', 'verbose': -1}
    lgb.train(params, lgb_dataset, num_boost_round=50)
    training_times.append(time.time() - start)
    
    print(f"  Memory: ~{data_memory:.1f} MB")
    print(f"  Training time: {training_times[-1]:.2f}s")

In [None]:
# Visualize scalability
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Training time vs dataset size
axes[0].plot(dataset_sizes, training_times, marker='o', linewidth=2, markersize=10, color='#e74c3c')
axes[0].set_xlabel('Number of Samples', fontsize=12)
axes[0].set_ylabel('Training Time (seconds)', fontsize=12)
axes[0].set_title('LightGBM Scalability', fontsize=13, fontweight='bold')
axes[0].grid(True, alpha=0.3)

# Memory usage
axes[1].plot(dataset_sizes, memory_usage, marker='s', linewidth=2, markersize=10, color='#3498db')
axes[1].set_xlabel('Number of Samples', fontsize=12)
axes[1].set_ylabel('Memory Usage (MB)', fontsize=12)
axes[1].set_title('Memory Efficiency', fontsize=13, fontweight='bold')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nLightGBM handles large datasets efficiently!")
print("Near-linear scaling in both time and memory.")

## 7. Feature Importance and SHAP

LightGBM provides similar feature importance metrics as XGBoost.

In [None]:
# Train model on breast cancer for feature importance
lgb_fi = lgb.LGBMClassifier(
    n_estimators=100,
    num_leaves=31,
    learning_rate=0.1,
    random_state=RANDOM_STATE,
    verbose=-1
)
lgb_fi.fit(X_train_bc, y_train_bc)

# Get feature importance (split, gain)
importance_split = lgb_fi.feature_importances_  # Default is split
importance_gain = lgb_fi.booster_.feature_importance(importance_type='gain')

# Create DataFrame
fi_df = pd.DataFrame({
    'Feature': feature_names_cancer,
    'Split': importance_split,
    'Gain': importance_gain
}).sort_values('Gain', ascending=False)

print("Top 10 Features by Gain:")
print("=" * 70)
print(fi_df.head(10).to_string(index=False))

In [None]:
# Visualize top features
top_features = fi_df.head(15)

plt.figure(figsize=(10, 8))
plt.barh(range(len(top_features)), top_features['Gain'], color='steelblue', edgecolor='black')
plt.yticks(range(len(top_features)), top_features['Feature'])
plt.gca().invert_yaxis()
plt.xlabel('Importance (Gain)', fontsize=12)
plt.title('Top 15 Features by Gain', fontsize=14, fontweight='bold')
plt.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# SHAP analysis
if SHAP_AVAILABLE:
    print("Computing SHAP values...")
    explainer = shap.TreeExplainer(lgb_fi)
    X_test_sample = X_test_bc[:100]
    shap_values = explainer.shap_values(X_test_sample)
    
    # For binary classification, shap_values is a list
    if isinstance(shap_values, list):
        shap_values = shap_values[1]  # Use positive class
    
    print("SHAP values computed!")
    print(f"Shape: {shap_values.shape}")
else:
    print("SHAP not available. Install with: pip install shap")

In [None]:
if SHAP_AVAILABLE:
    # Global feature importance
    plt.figure(figsize=(10, 8))
    shap.summary_plot(shap_values, X_test_sample, 
                     feature_names=feature_names_cancer, 
                     plot_type='bar', show=False)
    plt.title('Global Feature Importance (SHAP)', fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.show()

In [None]:
if SHAP_AVAILABLE:
    # Detailed SHAP summary
    plt.figure(figsize=(10, 10))
    shap.summary_plot(shap_values, X_test_sample, 
                     feature_names=feature_names_cancer, show=False)
    plt.title('SHAP Summary Plot', fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.show()
    
    print("\nInterpretation:")
    print("- Red dots = high feature value")
    print("- Blue dots = low feature value")
    print("- Position = impact on prediction")

## Exercises

### Exercise 1: Hyperparameter Optimization

Use GridSearchCV or RandomizedSearchCV to find optimal LightGBM parameters:

Parameters to tune:
- `num_leaves`: [15, 31, 63, 127]
- `max_depth`: [5, 7, 10, -1]
- `learning_rate`: [0.01, 0.05, 0.1]
- `n_estimators`: [100, 200, 500]
- `min_data_in_leaf`: [10, 20, 50]
- `feature_fraction`: [0.6, 0.8, 1.0]
- `bagging_fraction`: [0.6, 0.8, 1.0]

1. Perform hyperparameter search
2. Identify best parameters
3. Analyze which parameters matter most
4. Compare with default parameters
5. Plot validation curves for key parameters

In [None]:
# Your code here


### Exercise 2: LightGBM vs XGBoost Speed Comparison

Conduct comprehensive speed benchmarks:

1. Create datasets of varying sizes: 10K, 50K, 100K, 500K samples
2. For each dataset:
   - Train LightGBM with optimal parameters
   - Train XGBoost with equivalent parameters
   - Measure training time
   - Measure prediction time
   - Compare accuracy
3. Plot results:
   - Training time vs dataset size
   - Speedup ratio
   - Accuracy comparison
4. Determine at what dataset size LightGBM's advantage is most pronounced

In [None]:
# Your code here


### Exercise 3: Categorical Feature Experiments

Test different categorical encoding strategies:

1. Create dataset with varying cardinality categories:
   - Low cardinality (3-5 values)
   - Medium cardinality (10-20 values)
   - High cardinality (100+ values)
2. Compare encoding methods:
   - LightGBM native categorical
   - One-hot encoding
   - Label encoding
   - Target encoding
3. Measure for each:
   - Training time
   - Memory usage
   - Accuracy
   - Number of features
4. Determine when native categorical support provides the biggest advantage

In [None]:
# Your code here


### Exercise 4: Preventing Overfitting with Leaf-wise Growth

LightGBM's leaf-wise growth can overfit. Test regularization strategies:

1. Create a small dataset prone to overfitting (1000 samples, 50 features)
2. Train baseline model with aggressive settings:
   - `num_leaves=255`
   - `max_depth=-1`
   - `min_data_in_leaf=1`
3. Apply regularization techniques:
   - Reduce `num_leaves`
   - Set `max_depth` limit
   - Increase `min_data_in_leaf`
   - Add L1/L2 regularization
   - Use `feature_fraction` and `bagging_fraction`
4. Compare train vs test performance
5. Find optimal regularization combination

In [None]:
# Your code here


### Exercise 5: Custom Objective and Metric

Implement a custom objective function and evaluation metric:

1. Create an imbalanced classification problem (90:10 ratio)
2. Implement custom focal loss objective:
   - Focal loss focuses on hard examples
   - Formula: FL = -α(1-p)^γ log(p)
3. Implement custom F1 evaluation metric
4. Train LightGBM with:
   - Custom focal loss objective
   - Custom F1 metric
   - Early stopping based on F1
5. Compare with standard binary logloss
6. Analyze improvements on minority class

**Hint**: Use `fobj` and `feval` parameters in `lgb.train()`

In [None]:
# Your code here


## Summary

### Key Concepts

1. **LightGBM Innovations**:
   - **Histogram-based learning**: Bins continuous values (faster, memory efficient)
   - **Leaf-wise growth**: Best-first tree building (faster convergence, risk of overfitting)
   - **GOSS**: Gradient-based One-Side Sampling (fewer samples, maintained accuracy)
   - **EFB**: Exclusive Feature Bundling (reduces dimension in sparse data)
   - **Native categorical support**: No encoding needed!

2. **Critical Hyperparameters**:
   - `num_leaves`: Most important! Controls complexity (20-100)
   - `max_depth`: Limits tree depth, prevents overfitting (5-15 or -1)
   - `min_data_in_leaf`: Minimum samples per leaf, strong regularization (10-100)
   - `learning_rate`: Shrinkage rate (0.01-0.3)
   - `feature_fraction`, `bagging_fraction`: Randomness for regularization (0.6-1.0)
   - `max_bin`: Histogram bins, speed/accuracy trade-off (63-511)

3. **Best Practices**:
   - Start with `num_leaves=31`, tune up/down
   - Use `min_data_in_leaf` to prevent overfitting
   - Leverage categorical features natively (don't one-hot encode)
   - Use early stopping with validation set
   - Lower `learning_rate` + more `n_estimators` = better generalization
   - Monitor train/validation gap

4. **Advantages Over XGBoost**:
   - **Faster**: 2-10x speedup, especially on large data
   - **Memory efficient**: Lower RAM usage
   - **Categorical features**: Native support, no preprocessing
   - **Large datasets**: Designed for millions of samples
   - **Sparse data**: EFB handles sparsity efficiently

5. **When to Use LightGBM**:
   - ✅ Large datasets (100K+ samples)
   - ✅ Many categorical features
   - ✅ High-dimensional sparse data
   - ✅ Need fast training
   - ✅ Limited memory
   - ⚠️ Small datasets: May overfit (use strong regularization)
   - ⚠️ Need deterministic results: Leaf-wise can vary slightly

### Strengths

- Extremely fast training
- Excellent memory efficiency
- Handles categorical features natively
- Great for large-scale production systems
- Competitive or better accuracy than XGBoost
- Active development and community

### Weaknesses

- Leaf-wise growth can overfit on small data
- Requires careful hyperparameter tuning
- Default parameters may not be optimal
- Less stable than level-wise growth
- Can be sensitive to noise

### LightGBM vs XGBoost: Which to Choose?

**Choose LightGBM when**:
- Dataset is large (100K+ samples)
- Many categorical features
- Speed is critical
- Memory is limited
- Features are sparse

**Choose XGBoost when**:
- Dataset is small (<10K samples)
- Need maximum stability
- Default parameters matter
- Already familiar with XGBoost

**Truth**: Both are excellent! Try both and use what works best for your problem.

### Tuning Strategy

**Phase 1: Prevent overfitting**
1. Start with `num_leaves=31`
2. Increase `min_data_in_leaf` if overfitting (20 → 50 → 100)
3. Set `max_depth` if needed (7-10)

**Phase 2: Improve accuracy**
4. Tune `num_leaves` (increase if underfitting, decrease if overfitting)
5. Add randomness: `feature_fraction=0.8`, `bagging_fraction=0.8`

**Phase 3: Optimize learning**
6. Lower `learning_rate` to 0.05 or 0.01
7. Increase `n_estimators` accordingly
8. Use early stopping

**Phase 4: Fine-tune**
9. Adjust `max_bin` for speed/accuracy trade-off
10. Add L1/L2 regularization if still overfitting

### Production Tips

1. **Model persistence**: Use `model.save_model()` / `Booster.save_model()`
2. **Categorical features**: Save category mappings with model
3. **Monitoring**: Track feature importance changes
4. **Retraining**: Schedule regular retraining
5. **A/B testing**: Compare model versions
6. **GPU acceleration**: Use `device='gpu'` for huge speedups (if available)

### What's Next?

In **Module 07: CatBoost**, we'll explore:
- Yandex's gradient boosting implementation
- Ordered boosting to prevent target leakage
- Symmetric trees for faster prediction
- Even better categorical feature handling
- Three-way comparison: XGBoost vs LightGBM vs CatBoost

### Additional Resources

- **Paper**: "LightGBM: A Highly Efficient Gradient Boosting Decision Tree" (Ke et al., 2017)
- **Documentation**: [LightGBM Official Docs](https://lightgbm.readthedocs.io/)
- **Parameters Guide**: [LightGBM Parameters](https://lightgbm.readthedocs.io/en/latest/Parameters.html)
- **Tutorial**: [LightGBM Parameter Tuning](https://lightgbm.readthedocs.io/en/latest/Parameters-Tuning.html)
- **GitHub**: [Microsoft LightGBM](https://github.com/microsoft/LightGBM)