# Module 06: LightGBM - Light Gradient Boosting Machine

**Difficulty**: ⭐⭐
**Estimated Time**: 50 minutes
**Prerequisites**: 
- Module 04: Gradient Boosting Machines
- Module 05: XGBoost

## Learning Objectives
By the end of this notebook, you will be able to:
1. Understand what makes LightGBM unique: GOSS, EFB, and leaf-wise growth
2. Install and use LightGBM for classification and regression tasks
3. Compare LightGBM's speed and accuracy with XGBoost
4. Tune LightGBM-specific hyperparameters effectively
5. Handle categorical features natively without encoding
6. Optimize LightGBM for large datasets
7. Understand when to choose LightGBM over XGBoost

## 1. Introduction to LightGBM

### What is LightGBM?

**LightGBM** (Light Gradient Boosting Machine) is a gradient boosting framework developed by Microsoft in 2017. It's designed to be:
- **Faster** than XGBoost
- **More memory efficient**
- **Better on large datasets** (millions of rows)
- **Accurate** with lower computational cost

### Key Innovations:

#### 1. **GOSS (Gradient-based One-Side Sampling)**

Instead of using all data points:
- Keeps all instances with **large gradients** (poorly predicted samples)
- Randomly samples instances with **small gradients** (well-predicted samples)
- **Why?** Instances with large gradients contribute more to information gain
- **Result:** Faster training without significant accuracy loss

#### 2. **EFB (Exclusive Feature Bundling)**

- Bundles mutually exclusive features (features that rarely take non-zero values simultaneously)
- Reduces the number of features without losing information
- **Why?** Many real-world datasets have sparse features (e.g., one-hot encoded)
- **Result:** Lower dimensionality, faster training

#### 3. **Leaf-wise (Best-first) Tree Growth**

- **XGBoost:** Grows trees level-wise (depth-wise) - splits all nodes at each level
- **LightGBM:** Grows trees leaf-wise - splits the leaf with maximum loss reduction
- **Advantage:** More accurate models with fewer leaves
- **Caution:** Can overfit on small datasets if max_depth is not set

```
Level-wise (XGBoost):          Leaf-wise (LightGBM):
        Root                           Root
       /    \                         /    \
      A      B                       A      B*
     / \    / \                     / \      \
    C   D  E   F                   C   D      E
                                              \
                                               F
    
* = Best leaf to split next
```

### LightGBM vs XGBoost:

| Feature | XGBoost | LightGBM |
|---------|---------|----------|
| Speed | Fast | **Faster** |
| Memory | Moderate | **Lower** |
| Tree Growth | Level-wise | **Leaf-wise** |
| Large Datasets | Good | **Better** |
| Small Datasets | Better | Can overfit |
| Categorical Features | Need encoding | **Native support** |
| GPU Support | Yes | **Yes (faster)** |

## 2. Setup and Installation

In [None]:
# Install LightGBM if not already installed
# Uncomment the line below if needed
# !pip install lightgbm

# Standard libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from time import time

# LightGBM
import lightgbm as lgb
from lightgbm import LGBMClassifier, LGBMRegressor

# XGBoost for comparison
from xgboost import XGBClassifier, XGBRegressor

# Scikit-learn
from sklearn.datasets import make_classification, make_regression, load_breast_cancer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import (
    accuracy_score, classification_report, confusion_matrix,
    mean_squared_error, r2_score, log_loss
)

# Configuration
%matplotlib inline
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

# Set random seed
np.random.seed(42)

# Check version
print(f"LightGBM version: {lgb.__version__}")

## 3. LightGBM for Classification

In [None]:
# Load breast cancer dataset
cancer_data = load_breast_cancer()
X = cancer_data.data
y = cancer_data.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set: {X_train.shape}")
print(f"Test set: {X_test.shape}")
print(f"Features: {len(cancer_data.feature_names)}")

In [None]:
# Train a basic LightGBM classifier
lgbm_clf = LGBMClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=3,
    random_state=42,
    verbose=-1  # Suppress warnings
)

# Time the training
start_time = time()
lgbm_clf.fit(X_train, y_train)
lgbm_time = time() - start_time

# Predictions
y_pred_train = lgbm_clf.predict(X_train)
y_pred_test = lgbm_clf.predict(X_test)

# Evaluate
train_acc = accuracy_score(y_train, y_pred_train)
test_acc = accuracy_score(y_test, y_pred_test)

print("LightGBM Classifier Performance:")
print(f"Training time: {lgbm_time:.4f} seconds")
print(f"Train Accuracy: {train_acc:.4f}")
print(f"Test Accuracy: {test_acc:.4f}")
print(f"\nClassification Report:")
print(classification_report(y_test, y_pred_test, target_names=cancer_data.target_names))

In [None]:
# Confusion Matrix
cm = confusion_matrix(y_test, y_pred_test)

plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Greens',
            xticklabels=cancer_data.target_names,
            yticklabels=cancer_data.target_names)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('LightGBM Confusion Matrix')
plt.show()

## 4. Speed Comparison: LightGBM vs XGBoost

Let's create a larger dataset to see the speed difference more clearly.

In [None]:
# Create a larger dataset for speed comparison
X_large, y_large = make_classification(
    n_samples=50000,
    n_features=50,
    n_informative=30,
    n_redundant=10,
    random_state=42
)

X_train_large, X_test_large, y_train_large, y_test_large = train_test_split(
    X_large, y_large, test_size=0.2, random_state=42
)

print(f"Large dataset - Training: {X_train_large.shape}")
print(f"Large dataset - Test: {X_test_large.shape}")

In [None]:
# Train LightGBM
print("Training LightGBM...")
start_time = time()
lgbm_large = LGBMClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=5,
    random_state=42,
    verbose=-1
)
lgbm_large.fit(X_train_large, y_train_large)
lgbm_large_time = time() - start_time
lgbm_large_acc = lgbm_large.score(X_test_large, y_test_large)

print(f"LightGBM - Time: {lgbm_large_time:.2f}s, Accuracy: {lgbm_large_acc:.4f}")

# Train XGBoost
print("\nTraining XGBoost...")
start_time = time()
xgb_large = XGBClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=5,
    random_state=42,
    eval_metric='logloss',
    verbosity=0
)
xgb_large.fit(X_train_large, y_train_large)
xgb_large_time = time() - start_time
xgb_large_acc = xgb_large.score(X_test_large, y_test_large)

print(f"XGBoost - Time: {xgb_large_time:.2f}s, Accuracy: {xgb_large_acc:.4f}")

# Compare
print(f"\n{'='*50}")
print("Comparison:")
print(f"Speedup: LightGBM is {xgb_large_time/lgbm_large_time:.2f}x faster")
print(f"Accuracy difference: {abs(lgbm_large_acc - xgb_large_acc):.4f}")

In [None]:
# Visualize the comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Training time comparison
models = ['LightGBM', 'XGBoost']
times = [lgbm_large_time, xgb_large_time]
accuracies = [lgbm_large_acc, xgb_large_acc]

axes[0].bar(models, times, color=['green', 'orange'], alpha=0.7)
axes[0].set_ylabel('Training Time (seconds)')
axes[0].set_title('Training Speed Comparison')
axes[0].grid(axis='y', alpha=0.3)

for i, (model, t) in enumerate(zip(models, times)):
    axes[0].text(i, t + 0.1, f'{t:.2f}s', ha='center', va='bottom')

# Accuracy comparison
axes[1].bar(models, accuracies, color=['green', 'orange'], alpha=0.7)
axes[1].set_ylabel('Accuracy')
axes[1].set_title('Accuracy Comparison')
axes[1].set_ylim([min(accuracies) - 0.01, 1.0])
axes[1].grid(axis='y', alpha=0.3)

for i, (model, acc) in enumerate(zip(models, accuracies)):
    axes[1].text(i, acc + 0.001, f'{acc:.4f}', ha='center', va='bottom')

plt.tight_layout()
plt.show()

## 5. LightGBM Hyperparameters

LightGBM has similar parameters to XGBoost but with some differences:

### Core Parameters:

1. **num_leaves** (default=31):
   - Maximum number of leaves in one tree
   - **Key difference from XGBoost:** Controls tree complexity instead of max_depth
   - Rule of thumb: num_leaves < 2^max_depth
   - Typical range: 20-100

2. **max_depth** (default=-1, unlimited):
   - Still important to prevent overfitting
   - Set to limit tree depth even with leaf-wise growth
   - Typical range: 3-12

3. **learning_rate** (default=0.1):
   - Same as XGBoost
   - Typical range: 0.01-0.3

4. **n_estimators** (default=100):
   - Number of boosting rounds

### LightGBM-Specific Parameters:

5. **min_child_samples** (default=20):
   - Minimum number of data points in a leaf
   - Prevents overfitting
   - Higher values → more conservative

6. **subsample** (bagging_fraction, default=1.0):
   - Fraction of data to use for each iteration
   - Need to set bagging_freq > 0 to enable

7. **colsample_bytree** (feature_fraction, default=1.0):
   - Fraction of features to use for each tree

8. **reg_alpha** (lambda_l1, default=0):
   - L1 regularization

9. **reg_lambda** (lambda_l2, default=0):
   - L2 regularization

10. **min_split_gain** (min_gain_to_split, default=0):
    - Minimum gain to make a split
    - Similar to gamma in XGBoost

### Effect of num_leaves

In [None]:
# Test different num_leaves values
num_leaves_values = [7, 15, 31, 63, 127]
leaves_results = []

for num_leaves in num_leaves_values:
    model = LGBMClassifier(
        n_estimators=100,
        learning_rate=0.1,
        num_leaves=num_leaves,
        random_state=42,
        verbose=-1
    )
    model.fit(X_train, y_train)
    
    train_acc = model.score(X_train, y_train)
    test_acc = model.score(X_test, y_test)
    
    leaves_results.append({
        'num_leaves': num_leaves,
        'train_acc': train_acc,
        'test_acc': test_acc,
        'gap': train_acc - test_acc
    })
    
    print(f"num_leaves={num_leaves}: Train={train_acc:.4f}, Test={test_acc:.4f}")

leaves_df = pd.DataFrame(leaves_results)

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Accuracy plot
axes[0].plot(leaves_df['num_leaves'], leaves_df['train_acc'], 'o-', 
             label='Train', linewidth=2, markersize=8)
axes[0].plot(leaves_df['num_leaves'], leaves_df['test_acc'], 's-', 
             label='Test', linewidth=2, markersize=8)
axes[0].set_xlabel('Number of Leaves')
axes[0].set_ylabel('Accuracy')
axes[0].set_title('Accuracy vs Number of Leaves')
axes[0].legend()
axes[0].grid(alpha=0.3)

# Overfitting gap
axes[1].plot(leaves_df['num_leaves'], leaves_df['gap'], 'ro-', 
             linewidth=2, markersize=8)
axes[1].set_xlabel('Number of Leaves')
axes[1].set_ylabel('Train-Test Gap')
axes[1].set_title('Overfitting Gap vs Number of Leaves')
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

print("\nBest num_leaves based on test accuracy:")
best_idx = leaves_df['test_acc'].idxmax()
print(f"num_leaves: {leaves_df.loc[best_idx, 'num_leaves']}")
print(f"Test Accuracy: {leaves_df.loc[best_idx, 'test_acc']:.4f}")

### Effect of min_child_samples

In [None]:
# Test different min_child_samples values
min_samples_values = [5, 10, 20, 50, 100]
samples_results = []

for min_samples in min_samples_values:
    model = LGBMClassifier(
        n_estimators=100,
        learning_rate=0.1,
        num_leaves=31,
        min_child_samples=min_samples,
        random_state=42,
        verbose=-1
    )
    model.fit(X_train, y_train)
    
    train_acc = model.score(X_train, y_train)
    test_acc = model.score(X_test, y_test)
    
    samples_results.append({
        'min_child_samples': min_samples,
        'train_acc': train_acc,
        'test_acc': test_acc
    })
    
    print(f"min_child_samples={min_samples}: Train={train_acc:.4f}, Test={test_acc:.4f}")

samples_df = pd.DataFrame(samples_results)

# Visualize
plt.figure(figsize=(10, 6))
plt.plot(samples_df['min_child_samples'], samples_df['train_acc'], 'o-', 
         label='Train', linewidth=2, markersize=8)
plt.plot(samples_df['min_child_samples'], samples_df['test_acc'], 's-', 
         label='Test', linewidth=2, markersize=8)
plt.xlabel('min_child_samples')
plt.ylabel('Accuracy')
plt.title('Effect of min_child_samples on Performance')
plt.legend()
plt.grid(alpha=0.3)
plt.show()

## 6. Categorical Features Support

One of LightGBM's most powerful features is **native categorical feature support**. You don't need to one-hot encode categorical variables!

In [None]:
# Create a dataset with categorical features
np.random.seed(42)
n_samples = 1000

# Numerical features
X_num = np.random.randn(n_samples, 5)

# Categorical features
cat_feature1 = np.random.choice(['A', 'B', 'C', 'D'], n_samples)
cat_feature2 = np.random.choice(['Low', 'Medium', 'High'], n_samples)
cat_feature3 = np.random.choice(range(10), n_samples)  # Integer categories

# Create DataFrame
df = pd.DataFrame(X_num, columns=[f'num_{i}' for i in range(5)])
df['category1'] = cat_feature1
df['category2'] = cat_feature2
df['category3'] = cat_feature3

# Create target based on features (including categoricals)
y_cat = (X_num[:, 0] + 
         (df['category1'] == 'A').astype(int) * 2 + 
         (df['category2'] == 'High').astype(int) * 1.5 +
         np.random.randn(n_samples) * 0.5) > 1
y_cat = y_cat.astype(int)

print("Dataset with categorical features:")
print(df.head())
print(f"\nShape: {df.shape}")
print(f"Target distribution: {pd.Series(y_cat).value_counts().to_dict()}")

In [None]:
# Prepare data for LightGBM
# Convert categorical columns to 'category' dtype
for col in ['category1', 'category2', 'category3']:
    df[col] = df[col].astype('category')

# Split data
X_cat = df
X_train_cat, X_test_cat, y_train_cat, y_test_cat = train_test_split(
    X_cat, y_cat, test_size=0.2, random_state=42
)

# Get categorical feature indices
categorical_features = ['category1', 'category2', 'category3']

print("Categorical features identified:")
print(categorical_features)

In [None]:
# Train LightGBM with categorical features
lgbm_cat = LGBMClassifier(
    n_estimators=100,
    learning_rate=0.1,
    num_leaves=31,
    random_state=42,
    verbose=-1
)

# Specify categorical features
lgbm_cat.fit(
    X_train_cat, y_train_cat,
    categorical_feature=categorical_features
)

# Evaluate
train_acc_cat = lgbm_cat.score(X_train_cat, y_train_cat)
test_acc_cat = lgbm_cat.score(X_test_cat, y_test_cat)

print("LightGBM with Native Categorical Support:")
print(f"Train Accuracy: {train_acc_cat:.4f}")
print(f"Test Accuracy: {test_acc_cat:.4f}")

In [None]:
# Compare with one-hot encoding approach
X_encoded = pd.get_dummies(X_cat, columns=['category1', 'category2'], drop_first=True)
# Convert category3 to numeric if not already
X_encoded['category3'] = X_encoded['category3'].astype(int)

X_train_enc, X_test_enc, y_train_enc, y_test_enc = train_test_split(
    X_encoded, y_cat, test_size=0.2, random_state=42
)

# Train with one-hot encoded features
start_time = time()
lgbm_encoded = LGBMClassifier(
    n_estimators=100,
    learning_rate=0.1,
    num_leaves=31,
    random_state=42,
    verbose=-1
)
lgbm_encoded.fit(X_train_enc, y_train_enc)
encoded_time = time() - start_time

# Train with categorical features
start_time = time()
lgbm_cat_timed = LGBMClassifier(
    n_estimators=100,
    learning_rate=0.1,
    num_leaves=31,
    random_state=42,
    verbose=-1
)
lgbm_cat_timed.fit(X_train_cat, y_train_cat, categorical_feature=categorical_features)
cat_time = time() - start_time

# Compare
print("Comparison: One-Hot Encoding vs Native Categorical:")
print(f"\nOne-Hot Encoded:")
print(f"  Number of features: {X_encoded.shape[1]}")
print(f"  Training time: {encoded_time:.4f}s")
print(f"  Test Accuracy: {lgbm_encoded.score(X_test_enc, y_test_enc):.4f}")
print(f"\nNative Categorical:")
print(f"  Number of features: {X_cat.shape[1]}")
print(f"  Training time: {cat_time:.4f}s")
print(f"  Test Accuracy: {lgbm_cat_timed.score(X_test_cat, y_test_cat):.4f}")
print(f"\nSpeedup: {encoded_time/cat_time:.2f}x faster with native categorical")

**Key Insight**: LightGBM's native categorical support is not only more convenient (no need for encoding) but also often faster and can lead to better performance by finding optimal splits directly on categorical values.

## 7. Early Stopping and Model Monitoring

In [None]:
# Create validation set
X_train_sub, X_val, y_train_sub, y_val = train_test_split(
    X_train, y_train, test_size=0.2, random_state=42, stratify=y_train
)

# Train with early stopping and callbacks
lgbm_early = LGBMClassifier(
    n_estimators=500,
    learning_rate=0.05,
    num_leaves=31,
    random_state=42,
    verbose=-1
)

# Fit with early stopping
lgbm_early.fit(
    X_train_sub, y_train_sub,
    eval_set=[(X_val, y_val)],
    eval_metric='binary_logloss',
    callbacks=[lgb.early_stopping(stopping_rounds=20, verbose=False)]
)

print(f"Best iteration: {lgbm_early.best_iteration_}")
print(f"Best score: {lgbm_early.best_score_['valid_0']['binary_logloss']:.4f}")
print(f"Test Accuracy: {lgbm_early.score(X_test, y_test):.4f}")

In [None]:
# Visualize training progress
# Train another model to capture full training history
lgbm_history = LGBMClassifier(
    n_estimators=200,
    learning_rate=0.05,
    num_leaves=31,
    random_state=42,
    verbose=-1
)

# Use callbacks to log evaluation results
evals_result = {}
lgbm_history.fit(
    X_train_sub, y_train_sub,
    eval_set=[(X_train_sub, y_train_sub), (X_val, y_val)],
    eval_metric='binary_logloss',
    callbacks=[lgb.record_evaluation(evals_result)]
)

# Plot training history
train_logloss = evals_result['training']['binary_logloss']
val_logloss = evals_result['valid_1']['binary_logloss']

plt.figure(figsize=(10, 6))
plt.plot(train_logloss, label='Training', linewidth=2)
plt.plot(val_logloss, label='Validation', linewidth=2)
plt.xlabel('Boosting Round')
plt.ylabel('Log Loss')
plt.title('LightGBM Training Progress')
plt.legend()
plt.grid(alpha=0.3)
plt.show()

# Find best iteration
best_iter = np.argmin(val_logloss)
print(f"Optimal iteration: {best_iter}")
print(f"Best validation log loss: {val_logloss[best_iter]:.4f}")

## 8. Feature Importance

In [None]:
# Get feature importances
feature_importance = lgbm_clf.feature_importances_
feature_names = cancer_data.feature_names

# Create DataFrame
importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': feature_importance
}).sort_values('Importance', ascending=False)

print("Top 10 Most Important Features:")
print(importance_df.head(10))

# Visualize
plt.figure(figsize=(10, 8))
top_features = importance_df.head(15)
plt.barh(range(len(top_features)), top_features['Importance'], color='green', alpha=0.7)
plt.yticks(range(len(top_features)), top_features['Feature'])
plt.xlabel('Feature Importance (Gain)')
plt.title('Top 15 Feature Importances - LightGBM')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

In [None]:
# LightGBM also supports split-based and gain-based importance
# Get importance using the booster object
lgb.plot_importance(lgbm_clf, max_num_features=15, importance_type='gain', 
                   figsize=(10, 8), title='Feature Importance (Gain)')
plt.tight_layout()
plt.show()

lgb.plot_importance(lgbm_clf, max_num_features=15, importance_type='split',
                   figsize=(10, 8), title='Feature Importance (Split Count)')
plt.tight_layout()
plt.show()

## 9. LightGBM for Regression

In [None]:
# Create regression dataset
X_reg, y_reg = make_regression(
    n_samples=1000,
    n_features=10,
    n_informative=8,
    noise=10,
    random_state=42
)

X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(
    X_reg, y_reg, test_size=0.2, random_state=42
)

# Train LightGBM Regressor
lgbm_reg = LGBMRegressor(
    n_estimators=100,
    learning_rate=0.1,
    num_leaves=31,
    random_state=42,
    verbose=-1
)

lgbm_reg.fit(X_train_reg, y_train_reg)

# Predictions
y_pred_train_reg = lgbm_reg.predict(X_train_reg)
y_pred_test_reg = lgbm_reg.predict(X_test_reg)

# Evaluate
train_r2 = r2_score(y_train_reg, y_pred_train_reg)
test_r2 = r2_score(y_test_reg, y_pred_test_reg)
train_mse = mean_squared_error(y_train_reg, y_pred_train_reg)
test_mse = mean_squared_error(y_test_reg, y_pred_test_reg)

print("LightGBM Regressor Performance:")
print(f"Train R²: {train_r2:.4f}")
print(f"Test R²: {test_r2:.4f}")
print(f"Train MSE: {train_mse:.2f}")
print(f"Test MSE: {test_mse:.2f}")

In [None]:
# Visualize predictions
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

axes[0].scatter(y_train_reg, y_pred_train_reg, alpha=0.5, color='green')
axes[0].plot([y_train_reg.min(), y_train_reg.max()],
             [y_train_reg.min(), y_train_reg.max()],
             'r--', linewidth=2, label='Perfect prediction')
axes[0].set_xlabel('Actual values')
axes[0].set_ylabel('Predicted values')
axes[0].set_title(f'Training Set (R² = {train_r2:.4f})')
axes[0].legend()

axes[1].scatter(y_test_reg, y_pred_test_reg, alpha=0.5, color='green')
axes[1].plot([y_test_reg.min(), y_test_reg.max()],
             [y_test_reg.min(), y_test_reg.max()],
             'r--', linewidth=2, label='Perfect prediction')
axes[1].set_xlabel('Actual values')
axes[1].set_ylabel('Predicted values')
axes[1].set_title(f'Test Set (R² = {test_r2:.4f})')
axes[1].legend()

plt.tight_layout()
plt.show()

## 10. Exercises

### Exercise 1: Basic LightGBM Training

Create a synthetic classification dataset with 1000 samples and 25 features. Train a LightGBM classifier with:
- 150 estimators
- Learning rate of 0.05
- num_leaves of 50

Compare training and test accuracy. Is there overfitting?

In [None]:
# Your code here


### Exercise 2: Speed Comparison on Large Data

Create a large dataset with 100,000 samples and 100 features. Train both LightGBM and XGBoost with identical parameters:
- 100 estimators
- Learning rate 0.1
- Max depth 5

Compare:
1. Training time
2. Test accuracy
3. Which is faster and by how much?

In [None]:
# Your code here


### Exercise 3: Categorical Feature Handling

Create a dataset with:
- 5 numerical features
- 3 categorical features (with 5, 10, and 20 categories respectively)
- 2000 samples

Train two models:
1. LightGBM with native categorical support
2. LightGBM with one-hot encoded features

Compare performance and training time.

In [None]:
# Your code here


### Exercise 4: Hyperparameter Grid Search

Using the breast cancer dataset, perform a grid search over:
- num_leaves: [15, 31, 63]
- learning_rate: [0.01, 0.05, 0.1]
- min_child_samples: [10, 20, 50]

Find the best combination and report the test accuracy.

In [None]:
# Your code here


### Exercise 5: Early Stopping Optimization

Train a LightGBM model on the breast cancer data with:
- n_estimators=1000
- Early stopping enabled with 30 rounds patience
- Learning rate of 0.03

Questions:
1. At what iteration does it stop?
2. Plot the training and validation loss curves
3. What's the final test accuracy?

In [None]:
# Your code here


## 11. Summary

In this notebook, you learned about LightGBM, Microsoft's high-performance gradient boosting framework:

### Key Innovations:

1. **GOSS (Gradient-based One-Side Sampling)**:
   - Keeps samples with large gradients
   - Randomly samples those with small gradients
   - Result: Faster training with minimal accuracy loss

2. **EFB (Exclusive Feature Bundling)**:
   - Bundles mutually exclusive features
   - Reduces dimensionality without information loss
   - Particularly effective for sparse data

3. **Leaf-wise Tree Growth**:
   - Splits the best leaf (highest loss reduction)
   - More accurate than level-wise growth
   - Requires max_depth to prevent overfitting

### Important Hyperparameters:

- **num_leaves**: Controls tree complexity (key parameter!)
- **max_depth**: Limits tree depth to prevent overfitting
- **min_child_samples**: Minimum data in leaf
- **learning_rate**: Step size shrinkage
- **bagging_fraction/feature_fraction**: Add randomness
- **min_split_gain**: Minimum gain for splits
- **reg_alpha/reg_lambda**: Regularization

### Advantages:

✅ **Faster training** than XGBoost, especially on large datasets
✅ **Lower memory usage**
✅ **Native categorical feature support** (no encoding needed)
✅ **Better accuracy** in many scenarios
✅ **Handles large datasets** efficiently
✅ **GPU support** with excellent performance
✅ **Network training** for distributed learning

### Disadvantages:

❌ Can **overfit on small datasets** (< 10,000 samples)
❌ Leaf-wise growth requires **careful tuning** of max_depth
❌ More **sensitive to parameters** than XGBoost
❌ Less **documentation and community** than XGBoost (though growing)

### LightGBM vs XGBoost - When to Use Which:

**Choose LightGBM when:**
- Dataset is **large** (> 100K samples)
- Training **speed is critical**
- Working with **categorical features**
- **Memory is limited**
- Need **GPU acceleration**

**Choose XGBoost when:**
- Dataset is **small** (< 10K samples)
- Need more **stable/conservative** performance
- Want **extensive documentation** and examples
- Prefer **safer defaults**
- **Regularization** is critical

**Both work well for:**
- Medium-sized datasets (10K - 100K samples)
- Most tabular data problems
- Kaggle competitions (winners use both!)

### Best Practices:

1. **Start with conservative parameters**:
   ```python
   LGBMClassifier(
       num_leaves=31,
       max_depth=7,  # Set explicitly to prevent overfitting
       learning_rate=0.05,
       n_estimators=100,
       min_child_samples=20
   )
   ```

2. **Always set max_depth** to prevent overfitting
3. **Use categorical features natively** when possible
4. **Enable early stopping** for optimal tree count
5. **Monitor train/val curves** to detect overfitting
6. **Use bagging/feature fraction** for regularization

### What's Next?

In the next module, we'll explore **CatBoost**, Yandex's gradient boosting library that:
- Handles **categorical features even better** than LightGBM
- Uses **ordered boosting** to reduce overfitting
- Employs **symmetric trees** for faster prediction
- Often achieves **best accuracy** out-of-the-box

We'll compare all three libraries (XGBoost, LightGBM, CatBoost) to help you choose the right one!

## Additional Resources

- [LightGBM Documentation](https://lightgbm.readthedocs.io/)
- [LightGBM Parameters](https://lightgbm.readthedocs.io/en/latest/Parameters.html)
- [Original LightGBM Paper (Ke et al., 2017)](https://papers.nips.cc/paper/6907-lightgbm-a-highly-efficient-gradient-boosting-decision-tree)
- [LightGBM Python Quick Start](https://lightgbm.readthedocs.io/en/latest/Python-Intro.html)
- [Parameter Tuning Guide](https://lightgbm.readthedocs.io/en/latest/Parameters-Tuning.html)