# Lab 1.6.1: Tabular Data Challenge - XGBoost vs Neural Networks

**Module:** 1.6 - Classical ML Foundations  
**Time:** 2 hours  
**Difficulty:** ‚≠ê‚≠ê‚≠ê (Intermediate)

---

## üéØ Learning Objectives

By the end of this notebook, you will:
- [ ] Train XGBoost models on tabular data with GPU acceleration
- [ ] Build equivalent neural networks for fair comparison
- [ ] Understand why XGBoost often wins on tabular data
- [ ] Extract and interpret feature importance from tree models
- [ ] Make informed decisions about when to use each approach

---

## üìö Prerequisites

- Completed: Module 1.5 (Neural Network Fundamentals)
- Knowledge of: Basic Python, NumPy, basic neural network concepts

---

## üåç Real-World Context

**The $1 Million Question:** When should you use XGBoost vs a neural network?

This isn't academic‚Äîit's the question that determines whether your model ships in 2 days or 2 months:

- **Banks** use XGBoost for credit scoring (interpretability required by law!)
- **Kaggle competitions** are dominated by XGBoost on tabular data
- **Tech companies** start every ML project with an XGBoost baseline
- **Healthcare** prefers tree models because doctors need to understand decisions

In 2022, a landmark paper "Why do tree-based models still outperform deep learning on tabular data?" confirmed what practitioners knew: **XGBoost wins on tabular data most of the time**.

Today, you'll see this firsthand and understand *why*.

---

## üßí ELI5: Decision Trees and Gradient Boosting

> **Imagine you're playing 20 Questions...**
>
> In 20 Questions, you ask yes/no questions to guess what someone is thinking:
> - "Is it alive?" ‚Üí Yes
> - "Is it a mammal?" ‚Üí Yes  
> - "Does it live in water?" ‚Üí Yes
> - "Is it a whale?" ‚Üí Yes! üêã
>
> A **Decision Tree** works exactly like this! It asks questions about your data:
> - "Is income > $50K?" ‚Üí Yes
> - "Is age > 30?" ‚Üí No
> - "Has credit history > 5 years?" ‚Üí Yes
> - Prediction: **Approve loan** ‚úÖ
>
> **But what about XGBoost (Gradient Boosting)?**
>
> Imagine you're terrible at 20 Questions. So you ask 100 friends to play, and each friend:
> 1. Plays the game
> 2. Sees where the previous friends made mistakes
> 3. Focuses on fixing THOSE specific mistakes
>
> XGBoost is like this team of 100 friends, where each new tree specifically learns to fix the errors of all previous trees!
>
> **In AI terms:** Gradient boosting builds trees sequentially, with each new tree trained to predict the *residual errors* of the ensemble so far.

---

## Part 1: Environment Setup & Data Loading

Let's set up our environment and load a classic tabular dataset: the California Housing dataset.

### Why California Housing?
- Real estate prediction is a perfect tabular problem
- Mix of numerical features (median income, house age, etc.)
- Reasonable size (~20K samples) for quick experiments
- Well-understood baseline performance

In [None]:
# First, let's check our DGX Spark GPU!
import torch

print("üîç System Check")
print("=" * 50)
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
print("=" * 50)

In [None]:
# Import all required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from time import time
import warnings
warnings.filterwarnings('ignore')

# Scikit-learn for data and preprocessing
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

# XGBoost - The Kaggle Champion!
import xgboost as xgb

# PyTorch for neural networks
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

# Set random seeds for reproducibility
np.random.seed(42)
torch.manual_seed(42)

# Plotting style
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

print("‚úÖ All libraries imported successfully!")

In [None]:
# Load California Housing dataset
print("üì¶ Loading California Housing Dataset...")
housing = fetch_california_housing()

# Create a DataFrame for easier exploration
df = pd.DataFrame(housing.data, columns=housing.feature_names)
df['Target'] = housing.target  # Median house value in $100,000s

print(f"\nüìä Dataset Shape: {df.shape}")
print(f"üìù Features: {list(housing.feature_names)}")
print(f"üéØ Target: Median house value (in $100,000s)")
print("\n" + "=" * 60)
print("First 5 rows:")
df.head()

In [None]:
# Let's understand our features better
print("üìä Feature Descriptions:")
print("=" * 60)
feature_descriptions = {
    'MedInc': 'Median income in block group (in $10,000s)',
    'HouseAge': 'Median house age in block group (years)',
    'AveRooms': 'Average number of rooms per household',
    'AveBedrms': 'Average number of bedrooms per household',
    'Population': 'Block group population',
    'AveOccup': 'Average number of household members',
    'Latitude': 'Block group latitude',
    'Longitude': 'Block group longitude'
}

for feature, desc in feature_descriptions.items():
    print(f"  ‚Ä¢ {feature}: {desc}")

print("\nüìà Statistical Summary:")
df.describe().round(2)

In [None]:
# Visualize the target distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Target distribution
axes[0].hist(df['Target'], bins=50, edgecolor='black', alpha=0.7, color='steelblue')
axes[0].axvline(df['Target'].median(), color='red', linestyle='--', linewidth=2, label=f'Median: ${df["Target"].median()*100000:,.0f}')
axes[0].set_xlabel('Median House Value ($100,000s)')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Distribution of House Prices')
axes[0].legend()

# Correlation with target
correlations = df.corr()['Target'].drop('Target').sort_values()
colors = ['green' if c > 0 else 'red' for c in correlations]
axes[1].barh(correlations.index, correlations.values, color=colors, alpha=0.7)
axes[1].set_xlabel('Correlation with House Price')
axes[1].set_title('Feature Correlations with Target')
axes[1].axvline(0, color='black', linewidth=0.5)

plt.tight_layout()
plt.show()

print("üí° Key Insight: MedInc (median income) has the strongest correlation with house prices!")

### üîç What Just Happened?

We loaded the California Housing dataset and discovered:
1. **20,640 samples** with 8 features each
2. **MedInc** (median income) is the strongest predictor of house prices
3. **Location** (Lat/Long) also matters (beachfront property costs more!)
4. The target is capped at $500K (values above were truncated)

---

## Part 2: Data Preparation

Now let's prepare our data for both XGBoost and neural network training.

In [None]:
# Prepare features and target
X = df.drop('Target', axis=1).values
y = df['Target'].values

# Split into train, validation, and test sets
# 70% train, 15% validation, 15% test
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

print("üìä Data Split:")
print(f"  ‚Ä¢ Training:   {X_train.shape[0]:,} samples ({X_train.shape[0]/len(X)*100:.0f}%)")
print(f"  ‚Ä¢ Validation: {X_val.shape[0]:,} samples ({X_val.shape[0]/len(X)*100:.0f}%)")
print(f"  ‚Ä¢ Test:       {X_test.shape[0]:,} samples ({X_test.shape[0]/len(X)*100:.0f}%)")

In [None]:
# Scale features for neural network (XGBoost doesn't need this!)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)
X_test_scaled = scaler.transform(X_test)

print("‚ö†Ô∏è Important Note:")
print("  ‚Ä¢ XGBoost: Does NOT need feature scaling (tree-based!)")
print("  ‚Ä¢ Neural Network: NEEDS feature scaling (gradient-based!)")
print("\nThis is one reason XGBoost is easier to use!")

---

## Part 3: XGBoost - The Gradient Boosting Champion

### üßí ELI5: What Makes XGBoost Special?

> **Think of XGBoost as a smart study group...**
>
> Imagine a class where students help each other prepare for an exam:
> 1. **Student 1** studies everything, but makes some mistakes on practice problems
> 2. **Student 2** only studies the problems Student 1 got wrong
> 3. **Student 3** focuses on problems both previous students struggled with
> 4. And so on...
>
> By the end, this study group can solve any problem because each student specializes in fixing the previous students' weaknesses!
>
> **XGBoost works the same way:**
> - Each tree focuses on the errors made by previous trees
> - Trees are shallow (usually 3-10 levels) but there are many of them
> - The final prediction is the sum of all trees' predictions

### Why XGBoost Dominates Tabular Data

1. **No preprocessing needed**: Handles raw features directly
2. **Built-in regularization**: L1/L2 prevents overfitting
3. **Missing values**: Learns optimal direction for missing data
4. **Feature importance**: Free interpretability!
5. **Fast**: Highly optimized C++ with GPU support

In [None]:
# Train XGBoost with GPU acceleration
print("üöÄ Training XGBoost on GPU...")
print("=" * 60)

# XGBoost parameters - good defaults for tabular regression
xgb_params = {
    'objective': 'reg:squarederror',  # Regression task
    'eval_metric': 'rmse',             # Root Mean Squared Error
    'max_depth': 6,                    # Tree depth (controls complexity)
    'learning_rate': 0.1,              # Step size (Œ∑)
    'n_estimators': 100,               # Number of trees
    'subsample': 0.8,                  # Row sampling
    'colsample_bytree': 0.8,           # Column sampling
    'tree_method': 'hist',             # Fast histogram-based algorithm
    'device': 'cuda' if torch.cuda.is_available() else 'cpu',  # Use GPU if available
    'random_state': 42,
    'verbosity': 0
}

# Create and train model
xgb_model = xgb.XGBRegressor(**xgb_params)

# Time the training
start_time = time()
xgb_model.fit(
    X_train, y_train,
    eval_set=[(X_val, y_val)],
    verbose=False
)
xgb_train_time = time() - start_time

# Predictions
start_time = time()
xgb_pred_train = xgb_model.predict(X_train)
xgb_pred_val = xgb_model.predict(X_val)
xgb_pred_test = xgb_model.predict(X_test)
xgb_inference_time = time() - start_time

# Calculate metrics
xgb_metrics = {
    'train_rmse': np.sqrt(mean_squared_error(y_train, xgb_pred_train)),
    'val_rmse': np.sqrt(mean_squared_error(y_val, xgb_pred_val)),
    'test_rmse': np.sqrt(mean_squared_error(y_test, xgb_pred_test)),
    'test_r2': r2_score(y_test, xgb_pred_test),
    'test_mae': mean_absolute_error(y_test, xgb_pred_test),
    'train_time': xgb_train_time,
    'inference_time': xgb_inference_time
}

print(f"\n‚úÖ XGBoost Training Complete!")
print(f"\nüìä Results:")
print(f"  ‚Ä¢ Training RMSE:   ${xgb_metrics['train_rmse']*100000:,.0f}")
print(f"  ‚Ä¢ Validation RMSE: ${xgb_metrics['val_rmse']*100000:,.0f}")
print(f"  ‚Ä¢ Test RMSE:       ${xgb_metrics['test_rmse']*100000:,.0f}")
print(f"  ‚Ä¢ Test R¬≤ Score:   {xgb_metrics['test_r2']:.4f}")
print(f"  ‚Ä¢ Test MAE:        ${xgb_metrics['test_mae']*100000:,.0f}")
print(f"\n‚è±Ô∏è Timing:")
print(f"  ‚Ä¢ Training Time:   {xgb_metrics['train_time']:.2f} seconds")
print(f"  ‚Ä¢ Inference Time:  {xgb_metrics['inference_time']*1000:.2f} ms")

In [None]:
# Visualize XGBoost Feature Importance
print("üìä XGBoost Feature Importance")
print("=" * 60)

# Get feature importance
importance_df = pd.DataFrame({
    'Feature': housing.feature_names,
    'Importance': xgb_model.feature_importances_
}).sort_values('Importance', ascending=True)

# Plot
fig, ax = plt.subplots(figsize=(10, 6))
bars = ax.barh(importance_df['Feature'], importance_df['Importance'], 
               color='steelblue', alpha=0.8)
ax.set_xlabel('Feature Importance (Gain)')
ax.set_title('XGBoost Feature Importance - Which Features Matter Most?')

# Add value labels
for bar, val in zip(bars, importance_df['Importance']):
    ax.text(val + 0.01, bar.get_y() + bar.get_height()/2, 
            f'{val:.3f}', va='center', fontsize=10)

plt.tight_layout()
plt.show()

print("\nüí° Insight: MedInc (median income) is by far the most important feature!")
print("   This matches our correlation analysis and makes intuitive sense.")

### üîç What Just Happened?

XGBoost trained in under a second on GPU and achieved:
- **R¬≤ of ~0.83**: Explains 83% of the variance in house prices!
- **RMSE of ~$47,000**: Average error in predictions
- **Automatic feature importance**: We know MedInc is the key driver

Now let's see if a neural network can do better...

---

## Part 4: Neural Network Approach

### üßí ELI5: Neural Networks vs Decision Trees

> **Trees ask questions. Neural networks learn patterns.**
>
> Think of it this way:
> - **Decision Tree**: "Is the income above $50K? Yes? Then check if age > 30..." (Explicit rules)
> - **Neural Network**: "I've seen millions of examples. There's a complex pattern here..." (Learned features)
>
> **Why neural networks sometimes struggle with tabular data:**
> 1. Tabular features are often **heterogeneous** (income, age, location = different scales/meanings)
> 2. Important patterns may be **axis-aligned** (income > threshold) - trees handle this naturally
> 3. Need more data to learn what trees "know" implicitly

Let's build a fair comparison neural network:
- Similar number of parameters to XGBoost's effective capacity
- Modern techniques: BatchNorm, Dropout, Adam optimizer
- Similar training time budget

In [None]:
# Define a modern MLP for tabular data
class TabularMLP(nn.Module):
    """
    Modern MLP architecture for tabular data.
    
    Architecture:
    - Input ‚Üí Linear ‚Üí BatchNorm ‚Üí ReLU ‚Üí Dropout
    - Repeat for hidden layers
    - Final Linear ‚Üí Output
    """
    
    def __init__(self, input_dim, hidden_dims=[256, 128, 64], dropout=0.2):
        super().__init__()
        
        layers = []
        prev_dim = input_dim
        
        # Build hidden layers
        for hidden_dim in hidden_dims:
            layers.extend([
                nn.Linear(prev_dim, hidden_dim),
                nn.BatchNorm1d(hidden_dim),
                nn.ReLU(),
                nn.Dropout(dropout)
            ])
            prev_dim = hidden_dim
        
        # Output layer
        layers.append(nn.Linear(prev_dim, 1))
        
        self.network = nn.Sequential(*layers)
    
    def forward(self, x):
        return self.network(x).squeeze(-1)

# Create model
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
nn_model = TabularMLP(input_dim=8, hidden_dims=[256, 128, 64], dropout=0.2).to(device)

# Count parameters
n_params = sum(p.numel() for p in nn_model.parameters())
print(f"üìê Neural Network Architecture:")
print(f"  ‚Ä¢ Input: 8 features")
print(f"  ‚Ä¢ Hidden: 256 ‚Üí 128 ‚Üí 64")
print(f"  ‚Ä¢ Output: 1 (house price)")
print(f"  ‚Ä¢ Total Parameters: {n_params:,}")
print(f"  ‚Ä¢ Device: {device}")
print("\n" + str(nn_model))

In [None]:
# Prepare PyTorch datasets
def to_tensor(x):
    return torch.FloatTensor(x)

# Create datasets (using scaled data for neural network!)
train_dataset = TensorDataset(to_tensor(X_train_scaled), to_tensor(y_train))
val_dataset = TensorDataset(to_tensor(X_val_scaled), to_tensor(y_val))
test_dataset = TensorDataset(to_tensor(X_test_scaled), to_tensor(y_test))

# Create dataloaders
batch_size = 256
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size)
test_loader = DataLoader(test_dataset, batch_size=batch_size)

print(f"üì¶ DataLoaders created with batch_size={batch_size}")
print(f"  ‚Ä¢ Training batches: {len(train_loader)}")
print(f"  ‚Ä¢ Validation batches: {len(val_loader)}")
print(f"  ‚Ä¢ Test batches: {len(test_loader)}")

In [None]:
# Training function
def train_epoch(model, loader, optimizer, criterion, device):
    model.train()
    total_loss = 0
    for X_batch, y_batch in loader:
        X_batch, y_batch = X_batch.to(device), y_batch.to(device)
        
        optimizer.zero_grad()
        predictions = model(X_batch)
        loss = criterion(predictions, y_batch)
        loss.backward()
        optimizer.step()
        
        total_loss += loss.item() * len(y_batch)
    
    return total_loss / len(loader.dataset)

def evaluate(model, loader, criterion, device):
    model.eval()
    total_loss = 0
    predictions = []
    targets = []
    
    with torch.no_grad():
        for X_batch, y_batch in loader:
            X_batch, y_batch = X_batch.to(device), y_batch.to(device)
            pred = model(X_batch)
            loss = criterion(pred, y_batch)
            total_loss += loss.item() * len(y_batch)
            predictions.append(pred.cpu().numpy())
            targets.append(y_batch.cpu().numpy())
    
    predictions = np.concatenate(predictions)
    targets = np.concatenate(targets)
    
    return total_loss / len(loader.dataset), predictions, targets

In [None]:
# Train the neural network
print("üöÄ Training Neural Network on GPU...")
print("=" * 60)

# Setup
criterion = nn.MSELoss()
optimizer = optim.Adam(nn_model.parameters(), lr=0.001, weight_decay=1e-5)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, patience=10, factor=0.5)

# Training history
history = {'train_loss': [], 'val_loss': []}
best_val_loss = float('inf')
patience = 20
patience_counter = 0
n_epochs = 200

start_time = time()

for epoch in range(n_epochs):
    # Train
    train_loss = train_epoch(nn_model, train_loader, optimizer, criterion, device)
    
    # Evaluate
    val_loss, _, _ = evaluate(nn_model, val_loader, criterion, device)
    
    # Record history
    history['train_loss'].append(train_loss)
    history['val_loss'].append(val_loss)
    
    # Learning rate scheduling
    scheduler.step(val_loss)
    
    # Early stopping
    if val_loss < best_val_loss:
        best_val_loss = val_loss
        patience_counter = 0
        # Save best model
        best_model_state = nn_model.state_dict().copy()
    else:
        patience_counter += 1
    
    if patience_counter >= patience:
        print(f"\n‚èπÔ∏è Early stopping at epoch {epoch+1}")
        break
    
    # Print progress
    if (epoch + 1) % 20 == 0:
        print(f"Epoch {epoch+1:3d}/{n_epochs}: Train Loss={train_loss:.6f}, Val Loss={val_loss:.6f}")

nn_train_time = time() - start_time

# Load best model
nn_model.load_state_dict(best_model_state)

print(f"\n‚úÖ Training complete in {nn_train_time:.2f} seconds!")

In [None]:
# Evaluate neural network
print("üìä Evaluating Neural Network...")

# Get predictions
start_time = time()
_, nn_pred_train, y_train_check = evaluate(nn_model, train_loader, criterion, device)
_, nn_pred_val, y_val_check = evaluate(nn_model, val_loader, criterion, device)
_, nn_pred_test, y_test_check = evaluate(nn_model, test_loader, criterion, device)
nn_inference_time = time() - start_time

# Calculate metrics
nn_metrics = {
    'train_rmse': np.sqrt(mean_squared_error(y_train, nn_pred_train)),
    'val_rmse': np.sqrt(mean_squared_error(y_val, nn_pred_val)),
    'test_rmse': np.sqrt(mean_squared_error(y_test, nn_pred_test)),
    'test_r2': r2_score(y_test, nn_pred_test),
    'test_mae': mean_absolute_error(y_test, nn_pred_test),
    'train_time': nn_train_time,
    'inference_time': nn_inference_time
}

print(f"\nüìä Results:")
print(f"  ‚Ä¢ Training RMSE:   ${nn_metrics['train_rmse']*100000:,.0f}")
print(f"  ‚Ä¢ Validation RMSE: ${nn_metrics['val_rmse']*100000:,.0f}")
print(f"  ‚Ä¢ Test RMSE:       ${nn_metrics['test_rmse']*100000:,.0f}")
print(f"  ‚Ä¢ Test R¬≤ Score:   {nn_metrics['test_r2']:.4f}")
print(f"  ‚Ä¢ Test MAE:        ${nn_metrics['test_mae']*100000:,.0f}")
print(f"\n‚è±Ô∏è Timing:")
print(f"  ‚Ä¢ Training Time:   {nn_metrics['train_time']:.2f} seconds")
print(f"  ‚Ä¢ Inference Time:  {nn_metrics['inference_time']*1000:.2f} ms")

In [None]:
# Plot training curves
fig, ax = plt.subplots(figsize=(10, 6))

ax.plot(history['train_loss'], label='Train Loss', alpha=0.8)
ax.plot(history['val_loss'], label='Validation Loss', alpha=0.8)
ax.set_xlabel('Epoch')
ax.set_ylabel('MSE Loss')
ax.set_title('Neural Network Training Curves')
ax.legend()
ax.set_yscale('log')

plt.tight_layout()
plt.show()

---

## Part 5: Head-to-Head Comparison

Now let's compare XGBoost and Neural Network side-by-side!

In [None]:
# Create comparison DataFrame
comparison = pd.DataFrame({
    'Metric': [
        'Test RMSE ($)',
        'Test R¬≤ Score',
        'Test MAE ($)',
        'Training Time (s)',
        'Inference Time (ms)',
        'Feature Importance',
        'Preprocessing Required'
    ],
    'XGBoost': [
        f"${xgb_metrics['test_rmse']*100000:,.0f}",
        f"{xgb_metrics['test_r2']:.4f}",
        f"${xgb_metrics['test_mae']*100000:,.0f}",
        f"{xgb_metrics['train_time']:.2f}",
        f"{xgb_metrics['inference_time']*1000:.2f}",
        '‚úÖ Built-in',
        '‚ùå No'
    ],
    'Neural Network': [
        f"${nn_metrics['test_rmse']*100000:,.0f}",
        f"{nn_metrics['test_r2']:.4f}",
        f"${nn_metrics['test_mae']*100000:,.0f}",
        f"{nn_metrics['train_time']:.2f}",
        f"{nn_metrics['inference_time']*1000:.2f}",
        '‚ö†Ô∏è Requires extra work',
        '‚úÖ Yes (scaling)'
    ]
})

# Determine winner for each metric
winners = []
for i, metric in enumerate(comparison['Metric']):
    if metric in ['Test RMSE ($)', 'Test MAE ($)', 'Training Time (s)', 'Inference Time (ms)']:
        # Lower is better
        xgb_val = float(comparison['XGBoost'].iloc[i].replace('$', '').replace(',', '').replace(' s', '').replace(' ms', ''))
        nn_val = float(comparison['Neural Network'].iloc[i].replace('$', '').replace(',', '').replace(' s', '').replace(' ms', ''))
        winners.append('XGBoost' if xgb_val < nn_val else 'Neural Network')
    elif metric == 'Test R¬≤ Score':
        # Higher is better
        xgb_val = float(comparison['XGBoost'].iloc[i])
        nn_val = float(comparison['Neural Network'].iloc[i])
        winners.append('XGBoost' if xgb_val > nn_val else 'Neural Network')
    else:
        winners.append('-')

comparison['Winner'] = winners

print("üèÜ XGBoost vs Neural Network: Head-to-Head Comparison")
print("=" * 70)
print(comparison.to_string(index=False))

In [None]:
# Visual comparison
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# 1. Predictions vs Actual
ax1 = axes[0]
ax1.scatter(y_test, xgb_pred_test, alpha=0.5, label='XGBoost', s=20)
ax1.scatter(y_test, nn_pred_test, alpha=0.5, label='Neural Net', s=20)
ax1.plot([0, 5], [0, 5], 'r--', linewidth=2, label='Perfect')
ax1.set_xlabel('Actual Price ($100K)')
ax1.set_ylabel('Predicted Price ($100K)')
ax1.set_title('Predictions vs Actual')
ax1.legend()
ax1.set_xlim(0, 5.5)
ax1.set_ylim(0, 5.5)

# 2. Residual distributions
ax2 = axes[1]
xgb_residuals = y_test - xgb_pred_test
nn_residuals = y_test - nn_pred_test
ax2.hist(xgb_residuals, bins=50, alpha=0.6, label='XGBoost', density=True)
ax2.hist(nn_residuals, bins=50, alpha=0.6, label='Neural Net', density=True)
ax2.axvline(0, color='red', linestyle='--', linewidth=2)
ax2.set_xlabel('Residual ($100K)')
ax2.set_ylabel('Density')
ax2.set_title('Residual Distribution')
ax2.legend()

# 3. Timing comparison
ax3 = axes[2]
metrics_names = ['Training\nTime (s)', 'Inference\nTime (ms)']
xgb_times = [xgb_metrics['train_time'], xgb_metrics['inference_time']*1000]
nn_times = [nn_metrics['train_time'], nn_metrics['inference_time']*1000]

x = np.arange(len(metrics_names))
width = 0.35
ax3.bar(x - width/2, xgb_times, width, label='XGBoost', color='steelblue')
ax3.bar(x + width/2, nn_times, width, label='Neural Net', color='coral')
ax3.set_ylabel('Time')
ax3.set_title('Training & Inference Time')
ax3.set_xticks(x)
ax3.set_xticklabels(metrics_names)
ax3.legend()
ax3.set_yscale('log')

plt.tight_layout()
plt.show()

---

## Part 6: Analysis - When Does Each Excel?

### Why XGBoost Usually Wins on Tabular Data

Based on our experiment and research, here's why:

In [None]:
# Summary analysis
print("üìä Analysis: XGBoost vs Neural Networks on Tabular Data")
print("=" * 70)

analysis = """
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ                    WHY XGBOOST WINS ON TABULAR DATA                  ‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
‚îÇ                                                                      ‚îÇ
‚îÇ  1. AXIS-ALIGNED SPLITS                                             ‚îÇ
‚îÇ     Trees naturally handle "if income > $50K" decisions             ‚îÇ
‚îÇ     Neural nets must learn these boundaries from scratch            ‚îÇ
‚îÇ                                                                      ‚îÇ
‚îÇ  2. HETEROGENEOUS FEATURES                                          ‚îÇ
‚îÇ     Age, income, location are fundamentally different               ‚îÇ
‚îÇ     Trees handle each feature independently                         ‚îÇ
‚îÇ     Neural nets struggle with mixed feature semantics               ‚îÇ
‚îÇ                                                                      ‚îÇ
‚îÇ  3. NO PREPROCESSING NEEDED                                         ‚îÇ
‚îÇ     XGBoost: Raw data in, predictions out                           ‚îÇ
‚îÇ     Neural nets: Need scaling, encoding, careful initialization     ‚îÇ
‚îÇ                                                                      ‚îÇ
‚îÇ  4. BUILT-IN REGULARIZATION                                         ‚îÇ
‚îÇ     Tree depth, min samples, L1/L2 penalties                        ‚îÇ
‚îÇ     Neural nets: Dropout, weight decay, early stopping              ‚îÇ
‚îÇ                                                                      ‚îÇ
‚îÇ  5. TRAINING EFFICIENCY                                             ‚îÇ
‚îÇ     XGBoost: Seconds to minutes                                     ‚îÇ
‚îÇ     Neural nets: Minutes to hours                                   ‚îÇ
‚îÇ                                                                      ‚îÇ
‚îÇ  6. INTERPRETABILITY                                                ‚îÇ
‚îÇ     XGBoost: Feature importance is built-in and meaningful          ‚îÇ
‚îÇ     Neural nets: Require SHAP, LIME, or other tools                 ‚îÇ
‚îÇ                                                                      ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò

‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ                    WHEN NEURAL NETWORKS WIN                          ‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
‚îÇ                                                                      ‚îÇ
‚îÇ  1. VERY LARGE DATASETS (>1M samples)                               ‚îÇ
‚îÇ     Neural nets can keep improving with more data                   ‚îÇ
‚îÇ                                                                      ‚îÇ
‚îÇ  2. COMPLEX FEATURE INTERACTIONS                                    ‚îÇ
‚îÇ     When relationships aren't axis-aligned                          ‚îÇ
‚îÇ                                                                      ‚îÇ
‚îÇ  3. TRANSFER LEARNING                                               ‚îÇ
‚îÇ     Pre-trained embeddings from related tasks                       ‚îÇ
‚îÇ     (e.g., text/image features as columns)                          ‚îÇ
‚îÇ                                                                      ‚îÇ
‚îÇ  4. MULTI-MODAL DATA                                                ‚îÇ
‚îÇ     Tables + Images + Text combined                                 ‚îÇ
‚îÇ                                                                      ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
"""
print(analysis)

---

## ‚úã Try It Yourself

Now it's your turn! Complete these exercises to solidify your understanding.

### Exercise 1: Tune XGBoost Hyperparameters

Try different hyperparameters and see if you can beat the default XGBoost model.

<details>
<summary>üí° Hint</summary>
Try adjusting:
- `max_depth`: 3-10 (lower = less overfitting)
- `learning_rate`: 0.01-0.3 (lower = needs more trees)
- `n_estimators`: 100-1000 (more trees = better, but slower)
- `min_child_weight`: 1-10 (higher = more conservative)
</details>

In [None]:
# Exercise 1: Your code here
# Try to beat the default XGBoost score!

# Suggested starting point - uncomment and fill in the blanks:
# tuned_params = {
#     'objective': 'reg:squarederror',
#     'max_depth': 5,            # Try: 4, 5, 6, 7, 8
#     'learning_rate': 0.1,      # Try: 0.05, 0.1, 0.2
#     'n_estimators': 200,       # Try: 200, 300, 500
#     'min_child_weight': 3,     # Try: 1, 3, 5
#     'subsample': 0.8,
#     'colsample_bytree': 0.8,
#     'device': 'cuda' if torch.cuda.is_available() else 'cpu',
#     'random_state': 42
# }
#
# tuned_model = xgb.XGBRegressor(**tuned_params)
# tuned_model.fit(X_train, y_train, eval_set=[(X_val, y_val)], verbose=False)
# tuned_pred = tuned_model.predict(X_test)
# tuned_rmse = np.sqrt(mean_squared_error(y_test, tuned_pred))
# print(f"Tuned RMSE: ${tuned_rmse*100000:,.0f} vs Original: ${xgb_metrics['test_rmse']*100000:,.0f}")

### Exercise 2: Experiment with Neural Network Architecture

Try different architectures and see if a neural network can match XGBoost.

<details>
<summary>üí° Hint</summary>
Try:
- Deeper networks: [512, 256, 128, 64]
- Wider networks: [512, 512, 256]
- Different activations: LeakyReLU, GELU
- More/less dropout
</details>

In [None]:
# Exercise 2: Your code here
# Try different neural network architectures

# Example: Create a deeper model
# deep_model = TabularMLP(
#     input_dim=8, 
#     hidden_dims=[____],  # Try different architectures
#     dropout=____         # Try different dropout rates
# ).to(device)

# Train and evaluate...

### Exercise 3: Cross-Validation Comparison

Use 5-fold cross-validation to get a more robust comparison.

<details>
<summary>üí° Hint</summary>
Use `sklearn.model_selection.cross_val_score` with `scoring='neg_root_mean_squared_error'`
</details>

In [None]:
# Exercise 3: Your code here
# Use cross-validation for a fairer comparison

from sklearn.model_selection import cross_val_score

# cv_model = xgb.XGBRegressor(**xgb_params)
# cv_scores = cross_val_score(cv_model, X, y, cv=5, scoring='neg_root_mean_squared_error')
# print(f"Cross-validation RMSE: ${-cv_scores.mean()*100000:,.0f} (+/- ${cv_scores.std()*100000:,.0f})")

---

## ‚ö†Ô∏è Common Mistakes

### Mistake 1: Not Using Early Stopping with XGBoost

In [None]:
# ‚ùå Wrong: Training until all n_estimators complete (may overfit)
# bad_model = xgb.XGBRegressor(n_estimators=1000)
# bad_model.fit(X_train, y_train)  # No validation set, no early stopping!

# ‚úÖ Right: Use early stopping with validation set
good_model = xgb.XGBRegressor(
    n_estimators=1000,
    early_stopping_rounds=10,  # Stop if no improvement for 10 rounds
    device='cuda'
)
good_model.fit(
    X_train, y_train,
    eval_set=[(X_val, y_val)],
    verbose=False
)
print(f"‚úÖ Stopped at {good_model.best_iteration} trees (instead of 1000)!")
print(f"   This prevents overfitting and saves time.")

### Mistake 2: Scaling Features for XGBoost

In [None]:
# ‚ùå Wrong: Scaling features for tree-based models (unnecessary!)
# Tree-based models split on feature values - scaling changes nothing!
# model = xgb.XGBRegressor()
# model.fit(StandardScaler().fit_transform(X_train), y_train)  # Wastes time

# ‚úÖ Right: Use raw features
print("üí° XGBoost doesn't need feature scaling!")
print("   Trees split on 'feature > threshold', so scaling doesn't change anything.")
print("   Save StandardScaler for neural networks only.")

### Mistake 3: Not Using GPU Acceleration

In [None]:
# ‚ùå Wrong: Using CPU on DGX Spark (wastes GPU!)
# model = xgb.XGBRegressor(tree_method='hist')  # CPU-only

# ‚úÖ Right: Use GPU acceleration
model = xgb.XGBRegressor(
    tree_method='hist',
    device='cuda'  # Use GPU!
)
print("‚úÖ XGBoost is using GPU acceleration!")
print("   On DGX Spark, this can be 2-5x faster than CPU for large datasets.")

### Mistake 4: Forgetting to Scale for Neural Networks

In [None]:
# ‚ùå Wrong: Using raw features for neural networks
# Neural networks are VERY sensitive to feature scale!
# model.fit(X_train, y_train)  # Features have wildly different ranges

# ‚úÖ Right: Always scale features for neural networks
print("üí° Neural networks NEED feature scaling!")
print("   Income ranges from 0.5-15, but Population can be 3-35,000")
print("   Without scaling, gradients will be dominated by large features.")
print("\n   Use: X_scaled = StandardScaler().fit_transform(X)")

---

## üéâ Checkpoint

Congratulations! You've completed the Tabular Data Challenge. You've learned:

- ‚úÖ **XGBoost basics**: Training, prediction, and feature importance
- ‚úÖ **Neural network comparison**: Fair benchmark with modern architecture
- ‚úÖ **Why XGBoost wins**: Axis-aligned splits, heterogeneous features, no preprocessing
- ‚úÖ **When to use each**: Tabular ‚Üí XGBoost, Complex/Multi-modal ‚Üí Neural Networks
- ‚úÖ **GPU acceleration**: Both models run faster on your DGX Spark!

---

## üöÄ Challenge (Optional)

**The Ultimate Challenge:** Can you find a tabular dataset where a neural network beats XGBoost?

Some candidates to try:
1. **Forest Cover Type** (sklearn.datasets.fetch_covtype) - 581K samples
2. **Higgs Boson** (from UCI) - 11M samples
3. **Click-Through Rate Prediction** - Kaggle datasets

Neural networks tend to win when:
- Dataset is very large (>1M samples)
- Features have complex interactions
- You can use pre-trained embeddings

---

## üìñ Further Reading

- [Why do tree-based models still outperform deep learning on tabular data? (2022)](https://arxiv.org/abs/2207.08815)
- [XGBoost: A Scalable Tree Boosting System (Original Paper)](https://arxiv.org/abs/1603.02754)
- [TabNet: Attentive Interpretable Tabular Learning](https://arxiv.org/abs/1908.07442)
- [Deep Neural Networks and Tabular Data: A Survey](https://arxiv.org/abs/2110.01889)

---

## üßπ Cleanup

In [None]:
# Clear GPU memory
import torch
import gc

# Delete models
del nn_model
del xgb_model

# Clear CUDA cache
if torch.cuda.is_available():
    torch.cuda.empty_cache()

# Garbage collection
gc.collect()

print("‚úÖ Memory cleaned up!")
if torch.cuda.is_available():
    print(f"   GPU Memory: {torch.cuda.memory_allocated()/1e6:.1f} MB allocated")

---

## ‚û°Ô∏è Next Steps

Continue to **Lab 1.6.2: Hyperparameter Optimization** to learn how to use Optuna to automatically find the best hyperparameters!