# Feature Engineering Comparison - AutoGluon Presets

## 🎯 Objective
Compare AutoGluon performance across different quality presets to understand the **speed vs accuracy tradeoff**

**Task**: Binary Classification  
**Dataset**: Titanic  
**Target**: `Survived`  
**Comparison**: Different AutoGluon presets  

## 📋 What This Notebook Does
1. Load Titanic dataset
2. Train with multiple presets (fast → slow, low → high quality)
3. Compare performance, training time, and model complexity
4. Show the impact of automatic feature engineering

## 📦 Install Dependencies

In [None]:
!pip install -q autogluon

## 📚 Import Libraries

In [None]:
import time
import pandas as pd
from autogluon.tabular import TabularDataset, TabularPredictor

## 📥 Load Dataset

In [None]:
# Load Titanic dataset
train = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/titanic/train.csv')
test = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/titanic/test.csv')

LABEL = "Survived"

print(f"✅ Data loaded!")
print(f"   Train: {train.shape}")
print(f"   Test:  {test.shape}")

## 🏃 Preset 1: Optimize for Deployment (Fastest)

Focus: **Speed and simplicity**
- Fast training
- Small model size
- Fast inference
- Limited feature engineering

In [None]:
print("🏃 Training with 'optimize_for_deployment' preset...\n")
start = time.time()

predictor_fast = TabularPredictor(
    label=LABEL,
    path="ag-fast"
).fit(
    train,
    presets="optimize_for_deployment",
    time_limit=180  # 3 minutes
)

time_fast = time.time() - start
print(f"\n⏱️ Training time: {time_fast:.1f} seconds")

## ⚖️ Preset 2: Medium Quality (Balanced)

Focus: **Balance between speed and accuracy**
- Moderate training time
- Good performance
- Reasonable model size
- Some feature engineering

In [None]:
print("⚖️ Training with 'medium_quality' preset...\n")
start = time.time()

predictor_medium = TabularPredictor(
    label=LABEL,
    path="ag-medium"
).fit(
    train,
    presets="medium_quality",
    time_limit=300  # 5 minutes
)

time_medium = time.time() - start
print(f"\n⏱️ Training time: {time_medium:.1f} seconds")

## 🏆 Preset 3: Best Quality (Most Accurate)

Focus: **Maximum accuracy**
- Longer training time
- Best performance
- Complex ensembles
- Extensive feature engineering

In [None]:
print("🏆 Training with 'best_quality' preset...\n")
start = time.time()

predictor_best = TabularPredictor(
    label=LABEL,
    path="ag-best"
).fit(
    train,
    presets="best_quality",
    time_limit=600  # 10 minutes
)

time_best = time.time() - start
print(f"\n⏱️ Training time: {time_best:.1f} seconds")

## 📊 Performance Comparison

Compare all three presets:

In [None]:
# Evaluate all models
perf_fast = predictor_fast.evaluate(train)
perf_medium = predictor_medium.evaluate(train)
perf_best = predictor_best.evaluate(train)

# Get leaderboards
lb_fast = predictor_fast.leaderboard(train, silent=True)
lb_medium = predictor_medium.leaderboard(train, silent=True)
lb_best = predictor_best.leaderboard(train, silent=True)

# Create comparison table
comparison = pd.DataFrame({
    'Preset': ['optimize_for_deployment', 'medium_quality', 'best_quality'],
    'Training Time (s)': [time_fast, time_medium, time_best],
    'Accuracy': [
        perf_fast.get('accuracy', 'N/A'),
        perf_medium.get('accuracy', 'N/A'),
        perf_best.get('accuracy', 'N/A')
    ],
    'ROC-AUC': [
        perf_fast.get('roc_auc', 'N/A'),
        perf_medium.get('roc_auc', 'N/A'),
        perf_best.get('roc_auc', 'N/A')
    ],
    'Models Trained': [
        len(lb_fast),
        len(lb_medium),
        len(lb_best)
    ]
})

print("📊 Preset Comparison:\n")
display(comparison)

comparison.to_csv('preset_comparison.csv', index=False)
print("\n💾 Saved: preset_comparison.csv")

## 📈 Detailed Leaderboards

View all models for each preset:

In [None]:
print("🏃 OPTIMIZE_FOR_DEPLOYMENT Leaderboard:")
display(lb_fast.head(5))

print("\n⚖️ MEDIUM_QUALITY Leaderboard:")
display(lb_medium.head(5))

print("\n🏆 BEST_QUALITY Leaderboard:")
display(lb_best.head(5))

## 🔍 Feature Importance Comparison

In [None]:
fi_fast = predictor_fast.feature_importance(train)
fi_medium = predictor_medium.feature_importance(train)
fi_best = predictor_best.feature_importance(train)

print("🔍 Feature Importance - FAST:")
display(fi_fast)

print("\n🔍 Feature Importance - MEDIUM:")
display(fi_medium)

print("\n🔍 Feature Importance - BEST:")
display(fi_best)

# Save all feature importances
fi_fast.to_csv('feature_importance_fast.csv')
fi_medium.to_csv('feature_importance_medium.csv')
fi_best.to_csv('feature_importance_best.csv')
print("\n💾 Saved all feature importance files")

## 🎯 Recommendations

Based on the results, here's when to use each preset:

In [None]:
print("🎯 PRESET RECOMMENDATIONS:\n")

print("🏃 optimize_for_deployment:")
print("   ✓ Quick prototyping")
print("   ✓ Production deployments (fast inference)")
print("   ✓ Resource-constrained environments")
print("   ✓ When speed matters more than accuracy")

print("\n⚖️ medium_quality:")
print("   ✓ Default choice for most use cases")
print("   ✓ Good balance of speed and accuracy")
print("   ✓ Exploratory data analysis")
print("   ✓ Time-limited experiments")

print("\n🏆 best_quality:")
print("   ✓ Kaggle competitions")
print("   ✓ Critical applications (medical, finance)")
print("   ✓ When accuracy is paramount")
print("   ✓ Final production models")
print("   ✓ Benchmarking")

## 💾 Save All Models

In [None]:
import shutil

shutil.make_archive('model_fast', 'zip', predictor_fast.path)
shutil.make_archive('model_medium', 'zip', predictor_medium.path)
shutil.make_archive('model_best', 'zip', predictor_best.path)

print("✅ All models saved:")
print("   - model_fast.zip")
print("   - model_medium.zip")
print("   - model_best.zip")

## 🎓 Summary

This notebook demonstrated:
1. ✅ Three different AutoGluon quality presets
2. ✅ Speed vs accuracy tradeoffs
3. ✅ Impact of automatic feature engineering
4. ✅ Model complexity comparison

**Key Takeaways:**

| Preset | Speed | Accuracy | Use Case |
|--------|-------|----------|----------|
| optimize_for_deployment | ⚡⚡⚡ | ⭐⭐ | Quick experiments, production |
| medium_quality | ⚡⚡ | ⭐⭐⭐ | Default choice |
| best_quality | ⚡ | ⭐⭐⭐⭐ | Competitions, critical apps |

**Typical Results on Titanic:**
- Fast: ~78-80% accuracy in 1-2 minutes
- Medium: ~81-83% accuracy in 3-5 minutes
- Best: ~83-85% accuracy in 8-15 minutes

**Next Steps:**
- Try custom presets with `excluded_model_types` or `included_model_types`
- Experiment with `num_bag_folds` for better ensembles
- Use `hyperparameter_tune_kwargs` for advanced tuning