# AutoML with FLAML

In this assignment, you'll learn how to use **FLAML** (Fast Lightweight AutoML) for automated machine learning.

## What is AutoML?
AutoML automates the process of:
- **Model Selection**: Tries different algorithms (XGBoost, LightGBM, Random Forest, etc.)
- **Hyperparameter Tuning**: Finds best parameters for each model
- **Feature Engineering**: Creates and selects useful features
- **Model Comparison**: Identifies the best performing model

## Why FLAML?
- **Fast**: Efficient search algorithms
- **Simple**: Easy API, minimal code
- **Smart**: Uses cost-effective hyperparameter optimization
- **Interpretable**: Provides feature importance and model insights

## Learning Objectives
- Understand how AutoML works
- Use FLAML to automatically find the best model
- Interpret feature importance
- Compare AutoML results with manual approaches

## Setup and Installation

In [2]:
!%pip install flaml scikit-learn matplotlib seaborn pandas numpy xgboost lightgbm shap

Collecting shap
  Downloading shap-0.49.1-cp311-cp311-macosx_10_9_x86_64.whl.metadata (25 kB)
Collecting slicer==0.0.8 (from shap)
  Downloading slicer-0.0.8-py3-none-any.whl.metadata (4.0 kB)
Collecting cloudpickle (from shap)
  Using cached cloudpickle-3.1.2-py3-none-any.whl.metadata (7.1 kB)
Downloading shap-0.49.1-cp311-cp311-macosx_10_9_x86_64.whl (558 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m558.7/558.7 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading slicer-0.0.8-py3-none-any.whl (15 kB)
Using cached cloudpickle-3.1.2-py3-none-any.whl (22 kB)
Installing collected packages: slicer, cloudpickle, shap
Successfully installed cloudpickle-3.1.2 shap-0.49.1 slicer-0.0.8

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.11 -m pip install --upgrade pip[0m


## Import Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import shap
from flaml import AutoML

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_auc_score

sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

## 1. Load and Prepare Dataset

In [None]:
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Dataset: {data.DESCR.split('**')[1].strip()}")
print(f"Training set size: {X_train.shape[0]}")
print(f"Test set size: {X_test.shape[0]}")
print(f"Number of features: {X_train.shape[1]}")

## 2. Create and Configure AutoML

### Task: Set up FLAML AutoML

**Key Parameters:**
- `task`: Type of ML task ('classification' or 'regression')
- `time_budget`: Maximum time in seconds for searching (60-300 for this exercise)
- `metric`: Metric to optimize ('accuracy', 'roc_auc', 'f1', etc.)
- `estimator_list`: Which models to try (default: tries multiple algorithms)
- `log_file_name`: Where to save search logs
- `seed`: Random seed for reproducibility

**Available Estimators in FLAML:**
- `'lgbm'`: LightGBM
- `'xgboost'`: XGBoost
- `'rf'`: Random Forest
- `'extra_tree'`: Extra Trees
- `'lrl1'`: Logistic Regression with L1 regularization
- `'lrl2'`: Logistic Regression with L2 regularization

**Hints:**
1. Create an `AutoML()` object
2. Use `fit()` method with X_train, y_train
3. Set task='classification'
4. Set a time_budget (start with 60 seconds)
5. Set metric='accuracy'

**TODO:** Complete the AutoML setup and training.

In [None]:
# TODO: Create AutoML instance
# automl = AutoML()

# TODO: Configure and train
# automl.fit(
#     X_train, y_train,
#     task='classification',
#     time_budget=60,
#     metric='accuracy',
#     log_file_name='automl_log.txt',
#     seed=42
# )

# print("AutoML training complete!")

## 3. Examine AutoML Results

### Task: Explore what AutoML discovered

**Useful Attributes:**
- `automl.best_estimator`: Name of the best model found
- `automl.best_config`: Best hyperparameters
- `automl.best_loss`: Best validation loss
- `automl.model`: The trained model object

**TODO:** Print information about the best model found.

In [None]:
# TODO: Print best estimator name
# print(f"Best estimator: {automl.best_estimator}")

# TODO: Print best configuration
# print(f"\nBest configuration:")
# for param, value in automl.best_config.items():
#     print(f"  {param}: {value}")

# TODO: Print best validation accuracy
# print(f"\nBest validation accuracy: {1 - automl.best_loss:.4f}")

## 4. Evaluate on Test Set

### Task: Make predictions and evaluate performance

**Hints:**
- Use `automl.predict()` for class predictions
- Use `automl.predict_proba()` for probability predictions
- Calculate accuracy, ROC AUC, and generate classification report

**TODO:** Evaluate the AutoML model on the test set.

In [None]:
# TODO: Make predictions
# y_pred = automl.predict(X_test)
# y_pred_proba = automl.predict_proba(X_test)[:, 1]

# TODO: Calculate metrics
# accuracy = accuracy_score(y_test, y_pred)
# roc_auc = roc_auc_score(y_test, y_pred_proba)

# TODO: Print results
# print(f"\nTest Set Performance:")
# print(f"Accuracy: {accuracy:.4f}")
# print(f"ROC AUC: {roc_auc:.4f}")
# print(f"\nClassification Report:")
# print(classification_report(y_test, y_pred, target_names=['Malignant', 'Benign']))

## 6. SHAP Feature Importance

### Task: Use SHAP to explain model predictions

**What is SHAP?**
SHAP (SHapley Additive exPlanations) shows:
- Which features contribute most to predictions
- Direction of impact (positive/negative)
- Per-sample explanations (not just global)

**TODO:** Create SHAP visualizations.

**Hints:**
1. Create `shap.TreeExplainer(automl.model.estimator)` for tree models
2. Calculate SHAP values with `explainer.shap_values(X_test)`
3. Use `shap.summary_plot()` for beeswarm plot - GOLD!
4. Try `shap.waterfall_plot()` for single prediction

In [None]:
# TODO: Create SHAP explainer for tree-based models
# if automl.best_estimator in ['lgbm', 'xgboost', 'rf', 'extra_tree']:
#     explainer = shap.TreeExplainer(automl.model.estimator)
#     
#     # TODO: Calculate SHAP values for test set
#     shap_values = explainer.shap_values(X_test)
#     
#     # TODO: Summary plot (beeswarm)
#     # Shows feature importance + direction + distribution
#     shap.summary_plot(shap_values, X_test, plot_type="dot")
#     
#     # TODO: Bar plot - Simple feature importance
#     shap.summary_plot(shap_values, X_test, plot_type="bar")
#     
#     # TODO: Waterfall plot for single prediction
#     # Shows how each feature contributes to one specific prediction
#     shap.waterfall_plot(shap.Explanation(values=shap_values[0], 
#                                           base_values=explainer.expected_value, 
#                                           data=X_test.iloc[0],
#                                           feature_names=X_test.columns))
# else:
#     print("SHAP TreeExplainer only works with tree-based models")

## 8. Compare AutoML with Specific Estimators

### Task: Run AutoML with only specific estimators

**TODO:** Try AutoML with only tree-based models vs only linear models.

**Hints:**
- Use `estimator_list` parameter
- Tree-based: `['lgbm', 'xgboost', 'rf']`
- Linear: `['lrl1', 'lrl2']`

In [None]:
# TODO: Run with tree-based models only
# automl_tree = AutoML()
# automl_tree.fit(
#     X_train, y_train,
#     task='classification',
#     time_budget=60,
#     metric='accuracy',
#     estimator_list=['lgbm', 'xgboost', 'rf'],
#     seed=42
# )

# TODO: Run with linear models only
# automl_linear = AutoML()
# automl_linear.fit(
#     X_train, y_train,
#     task='classification',
#     time_budget=60,
#     metric='accuracy',
#     estimator_list=['lrl1', 'lrl2'],
#     seed=42
# )

# TODO: Compare results
# print(f"Tree-based best: {automl_tree.best_estimator} - Accuracy: {accuracy_score(y_test, automl_tree.predict(X_test)):.4f}")
# print(f"Linear best: {automl_linear.best_estimator} - Accuracy: {accuracy_score(y_test, automl_linear.predict(X_test)):.4f}")