# DeepBridge Auto-Distillation Demo Notebook

I'll create a Jupyter notebook that demonstrates how to use the DeepBridge library for auto-distillation of classification models. This notebook will walk through the entire process from data preparation to model evaluation.

## Understanding the DeepBridge Auto-Distillation Notebook

I've created a comprehensive Jupyter notebook that demonstrates how to use DeepBridge's AutoDistiller for model distillation with a classification task. Let me walk you through what this notebook covers:

### Key Components:

1. **Data Preparation**: Using the breast cancer dataset, which is a standard binary classification problem
2. **Teacher Model**: Creating a complex Random Forest model with 500 trees
3. **DBDataset Setup**: Preparing the data structure required by DeepBridge
4. **Auto-Distillation**: Using AutoDistiller to test multiple student models and configurations
5. **Results Analysis**: Evaluating and comparing the performance of teacher vs. student models
6. **Visualization**: Comparing probability distributions and model sizes

### Workflow:

The notebook demonstrates the entire knowledge distillation workflow:
1. Start with a complex, high-performing model (the "teacher")
2. Use DeepBridge's AutoDistiller to create simplified models (the "students")
3. Automatically test different model types, temperatures, and alpha values
4. Find the optimal configuration that balances performance and simplicity
5. Compare the final distilled model against the original

### Benefits:

- **Model Compression**: See how much smaller the distilled model is compared to the teacher
- **Performance Preservation**: Evaluate how well the student model retains the teacher's accuracy
- **Automated Process**: The AutoDistiller handles the complexity of testing multiple configurations

### How to Run:

You can copy this notebook to your environment and run it with the DeepBridge library installed. For installation, you would typically use:

```bash
pip install deepbridge
```

Note that the demonstration uses a reduced number of trials (`n_trials=5`) for faster execution, but in a production scenario, you might want to increase this for better optimization.

Would you like me to explain any particular section of the notebook in more detail?

# DeepBridge Auto-Distillation Demo
# ==============================
#
# This notebook demonstrates how to use DeepBridge's AutoDistiller 
# for model distillation in classification tasks.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, roc_auc_score, classification_report

# DeepBridge imports
from deepbridge.db_data import DBDataset
from deepbridge.auto_distiller import AutoDistiller
from deepbridge.distillation.classification.model_registry import ModelType

# For visualization
import seaborn as sns

# Set plot style
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_theme(style="whitegrid")

# Seed for reproducibility
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)

# ---------------------------------------------------------------------
# 1. Data Preparation
# ---------------------------------------------------------------------

In [None]:
# Load breast cancer dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target, name='target')

print(f"Dataset shape: {X.shape}")
print(f"Number of classes: {len(np.unique(y))}")
print(f"Class distribution: {pd.Series(y).value_counts().to_dict()}")

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=RANDOM_SEED
)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Convert back to DataFrames
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns, index=X_train.index)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns, index=X_test.index)

print(f"Training set shape: {X_train_scaled.shape}")
print(f"Test set shape: {X_test_scaled.shape}")



# ---------------------------------------------------------------------
# 2. Create a Teacher Model (Complex Model)
# ---------------------------------------------------------------------

In [None]:
# Create a complex Random Forest model as our teacher
teacher_model = RandomForestClassifier(
    n_estimators=500,  # Use many trees
    max_depth=20,      # Allow deep trees
    min_samples_split=2,
    bootstrap=True,
    random_state=RANDOM_SEED
)

# Train the teacher model
teacher_model.fit(X_train_scaled, y_train)

# Evaluate teacher model
y_pred_teacher = teacher_model.predict(X_test_scaled)
y_prob_teacher = teacher_model.predict_proba(X_test_scaled)[:, 1]

teacher_accuracy = accuracy_score(y_test, y_pred_teacher)
teacher_auc = roc_auc_score(y_test, y_prob_teacher)

print(f"Teacher Model Performance:")
print(f"Accuracy: {teacher_accuracy:.4f}")
print(f"AUC-ROC: {teacher_auc:.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred_teacher))

# Generate teacher probabilities for all data
train_probs = teacher_model.predict_proba(X_train_scaled)
test_probs = teacher_model.predict_proba(X_test_scaled)

# Create probability DataFrames
train_probs_df = pd.DataFrame(
    train_probs, 
    columns=[f'prob_class_{i}' for i in range(train_probs.shape[1])],
    index=X_train.index
)

test_probs_df = pd.DataFrame(
    test_probs, 
    columns=[f'prob_class_{i}' for i in range(test_probs.shape[1])],
    index=X_test.index
)

# ---------------------------------------------------------------------
# 3. Create a DBDataset for Distillation
# ---------------------------------------------------------------------

In [None]:
# Create a combined DataFrame for training data
train_data = pd.concat([X_train_scaled, y_train], axis=1)
test_data = pd.concat([X_test_scaled, y_test], axis=1)

# Create DBDataset
dataset = DBDataset(
    train_data=train_data,
    test_data=test_data,
    target_column='target',
    features=X_train.columns.tolist(),
    train_predictions=train_probs_df,
    test_predictions=test_probs_df,
    prob_cols=[f'prob_class_{i}' for i in range(train_probs.shape[1])],
    dataset_name="breast_cancer"
)

print(dataset)

# ---------------------------------------------------------------------
# 4. Using AutoDistiller for Model Distillation
# ---------------------------------------------------------------------

In [None]:
# Create an AutoDistiller instance
auto_distiller = AutoDistiller(
    dataset=dataset,
    output_dir="./distillation_results",
    test_size=0.2,
    random_state=RANDOM_SEED,
    n_trials=5,  # Reduced for demo purposes
    validation_split=0.2,
    verbose=True
)

# Customize configuration
auto_distiller.customize_config(
    model_types=[
        ModelType.LOGISTIC_REGRESSION,
        ModelType.DECISION_TREE,
        ModelType.GBM
    ],
    temperatures=[0.5, 1.0, 2.0],
    alphas=[0.3, 0.7]
)

# Run the distillation process
results = auto_distiller.run(use_probabilities=True, verbose_output=True)

# Display results
print("Distillation Results Summary:")
print(results.head())

# ---------------------------------------------------------------------
# 5. Analyze the Results
# ---------------------------------------------------------------------

In [None]:
# Find the best model based on test accuracy
best_config = auto_distiller.find_best_model(metric='test_accuracy', minimize=False)

print("\nBest Model Configuration:")
print(f"Model Type: {best_config['model_type']}")
print(f"Temperature: {best_config['temperature']}")
print(f"Alpha: {best_config['alpha']}")
print(f"Test Accuracy: {best_config.get('test_accuracy', 'N/A')}")

# Get the best trained model
best_model = auto_distiller.get_trained_model(
    model_type=best_config['model_type'],
    temperature=best_config['temperature'],
    alpha=best_config['alpha']
)

# Evaluate the distilled model
student_preds = best_model.predict(X_test_scaled)
student_probs = best_model.predict_proba(X_test_scaled)[:, 1]

student_accuracy = accuracy_score(y_test, student_preds)
student_auc = roc_auc_score(y_test, student_probs)

print("\nDistilled Model Performance:")
print(f"Accuracy: {student_accuracy:.4f}")
print(f"AUC-ROC: {student_auc:.4f}")
print("\nClassification Report:")
print(classification_report(y_test, student_preds))

# ---------------------------------------------------------------------
# 6. Performance Comparison
# ---------------------------------------------------------------------

In [None]:
# Compare teacher vs student model performance
metrics = {
    'Model': ['Teacher', 'Student (Distilled)'],
    'Accuracy': [teacher_accuracy, student_accuracy],
    'AUC-ROC': [teacher_auc, student_auc]
}

comparison_df = pd.DataFrame(metrics)
print("\nPerformance Comparison:")
print(comparison_df)

# Plot comparison
fig, ax = plt.subplots(1, 2, figsize=(12, 5))

# Accuracy comparison
ax[0].bar(metrics['Model'], metrics['Accuracy'], color=['#4A6FA5', '#E57373'])
ax[0].set_ylim([0.9, 1.0])  # Adjust as needed
ax[0].set_title('Accuracy Comparison')
ax[0].set_ylabel('Accuracy')

# AUC-ROC comparison
ax[1].bar(metrics['Model'], metrics['AUC-ROC'], color=['#4A6FA5', '#E57373'])
ax[1].set_ylim([0.9, 1.0])  # Adjust as needed
ax[1].set_title('AUC-ROC Comparison')
ax[1].set_ylabel('AUC-ROC')

plt.tight_layout()
plt.show()

# ---------------------------------------------------------------------
# 7. Visualize Probability Distributions
# ---------------------------------------------------------------------

In [None]:
# Compare probability distributions between teacher and student
plt.figure(figsize=(10, 6))

# Get probabilities for positive class
teacher_probs_pos = y_prob_teacher
student_probs_pos = student_probs

# Plot density curves
sns.kdeplot(teacher_probs_pos, label='Teacher Model', color='#4A6FA5')
sns.kdeplot(student_probs_pos, label='Student Model', color='#E57373')

plt.title('Comparison of Probability Distributions')
plt.xlabel('Probability (Positive Class)')
plt.ylabel('Density')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

# ---------------------------------------------------------------------
# 8. Save the Best Model
# ---------------------------------------------------------------------

In [None]:
# Save the best model
model_path = auto_distiller.save_best_model(
    metric='test_accuracy',
    minimize=False,
    file_path='./best_distilled_model.pkl'
)

print(f"\nBest model saved to: {model_path}")

# ---------------------------------------------------------------------
# 9. Model Size Comparison
# ---------------------------------------------------------------------

In [None]:
import sys
import pickle

# Estimate the size of the teacher model
teacher_size = sys.getsizeof(pickle.dumps(teacher_model)) / (1024 * 1024)  # in MB

# Estimate the size of the student model
student_size = sys.getsizeof(pickle.dumps(best_model)) / (1024 * 1024)  # in MB

print("\nModel Size Comparison:")
print(f"Teacher Model: {teacher_size:.2f} MB")
print(f"Student Model: {student_size:.2f} MB")
print(f"Size Reduction: {(1 - student_size/teacher_size) * 100:.2f}%")

# Plot model size comparison
plt.figure(figsize=(8, 5))
plt.bar(['Teacher Model', 'Student Model'], [teacher_size, student_size], color=['#4A6FA5', '#E57373'])
plt.title('Model Size Comparison')
plt.ylabel('Size (MB)')
plt.grid(axis='y', alpha=0.3)

# Add size labels
for i, size in enumerate([teacher_size, student_size]):
    plt.text(i, size + 0.1, f"{size:.2f} MB", ha='center')

plt.tight_layout()
plt.show()

# ---------------------------------------------------------------------
# 10. Summary and Conclusions
# ---------------------------------------------------------------------

In [None]:
print("\nDistillation Summary:")
print("=====================")
print(f"Teacher Model: RandomForest with 500 trees")
print(f"Best Student Model: {best_config['model_type']}")
print(f"Performance Gap (Accuracy): {(teacher_accuracy - student_accuracy) * 100:.2f}%")
print(f"Performance Gap (AUC-ROC): {(teacher_auc - student_auc) * 100:.2f}%")
print(f"Size Reduction: {(1 - student_size/teacher_size) * 100:.2f}%")

print("\nConclusion:")
if student_accuracy >= 0.95 * teacher_accuracy:
    print("✅ Successfully distilled a simpler model with minimal performance loss")
    print("✅ Achieved significant reduction in model size")
    print("✅ The distilled model maintains high accuracy and AUC-ROC")
else:
    print("⚠️ Distillation completed, but performance gap is significant")
    print("⚠️ Consider adjusting distillation parameters or using a different student model type")