# DeepBridge Tutorial - Part 2: AutoDistiller and Robustness Testing

This notebook continues our exploration of the DeepBridge library, focusing on advanced usage including:

1. Using the AutoDistiller for automated model distillation
2. Robustness testing and validation
3. Feature importance analysis

Let's begin by importing the necessary components.

In [None]:
# Import core DeepBridge components
from deepbridge.core.db_data import DBDataset
from deepbridge.core.experiment import Experiment
from deepbridge.distillation.auto_distiller import AutoDistiller
from deepbridge.utils.model_registry import ModelType
from deepbridge.validation.robustness_test import RobustnessTest
from deepbridge.validation.robustness_metrics import RobustnessScore
from deepbridge.visualization.robustness_viz import RobustnessViz

# Additional imports
import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import seaborn as sns
import tempfile
import os
import joblib

# Set random seed for reproducibility
np.random.seed(42)

## 1. Preparing a Dataset with Model Predictions

For advanced distillation, we'll start with a more realistic dataset and create a complex teacher model.

In [None]:
# Load breast cancer dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target, name='target')

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a complex teacher model (Random Forest with many trees)
teacher_model = RandomForestClassifier(
    n_estimators=200,
    max_depth=20,
    min_samples_split=2,
    min_samples_leaf=1,
    random_state=42
)

# Train the model
teacher_model.fit(X_train, y_train)

# Generate predictions
train_probs = teacher_model.predict_proba(X_train)
test_probs = teacher_model.predict_proba(X_test)

# Convert to DataFrames
train_probs_df = pd.DataFrame(train_probs, columns=['prob_class_0', 'prob_class_1'], index=X_train.index)
test_probs_df = pd.DataFrame(test_probs, columns=['prob_class_0', 'prob_class_1'], index=X_test.index)

# Save the teacher model
temp_dir = tempfile.mkdtemp()
model_path = os.path.join(temp_dir, 'teacher_model.pkl')
joblib.dump(teacher_model, model_path)

# Combine features and target
train_data = X_train.copy()
train_data['target'] = y_train
test_data = X_test.copy()
test_data['target'] = y_test

# Print model info
print(f"Teacher model: {type(teacher_model).__name__}")
print(f"Number of trees: {teacher_model.n_estimators}")
print(f"Train accuracy: {teacher_model.score(X_train, y_train):.4f}")
print(f"Test accuracy: {teacher_model.score(X_test, y_test):.4f}")
print(f"Model size: {os.path.getsize(model_path) / (1024 * 1024):.2f} MB")

In [None]:
# Create a DBDataset with the teacher model
dataset = DBDataset(
    train_data=train_data,
    test_data=test_data,
    target_column='target',
    model_path=model_path,
    train_predictions=train_probs_df,
    test_predictions=test_probs_df,
    prob_cols=['prob_class_0', 'prob_class_1'],
    dataset_name='breast_cancer_with_teacher_model'
)

print(dataset)

## 2. Using AutoDistiller for Automated Model Distillation

The `AutoDistiller` class automates the process of finding the best distilled model by testing various combinations of:
- Student model types
- Temperatures for knowledge distillation
- Alpha values (weighting between soft and hard targets)

It performs hyperparameter optimization for each configuration and calculates detailed metrics.

In [None]:
# Create an AutoDistiller instance
auto_distiller = AutoDistiller(
    dataset=dataset,
    output_dir=os.path.join(temp_dir, 'distillation_results'),
    test_size=0.2,  # This is used for validation during distillation
    random_state=42,
    n_trials=10,    # Number of hyperparameter optimization trials
    verbose=True    # Show detailed output
)

# Customize the configuration
auto_distiller.customize_config(
    model_types=[ModelType.LOGISTIC_REGRESSION, ModelType.GBM, ModelType.DECISION_TREE],
    temperatures=[1.0, 2.0],         # Temperature values to test
    alphas=[0.3, 0.7],               # Alpha values to test
    distillation_method="knowledge_distillation"  # Method to use
)

# Calculate metrics for the original model
original_metrics = auto_distiller.original_metrics()
print("\nOriginal Model Metrics (Test Set):")
for metric, value in original_metrics['test'].items():
    if isinstance(value, (int, float)):
        print(f"  {metric}: {value:.4f}")

In [None]:
# Run the automated distillation process
# This will test all combinations of model types, temperatures, and alphas
results = auto_distiller.run(use_probabilities=True, verbose_output=True)

### 2.1 Analyzing AutoDistiller Results

Now, let's explore the results of our automated distillation process.

In [None]:
# Display the results DataFrame
print("Distillation Results:")
results.head()

In [None]:
# Find the best model based on test accuracy
best_config = auto_distiller.find_best_model(metric='test_accuracy')
print("Best Model Configuration (by Test Accuracy):")
for key, value in best_config.items():
    if key in ['model_type', 'temperature', 'alpha']:
        print(f"  {key}: {value}")
        
print("\nTest Metrics:")
for key, value in best_config.items():
    if key.startswith('test_') and isinstance(value, (int, float)):
        print(f"  {key}: {value:.4f}")

In [None]:
# Find the best model for distribution matching (KL divergence)
best_distribution = auto_distiller.find_best_model(metric='test_kl_divergence', minimize=True)
print("Best Model Configuration (by KL Divergence):")
for key, value in best_distribution.items():
    if key in ['model_type', 'temperature', 'alpha']:
        print(f"  {key}: {value}")
        
print("\nDistribution Metrics:")
for key, value in best_distribution.items():
    if key in ['test_kl_divergence', 'test_ks_statistic', 'test_r2_score'] and isinstance(value, (int, float)):
        print(f"  {key}: {value:.4f}")

In [None]:
# Compare the original model with the best distilled model
comparison = auto_distiller.compare_models()
print("Original vs. Best Distilled Model Comparison (Test Set):")
print(comparison[comparison['dataset'] == 'test'])

In [None]:
# Generate a summary report
summary = auto_distiller.generate_summary()
print(summary)

### 2.2 Using the Best Distilled Model

Now that we've found the best distilled model, let's see how to use it for predictions.

In [None]:
# Get the best model configuration
best_model_config = auto_distiller.find_best_model(metric='test_accuracy')
model_type = best_model_config['model_type']
temperature = best_model_config['temperature']
alpha = best_model_config['alpha']

# Get the trained distilled model
best_model = auto_distiller.get_trained_model(
    model_type=model_type,
    temperature=temperature,
    alpha=alpha
)

# Make predictions with the distilled model
distilled_probs = best_model.predict_proba(X_test)
distilled_preds = (distilled_probs[:, 1] > 0.5).astype(int)

# Make predictions with the teacher model for comparison
teacher_probs = teacher_model.predict_proba(X_test)
teacher_preds = (teacher_probs[:, 1] > 0.5).astype(int)

# Compare accuracy
distilled_accuracy = (distilled_preds == y_test).mean()
teacher_accuracy = (teacher_preds == y_test).mean()

print(f"Teacher model accuracy: {teacher_accuracy:.4f}")
print(f"Distilled model accuracy: {distilled_accuracy:.4f}")
print(f"Accuracy retention: {distilled_accuracy/teacher_accuracy:.2%}")

In [None]:
# Save the best model
best_model_path = auto_distiller.save_best_model(
    metric='test_accuracy',
    file_path=os.path.join(temp_dir, 'best_distilled_model.pkl')
)

# Compare model sizes
original_size = os.path.getsize(model_path) / (1024 * 1024)  # in MB
distilled_size = os.path.getsize(best_model_path) / (1024 * 1024)  # in MB

print(f"Original model size: {original_size:.2f} MB")
print(f"Distilled model size: {distilled_size:.2f} MB")
print(f"Size reduction: {(1 - distilled_size/original_size):.2%}")

## 3. Visualizing the Probability Distributions

An important aspect of model distillation is how well the student model matches the probability distribution of the teacher model.

In [None]:
# Create a visualization of the probability distributions
plt.figure(figsize=(12, 6))

# Get teacher and student probabilities for the positive class
teacher_probs_pos = teacher_probs[:, 1]
student_probs_pos = distilled_probs[:, 1]

# Plot density curves
sns.kdeplot(teacher_probs_pos, fill=True, color="royalblue", alpha=0.5, 
           label="Teacher Model", linewidth=2)
sns.kdeplot(student_probs_pos, fill=True, color="crimson", alpha=0.5, 
           label="Distilled Model", linewidth=2)

# Add histogram for additional clarity
plt.hist(teacher_probs_pos, bins=20, density=True, alpha=0.3, color="blue")
plt.hist(student_probs_pos, bins=20, density=True, alpha=0.3, color="red")

# Add titles and labels
plt.xlabel("Probability Value", fontsize=12)
plt.ylabel("Density", fontsize=12)
plt.title("Teacher vs Distilled Model Probability Distribution", fontsize=14, fontweight='bold')

# Add distribution similarity metrics
from scipy import stats
from sklearn.metrics import r2_score

# Calculate KS statistic
ks_stat, ks_pvalue = stats.ks_2samp(teacher_probs_pos, student_probs_pos)

# Calculate R² by sorting both distributions
teacher_sorted = np.sort(teacher_probs_pos)
student_sorted = np.sort(student_probs_pos)
min_len = min(len(teacher_sorted), len(student_sorted))
r2 = r2_score(teacher_sorted[:min_len], student_sorted[:min_len])

# Add metrics to the plot
metrics_text = f"KS Statistic: {ks_stat:.4f} (p={ks_pvalue:.4f})\nR² Score: {r2:.4f}"
plt.annotate(metrics_text, xy=(0.02, 0.96), xycoords='axes fraction',
            bbox=dict(boxstyle="round,pad=0.5", fc="white", alpha=0.8),
            va='top', fontsize=10)

plt.legend(loc='best')
plt.grid(True, linestyle='--', alpha=0.7)
plt.show()

## 4. Robustness Testing of Original and Distilled Models

Robustness testing evaluates how well models perform when data is perturbed or corrupted. Let's test both our original teacher model and our distilled model.

In [None]:
# Initialize the robustness test
robustness_test = RobustnessTest()

# Create a dictionary of models to test
models = {
    'Teacher (Random Forest)': teacher_model,
    'Distilled': best_model
}

# Run robustness evaluation
results = robustness_test.evaluate_robustness(
    models=models,
    X=X_test,
    y=y_test,
    perturb_method='raw',  # Use raw perturbation (add Gaussian noise)
    perturb_sizes=[0.0, 0.2, 0.4, 0.6, 0.8, 1.0],  # Different levels of perturbation
    metric='AUC',  # Evaluation metric
    n_iterations=5,  # Number of iterations per perturbation size
    random_state=42
)

In [None]:
# Calculate robustness indices
robustness_indices = RobustnessScore.calculate_robustness_index(
    results=results,
    metric='AUC'
)

print("Robustness Indices:")
for model_name, index in robustness_indices.items():
    print(f"  {model_name}: {index:.4f}")

In [None]:
# Create a visualization of model performance under perturbation
fig = RobustnessViz.plot_models_comparison(
    results=results,
    metric_name='AUC-ROC',
    height=500,
    width=800
)

# Display the figure (note: normally this would be interactive with Plotly)
from IPython.display import Image
import plotly.io as pio

# Convert Plotly figure to static image
img_path = os.path.join(temp_dir, 'robustness_comparison.png')
pio.write_image(fig, img_path)

# Display the image
Image(img_path)

## 5. Feature Importance for Robustness

Let's analyze which features impact model robustness the most.

In [None]:
# Analyze feature importance for robustness
feature_importance = robustness_test.analyze_feature_importance(
    model=best_model,
    X=X_test,
    y=y_test,
    perturb_method='raw',
    perturb_size=0.5,
    metric='AUC',
    n_iterations=3,
    random_state=42
)

# Display top 10 important features
top_features = feature_importance['sorted_features'][:10]
top_impacts = feature_importance['sorted_impacts'][:10]

print("Top 10 Features by Impact on Robustness:")
for feature, impact in zip(top_features, top_impacts):
    print(f"  {feature}: {impact:.4f}")

In [None]:
# Create a feature importance visualization
fig = RobustnessViz.plot_feature_importance(
    feature_importance_results=feature_importance,
    title="Feature Impact on Model Robustness",
    top_n=10,
    height=600,
    width=800
)

# Convert Plotly figure to static image
img_path = os.path.join(temp_dir, 'feature_importance.png')
pio.write_image(fig, img_path)

# Display the image
Image(img_path)

## Conclusion

In this second part of our DeepBridge tutorial, we've explored several advanced capabilities:

1. **AutoDistiller**: The automated distillation process makes it easy to find the optimal student model by testing various combinations of model types, temperatures, and alpha values.

2. **Robustness Testing**: We evaluated how well both teacher and student models perform under data perturbation, showing that distilled models can maintain similar robustness to the original model.

3. **Feature Importance Analysis**: We identified which features have the most impact on model robustness, which can guide feature engineering and model development.

4. **Visualization**: We created visualizations for probability distributions and performance under perturbation.

Key takeaways:
- Model distillation can significantly reduce model size while maintaining most of the performance
- Robustness testing helps ensure that distilled models remain stable under perturbation
- Feature importance analysis provides insights into which features are most critical for model stability

In Part 3, we'll explore comparing multiple distillation methods and creating comprehensive dashboards.