# DeepBridge Tutorial - Dataset Handling and Basic Experiment

This notebook demonstrates how to use the core components of DeepBridge library: `DBDataset` and `Experiment` classes. These classes provide a foundation for model validation and distillation workflows.

## Overview

1. **DBDataset**: Wraps training and test datasets along with optional model and predictions
2. **Experiment**: Handles different types of modeling tasks and their configurations

Let's start by importing the necessary modules.

In [4]:
import sys
import os

sys.path.append(os.path.expanduser("~/projetos/DeepBridge"))

# Import core DeepBridge components
from deepbridge.core.db_data import DBDataset
from deepbridge.core.experiment import Experiment
from deepbridge.utils.model_registry import ModelType

# Additional imports
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt
import seaborn as sns
import os
import tempfile
import joblib

# Set random seed for reproducibility
np.random.seed(42)

  from .autonotebook import tqdm as notebook_tqdm


## Part 1: Working with DBDataset

The `DBDataset` class is a wrapper around datasets that handles feature management, categorical variables, and model predictions. Let's explore its functionality using a synthetic dataset.

In [5]:
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
import joblib
import tempfile
import os

# Gerar dataset sintético com 20 mil registros
X, y = make_classification(
    n_samples=200000,
    n_features=30,
    n_informative=25,
    n_redundant=5,
    n_classes=2,
    random_state=42
)

X = pd.DataFrame(X, columns=[f'feature_{i}' for i in range(30)])
y = pd.Series(y, name='target')

# Combinar features e alvo em um único DataFrame antes da separação
data = X.copy()
data['target'] = y

# Dividir em conjuntos de treino e teste
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Criar o modelo de redes neurais (MLPClassifier)
nn_model = MLPClassifier(
    hidden_layer_sizes=(100, 50),
    activation='relu',
    solver='adam',
    max_iter=200,
    random_state=42
)

# Treinar o modelo
nn_model.fit(X_train, y_train)

# Gerar previsões
train_probs = nn_model.predict_proba(X_train)
test_probs = nn_model.predict_proba(X_test)

# Converter para DataFrame
train_probs_df = pd.DataFrame(train_probs, columns=['prob_class_0', 'prob_class_1'], index=X_train.index)
test_probs_df = pd.DataFrame(test_probs, columns=['prob_class_0', 'prob_class_1'], index=X_test.index)

# Salvar o modelo treinado
temp_dir = tempfile.mkdtemp()
model_path = os.path.join(temp_dir, 'nn_model_large.pkl')
joblib.dump(nn_model, model_path)

# Exibir informações do modelo
print(f"Neural network model: {type(nn_model).__name__}")
print(f"Hidden layers: {nn_model.hidden_layer_sizes}")
print(f"Train accuracy: {nn_model.score(X_train, y_train):.4f}")
print(f"Test accuracy: {nn_model.score(X_test, y_test):.4f}")
print(f"Model size: {os.path.getsize(model_path) / (1024 * 1024):.2f} MB")

Neural network model: MLPClassifier
Hidden layers: (100, 50)
Train accuracy: 0.9979
Test accuracy: 0.9897
Model size: 0.26 MB


In [6]:
nn_model

### 1.1 Creating a DBDataset instance

There are multiple ways to create a `DBDataset` instance:

1. From a unified dataset that will be split into train/test
2. From already split train/test datasets
3. With pre-loaded model and predictions

Let's demonstrate the first approach.

In [7]:
# Create a DBDataset from unified data
db_dataset = DBDataset(
    data=data,                   # Unified dataset
    target_column='target',      # Name of target column
    model=nn_model
)

# Display information about the dataset
print(db_dataset)

DBDataset(with 200000 samples (not split))
Features: 30 total (0 categorical, 30 numerical)
Target: 'target'
Model: loaded
Predictions: available


In [8]:
experiment = Experiment(
    dataset=db_dataset,
    experiment_type="binary_classification"
)


=== Evaluating distillation model on train dataset ===
Student predictions shape: (160000, 2)
First 3 student probabilities: [[9.9999386e-01 6.1506762e-06]
 [4.0531158e-06 9.9999595e-01]
 [1.8007445e-01 8.1992555e-01]]
Teacher probabilities type: <class 'pandas.core.frame.DataFrame'>
Using 'prob_class_1' column from teacher probabilities
Teacher probabilities shape: (160000, 2)
First 3 teacher probabilities (positive class): [4.06659572e-10 9.99999833e-01 3.13943004e-17]
KS Statistic calculation: 0.42930625, p-value: 0.0
R² Score calculation: 0.9730334873662196
Teacher prob type: <class 'numpy.ndarray'>, shape: (160000,)
Student prob type: <class 'numpy.ndarray'>, shape: (160000,)
Teacher prob first 5 values: [4.06659572e-10 9.99999833e-01 3.13943004e-17 3.16337256e-18
 1.00000000e+00]
Student prob first 5 values: [6.1506762e-06 9.9999595e-01 8.1992555e-01 1.8750736e-02 9.9986696e-01]
KS calculation successful: (0.42930625, 0.0)
Sorted teacher dist - min: 4.368222869789236e-45, max: 1

In [9]:
experiment.compare_teacher_student_metrics()

Unnamed: 0,dataset,metric,teacher_value,student_value,difference
0,train,accuracy,0.997912,0.955694,-0.042219
1,train,precision,0.997107,0.955676,-0.041431
2,train,recall,0.998726,0.955784,-0.042942
3,train,f1_score,0.997916,0.95573,-0.042186
4,train,auc_roc,0.999976,0.989103,-0.010873
5,train,auc_pr,0.999976,0.98841,-0.011566
6,train,log_loss,0.006103,0.135734,0.129631
7,test,accuracy,0.989675,0.95155,-0.038125
8,test,precision,0.989222,0.952643,-0.036579
9,test,recall,0.990161,0.950455,-0.039706


In [10]:
from deepbridge.visualization.distribution import DistributionVisualizer

# Obtenha predições do estudante (student) e do professor (teacher) para o conjunto de teste
student_predictions = experiment.get_student_predictions(dataset='test')
student_probs = student_predictions['prob_1'].values  # ou outra coluna de probabilidade conforme seu caso

# As probabilidades do professor geralmente estão disponíveis no experimento
teacher_probs = experiment.prob_test

# Crie o visualizador de distribuição
output_dir = "distribution_plots"  # diretório onde os gráficos serão salvos
visualizer = DistributionVisualizer(output_dir=output_dir)

# Gere a visualização comparando as distribuições
metrics = visualizer.compare_distributions(
    teacher_probs=teacher_probs,
    student_probs=student_probs,
    title="Comparação da Distribuição de Probabilidades: Teacher vs Student",
    filename="teacher_student_comparison.png",
    show_metrics=True
)

# Gere também um gráfico de distribuição cumulativa
visualizer.compare_cumulative_distributions(
    teacher_probs=teacher_probs,
    student_probs=student_probs,
    title="Comparação da Distribuição Cumulativa: Teacher vs Student",
    filename="teacher_student_cdf.png"
)

# Crie um gráfico de quantil-quantil (Q-Q plot)
visualizer.create_quantile_plot(
    teacher_probs=teacher_probs,
    student_probs=student_probs,
    title="Q-Q Plot: Teacher vs Student",
    filename="teacher_student_qq.png"
)

Created distribution comparison: distribution_plots/teacher_student_comparison.png
Created cumulative distribution comparison: distribution_plots/teacher_student_cdf.png
Created quantile plot: distribution_plots/teacher_student_qq.png


We can also pre-split our data and provide the train and test sets directly:

In [11]:
# Split data manually
train_data, test_data = train_test_split(data, test_size=0.2, random_state=42)

# Create a DBDataset from pre-split data
db_dataset_split = DBDataset(
    train_data=train_data,       # Training dataset
    test_data=test_data,         # Test dataset
    target_column='target',      # Name of target column
    dataset_name='presplit_classification'  # Optional name
)

print(db_dataset_split)

DBDataset('presplit_classification' with 200000 samples (not split))
Features: 30 total (0 categorical, 30 numerical)
Target: 'target'
Model: not loaded
Predictions: not available


### 1.2 Accessing dataset properties

`DBDataset` provides several properties to access its components:

In [12]:
# Access properties
print(f"Total features: {len(db_dataset.features)}")
print(f"Feature names: {db_dataset.features[:5]}...")
print(f"Target name: {db_dataset.target_name}")
print(f"Total samples: {len(db_dataset)}")
print(f"Training samples: {len(db_dataset.train_data)}")
print(f"Test samples: {len(db_dataset.test_data)}")

Total features: 30
Feature names: ['feature_0', 'feature_1', 'feature_2', 'feature_3', 'feature_4']...
Target name: target
Total samples: 200000
Training samples: 160000
Test samples: 40000


### 1.3 Working with categorical features

Let's create a dataset with both numerical and categorical features to demonstrate categorical feature handling.

In [13]:
# Load breast cancer dataset (for numerical features)
breast_cancer = load_breast_cancer()
X_numeric = pd.DataFrame(breast_cancer.data, columns=breast_cancer.feature_names)
y = breast_cancer.target

# Add some categorical features
X_numeric['age_group'] = pd.cut(np.random.randint(30, 80, size=X_numeric.shape[0]), 
                               bins=[30, 45, 60, 80], 
                               labels=['young', 'middle', 'senior'])

X_numeric['risk_factor'] = np.random.choice(['low', 'medium', 'high'], size=X_numeric.shape[0])
X_numeric['family_history'] = np.random.choice([0, 1], size=X_numeric.shape[0])

# Create final dataset
mixed_data = X_numeric.copy()
mixed_data['target'] = y

# Display data types
print("Data types:")
print(mixed_data.dtypes.value_counts())
print("\nSample data:")
mixed_data[['mean radius', 'mean texture', 'age_group', 'risk_factor', 'family_history', 'target']].head()

NameError: name 'load_breast_cancer' is not defined

In [None]:
# Create DBDataset with categorical features
db_mixed = DBDataset(
    data=mixed_data,
    target_column='target',
    test_size=0.2,
    random_state=42,
    categorical_features=['age_group', 'risk_factor', 'family_history'],  # Specify categorical features
    dataset_name='mixed_features_dataset'
)

# Check categorical features
print(f"Categorical features: {db_mixed.categorical_features}")
print(f"Numerical features: {db_mixed.numerical_features[:5]}...")

### 1.4 Adding a model and predictions to DBDataset

We can train a model and add it to our dataset. This is useful when you want to use DeepBridge's distillation capabilities.

In [None]:
# Train a model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(db_dataset.get_feature_data('train'), db_dataset.get_target_data('train'))

# Generate predictions
train_probas = model.predict_proba(db_dataset.get_feature_data('train'))
test_probas = model.predict_proba(db_dataset.get_feature_data('test'))

# Create DataFrames for probabilities
train_probas_df = pd.DataFrame(train_probas, columns=['prob_class_0', 'prob_class_1'])
test_probas_df = pd.DataFrame(test_probas, columns=['prob_class_0', 'prob_class_1'])

# Save model to a temporary file
temp_dir = tempfile.mkdtemp()
model_path = os.path.join(temp_dir, 'rf_model.pkl')
joblib.dump(model, model_path)
print(f"Saved model to: {model_path}")

In [None]:
# Create DBDataset with model path
db_with_model = DBDataset(
    data=data,
    target_column='target',
    test_size=0.2,
    random_state=42,
    model_path=model_path,  # Path to saved model
    dataset_name='dataset_with_model'
)

print(db_with_model)

# Check if model is loaded
print(f"\nModel type: {type(db_with_model.model).__name__}")

In [None]:
# Alternatively, create DBDataset with pre-calculated probabilities
db_with_probs = DBDataset(
    train_data=train_data,
    test_data=test_data,
    target_column='target',
    train_predictions=train_probas_df,  # Pre-calculated train predictions
    test_predictions=test_probas_df,    # Pre-calculated test predictions
    prob_cols=['prob_class_0', 'prob_class_1'],  # Probability column names
    dataset_name='dataset_with_probabilities'
)

print(db_with_probs)

# Access probabilities
print("\nSample probabilities:")
print(db_with_probs.original_prob.head())

### 1.5 Generating synthetic data

The `DBDataset` class can also generate synthetic data based on the original dataset distribution. This is useful for experimentation without using the original data.

In [None]:
# Create a DBDataset with synthetic data generation enabled
db_synthetic = DBDataset(
    data=data,
    target_column='target',
    test_size=0.2,
    random_state=42,
    model_path=model_path,                # Model is required for synthetic data
    synthetic=True,                       # Enable synthetic data generation
    synthetic_sample=200,                 # Number of synthetic samples to generate
    dataset_name='dataset_with_synthetic'
)

print(db_synthetic)

# Access synthetic data
if db_synthetic.synthetic_data is not None:
    print("\nSynthetic data shape:", db_synthetic.synthetic_data.shape)
    print("\nSample synthetic data:")
    print(db_synthetic.synthetic_data.head())

## Part 2: Working with Experiment

The `Experiment` class handles different types of modeling tasks and their configurations. It works with `DBDataset` to manage experiments, including model training, evaluation, and comparison.

In [None]:
# Create an experiment using the DBDataset with model
experiment = Experiment(
    dataset=db_with_model,
    experiment_type="binary_classification",
    test_size=0.2,
    random_state=42,
    config={"verbose": True}  # Additional configuration
)

# Show experiment properties
print(f"Experiment type: {experiment.experiment_type}")
print(f"Test size: {experiment.test_size}")
print(f"Random state: {experiment.random_state}")
print(f"Train data shape: {experiment.X_train.shape}")
print(f"Test data shape: {experiment.X_test.shape}")

### 2.1 Training distillation models

One of the main functions of the `Experiment` class is to train distillation models. These are simpler models that mimic the behavior of complex models.

In [None]:
# Create an experiment using the DBDataset with probabilities
experiment_probs = Experiment(
    dataset=db_with_probs,
    experiment_type="binary_classification",
    test_size=0.2,
    random_state=42,
    config={"verbose": True}
)

# Fit a distilled model using logistic regression
experiment_probs.fit(
    student_model_type=ModelType.LOGISTIC_REGRESSION,
    temperature=1.0,
    alpha=0.5,
    use_probabilities=True,  # Use pre-calculated probabilities
    n_trials=10,  # Number of hyperparameter optimization trials
    distillation_method="surrogate"
)

### 2.2 Evaluating and comparing models

After training, we can evaluate the distilled model and compare it to the original model.

In [None]:
# Get metrics for both train and test sets
metrics = experiment_probs.metrics

# Print metrics
print("Training set metrics:")
for key, value in metrics['train'].items():
    if key not in ['best_params', 'distillation_method'] and value is not None:
        print(f"  {key}: {value:.4f}")

print("\nTest set metrics:")
for key, value in metrics['test'].items():
    if key not in ['best_params', 'distillation_method'] and value is not None:
        print(f"  {key}: {value:.4f}")

In [None]:
# Get student model predictions
student_predictions = experiment_probs.get_student_predictions('test')

# Display sample predictions
print("Student model predictions:")
print(student_predictions.head())

### 2.3 Comparing teacher and student models

We can directly compare the teacher (original) and student (distilled) models.

In [None]:
# Compare teacher and student models
comparison = experiment_probs.compare_teacher_student_metrics()

# Display comparison for test set
test_comparison = comparison[comparison['dataset'] == 'test']
test_comparison.sort_values('metric', inplace=True)

print("Teacher vs Student Model Comparison (Test Set):")
print(test_comparison)

In [None]:
# Visualize comparison
plt.figure(figsize=(12, 6))

# Filter metrics for plotting
metrics_to_plot = ['accuracy', 'precision', 'recall', 'f1_score', 'auc_roc', 'auc_pr']
plot_data = test_comparison[test_comparison['metric'].isin(metrics_to_plot)]

# Create bar plot
x = np.arange(len(plot_data))
width = 0.35

plt.bar(x - width/2, plot_data['teacher_value'], width, label='Teacher')
plt.bar(x + width/2, plot_data['student_value'], width, label='Student')

plt.xlabel('Metric')
plt.ylabel('Value')
plt.title('Teacher vs Student Model Performance')
plt.xticks(x, plot_data['metric'])
plt.ylim([0.8, 1.0])  # Adjust as needed
plt.legend()
plt.grid(axis='y', linestyle='--', alpha=0.7)

plt.tight_layout()
plt.show()

## Part 3: Trying Different Distillation Methods

DeepBridge supports different distillation methods. The two main ones are:

1. **Surrogate Model**: Directly fits a model to the outputs of the teacher model
2. **Knowledge Distillation**: Uses a combination of soft targets and hard labels for training

Let's try the knowledge distillation method:

In [None]:
# Fit a distilled model using knowledge distillation
experiment_kd = Experiment(
    dataset=db_with_probs,
    experiment_type="binary_classification",
    test_size=0.2,
    random_state=42
)

experiment_kd.fit(
    student_model_type=ModelType.GBM,  # Try a different model type
    temperature=2.0,  # Higher temperature for softer probabilities
    alpha=0.7,  # More weight on the teacher's soft targets
    use_probabilities=True,
    n_trials=10,
    distillation_method="knowledge_distillation"  # Use knowledge distillation
)

In [None]:
# Compare results of different methods
surrogate_metrics = experiment_probs.metrics['test']
kd_metrics = experiment_kd.metrics['test']

# Print comparison table
print("Comparison of Distillation Methods (Test Set Metrics):")
print("\nMetric         | Surrogate    | Knowledge Distillation")
print("--------------|--------------|-----------------------")

for metric in ['accuracy', 'precision', 'recall', 'f1_score', 'auc_roc', 'auc_pr']:
    if metric in surrogate_metrics and metric in kd_metrics:
        print(f"{metric.ljust(14)}| {surrogate_metrics[metric]:.4f}      | {kd_metrics[metric]:.4f}")

### 3.1 Distribution similarity metrics

DeepBridge provides metrics to measure how well the student model mimics the probability distribution of the teacher model:

In [None]:
# Compare distribution similarity metrics
print("Distribution Similarity Metrics:")
print("\nMetric             | Surrogate    | Knowledge Distillation")
print("-------------------|--------------|-----------------------")

for metric in ['kl_divergence', 'ks_statistic', 'r2_score']:
    if metric in surrogate_metrics and metric in kd_metrics:
        print(f"{metric.ljust(19)}| {surrogate_metrics[metric]:.4f}      | {kd_metrics[metric]:.4f}")

Let's visualize the probability distributions to see how well each method mimics the teacher's outputs:

In [None]:
# Get predictions from all models
surrogate_preds = experiment_probs.get_student_predictions('test')
kd_preds = experiment_kd.get_student_predictions('test')

# Extract teacher probabilities
teacher_probs = db_with_probs.test_data.copy()
for col in db_with_probs.original_prob.columns:
    if col in ['prob_class_1', 'prob_1']:
        teacher_probs = db_with_probs.original_prob[col].values
        break
if isinstance(teacher_probs, pd.DataFrame):
    teacher_probs = teacher_probs.iloc[:, 1].values  # Get positive class probability

# Plot distributions
plt.figure(figsize=(12, 6))

sns.kdeplot(teacher_probs, label='Teacher', color='blue', linewidth=2)
sns.kdeplot(surrogate_preds['prob_1'], label='Surrogate', color='red', linewidth=2)
sns.kdeplot(kd_preds['prob_1'], label='Knowledge Distillation', color='green', linewidth=2)

plt.xlabel('Probability of Positive Class')
plt.ylabel('Density')
plt.title('Comparison of Probability Distributions')
plt.legend()
plt.grid(alpha=0.3)

plt.tight_layout()
plt.show()

## Conclusion

In this notebook, we explored the core components of the DeepBridge library:

1. **DBDataset**: A versatile class for managing datasets, features, and models
   - Creating datasets from unified or split data
   - Handling categorical features
   - Adding models and predictions
   - Generating synthetic data

2. **Experiment**: A powerful tool for running distillation experiments
   - Training student models using different distillation methods
   - Evaluating and comparing model performance
   - Analyzing distribution similarity

3. **Distillation Methods**: Different approaches to model compression
   - Surrogate models for direct mimicry
   - Knowledge distillation for more nuanced learning

DeepBridge makes it easy to compress complex models into simpler ones without significantly sacrificing performance, enabling faster inference, reduced memory requirements, and easier deployment.