# Understanding the Experiment Class in DeepBridge

This notebook explains the purpose and functionality of the `Experiment` class in the DeepBridge library, which is a key component for managing and executing model validation and distillation tasks.

## 1. Introduction to the Experiment Class

The `Experiment` class in DeepBridge serves as a container for experiments related to model validation and distillation. It encapsulates the entire workflow of preparing data, running experiments, and evaluating results.

Let's first import the necessary modules:

In [1]:
import pandas as pd
import numpy as np
import sys
import os
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

sys.path.append(os.path.expanduser("~/projetos/DeepBridge"))

# Import DeepBridge components
from deepbridge.db_data import DBDataset
from deepbridge.experiment import Experiment
from deepbridge.distillation.classification.model_registry import ModelType

  from .autonotebook import tqdm as notebook_tqdm


## 2. Creating a Basic Experiment

To demonstrate the Experiment class, let's first generate some synthetic data and a teacher model:

In [2]:
# Generate a synthetic classification dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10, 
                           n_classes=2, random_state=42)

# Convert to pandas DataFrame
X_df = pd.DataFrame(X, columns=[f'feature_{i}' for i in range(X.shape[1])])
y_df = pd.Series(y, name='target')

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_df, y_df, test_size=0.2, random_state=42)

# Train a "teacher" model (e.g., a complex RandomForest)
teacher_model = RandomForestClassifier(n_estimators=100, random_state=42)
teacher_model.fit(X_train, y_train)

# Generate probability predictions from the teacher model
train_probs = teacher_model.predict_proba(X_train)
test_probs = teacher_model.predict_proba(X_test)

# Create DataFrame with probabilities
train_probs_df = pd.DataFrame(train_probs, columns=['prob_class_0', 'prob_class_1'], index=X_train.index)
test_probs_df = pd.DataFrame(test_probs, columns=['prob_class_0', 'prob_class_1'], index=X_test.index)

# Create a DBDataset instance
train_data = pd.concat([X_train, y_train], axis=1)
test_data = pd.concat([X_test, y_test], axis=1)

dataset = DBDataset(
    train_data=train_data,
    test_data=test_data,
    target_column='target',
    features=X_df.columns.tolist(),
    train_predictions=train_probs_df,
    test_predictions=test_probs_df,
    prob_cols=['prob_class_0', 'prob_class_1']
)

Now that we have our dataset prepared, we can create an Experiment:

In [6]:
# Create an experiment
experiment = Experiment(
    dataset=dataset,
    experiment_type="binary_classification"
)

# Let's verify our experiment has been initialized correctly
print(f"Experiment type: {experiment.experiment_type}")
print(f"Training data shape: {experiment.X_train.shape}")
print(f"Test data shape: {experiment.X_test.shape}")
print(f"Training labels shape: {experiment.y_train.shape}")
print(f"Test labels shape: {experiment.y_test.shape}")
print(f"Training probability predictions shape: {experiment.prob_train.shape}")
print(f"Test probability predictions shape: {experiment.prob_test.shape}")


=== Evaluating distillation model on train dataset ===
Student predictions shape: (800, 2)
First 3 student probabilities: [[3.20679257e-04 9.99679321e-01]
 [9.99951268e-01 4.87318541e-05]
 [9.99710938e-01 2.89061720e-04]]
Teacher probabilities type: <class 'pandas.core.frame.DataFrame'>
Using 'prob_class_1' column from teacher probabilities
Teacher probabilities shape: (800, 2)
First 3 teacher probabilities (positive class): [0.71 0.   0.08]
KS Statistic calculation: 0.47, p-value: 3.1408946187192175e-80
R² Score calculation: 0.8846692436908336
Teacher prob type: <class 'numpy.ndarray'>, shape: (800,)
Student prob type: <class 'numpy.ndarray'>, shape: (800,)
Teacher prob first 5 values: [0.71 0.   0.08 0.88 0.39]
Student prob first 5 values: [9.99679321e-01 4.87318541e-05 2.89061720e-04 9.99927453e-01
 6.18127369e-04]
KS calculation successful: (0.47, 3.1408946187192175e-80)
Sorted teacher dist - min: 0.0, max: 1.0, length: 800
Sorted student dist - min: 2.8611958523003585e-05, max: 0

In [4]:
=

SyntaxError: invalid syntax (1763773627.py, line 1)

In [None]:
# 2. Train the distillation model first
experiment.fit(
    student_model_type=ModelType.GBM,  # or whichever model you prefer
    temperature=1.0,
    alpha=0.5,
    use_probabilities=True,
    verbose=True,
)

In [None]:
# Opção 1: Importar diretamente
from deepbridge.validation.experiment_extensions import analyze_hyperparameters_workaround_fixed
importance = analyze_hyperparameters_workaround_fixed(experiment)

# Opção 2: Usar como método do Experiment
importance = experiment.analyze_hyperparameters_workaround_fixed()

In [None]:
# Try simpler fit call
print("Trying simplified fit...")
experiment.fit(verbose=True
)

# Check if model was created
print(f"Distillation model after simplified fit: {experiment.distillation_model is not None}")

In [None]:
# Check if extensions are properly integrated
print("Extensions check:")
print(f"Has analyze_hyperparameter_importance: {hasattr(experiment, 'analyze_hyperparameter_importance')}")
print(f"Has optimize_hyperparameters: {hasattr(experiment, 'optimize_hyperparameters')}")
print(f"Has estimate_uncertainty: {hasattr(experiment, 'estimate_uncertainty')}")

In [None]:
experiment.analyze_hyperparameter_importance()

In [None]:
# Call the hyperparameter importance analysis
importance = experiment.analyze_hyperparameter_importance(
    verbose=True
)

## 3. Training a Distillation Model

The most common use case for the Experiment class is training a distillation model. Let's demonstrate this:

In [None]:
# Train a distillation model using the experiment
# Usar fit() com SurrogateModel (método padrão)
experiment.fit(
    student_model_type=ModelType.GBM,
    student_params={'n_estimators': 100, 'learning_rate': 0.1},
    use_probabilities=True,
    verbose=True
    # distillation_method="surrogate" não é necessário especificar, já é o padrão
)

# Check the results of our distillation
print("\nDistillation Results:")
print("Train metrics:", experiment.results['train'])
print("\nTest metrics:", experiment.results['test'])

In [None]:
modelo_treinado = experiment.distillation_model

In [None]:
df_previsoes = experiment.get_student_predictions(dataset='test')  

In [None]:
df_previsoes

## 4. Key Features of the Experiment Class

The Experiment class provides several key features and methods:

### 4.1 Automatic Data Preparation

The class automatically handles data splitting and preparation:

In [None]:
# We can access the prepared data splits directly
print(f"X_train shape: {experiment.X_train.shape}")
print(f"X_test shape: {experiment.X_test.shape}")

# We can also get both features and target using the get_dataset_split method
X_train, y_train, prob_train = experiment.get_dataset_split('train')
X_test, y_test, prob_test = experiment.get_dataset_split('test')

print(f"Features from get_dataset_split (train): {X_train.shape}")
print(f"Target from get_dataset_split (train): {y_train.shape}")
print(f"Probabilities from get_dataset_split (train): {prob_train.shape}")

### 4.2 Getting Student Model Predictions

We can easily get predictions from the trained student model:

In [None]:
# Get predictions from the student model
student_predictions = experiment.get_student_predictions(dataset='test')

# Let's look at the first few predictions
print("Student model predictions:")
print(student_predictions.head())

# Calculate metrics for the student model
student_metrics = experiment.calculate_student_metrics(dataset='test')
print("\nStudent model metrics:")
for metric, value in student_metrics.items():
    print(f"  {metric}: {value}")

### 4.3 Comparing Teacher and Student Models

One of the most valuable features is the ability to compare teacher and student models:

In [None]:
# Compare teacher and student model metrics
comparison_df = experiment.compare_teacher_student_metrics()

# Let's see the comparison
print("Teacher vs Student Model Comparison:")
print(comparison_df)

### 4.4 Evaluating Custom Predictions

The class also allows us to evaluate custom predictions:

In [None]:
# Create some simple predictions to evaluate
custom_predictions = pd.DataFrame({
    'y_pred': (X_test.iloc[:, 0] > 0).astype(int),  # Simple threshold-based prediction
    'prob_1': X_test.iloc[:, 0].clip(0, 1)  # Simple probability based on first feature
})

# Evaluate these predictions
custom_metrics = experiment.evaluate_predictions(
    predictions=custom_predictions,
    dataset='test',
    prob_column='prob_1'
)

print("\nCustom prediction metrics:")
for metric, value in custom_metrics.items():
    print(f"  {metric}: {value}")

## 5. How It Works: The Experiment Workflow

Let's explore the main workflow of the Experiment class:

### 5.1 Data Preparation

The `_prepare_data` method handles the train-test split. Here's what it does internally:

```python
def _prepare_data(self) -> None:
    """
    Prepare the data by performing train-test split on features and target.
    """
    X = self.dataset.X
    y = self.dataset.target
    
    self.X_train, self.X_test, self.y_train, self.y_test = train_test_split(
        X, y,
        test_size=self.test_size,
        random_state=self.random_state
    )
    
    # Split probabilities if available
    if self.dataset.original_prob is not None:
        prob_train_idx = self.X_train.index
        prob_test_idx = self.X_test.index
        
        self.prob_train = self.dataset.original_prob.loc[prob_train_idx]
        self.prob_test = self.dataset.original_prob.loc[prob_test_idx]
    else:
        self.prob_train = None
        self.prob_test = None
```

### 5.2 Distillation Training

The core of the distillation process happens in the `fit` method, which initializes a KnowledgeDistillation object. Here's a simplified version of what happens internally:

```python
def fit(self, student_model_type, ...) -> 'Experiment':
    if use_probabilities:
        # Create distillation model using pre-calculated probabilities
        self.distillation_model = KnowledgeDistillation.from_probabilities(
            probabilities=self.prob_train,
            student_model_type=student_model_type,
            temperature=temperature,
            alpha=alpha,
            ...
        )
    else:
        # Create distillation model using the teacher model directly
        self.distillation_model = KnowledgeDistillation(
            teacher_model=self.dataset.model,
            student_model_type=student_model_type,
            ...
        )
    
    # Train the model
    self.distillation_model.fit(self.X_train, self.y_train, verbose=verbose)
    
    # Evaluate the model
    train_metrics = self._evaluate_distillation_model('train')
    test_metrics = self._evaluate_distillation_model('test')
    
    # Store results
    self.results['train'] = train_metrics['metrics']
    self.results['test'] = test_metrics['metrics']
    
    return self
```

### 5.3 Evaluation Process

The evaluation process is handled by `_evaluate_distillation_model`. Here's a simplified version of what happens internally:

```python
def _evaluate_distillation_model(self, dataset: str = 'test') -> dict:
    # Get the appropriate data for evaluation
    if dataset == 'train':
        X, y, prob = self.X_train, self.y_train, self.prob_train
    else:
        X, y, prob = self.X_test, self.y_test, self.prob_test
    
    # Get predictions from the student model
    y_pred = self.distillation_model.predict(X)
    y_prob = self.distillation_model.predict_proba(X)
    
    # Extract probability of positive class
    student_prob_pos = y_prob[:, 1] if y_prob.shape[1] > 1 else y_prob
    
    # Process teacher probabilities for comparison
    teacher_prob_pos = self._extract_teacher_probabilities(prob)
    
    # Calculate metrics using the Classification class
    metrics = self.metrics_calculator.calculate_metrics(
        y_true=y,
        y_pred=y_pred,
        y_prob=student_prob_pos,
        teacher_prob=teacher_prob_pos
    )
    
    return {'metrics': metrics, 'predictions': predictions_df}
```

### 6.1 Saving and Loading Models

While not directly implemented in the shown code, you would typically save and load the trained distillation model like this:

In [None]:
import joblib

# Save the distilled model
joblib.dump(experiment.distillation_model, 'distilled_model.pkl')

# Later, load the model
loaded_model = joblib.load('distilled_model.pkl')

# Use the loaded model for predictions
predictions = loaded_model.predict(X_test)
probabilities = loaded_model.predict_proba(X_test)

## 7. Conclusion

The `Experiment` class in DeepBridge is a powerful tool for managing the complete knowledge distillation workflow. It provides:

1. **Data Management**: Automatic handling of data preparation and splitting
2. **Model Training**: Simplified interface for knowledge distillation with various configurations
3. **Evaluation**: Comprehensive metrics calculation for both teacher and student models
4. **Comparison**: Tools to compare different models and configurations
5. **Versatility**: Support for different experiment types and model configurations

This makes it an indispensable tool for anyone working with model distillation or looking to create more efficient versions of complex models.