# Understanding the Experiment Class in DeepBridge

This notebook explains the purpose and functionality of the `Experiment` class in the DeepBridge library, which is a key component for managing and executing model validation and distillation tasks.

## 1. Introduction to the Experiment Class

The `Experiment` class in DeepBridge serves as a container for experiments related to model validation and distillation. It encapsulates the entire workflow of preparing data, running experiments, and evaluating results.

Let's first import the necessary modules:

In [1]:
import pandas as pd
import numpy as np
import sys
import os
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

sys.path.append(os.path.expanduser("~/projetos/DeepBridge"))

# Importações atualizadas para a nova estrutura
from deepbridge.core.db_data import DBDataset
from deepbridge.core.experiment import Experiment
from deepbridge.utils.model_registry import ModelType

  from .autonotebook import tqdm as notebook_tqdm


## 2. Creating a Basic Experiment

To demonstrate the Experiment class, let's first generate some synthetic data and a teacher model:

In [2]:
# Generate a synthetic classification dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10, 
                           n_classes=2, random_state=42)

# Convert to pandas DataFrame
X_df = pd.DataFrame(X, columns=[f'feature_{i}' for i in range(X.shape[1])])
y_df = pd.Series(y, name='target')

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_df, y_df, test_size=0.2, random_state=42)

# Train a "teacher" model (e.g., a complex RandomForest)
teacher_model = RandomForestClassifier(n_estimators=100, random_state=42)
teacher_model.fit(X_train, y_train)

# Generate probability predictions from the teacher model
train_probs = teacher_model.predict_proba(X_train)
test_probs = teacher_model.predict_proba(X_test)

# Create DataFrame with probabilities
train_probs_df = pd.DataFrame(train_probs, columns=['prob_class_0', 'prob_class_1'], index=X_train.index)
test_probs_df = pd.DataFrame(test_probs, columns=['prob_class_0', 'prob_class_1'], index=X_test.index)

# Create a DBDataset instance
train_data = pd.concat([X_train, y_train], axis=1)
test_data = pd.concat([X_test, y_test], axis=1)

dataset = DBDataset(
    train_data=train_data,
    test_data=test_data,
    target_column='target',
    features=X_df.columns.tolist(),
    train_predictions=train_probs_df,
    test_predictions=test_probs_df,
    prob_cols=['prob_class_0', 'prob_class_1']
)

Now that we have our dataset prepared, we can create an Experiment:

In [3]:
# Create an experiment
experiment = Experiment(
    dataset=dataset,
    experiment_type="binary_classification"
)

# Let's verify our experiment has been initialized correctly
print(f"Experiment type: {experiment.experiment_type}")
print(f"Training data shape: {experiment.X_train.shape}")
print(f"Test data shape: {experiment.X_test.shape}")
print(f"Training labels shape: {experiment.y_train.shape}")
print(f"Test labels shape: {experiment.y_test.shape}")
print(f"Training probability predictions shape: {experiment.prob_train.shape}")
print(f"Test probability predictions shape: {experiment.prob_test.shape}")


=== Evaluating distillation model on train dataset ===
Student predictions shape: (800, 2)
First 3 student probabilities: [[3.20679257e-04 9.99679321e-01]
 [9.99951268e-01 4.87318541e-05]
 [9.99710938e-01 2.89061720e-04]]
Teacher probabilities type: <class 'pandas.core.frame.DataFrame'>
Using 'prob_class_1' column from teacher probabilities
Teacher probabilities shape: (800, 2)
First 3 teacher probabilities (positive class): [0.71 0.   0.08]
KS Statistic calculation: 0.47, p-value: 3.1408946187192175e-80
R² Score calculation: 0.8846692436908336
Teacher prob type: <class 'numpy.ndarray'>, shape: (800,)
Student prob type: <class 'numpy.ndarray'>, shape: (800,)
Teacher prob first 5 values: [0.71 0.   0.08 0.88 0.39]
Student prob first 5 values: [9.99679321e-01 4.87318541e-05 2.89061720e-04 9.99927453e-01
 6.18127369e-04]
KS calculation successful: (0.47, 3.1408946187192175e-80)
Sorted teacher dist - min: 0.0, max: 1.0, length: 800
Sorted student dist - min: 2.8611958523003585e-05, max: 0

In [5]:
experiment.model

<deepbridge.distillation.techniques.surrogate.SurrogateModel at 0x7fced5ddd580>

In [4]:
=

SyntaxError: invalid syntax (1763773627.py, line 1)