# Comparação de Gradient Boosting com e sem features quânticas

Este notebook carrega os *folds* disponíveis na pasta `features/`, treina modelos de Gradient Boosting usando apenas as features clássicas (`class_0` a `class_12`) e o conjunto combinado de features clássicas + quânticas (`qf_0` a `qf_12`). Em seguida, comparamos o desempenho entre os dois cenários e geramos as predições solicitadas.


## Dependências

O notebook utiliza `pandas`, `numpy`, `scikit-learn`, `matplotlib` e `seaborn`. Caso ainda não as tenha instalado no seu ambiente, execute o comando abaixo em uma célula separada ou diretamente no terminal:

```bash
pip install pandas numpy scikit-learn matplotlib seaborn
```


In [None]:
from pathlib import Path

import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
from sklearn.ensemble import GradientBoostingClassifier, HistGradientBoostingRegressor
from sklearn.metrics import (
    balanced_accuracy_score,
    f1_score,
    precision_score,
    recall_score,
    roc_auc_score,
)
from sklearn.multioutput import MultiOutputRegressor

# Configuração estética padrão para os gráficos
sns.set_style("whitegrid")
plt.rcParams.update({"figure.figsize": (10, 5), "axes.titlesize": 14, "axes.labelsize": 12})



## Carregamento dos dados

Cada CSV em `features/` corresponde a um *fold*. O conjunto possui uma coluna `set` indicando se a amostra pertence ao treino ou ao teste.


In [None]:
data_dir = Path("features")
fold_paths = sorted(data_dir.glob("features_y_fold*.csv"))
if not fold_paths:
    raise FileNotFoundError("Nenhum arquivo `features_y_fold*.csv` foi encontrado na pasta `features/`.")

fold_frames = []
for path in fold_paths:
    frame = pd.read_csv(path)
    frame["fold_name"] = path.stem
    fold_frames.append(frame)

fold_summary = (
    pd.concat(
        [df.assign(set=df["set"].str.lower())[["fold", "fold_name", "set"]] for df in fold_frames],
        ignore_index=True,
    )
    .value_counts()
    .unstack(fill_value=0)
)
fold_summary


## Definição dos grupos de features e função de avaliação

A função abaixo treina um `GradientBoostingClassifier` para cada conjunto de features em todos os *folds*, calcula as métricas desejadas no conjunto de teste e armazena as predições geradas.


In [None]:
classical_features = [col for col in fold_frames[0].columns if col.startswith("class_")]
quantum_features = [col for col in fold_frames[0].columns if col.startswith("qf_")]

# Treinamento do regressor quântico usando todos os dados disponíveis
all_data = pd.concat(fold_frames, ignore_index=True)

X_reg = all_data[classical_features]
y_reg = all_data[quantum_features]

quantum_regressor = MultiOutputRegressor(
    HistGradientBoostingRegressor(random_state=42)
)
quantum_regressor.fit(X_reg, y_reg)

predicted_quantum_features = [f"pred_{feature}" for feature in quantum_features]

for fold_df in fold_frames:
    predicted_values = quantum_regressor.predict(fold_df[classical_features])
    for pred_column, column_values in zip(predicted_quantum_features, predicted_values.T):
        fold_df[pred_column] = column_values

feature_sets = {
    "Benchmark": classical_features,
    "Quantum": classical_features + quantum_features,
    "Quantum (Regressor)": classical_features + predicted_quantum_features,
}

metric_functions = {
    "AUC": lambda y_true, y_score, y_pred: roc_auc_score(y_true, y_score) if len(np.unique(y_true)) > 1 else np.nan,
    "F1 Score Overall": lambda y_true, y_score, y_pred: f1_score(y_true, y_pred),
    "Balanced Accuracy": lambda y_true, y_score, y_pred: balanced_accuracy_score(y_true, y_pred),
    "Precision Class 0": lambda y_true, y_score, y_pred: precision_score(y_true, y_pred, pos_label=0),
    "Precision Class 1": lambda y_true, y_score, y_pred: precision_score(y_true, y_pred, pos_label=1),
    "Recall Class 0": lambda y_true, y_score, y_pred: recall_score(y_true, y_pred, pos_label=0),
    "Recall Class 1": lambda y_true, y_score, y_pred: recall_score(y_true, y_pred, pos_label=1),
}

results = []
prediction_frames = []

for fold_idx, fold_df in enumerate(fold_frames):
    train_df = fold_df[fold_df["set"] == "train"]
    test_df = fold_df[fold_df["set"] == "test"]

    y_train = train_df["y"]
    y_test = test_df["y"]

    for label, columns in feature_sets.items():
        X_train = train_df[columns]
        X_test = test_df[columns]

        model = GradientBoostingClassifier(random_state=42)
        model.fit(X_train, y_train)

        y_proba = model.predict_proba(X_test)[:, 1]
        y_pred = model.predict(X_test)

        for metric_name, metric_fn in metric_functions.items():
            value = metric_fn(y_test, y_proba, y_pred)
            results.append(
                {
                    "fold": fold_idx,
                    "fold_name": fold_df["fold_name"].iat[0],
                    "model": label,
                    "metric": metric_name,
                    "value": value,
                }
            )

        prediction_frames.append(
            pd.DataFrame(
                {
                    "fold": fold_idx,
                    "fold_name": fold_df["fold_name"].iat[0],
                    "row_id": test_df["row_id"].values,
                    "y_true": y_test.values,
                    "model": label,
                    "y_pred": y_pred,
                    "y_proba": y_proba,
                }
            )
        )


results_df = pd.DataFrame(results)
predictions_df = pd.concat(prediction_frames, ignore_index=True)

results_df.head()


## Resumo das métricas por modelo

A tabela a seguir mostra a mediana e o intervalo interquartil (25%-75%) das métricas em todos os *folds*.


In [None]:
metrics_summary = (
    results_df
    .groupby(['model', 'metric'])['value']
    .agg(
        median=lambda s: np.nanmedian(s),
        q1=lambda s: np.nanquantile(s, 0.25),
        q3=lambda s: np.nanquantile(s, 0.75),
    )
    .reset_index()
)
metrics_summary['iqr'] = metrics_summary['q3'] - metrics_summary['q1']
metrics_summary['median_iqr'] = metrics_summary.apply(
    lambda row: f"{row['median']:.4f} [{row['q1']:.4f}, {row['q3']:.4f}]",
    axis=1,
)
metrics_summary


## Gráfico comparativo

O gráfico reproduz o exemplo solicitado, com barras pretas representando o benchmark (apenas features clássicas) e barras amarelas representando o modelo com features clássicas + quânticas. Os valores exibidos nas barras correspondem às medianas por *fold*.


In [None]:
metric_order = [
    'AUC',
    'F1 Score Overall',
    'Balanced Accuracy',
    'Precision Class 0',
    'Precision Class 1',
    'Recall Class 0',
    'Recall Class 1',
]

plot_ready = (
    metrics_summary
    .pivot(index='metric', columns='model', values='median')
    .reindex(metric_order)
)

plot_ready = plot_ready.reindex(columns=["Benchmark", "Quantum", "Quantum (Regressor)"])

model_colors = {
    "Benchmark": "#2e2e2e",
    "Quantum": "#f1b82d",
    "Quantum (Regressor)": "#2b8cbe",
}

ax = plot_ready.plot(
    kind='bar',
    color=[model_colors[col] for col in plot_ready.columns],
    edgecolor='black'
)

ax.set_ylim(0, 1.05)
ax.set_ylabel('Score')
ax.set_xlabel('')
ax.set_title('Toxicity classification with Gradient Boosting')
ax.legend(title='Modelo')

for container in ax.containers:
    ax.bar_label(container, fmt='%.2f', label_type='edge', padding=3)

plt.tight_layout()
plt.show()



## Predições

As tabelas abaixo exibem, respectivamente, as primeiras linhas das predições de teste para o modelo benchmark e para o modelo com features quânticas.


In [None]:
benchmark_predictions = predictions_df[predictions_df['model'] == 'Benchmark']
quantum_predictions = predictions_df[predictions_df['model'] == 'Quantum']
quantum_regressor_predictions = predictions_df[predictions_df['model'] == 'Quantum (Regressor)']

benchmark_predictions.head()



In [None]:
quantum_predictions.head()


In [None]:
quantum_regressor_predictions.head()


## Regressor profundo para aproximar as features quânticas

A seção a seguir replica o experimento com um MLP profundo implementado em NumPy. O objetivo é prever as features quânticas a partir das features clássicas e, em seguida, avaliar o impacto dessas predições no modelo de classificação.

In [None]:
import math
from typing import List, Optional, Sequence

import numpy as np

class DeepNumpyMLPRegressor:
    """Implementação simples de um MLP com ReLU e optimização Adam."""

    def __init__(
        self,
        hidden_layers: Sequence[int] = (256, 128, 64),
        learning_rate: float = 1e-3,
        epochs: int = 100,
        batch_size: int = 256,
        random_state: Optional[int] = 42,
        beta1: float = 0.9,
        beta2: float = 0.999,
        eps: float = 1e-8,
    ) -> None:
        self.hidden_layers = tuple(hidden_layers)
        self.learning_rate = learning_rate
        self.epochs = epochs
        self.batch_size = batch_size
        self.random_state = random_state
        self.beta1 = beta1
        self.beta2 = beta2
        self.eps = eps
        self.weights: List[np.ndarray] = []
        self.biases: List[np.ndarray] = []
        self.m_w: List[np.ndarray] = []
        self.v_w: List[np.ndarray] = []
        self.m_b: List[np.ndarray] = []
        self.v_b: List[np.ndarray] = []
        self._step = 0

    def _initialise(self, n_features: int, n_outputs: int) -> None:
        layer_sizes = [n_features, *self.hidden_layers, n_outputs]
        rng = np.random.default_rng(self.random_state)
        self.weights.clear()
        self.biases.clear()
        self.m_w.clear()
        self.v_w.clear()
        self.m_b.clear()
        self.v_b.clear()
        for in_dim, out_dim in zip(layer_sizes[:-1], layer_sizes[1:]):
            limit = math.sqrt(6.0 / (in_dim + out_dim))
            weight = rng.uniform(-limit, limit, size=(in_dim, out_dim)).astype(np.float64)
            bias = np.zeros(out_dim, dtype=np.float64)
            self.weights.append(weight)
            self.biases.append(bias)
            self.m_w.append(np.zeros_like(weight))
            self.v_w.append(np.zeros_like(weight))
            self.m_b.append(np.zeros_like(bias))
            self.v_b.append(np.zeros_like(bias))
        self._step = 0

    @staticmethod
    def _relu(values: np.ndarray) -> np.ndarray:
        return np.maximum(0.0, values)

    @staticmethod
    def _relu_grad(values: np.ndarray) -> np.ndarray:
        grad = np.zeros_like(values)
        grad[values > 0.0] = 1.0
        return grad

    def _forward(self, batch: np.ndarray):
        activations = [batch]
        pre_activations: List[np.ndarray] = []
        current = batch
        for idx, (weight, bias) in enumerate(zip(self.weights, self.biases)):
            linear = current @ weight + bias
            pre_activations.append(linear)
            if idx == len(self.weights) - 1:
                current = linear
            else:
                current = self._relu(linear)
            activations.append(current)
        return pre_activations, activations

    def _adam_step(self, grads_w: List[np.ndarray], grads_b: List[np.ndarray]) -> None:
        self._step += 1
        lr = self.learning_rate
        for idx, (grad_w, grad_b) in enumerate(zip(grads_w, grads_b)):
            self.m_w[idx] = self.beta1 * self.m_w[idx] + (1 - self.beta1) * grad_w
            self.v_w[idx] = self.beta2 * self.v_w[idx] + (1 - self.beta2) * (grad_w ** 2)
            self.m_b[idx] = self.beta1 * self.m_b[idx] + (1 - self.beta1) * grad_b
            self.v_b[idx] = self.beta2 * self.v_b[idx] + (1 - self.beta2) * (grad_b ** 2)
            m_hat_w = self.m_w[idx] / (1 - self.beta1 ** self._step)
            v_hat_w = self.v_w[idx] / (1 - self.beta2 ** self._step)
            m_hat_b = self.m_b[idx] / (1 - self.beta1 ** self._step)
            v_hat_b = self.v_b[idx] / (1 - self.beta2 ** self._step)
            self.weights[idx] -= lr * m_hat_w / (np.sqrt(v_hat_w) + self.eps)
            self.biases[idx] -= lr * m_hat_b / (np.sqrt(v_hat_b) + self.eps)

    def fit(self, features: np.ndarray, targets: np.ndarray) -> None:
        features = np.asarray(features, dtype=np.float64)
        targets = np.asarray(targets, dtype=np.float64)
        if features.ndim != 2:
            raise ValueError('As features devem ser uma matriz 2D.')
        if targets.ndim != 2:
            raise ValueError('Os alvos devem ser uma matriz 2D.')
        self._initialise(features.shape[1], targets.shape[1])
        rng = np.random.default_rng(self.random_state)
        n_samples = features.shape[0]
        indices = np.arange(n_samples)
        for epoch in range(1, self.epochs + 1):
            rng.shuffle(indices)
            for start in range(0, n_samples, self.batch_size):
                end = min(start + self.batch_size, n_samples)
                batch_idx = indices[start:end]
                batch_features = features[batch_idx]
                batch_targets = targets[batch_idx]
                pre_acts, activations = self._forward(batch_features)
                delta = (activations[-1] - batch_targets) / batch_features.shape[0]
                grads_w: List[np.ndarray] = []
                grads_b: List[np.ndarray] = []
                for layer_idx in reversed(range(len(self.weights))):
                    grad_w = activations[layer_idx].T @ delta
                    grad_b = delta.sum(axis=0)
                    grads_w.insert(0, grad_w)
                    grads_b.insert(0, grad_b)
                    if layer_idx != 0:
                        delta = (delta @ self.weights[layer_idx].T) * self._relu_grad(pre_acts[layer_idx - 1])
                self._adam_step(grads_w, grads_b)
            if epoch % 10 == 0 or epoch == self.epochs:
                _, activations = self._forward(features)
                mse = np.mean((activations[-1] - targets) ** 2)
                print(f'Epoch {epoch:03d} - MSE: {mse:.6f}')

    def predict(self, features: np.ndarray) -> np.ndarray:
        features = np.asarray(features, dtype=np.float64)
        _, activations = self._forward(features)
        return activations[-1]

In [None]:
deep_regressor = DeepNumpyMLPRegressor(
    hidden_layers=(512, 256, 128, 64),
    learning_rate=5e-4,
    epochs=80,
    batch_size=256,
    random_state=42,
)
deep_regressor.fit(X_reg.values, y_reg.values)

dl_predicted_quantum_features = [f'dl_pred_{feature}' for feature in quantum_features]
for fold_df in fold_frames:
    predictions = deep_regressor.predict(fold_df[classical_features].values)
    for column, values in zip(dl_predicted_quantum_features, predictions.T):
        fold_df[column] = values

In [None]:
dl_feature_set = classical_features + dl_predicted_quantum_features
dl_results = []
dl_prediction_frames = []
for fold_idx, fold_df in enumerate(fold_frames):
    train_df = fold_df[fold_df['set'] == 'train']
    test_df = fold_df[fold_df['set'] == 'test']
    y_train = train_df['y']
    y_test = test_df['y']
    X_train = train_df[dl_feature_set]
    X_test = test_df[dl_feature_set]
    model = GradientBoostingClassifier(random_state=42)
    model.fit(X_train, y_train)
    y_proba = model.predict_proba(X_test)[:, 1]
    y_pred = model.predict(X_test)
    for metric_name, metric_fn in metric_functions.items():
        value = metric_fn(y_test, y_proba, y_pred)
        dl_results.append({
            'fold': fold_idx,
            'fold_name': fold_df['fold_name'].iat[0],
            'model': 'Quantum (Deep MLP)',
            'metric': metric_name,
            'value': value,
        })
    dl_prediction_frames.append(pd.DataFrame({
        'fold': fold_idx,
        'fold_name': fold_df['fold_name'].iat[0],
        'row_id': test_df['row_id'].values,
        'y_true': y_test.values,
        'model': 'Quantum (Deep MLP)',
        'y_pred': y_pred,
        'y_proba': y_proba,
    }))

results_df = pd.concat([results_df, pd.DataFrame(dl_results)], ignore_index=True)
predictions_df = pd.concat([predictions_df, pd.concat(dl_prediction_frames, ignore_index=True)], ignore_index=True)
results_df[results_df['model'] == 'Quantum (Deep MLP)'].head()

### Métricas atualizadas

As tabelas e gráficos são recalculados para incluir o modelo com as features estimadas pelo MLP profundo.

In [None]:
metrics_summary = (
    results_df
    .groupby(['model', 'metric'])['value']
    .agg(
        median=lambda s: np.nanmedian(s),
        q1=lambda s: np.nanquantile(s, 0.25),
        q3=lambda s: np.nanquantile(s, 0.75),
    )
    .reset_index()
)
metrics_summary['iqr'] = metrics_summary['q3'] - metrics_summary['q1']
metrics_summary['median_iqr'] = metrics_summary.apply(
    lambda row: f"{row['median']:.3f} (IQR: {row['q1']:.3f}-{row['q3']:.3f})"
    axis=1
)
metrics_summary

In [None]:
metric_order = [
    'AUC',
    'F1 Score Overall',
    'Balanced Accuracy',
    'Precision Class 0',
    'Precision Class 1',
    'Recall Class 0',
    'Recall Class 1',
]
plot_ready = (
    metrics_summary
    .pivot(index='metric', columns='model', values='median')
    .reindex(metric_order)
)
plot_ready = plot_ready.reindex(columns=[
    'Benchmark',
    'Quantum',
    'Quantum (Regressor)',
    'Quantum (Deep MLP)',
])
model_colors = {
    'Benchmark': '#2c2c2c',
    'Quantum': '#f1c40f',
    'Quantum (Regressor)': '#3498db',
    'Quantum (Deep MLP)': '#9b59b6',
}
ax = plot_ready.plot(kind='bar', color=[model_colors[col] for col in plot_ready.columns], figsize=(12, 6))
ax.set_ylabel('Mediana por fold')
ax.set_xlabel('Métrica')
ax.set_title('Comparação de desempenho com predições de features quânticas')
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha='right')
ax.legend(loc='lower right')
plt.tight_layout()
plt.show()

In [None]:
dl_predictions = predictions_df[predictions_df['model'] == 'Quantum (Deep MLP)']
dl_predictions.head()