# Predição com GradientBoostingClassifier

Este notebook carrega os *folds* disponíveis na pasta `features/`, treina modelos de Gradient Boosting usando apenas as features clássicas (`class_0` a `class_12`) e o conjunto combinado de features clássicas + quânticas (`qf_0` a `qf_12`).

Além disso, incluímos um `HistGradientBoostingRegressor` para aproximar as features quânticas a partir das clássicas e avaliamos o impacto dessas estimativas no desempenho do classificador.


## Dependências

O notebook utiliza `pandas`, `numpy`, `scikit-learn`, `matplotlib` e `seaborn`. Caso ainda não as tenha instalado no seu ambiente, execute o comando abaixo em uma célula separada ou diretamente no terminal:

```bash
pip install pandas numpy scikit-learn matplotlib seaborn
```


In [None]:
from pathlib import Path

import numpy as np
import pandas as pd
import seaborn as sns
from IPython.display import display
from matplotlib import pyplot as plt
from sklearn.ensemble import GradientBoostingClassifier, HistGradientBoostingRegressor
from sklearn.model_selection import ParameterGrid
from sklearn.metrics import (
    balanced_accuracy_score,
    f1_score,
    mean_absolute_error,
    mean_squared_error,
    precision_score,
    r2_score,
    recall_score,
    roc_auc_score,
)
from sklearn.multioutput import MultiOutputRegressor

# Configuração estética padrão para os gráficos
sns.set_style("whitegrid")
plt.rcParams.update({"figure.figsize": (10, 5), "axes.titlesize": 14, "axes.labelsize": 12})




## Carregamento dos dados

Cada CSV em `features/` corresponde a um *fold*. O conjunto possui uma coluna `set` indicando se a amostra pertence ao treino ou ao teste.


In [None]:
data_dir = Path("features")
fold_paths = sorted(data_dir.glob("features_y_fold*.csv"))
if not fold_paths:
    raise FileNotFoundError("Nenhum arquivo `features_y_fold*.csv` foi encontrado na pasta `features/`.")

fold_frames = []
for path in fold_paths:
    frame = pd.read_csv(path)
    frame["fold_name"] = path.stem
    fold_frames.append(frame)

fold_summary = (
    pd.concat(
        [df.assign(set=df["set"].str.lower())[["fold", "fold_name", "set"]] for df in fold_frames],
        ignore_index=True,
    )
    .value_counts()
    .unstack(fill_value=0)
)
fold_summary


## Definição dos grupos de features e função de avaliação

A função abaixo treina um `GradientBoostingClassifier` para cada conjunto de features em todos os *folds*, calcula as métricas desejadas no conjunto de teste e armazena as predições geradas.


In [None]:
classical_features = [col for col in fold_frames[0].columns if col.startswith("class_")]
quantum_features = [col for col in fold_frames[0].columns if col.startswith("qf_")]

# Treinamento do regressor quântico é realizado separadamente em cada fold usando apenas os dados de treino
predicted_quantum_features = [f"pred_{feature}" for feature in quantum_features]

for fold_df in fold_frames:
    train_mask = fold_df["set"] == "train"
    fold_regressor = MultiOutputRegressor(
        HistGradientBoostingRegressor(random_state=42)
    )
    fold_regressor.fit(
        fold_df.loc[train_mask, classical_features],
        fold_df.loc[train_mask, quantum_features],
    )
    predicted_values = fold_regressor.predict(fold_df[classical_features])
    for pred_column, column_values in zip(predicted_quantum_features, predicted_values.T):
        fold_df[pred_column] = column_values

feature_sets = {
    "Benchmark": classical_features,
    "Quantum": classical_features + quantum_features,
    "Quantum (Regressor)": classical_features + predicted_quantum_features,
}

metric_functions = {
    "AUC": lambda y_true, y_score, y_pred: roc_auc_score(y_true, y_score) if len(np.unique(y_true)) > 1 else np.nan,
    "F1 Score Overall": lambda y_true, y_score, y_pred: f1_score(y_true, y_pred),
    "Balanced Accuracy": lambda y_true, y_score, y_pred: balanced_accuracy_score(y_true, y_pred),
    "Precision Class 0": lambda y_true, y_score, y_pred: precision_score(y_true, y_pred, pos_label=0),
    "Precision Class 1": lambda y_true, y_score, y_pred: precision_score(y_true, y_pred, pos_label=1),
    "Recall Class 0": lambda y_true, y_score, y_pred: recall_score(y_true, y_pred, pos_label=0),
    "Recall Class 1": lambda y_true, y_score, y_pred: recall_score(y_true, y_pred, pos_label=1),
}

results = []

for fold_idx, fold_df in enumerate(fold_frames):
    train_df = fold_df[fold_df["set"] == "train"]
    test_df = fold_df[fold_df["set"] == "test"]

    y_train = train_df["y"]
    y_test = test_df["y"]

    for label, columns in feature_sets.items():
        X_train = train_df[columns]
        X_test = test_df[columns]

        model = GradientBoostingClassifier(random_state=42)
        model.fit(X_train, y_train)

        y_proba = model.predict_proba(X_test)[:, 1]
        y_pred = model.predict(X_test)

        for metric_name, metric_fn in metric_functions.items():
            value = metric_fn(y_test, y_proba, y_pred)
            results.append(
                {
                    "fold": fold_idx,
                    "fold_name": fold_df["fold_name"].iat[0],
                    "model": label,
                    "metric": metric_name,
                    "value": value,
                }
            )

results_df = pd.DataFrame(results)
results_df.head()


## Resumo das métricas por modelo

A tabela a seguir mostra a mediana e o intervalo interquartil (25%-75%) das métricas em todos os *folds*.


In [None]:
metrics_summary = (
    results_df
    .groupby(['model', 'metric'])['value']
    .agg(
        median=lambda s: np.nanmedian(s),
        q1=lambda s: np.nanquantile(s, 0.25),
        q3=lambda s: np.nanquantile(s, 0.75),
    )
    .reset_index()
)
metrics_summary['iqr'] = metrics_summary['q3'] - metrics_summary['q1']
metrics_summary['median_iqr'] = metrics_summary.apply(
    lambda row: f"{row['median']:.4f} [{row['q1']:.4f}, {row['q3']:.4f}]",
    axis=1,
)
metrics_summary


## Gráfico comparativo

O gráfico reproduz o exemplo solicitado, com barras pretas representando o benchmark (apenas features clássicas) e barras amarelas representando o modelo com features clássicas + quânticas. Os valores exibidos nas barras correspondem às medianas por *fold*.


In [None]:
metric_order = [
    'AUC',
    'F1 Score Overall',
    'Balanced Accuracy',
    'Precision Class 0',
    'Precision Class 1',
    'Recall Class 0',
    'Recall Class 1',
]

plot_ready = (
    metrics_summary
    .pivot(index='metric', columns='model', values='median')
    .reindex(metric_order)
)

plot_ready = plot_ready.reindex(columns=["Benchmark", "Quantum", "Quantum (Regressor)"])

model_colors = {
    "Benchmark": "#2e2e2e",
    "Quantum": "#f1b82d",
    "Quantum (Regressor)": "#2b8cbe",
}

ax = plot_ready.plot(
    kind='bar',
    color=[model_colors[col] for col in plot_ready.columns],
    edgecolor='black'
)

ax.set_ylim(0, 1.05)
ax.set_ylabel('Score')
ax.set_xlabel('')
ax.set_title('Toxicity classification with Gradient Boosting')
ax.legend(title='Modelo')

for container in ax.containers:
    ax.bar_label(container, fmt='%.2f', label_type='edge', padding=3)

plt.tight_layout()
plt.show()



## Busca de hiperparâmetros

Realizamos uma busca em grade simples para o `HistGradientBoostingRegressor` responsável por aproximar as features quânticas a partir das clássicas. Cada combinação é avaliada pelas métricas de MAE, RMSE e R² considerando todos os *folds* disponíveis.


In [None]:
search_feature_set = "Quantum (Regressor)"
regressor_param_grid = {
    "learning_rate": [0.05, 0.1],
    "max_iter": [200, 400],
    "max_depth": [3, 5],
    "min_samples_leaf": [20, 40],
}

search_results = []

for params in ParameterGrid(regressor_param_grid):
    fold_metrics = []
    for fold_df in fold_frames:
        train_mask = fold_df["set"] == "train"
        test_mask = fold_df["set"] == "test"

        base_regressor = HistGradientBoostingRegressor(random_state=42, **params)
        regressor = MultiOutputRegressor(base_regressor)
        regressor.fit(
            fold_df.loc[train_mask, classical_features],
            fold_df.loc[train_mask, quantum_features],
        )

        y_true_quantum = fold_df.loc[test_mask, quantum_features]
        y_pred_quantum = regressor.predict(fold_df.loc[test_mask, classical_features])

        fold_metrics.append(
            {
                "mae": mean_absolute_error(y_true_quantum, y_pred_quantum),
                "rmse": mean_squared_error(y_true_quantum, y_pred_quantum, squared=False),
                "r2": r2_score(y_true_quantum, y_pred_quantum),
            }
        )

    search_results.append(
        {
            **params,
            "mean_mae": np.nanmean([metrics["mae"] for metrics in fold_metrics]),
            "std_mae": np.nanstd([metrics["mae"] for metrics in fold_metrics]),
            "mean_rmse": np.nanmean([metrics["rmse"] for metrics in fold_metrics]),
            "std_rmse": np.nanstd([metrics["rmse"] for metrics in fold_metrics]),
            "mean_r2": np.nanmean([metrics["r2"] for metrics in fold_metrics]),
            "std_r2": np.nanstd([metrics["r2"] for metrics in fold_metrics]),
        }
    )

search_results_df = (
    pd.DataFrame(search_results)
    .sort_values("mean_mae", ascending=True)
    .reset_index(drop=True)
)
search_results_df.head()
