# 03 — Modelagem e comparação de modelos



Objetivo:
- Treinar e comparar múltiplos modelos (SVM, KNN, MLP)
- Utilizar validação cruzada estratificada
- Ajustar hiperparâmetros via GridSearch
- Comparar métricas (accuracy, precision, recall, F1)
- Selecionar modelo candidato para produção

Observação:
Este notebook é um laboratório de experimentos.
Nenhum modelo é considerado final aqui.


Imports

In [1]:

import numpy as np
import pandas as pd
import sys
from pathlib import Path
from imblearn.over_sampling import SMOTE
sys.path.append(str(Path("..").resolve()))
from src import train 

from src.config import MODELS_CONFIG

# from src.viz import plot_confusion_matrix, plot_roc_curve

RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)


Dataset de Treino e Pré-processamento

In [3]:

data_path = "../data/processed/train_dataset.csv"  
TARGET_COL = "Depression"  

df, X, y = train.load_data(data_path, target_col=TARGET_COL)
df.head()

print("Shape de X:", X.shape)
print("Distribuição da variável alvo:")
y.value_counts(normalize=True)



Shape de X: (22257, 11)
Distribuição da variável alvo:


Depression
1    0.585703
0    0.414297
Name: proportion, dtype: float64

Loop de treino e validação cruzada

In [4]:


results = []

for model_key, cfg in MODELS_CONFIG.items():
    print("="*60)
    print(f"Modelo: {cfg['display_name']}")
    
    smote = SMOTE(random_state=RANDOM_STATE) if cfg["use_smote"] else None

    pipeline = train.build_pipeline(
        model=cfg["model"],
        use_scaler=cfg["use_scaler"],
        use_smote=cfg["use_smote"],
        smote=smote
    )

    metrics = train.cross_validate(
        pipeline=pipeline,
        X=X,
        y=y,
        threshold=0.5,
        n_splits=10,
        random_state=RANDOM_STATE,
        verbose=True
    )

    train.summarize_cv_results(metrics)

    results.append({
        "model_key": model_key,
        "model": cfg["display_name"],
        "accuracy_mean": np.mean(metrics["accuracy"]),
        "precision_mean": np.mean(metrics["precision"]),
        "recall_mean": np.mean(metrics["recall"]),
        "f1_mean": np.mean(metrics["f1"])
    })



Modelo: K-Nearest Neighbors

FOLD 1


Acurácia: 0.7938
Precisão: 0.8398
Recall:   0.8005
F1-score: 0.8196

FOLD 2
Acurácia: 0.8235
Precisão: 0.8647
Recall:   0.8282
F1-score: 0.8461

FOLD 3
Acurácia: 0.8136
Precisão: 0.8542
Recall:   0.8221
F1-score: 0.8378

FOLD 4
Acurácia: 0.8194
Precisão: 0.8614
Recall:   0.8244
F1-score: 0.8425

FOLD 5
Acurácia: 0.8154
Precisão: 0.8507
Recall:   0.8305
F1-score: 0.8405

FOLD 6
Acurácia: 0.8122
Precisão: 0.8527
Recall:   0.8213
F1-score: 0.8367

FOLD 7
Acurácia: 0.8145
Precisão: 0.8483
Recall:   0.8321
F1-score: 0.8401

FOLD 8
Acurácia: 0.8067
Precisão: 0.8506
Recall:   0.8127
F1-score: 0.8312

FOLD 9
Acurácia: 0.8157
Precisão: 0.8535
Recall:   0.8273
F1-score: 0.8402

FOLD 10
Acurácia: 0.8112
Precisão: 0.8587
Recall:   0.8112
F1-score: 0.8343

MÉDIAS E DESVIOS-PADRÃO
Accuracy  : 0.8126 | DP: 0.0076
Precision : 0.8535 | DP: 0.0067
Recall    : 0.8210 | DP: 0.0095
F1        : 0.8369 | DP: 0.0070
Modelo: Support Vector Machine

FOLD 1
Acurácia: 0.8230
Precisão: 0.8582
Recall:   0.8358
F1-s

In [5]:
results_df = pd.DataFrame(results).sort_values(
    by="f1_mean",
    ascending=False
)

results_df


Unnamed: 0,model_key,model,accuracy_mean,precision_mean,recall_mean,f1_mean
0,svm,Support Vector Machine,0.840949,0.871587,0.854325,0.862842
2,mlp,MLP (Neural Net),0.841083,0.875247,0.849875,0.862334
1,knn,K-Nearest Neighbors,0.812598,0.853458,0.821032,0.836907


In [6]:
TOP_MODELS = results_df.head(3)["model_key"].tolist()
TOP_MODELS


['svm', 'mlp', 'knn']

In [7]:
grid_results = {}

for model_key in TOP_MODELS:
    cfg = MODELS_CONFIG[model_key]
    print("="*60)
    print(f"GridSearch para: {cfg['display_name']}")

    smote = SMOTE(random_state=RANDOM_STATE) if cfg["use_smote"] else None

    pipeline = train.build_pipeline(
        model=cfg["model"],
        use_scaler=cfg["use_scaler"],
        use_smote=cfg["use_smote"],
        smote=smote
    )

    grid = train.run_gridsearch(
        pipeline=pipeline,
        param_grid=cfg["param_grid"],
        X=X,
        y=y,
        scoring="f1",
        n_splits=5
    )

    grid_results[model_key] = grid

    print("Best params:", grid["best_params"])
    print("Best F1:", grid["best_score"])


GridSearch para: Support Vector Machine
Fitting 5 folds for each of 6 candidates, totalling 30 fits
Best params: {'model__C': 1, 'model__kernel': 'linear'}
Best F1: 0.864134118961192
GridSearch para: MLP (Neural Net)
Fitting 5 folds for each of 6 candidates, totalling 30 fits
Best params: {'model__alpha': 0.001, 'model__hidden_layer_sizes': (64, 32)}
Best F1: 0.8625602594527042
GridSearch para: K-Nearest Neighbors
Fitting 5 folds for each of 8 candidates, totalling 40 fits
Best params: {'model__n_neighbors': 19, 'model__weights': 'uniform'}
Best F1: 0.8531875635532501


In [8]:

best_model_key = TOP_MODELS[0]
best_pipeline = grid_results[best_model_key]["best_estimator"]

best_threshold, best_f1 = train.find_best_threshold(
    model=best_pipeline,
    X_val=X,
    y_val=y
)

print("Melhor threshold:", best_threshold)
print("F1 nesse threshold:", best_f1)


Melhor threshold: 0.33
F1 nesse threshold: 0.8740481655554341
