# **Template do processo de modelagem**
## Esse é um template inicial para guiar a busca por algoritmos com desempenho melhor. Para testar diferentes algoritmos basta trocar o import, o algoritmo do pipeline e os valores do grid. Também serão testados diferentes datasets.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge
from sklearn.multioutput import MultiOutputRegressor
from sklearn.pipeline import make_pipeline
from sklearn.metrics import mean_squared_error, r2_score

In [2]:
df = pd.read_csv('../data/processed/proc_birds.csv')
df.head(3)

Unnamed: 0,normalize,feature preprocessing,mlfs.igmf_adapted.PyIT_IGMF,mlfs.igmf_adapted.PyIT_IGMF-n_features,mlfs.ppt_mi_adapted.PyIT_PPT_MI,mlfs.ppt_mi_adapted.PyIT_PPT_MI-n_features,mlfs.scls_adapted.PyIT_SCLS,mlfs.scls_adapted.PyIT_SCLS-n_features,mlfs.lrfs_adapted.PyIT_LRFS,mlfs.lrfs_adapted.PyIT_LRFS-n_features,...,weka.classifiers.functions.supportVector.NormalizedPolyKernel-E,weka.classifiers.functions.supportVector.NormalizedPolyKernel-L,weka.classifiers.functions.supportVector.Puk,weka.classifiers.functions.supportVector.Puk-O,weka.classifiers.functions.supportVector.Puk-S,weka.classifiers.functions.supportVector.RBFKernel,weka.classifiers.functions.supportVector.RBFKernel-G,F1 (macro averaged by label),Model Size,Model Size Log
0,0,1,0,-1.0,0,-1.0,0,-1.0,0,-1.0,...,-1.0,-1.0,0,-1.0,-1.0,0,-1.0,0.256,37076.0,10.520752
1,0,1,0,-1.0,0,-1.0,0,-1.0,0,-1.0,...,-1.0,-1.0,0,-1.0,-1.0,0,-1.0,0.222,29686.0,10.298465
2,0,1,0,-1.0,0,-1.0,0,-1.0,0,-1.0,...,-1.0,-1.0,0,-1.0,-1.0,0,-1.0,0.026,18387.0,9.819454


In [3]:
# Separar features e targets
X = df.drop(columns=['F1 (macro averaged by label)', 'Model Size', 'Model Size Log'])
y = df[['F1 (macro averaged by label)', 'Model Size Log']]

# Dividir em treino e teste 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Criar o pipeline sem definir o hiperparâmetro 
# pipeline com escalonador, regressor e wrapper (problem transformation)
pipeline = make_pipeline(
    StandardScaler(), 
    MultiOutputRegressor(Ridge()) # hiperparâmetro padrão; procurar um melhor com grid search
)

# Definir os valores a testar
grid = {'multioutputregressor__estimator__alpha': [0.1, 1.0, 10.0, 50.0, 100.0]}

# Instanciando o grid search 
grid_search = GridSearchCV(
    estimator=pipeline, # objeto pipeline criado acima
    param_grid=grid, # conjunto de valores a testar
    cv=5, # quantidade de folds
    scoring='r2', # métrica de avaliação
    n_jobs=-1, # Usa todo o poder de processamento
    verbose=1 # Mostra o progresso 
)

# Testar cada valor de alpha usando cross validation de 5 folds
print("Iniciando o GridSearchCV")
grid_search.fit(X_train, y_train)

# 4. Exibir os resultados da busca
print("\nResultados do GridSearchCV:")
print(f"Melhores parâmetros encontrados: {grid_search.best_params_}")
print(f"Melhor R² (média do cross): {grid_search.best_score_:.4f}")
print("-" * 40)

# O .best_estimator_ é o pipeline já treinado com o melhor alpha 
# em TODOS os dados de treino 
best_model = grid_search.best_estimator_

# Predições
print("\nFazendo predições")
predictions = best_model.predict(X_test)

# Avaliando
f1_score_r2 = r2_score(y_test['F1 (macro averaged by label)'], predictions[:, 0])
model_size_log_r2 = r2_score(y_test['Model Size Log'], predictions[:, 1])

f1_score_mse = mean_squared_error(y_test['F1 (macro averaged by label)'], predictions[:, 0])
model_size_log_mse = mean_squared_error(y_test['Model Size Log'], predictions[:, 1])

print(f"\n--- Resultados Finais no Conjunto de Teste ---")
print(f"Melhor Alpha: {best_model.named_steps['multioutputregressor'].estimator.alpha}")
print(f"R² para F1-score: {f1_score_r2:.4f}")
print(f"R² para Tamanho do Modelo (log): {model_size_log_r2:.4f}")
print(f"MSE para F1-score: {f1_score_mse:.4f}")
print(f"MSE para Tamanho do Modelo (log): {model_size_log_mse:.4f}")

Iniciando o GridSearchCV
Fitting 5 folds for each of 5 candidates, totalling 25 fits

Resultados do GridSearchCV:
Melhores parâmetros encontrados: {'multioutputregressor__estimator__alpha': 10.0}
Melhor R² (média do cross): 0.8478
----------------------------------------

Fazendo predições

--- Resultados Finais no Conjunto de Teste ---
Melhor Alpha: 10.0
R² para F1-score: 0.7511
R² para Tamanho do Modelo (log): 0.9431
MSE para F1-score: 0.0055
MSE para Tamanho do Modelo (log): 0.2616
