# Modelagem

## Objetivo
Treinar modelos de classificação para prever clientes adimplentes (`DEFAULT=0`) e inadimplentes (`DEFAULT=1`) usando Random Forest e Logistic Regression.

In [44]:
import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedKFold, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score, make_scorer, classification_report, confusion_matrix
from imblearn.over_sampling import SMOTE
import joblib

## Preparação dos dados
- Carregando dados processados (`x_train`, `x_test`, `y_train`, `y_test`)
- Aplicando SMOTE para balancear as classes no conjunto de treino

In [45]:
x_train = pd.read_csv("data/processed/x_train.csv")
x_test = pd.read_csv("data/processed/x_test.csv")

y_train = pd.read_csv("data/processed/y_train.csv").values.ravel()
y_test = pd.read_csv("data/processed/y_test.csv").values.ravel()


In [46]:
smote = SMOTE(random_state=42)
x_res, y_res = smote.fit_resample(x_train, y_train)

## Cross-Validation e GridSearch, Treinamento e Salvamento dos Modelos
- Utilizando `StratifiedKFold` para validação cruzada estratificada
- Ajustando hiperparâmetros com `GridSearchCV`
- Métrica de avaliação: F1-Score Macro (foco em desempenho geral)
- Random Forest e Logistic Regression
- Salvando os melhores modelos usando `joblib` para reutilização

In [47]:
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
f1_macro = make_scorer(f1_score, average='macro')

In [48]:
rf_param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [5, 10, 15],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['sqrt', 'log2', None],
    'random_state': [42]
}

rf_grid = GridSearchCV(RandomForestClassifier(n_jobs=-1), rf_param_grid, scoring=f1_macro, cv=cv, n_jobs=-1, verbose=2)
rf_grid.fit(x_res, y_res)
rf_best = rf_grid.best_estimator_
print("\nMelhores parâmetros RF:", rf_grid.best_params_)

Fitting 5 folds for each of 243 candidates, totalling 1215 fits


[CV] END max_depth=5, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=100, random_state=42; total time=   4.1s
[CV] END max_depth=5, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=100, random_state=42; total time=   5.9s
[CV] END max_depth=5, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=200, random_state=42; total time=   6.0s
[CV] END max_depth=5, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=100, random_state=42; total time=   6.8s
[CV] END max_depth=5, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=100, random_state=42; total time=   6.8s
[CV] END max_depth=5, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=100, random_state=42; total time=   6.8s
[CV] END max_depth=5, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=100, random_state=42; total time=   7.0s
[CV] END max_depth=5, max_features=sqrt, min_sam

In [49]:
joblib.dump(rf_best, "models/rf.pkl")

['models/rf.pkl']

In [50]:
lr_param_grid = {
    'C': [0.01, 0.1, 1, 10],
    'solver': ['lbfgs','liblinear'],
    'max_iter': [1000],
    'random_state': [42]
}

lr_grid = GridSearchCV(LogisticRegression(), lr_param_grid, scoring=f1_macro, cv=cv, n_jobs=-1, verbose=2)
lr_grid.fit(x_res, y_res)
lr_best = lr_grid.best_estimator_
print("\nMelhores parâmetros LR:", lr_grid.best_params_)

Fitting 5 folds for each of 8 candidates, totalling 40 fits
[CV] END C=0.01, max_iter=1000, random_state=42, solver=lbfgs; total time=   0.1s
[CV] END C=0.01, max_iter=1000, random_state=42, solver=lbfgs; total time=   0.1s
[CV] END C=0.01, max_iter=1000, random_state=42, solver=lbfgs; total time=   0.1s
[CV] END C=0.01, max_iter=1000, random_state=42, solver=lbfgs; total time=   0.1s
[CV] END C=0.01, max_iter=1000, random_state=42, solver=lbfgs; total time=   0.1s
[CV] END C=0.01, max_iter=1000, random_state=42, solver=liblinear; total time=   0.2s
[CV] END C=0.1, max_iter=1000, random_state=42, solver=lbfgs; total time=   0.1s
[CV] END C=0.01, max_iter=1000, random_state=42, solver=liblinear; total time=   0.2s
[CV] END C=0.1, max_iter=1000, random_state=42, solver=lbfgs; total time=   0.1s
[CV] END C=0.1, max_iter=1000, random_state=42, solver=lbfgs; total time=   0.1s
[CV] END C=0.01, max_iter=1000, random_state=42, solver=liblinear; total time=   0.2s
[CV] END C=0.1, max_iter=1000

In [51]:
joblib.dump(lr_best, "models/lr.pkl")

['models/lr.pkl']

## Avaliação no Conjunto de Treino
- Relatórios de classificação e matrizes de confusão para verificar o desempenho nos dados usados para treinar

In [52]:
from sklearn.metrics import classification_report, confusion_matrix

# Random Forest
rf_pred_train = rf_best.predict(x_res)
print("\nRandom Forest - Classification Report (Treino)")
print(classification_report(y_res, rf_pred_train))
print("Confusion Matrix (Treino):")
print(confusion_matrix(y_res, rf_pred_train))

# Logistic Regression
lr_pred_train = lr_best.predict(x_res)
print("\nLogistic Regression - Classification Report (Treino)")
print(classification_report(y_res, lr_pred_train))
print("Confusion Matrix (Treino):")
print(confusion_matrix(y_res, lr_pred_train))



Random Forest - Classification Report (Treino)
              precision    recall  f1-score   support

           0       0.93      0.92      0.93     17402
           1       0.92      0.94      0.93     17402

    accuracy                           0.93     34804
   macro avg       0.93      0.93      0.93     34804
weighted avg       0.93      0.93      0.93     34804

Confusion Matrix (Treino):
[[16034  1368]
 [ 1127 16275]]

Logistic Regression - Classification Report (Treino)
              precision    recall  f1-score   support

           0       0.67      0.83      0.74     17402
           1       0.77      0.59      0.67     17402

    accuracy                           0.71     34804
   macro avg       0.72      0.71      0.70     34804
weighted avg       0.72      0.71      0.70     34804

Confusion Matrix (Treino):
[[14372  3030]
 [ 7177 10225]]


## Avaliação no Conjunto de Teste
- Avaliando a capacidade de generalização dos modelos

In [53]:
# Random Forest - Teste
rf_pred_test = rf_best.predict(x_test)
print("\nRandom Forest - Classification Report (Teste)")
print(classification_report(y_test, rf_pred_test))
print("Confusion Matrix (Teste):")
print(confusion_matrix(y_test, rf_pred_test))

# Logistic Regression - Teste
lr_pred_test = lr_best.predict(x_test)
print("\nLogistic Regression - Classification Report (Teste)")
print(classification_report(y_test, lr_pred_test))
print("Confusion Matrix (Teste):")
print(confusion_matrix(y_test, lr_pred_test))


Random Forest - Classification Report (Teste)
              precision    recall  f1-score   support

           0       0.87      0.83      0.85      4350
           1       0.49      0.56      0.52      1264

    accuracy                           0.77      5614
   macro avg       0.68      0.70      0.69      5614
weighted avg       0.78      0.77      0.78      5614

Confusion Matrix (Teste):
[[3623  727]
 [ 559  705]]

Logistic Regression - Classification Report (Teste)
              precision    recall  f1-score   support

           0       0.87      0.82      0.85      4350
           1       0.49      0.57      0.53      1264

    accuracy                           0.77      5614
   macro avg       0.68      0.70      0.69      5614
weighted avg       0.78      0.77      0.77      5614

Confusion Matrix (Teste):
[[3588  762]
 [ 542  722]]
