# Modelagem
## | Encontrando Melhores Hiper-Parâmetros para Modelagem

### > **Objetivo do Notebook**:

Este notebook representa a 3ª parte do desenvolvimento do projeto voltado à construção de um modelo preditivo para classificar alunos com alta probabilidade de evasão no ensino superior. O objetivo desta fase é encontrar os melhores hiperparâmetros para um modelo de **Regressão Logística**.

No notebook anterior, foram criadas duas bases de desenvolvimento: *Base T e Base S*.

- **Base T**: Contém todas as variáveis transformadas pelos processos anteriores.

- **Base S**: Um subconjunto da Base T, composto apenas pelas variáveis que apresentaram os melhores valores de **Information Value (IV)** e **Variance Inflation Factor (VIF)**.

Neste notebook, o foco será a otimização dos hiperparâmetros para aprimorar o desempenho do modelo preditivo.

## | Importando Bibliotecas

In [1]:
# Manipulação e transformação dos dados
import pandas as pd
import numpy as np

# Salvar modelos criado ajustados
from joblib import dump

# Modelagem
from sklearn.linear_model import LogisticRegression

# GridSearch
from sklearn.model_selection import train_test_split, GridSearchCV

## | Extraindo e Preparando Bases

In [2]:
# Extraindo dados para realizar a modelagem
df_desen_t = pd.read_feather('../data/processed/data_t.ftr')
df_desen_s = pd.read_feather('../data/processed/data_s.ftr')

# Visualizando estrutura das bases
display(df_desen_t.shape)
display(df_desen_s.shape)

(72692, 39)

(72692, 12)

In [3]:
# Separando em features (X) e target (y)

# Base de desenvolvimento com todas variáveis transformadas (Base T)
X_t = df_desen_t.drop(columns=['Target'])
y_t = df_desen_t.Target

# Base de desenvolvimento com subconjunto de variáveis selecionadas (Base S)
X_s = df_desen_s.drop(columns=['Target'])
y_s = df_desen_s.Target

# Visualizando estrutura das bases
display(X_s.shape)
display(X_t.shape)

(72692, 11)

(72692, 38)

## | Base T

Abaixo, vamos dividir a base de desenvolvimento T em dados de treino e teste para a criação do modelo.

In [4]:
# Separando base de desenvolvimento T em treino e teste
X_t_train, X_t_test, y_t_train, y_t_test = train_test_split(X_t, y_t, test_size=.2, random_state=412)

# Visualizando dados
display(X_t_train.head(5))
display(X_t_test.head(5))

Unnamed: 0,Curricular_units_1st_sem_approved,Curricular_units_1st_sem_grade,Curricular_units_2nd_sem_approved,Curricular_units_2nd_sem_grade,Tuition_fees_up_to_date,Scholarship_holder,Course_Animação_e_Design_Multimédia,Course_Design_de_Comunicação,Course_Enfermagem,Course_Enfermagem_Veterinária,...,Application_mode_Mudança_de_curso,Application_mode_Mudança_de_instituição/curso,"Application_mode_Portaria_n.º_533-A/99,_alínea_b2_Plano_Diferente","Application_mode_Portaria_nº_533-A/99,_item_b3_Outra_Instituição",Application_mode_Portaria_nº_612/93,Application_mode_Portaria_nº_854-B/99,Application_mode_Titulares_de_diploma_de_especialização_tecnológica,Application_mode_Titulares_de_diplomas_de_ciclo_curto,Application_mode_Titulares_de_outros_cursos_superiores,Application_mode_Transferência
22991,-0.066414,0.28573,-1.084482,0.067421,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
69351,0.305613,0.380712,-0.363225,0.548248,1,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
62940,0.305613,0.380712,0.358031,0.536227,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4531,0.67764,0.91894,0.71866,0.848765,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1214,-1.554522,-1.898842,-1.44511,-1.735681,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,1,0


Unnamed: 0,Curricular_units_1st_sem_approved,Curricular_units_1st_sem_grade,Curricular_units_2nd_sem_approved,Curricular_units_2nd_sem_grade,Tuition_fees_up_to_date,Scholarship_holder,Course_Animação_e_Design_Multimédia,Course_Design_de_Comunicação,Course_Enfermagem,Course_Enfermagem_Veterinária,...,Application_mode_Mudança_de_curso,Application_mode_Mudança_de_instituição/curso,"Application_mode_Portaria_n.º_533-A/99,_alínea_b2_Plano_Diferente","Application_mode_Portaria_nº_533-A/99,_item_b3_Outra_Instituição",Application_mode_Portaria_nº_612/93,Application_mode_Portaria_nº_854-B/99,Application_mode_Titulares_de_diploma_de_especialização_tecnológica,Application_mode_Titulares_de_diplomas_de_ciclo_curto,Application_mode_Titulares_de_outros_cursos_superiores,Application_mode_Transferência
56902,0.67764,0.602335,0.71866,0.458093,1,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
65311,0.67764,0.608667,1.079288,0.464103,1,1,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
47695,0.305613,0.228741,-1.44511,-1.735681,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
64607,0.305613,0.570674,0.71866,0.608351,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
29978,0.67764,0.633995,0.71866,0.728558,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### | Aplicando GridSearch para Encontrar Melhores Hiper-parâmetros

Para encontrar os melhores parâmetros, vamos utilizar de **GridSearch** de forma a testar diferentes combinações. O código abaixo utiliza os dados da **Base T** para esse fim.

In [5]:
# Criando modelo para realizar GridSearch
log_reg = LogisticRegression(max_iter=5000, random_state=412)

# Definindo parâmetros a testar
param_grid = {
    'C': [0.01, 0.1, 1, 10, 100], 
    'penalty': ['l1', 'l2'], 
    'solver': ['liblinear', 'saga'] 
}

# Realizando GridSearch com dados de treino da base T
grid_search_T = GridSearchCV(log_reg, param_grid, cv=10, scoring=['accuracy', 'precision', 'recall', 'f1', 'roc_auc'], refit=False, verbose=3, n_jobs=-1)
grid_search_T.fit(X_t_train, y_t_train)

Fitting 10 folds for each of 20 candidates, totalling 200 fits


In [6]:
# Coletando resultados do GridSearch
results_T = pd.DataFrame(grid_search_T.cv_results_)

# Calculando a média dos ranques
results_T['average_rank'] = results_T[['rank_test_accuracy', 'rank_test_precision', 'rank_test_recall', 'rank_test_f1', 'rank_test_roc_auc']].mean(axis=1)

# Visualizando
results_T.head(5)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_C,param_penalty,param_solver,params,split0_test_accuracy,split1_test_accuracy,...,split4_test_roc_auc,split5_test_roc_auc,split6_test_roc_auc,split7_test_roc_auc,split8_test_roc_auc,split9_test_roc_auc,mean_test_roc_auc,std_test_roc_auc,rank_test_roc_auc,average_rank
0,0.450204,0.025148,0.028705,0.006988,0.01,l1,liblinear,"{'C': 0.01, 'penalty': 'l1', 'solver': 'liblin...",0.898728,0.905949,...,0.944617,0.945809,0.942685,0.941169,0.944236,0.939087,0.942984,0.002332,20,16.4
1,1.013028,0.094291,0.025405,0.003041,0.01,l1,saga,"{'C': 0.01, 'penalty': 'l1', 'solver': 'saga'}",0.898384,0.906293,...,0.944953,0.945908,0.94335,0.9415,0.944467,0.93932,0.943358,0.002263,19,15.8
2,0.234153,0.047397,0.028507,0.00946,0.01,l2,liblinear,"{'C': 0.01, 'penalty': 'l2', 'solver': 'liblin...",0.898728,0.905433,...,0.946318,0.947187,0.943585,0.94267,0.946002,0.940486,0.944301,0.002531,18,14.6
3,0.802582,0.054092,0.029407,0.006669,0.01,l2,saga,"{'C': 0.01, 'penalty': 'l2', 'solver': 'saga'}",0.899072,0.905605,...,0.946608,0.947396,0.944071,0.942941,0.946198,0.940685,0.944585,0.002488,17,14.4
4,0.671352,0.190354,0.028007,0.008356,0.1,l1,liblinear,"{'C': 0.1, 'penalty': 'l1', 'solver': 'libline...",0.899072,0.907325,...,0.947107,0.947983,0.94491,0.94376,0.94688,0.941068,0.945196,0.002578,16,12.4


Com o intuito de facilitar a visualização dos melhores candidatos, as células abaixos irão destacar os melhores valores encontrados no **GridSearch**.

In [7]:
# Função para destacar valores
def highlight_values(s):
    if s.name == 'params': 
        return [''] * len(s)
    if s.name in ['mean_fit_time', 'std_fit_time', 'mean_score_time', 'std_score_time']:  
        is_min = s == s.min()
        return ['background-color: blue' if v else '' for v in is_min]
    if s.name.startswith(('mean_test', 'std_test')):  
        is_max = s == s.max()
        return ['background-color: blue' if v else '' for v in is_max]
    if s.name.startswith('rank_test'):  
        is_min = s == s.min()
        return ['background-color: blue' if v else '' for v in is_min]
    if s.name == 'average_rank':
        is_min = s == s.min()
        return ['background-color: blue' if v else '' for v in is_min]
    return [''] * len(s)  

In [8]:
# Criando visualização dos parâmetros com destaque
selected_columns = ['params', 'mean_fit_time', 'std_fit_time', 'mean_score_time', 'std_score_time']
for col in results_T:
    if col.startswith('mean_test') or col.startswith('std_test') or col.startswith('rank_test') or col == 'average_rank':
        selected_columns.append(col)

# Criando objeto Style
results_highlighted_T = results_T[selected_columns].style.apply(highlight_values, subset=selected_columns[1:], axis=0)

# Visualizando
results_highlighted_T

Unnamed: 0,params,mean_fit_time,std_fit_time,mean_score_time,std_score_time,mean_test_accuracy,std_test_accuracy,rank_test_accuracy,mean_test_precision,std_test_precision,rank_test_precision,mean_test_recall,std_test_recall,rank_test_recall,mean_test_f1,std_test_f1,rank_test_f1,mean_test_roc_auc,std_test_roc_auc,rank_test_roc_auc,average_rank
0,"{'C': 0.01, 'penalty': 'l1', 'solver': 'liblinear'}",0.450204,0.025148,0.028705,0.006988,0.903892,0.002412,20,0.91116,0.003479,2,0.786782,0.007385,20,0.844393,0.004431,20,0.942984,0.002332,20,16.4
1,"{'C': 0.01, 'penalty': 'l1', 'solver': 'saga'}",1.013028,0.094291,0.025405,0.003041,0.904046,0.002778,19,0.909678,0.004027,4,0.788857,0.007421,18,0.844951,0.004921,19,0.943358,0.002263,19,15.8
2,"{'C': 0.01, 'penalty': 'l2', 'solver': 'liblinear'}",0.234153,0.047397,0.028507,0.00946,0.904321,0.003036,17,0.911627,0.004239,1,0.787716,0.007364,19,0.845139,0.005287,18,0.944301,0.002531,18,14.6
3,"{'C': 0.01, 'penalty': 'l2', 'solver': 'saga'}",0.802582,0.054092,0.029407,0.006669,0.904304,0.00304,18,0.910142,0.004244,3,0.78922,0.007172,17,0.845366,0.005254,17,0.944585,0.002488,17,14.4
4,"{'C': 0.1, 'penalty': 'l1', 'solver': 'liblinear'}",0.671352,0.190354,0.028007,0.008356,0.905233,0.002957,12,0.904979,0.004646,5,0.797884,0.006521,15,0.848052,0.004972,14,0.945196,0.002578,16,12.4
5,"{'C': 0.1, 'penalty': 'l1', 'solver': 'saga'}",1.289702,0.324728,0.022405,0.005221,0.905078,0.002916,15,0.904596,0.004501,7,0.79778,0.006694,16,0.847824,0.004941,16,0.945243,0.002546,15,13.8
6,"{'C': 0.1, 'penalty': 'l2', 'solver': 'liblinear'}",0.227606,0.025102,0.020505,0.001361,0.905353,0.002863,1,0.904783,0.004633,6,0.798507,0.006457,13,0.848317,0.004818,11,0.945314,0.002527,14,9.0
7,"{'C': 0.1, 'penalty': 'l2', 'solver': 'saga'}",1.670678,0.141451,0.028507,0.007147,0.905061,0.003063,16,0.903976,0.005189,8,0.798403,0.0063,14,0.847905,0.005063,15,0.945347,0.002508,13,13.2
8,"{'C': 1, 'penalty': 'l1', 'solver': 'liblinear'}",1.817609,0.368561,0.030308,0.010671,0.905267,0.00301,10,0.903523,0.005127,9,0.799596,0.0063,10,0.848377,0.004981,10,0.945375,0.00249,4,8.6
9,"{'C': 1, 'penalty': 'l1', 'solver': 'saga'}",7.873366,2.888208,0.020305,0.002901,0.905267,0.003041,11,0.903476,0.005232,10,0.799648,0.006248,9,0.848386,0.005015,9,0.94538,0.002489,3,8.4


O critério para selecionar os melhores hiper-parâmetros foi o `average_rank` calculado na tabela acima.

In [18]:
# Melhores parâmetros testados
results_T.iloc[17]['params']

{'C': 100, 'penalty': 'l1', 'solver': 'saga'}

### | Ajustando Modelo com Hiper-parâmetros Encontrados

In [19]:
# Criando um modelo com os melhores parâmetros encontrados
logistic_t = LogisticRegression(max_iter=5000, random_state=412, C=100, penalty='l1', solver='saga')

# Ajustando modelo aos dados de treino da base T
logistic_t.fit(X_t_train, y_t_train)

# Salvando o modelo para avaliação no próximo notebook
dump(logistic_t, '../src/models/logistic_regression_model_base_T.joblib')
print('Modelo salvo com sucesso!')

Modelo salvo com sucesso!


### | Base S


Abaixo, vamos dividir a base de desenvolvimento S em dados de treino e teste para a criação do modelo.

In [11]:
# Separando base de desenvolvimento S em treino e teste
X_s_train, X_s_test, y_s_train, y_s_test = train_test_split(X_s, y_s, test_size=.2, random_state=412)

# Visualizando dados
display(X_s_train.head(5))
display(X_s_test.head(5))

Unnamed: 0,Curricular_units_1st_sem_approved,Curricular_units_1st_sem_grade,Curricular_units_2nd_sem_approved,Curricular_units_2nd_sem_grade,Tuition_fees_up_to_date,Scholarship_holder,Course_Animação_e_Design_Multimédia,Course_Enfermagem,Course_Engenharia_Informática,Course_Gestão_atendimento_noturno,Course_Serviço_Social
22991,-0.066414,0.28573,-1.084482,0.067421,1,0,0,0,0,0,0
69351,0.305613,0.380712,-0.363225,0.548248,1,0,0,0,0,0,0
62940,0.305613,0.380712,0.358031,0.536227,1,0,0,0,0,1,0
4531,0.67764,0.91894,0.71866,0.848765,1,0,0,0,0,0,0
1214,-1.554522,-1.898842,-1.44511,-1.735681,0,0,0,0,0,0,0


Unnamed: 0,Curricular_units_1st_sem_approved,Curricular_units_1st_sem_grade,Curricular_units_2nd_sem_approved,Curricular_units_2nd_sem_grade,Tuition_fees_up_to_date,Scholarship_holder,Course_Animação_e_Design_Multimédia,Course_Enfermagem,Course_Engenharia_Informática,Course_Gestão_atendimento_noturno,Course_Serviço_Social
56902,0.67764,0.602335,0.71866,0.458093,1,1,0,0,0,0,0
65311,0.67764,0.608667,1.079288,0.464103,1,1,0,1,0,0,0
47695,0.305613,0.228741,-1.44511,-1.735681,0,0,0,0,0,0,0
64607,0.305613,0.570674,0.71866,0.608351,1,0,0,0,0,0,0
29978,0.67764,0.633995,0.71866,0.728558,1,0,0,0,0,0,0


### | Aplicando GridSearch para Encontrar Melhores Hiper-parâmetros

Para encontrar os melhores parâmetros, vamos utilizar de **GridSearch** de forma a testar diferentes combinações. O código abaixo utiliza os dados da **Base S** para esse fim.

In [12]:
# Realizando GridSearch com dados de treino da base S
grid_search_S = GridSearchCV(log_reg, param_grid, cv=10, scoring=['accuracy', 'precision', 'recall', 'f1', 'roc_auc'], refit=False, verbose=3, n_jobs=-1)
grid_search_S.fit(X_s_train, y_s_train)

Fitting 10 folds for each of 20 candidates, totalling 200 fits


In [13]:
# Coletando resultados do GridSearch
results_S = pd.DataFrame(grid_search_S.cv_results_)

# Calculando a média dos ranques
results_S['average_rank'] = results_S[['rank_test_accuracy', 'rank_test_precision', 'rank_test_recall', 'rank_test_f1', 'rank_test_roc_auc']].mean(axis=1)

# Visualizando resultados
results_S.head(5)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_C,param_penalty,param_solver,params,split0_test_accuracy,split1_test_accuracy,...,split4_test_roc_auc,split5_test_roc_auc,split6_test_roc_auc,split7_test_roc_auc,split8_test_roc_auc,split9_test_roc_auc,mean_test_roc_auc,std_test_roc_auc,rank_test_roc_auc,average_rank
0,0.328977,0.016191,0.025305,0.005424,0.01,l1,liblinear,"{'C': 0.01, 'penalty': 'l1', 'solver': 'liblin...",0.896836,0.90423,...,0.941392,0.942834,0.94027,0.936982,0.942224,0.935974,0.93985,0.002409,20,12.2
1,0.476405,0.071463,0.021306,0.003952,0.01,l1,saga,"{'C': 0.01, 'penalty': 'l1', 'solver': 'saga'}",0.89718,0.90423,...,0.94163,0.942995,0.940799,0.937184,0.942524,0.936148,0.940126,0.00239,19,11.6
2,0.136931,0.033891,0.020404,0.003721,0.01,l2,liblinear,"{'C': 0.01, 'penalty': 'l2', 'solver': 'liblin...",0.894945,0.902682,...,0.941963,0.943167,0.940643,0.93761,0.942643,0.936118,0.940253,0.002395,18,15.8
3,0.442301,0.020218,0.023007,0.004798,0.01,l2,saga,"{'C': 0.01, 'penalty': 'l2', 'solver': 'saga'}",0.896664,0.90337,...,0.942178,0.943371,0.940958,0.937818,0.942824,0.936228,0.940457,0.0024,17,15.2
4,0.330475,0.066439,0.019402,0.003801,0.1,l1,liblinear,"{'C': 0.1, 'penalty': 'l1', 'solver': 'libline...",0.896492,0.904058,...,0.942606,0.943683,0.94178,0.938407,0.943353,0.936588,0.940954,0.002349,14,8.6


Com o intuito de facilitar a visualização dos melhores candidatos, as células abaixos irão destacar os melhores valores encontrados no **GridSearch**.

In [14]:
# Criando visualização dos parâmetros com destaque
selected_columns = ['params', 'mean_fit_time', 'std_fit_time', 'mean_score_time', 'std_score_time']
for col in results_S:
    if col.startswith('mean_test') or col.startswith('std_test') or col.startswith('rank_test') or col == 'average_rank':
        selected_columns.append(col)

# Criando objeto Style
results_highlighted_S = results_S[selected_columns].style.apply(highlight_values, subset=selected_columns[1:], axis=0)

# Visualizando
results_highlighted_S

Unnamed: 0,params,mean_fit_time,std_fit_time,mean_score_time,std_score_time,mean_test_accuracy,std_test_accuracy,rank_test_accuracy,mean_test_precision,std_test_precision,rank_test_precision,mean_test_recall,std_test_recall,rank_test_recall,mean_test_f1,std_test_f1,rank_test_f1,mean_test_roc_auc,std_test_roc_auc,rank_test_roc_auc,average_rank
0,"{'C': 0.01, 'penalty': 'l1', 'solver': 'liblinear'}",0.328977,0.016191,0.025305,0.005424,0.902791,0.002932,2,0.91216,0.004205,3,0.782062,0.008322,18,0.842089,0.005306,18,0.93985,0.002409,20,12.2
1,"{'C': 0.01, 'penalty': 'l1', 'solver': 'saga'}",0.476405,0.071463,0.021306,0.003952,0.902825,0.002838,1,0.910726,0.004162,4,0.78367,0.007678,17,0.842413,0.005063,17,0.940126,0.00239,19,11.6
2,"{'C': 0.01, 'penalty': 'l2', 'solver': 'liblinear'}",0.136931,0.033891,0.020404,0.003721,0.90188,0.003147,20,0.913085,0.004573,1,0.778067,0.009025,20,0.840154,0.005746,20,0.940253,0.002395,18,15.8
3,"{'C': 0.01, 'penalty': 'l2', 'solver': 'saga'}",0.442301,0.020218,0.023007,0.004798,0.902395,0.003261,19,0.912639,0.004604,2,0.780246,0.009204,19,0.841234,0.005904,19,0.940457,0.0024,17,15.2
4,"{'C': 0.1, 'penalty': 'l1', 'solver': 'liblinear'}",0.330475,0.066439,0.019402,0.003801,0.902688,0.003257,5,0.907824,0.004354,7,0.786264,0.008593,14,0.842659,0.005771,3,0.940954,0.002349,14,8.6
5,"{'C': 0.1, 'penalty': 'l1', 'solver': 'saga'}",0.494508,0.044412,0.017504,0.001025,0.902567,0.003223,13,0.907298,0.004473,8,0.786419,0.008369,13,0.842523,0.00568,16,0.940971,0.002352,13,12.6
6,"{'C': 0.1, 'penalty': 'l2', 'solver': 'liblinear'}",0.17554,0.030217,0.018704,0.002493,0.902722,0.003152,3,0.908472,0.004248,5,0.785693,0.008369,16,0.842611,0.005596,9,0.940926,0.002351,16,9.8
7,"{'C': 0.1, 'penalty': 'l2', 'solver': 'saga'}",0.416797,0.035444,0.024507,0.006729,0.902722,0.003225,4,0.908127,0.004439,6,0.786056,0.008227,15,0.842673,0.005669,2,0.940932,0.002354,15,8.4
8,"{'C': 1, 'penalty': 'l1', 'solver': 'liblinear'}",0.30737,0.068053,0.023104,0.008082,0.902585,0.003171,12,0.90701,0.004373,10,0.786783,0.008152,11,0.842609,0.005574,10,0.940984,0.002347,1,8.8
9,"{'C': 1, 'penalty': 'l1', 'solver': 'saga'}",0.508611,0.066232,0.020705,0.004221,0.902567,0.003127,17,0.906907,0.004287,20,0.786834,0.00811,9,0.842594,0.005507,11,0.940975,0.002356,12,13.8


O critério para selecionar os melhores hiper-parâmetros foi o `average_rank` calculado na tabela acima.

In [24]:
# Melhores parâmetros testados
results_S.iloc[13]['params']

{'C': 10, 'penalty': 'l1', 'solver': 'saga'}

### | Ajustando Modelo com Hiper-parâmetros Encontrados

In [25]:
# Criando um modelo com os melhores parâmetros encontrados
logistic_s = LogisticRegression(max_iter=5000, random_state=412, C=10, penalty='l1', solver='saga')

# Ajustando modelo aos dados de treino da base S
logistic_s.fit(X_s_train, y_s_train)

# Salvando o modelo para avaliação no próximo notebook
dump(logistic_s, '../src/models/logistic_regression_model_base_S.joblib')
print('Modelo salvo com sucesso!')

Modelo salvo com sucesso!


## | Conclusão

Com essa etapa da criação de modelos concluída, vamos seguir para a próxima e última parte do projeto, avaliando os modelos criados nas bases de validação.