Créditos: Esse notebook é adaptado dos exemplos fornecidos pelo professor Eduardo Bezerra.

## Objetivo do trabalho:
* Apresentar implementações e descrevê-las com detalhes
* Apresentar os resultados
* Explicar em vídeo

# 1 Predição de pagamento de empréstimos

### Resumo:
* Classificação | 1500 exemplos | 11 atributos | credtrain.txt
* Preprocessamentos: one-hot encoding e normalização
* Mining: LogisticRegression, DecisionTreeClassifier, RandomForestClassifier, KNeighborsClassifier, GradientBoostingClassifier

## Seleção dos dados
Primeiramente é necessário baixar a base de dados

In [6]:
import pandas as pd
nome_colunas = ['ESCT', 'NDEP', 'RENDA', 'TIPOR', 'VBEM', 'NPARC', 'VPARC', 'TEL', 'IDADE', 'RESMS', 'ENTRADA', 'CLASSE']
data = pd.read_csv('cic1205/data/credtrain.txt', sep='\t', names=nome_colunas)

* O argumento **sep='\t'** detecta separações por tab
* O argumento **names=nome_colunas** foi utilizado para nomear as colunas. Para isso foi criada uma array com os nomes das colunas e passada para o parâmetro

## Preprocessamento
Agora tratarei os atributos categóricos e não categóricos.
* Categóricos: Serão codificados em novas variáveis binárias utilizando codificação one-hot (one-hot encoding)
* Não categóricos: Serão normalizados
* Ambos serão unidos num dataframe final

In [23]:
# Lista de variáveis categóricas
variaveis_categoricas = ['ESCT', 'NDEP', 'TIPOR', 'TEL']

# Lista de variáveis numéricas
variaveis_numericas = ['RENDA', 'VBEM', 'NPARC', 'VPARC', 'IDADE', 'RESMS', 'ENTRADA']

# Dataframe com variaveis categoricas
data_cat = data[variaveis_categoricas]

# Dataframe com variaveis numericas
data_num = data[variaveis_numericas]

### Codificação One-hot

In [111]:
from sklearn.preprocessing import OneHotEncoder # Classe para binarizar atributos

codificador = OneHotEncoder(sparse_output=False)   # Criando o objeto codificador
codificador.fit(data_cat)                          # Ajustando o objeto aos dados

def codificar(input_df):
    nparray_encoded = codificador.transform(input_df) # Criando um novo dataframe já codificado
    column_names = codificador.get_feature_names_out(input_features=input_df.columns)
    data_cat_encoded = pd.DataFrame(nparray_encoded, columns=column_names)
    return data_cat_encoded
    
data_cat_encoded = codificar(data_cat)

### Normalização

In [112]:
from sklearn.preprocessing import MinMaxScaler  # Importando biblioteca de normalização Min-Max
scaler = MinMaxScaler()                         # Objeto que realiza a normalização
scaler.fit(data_num)                            # Ajustando o objeto aos dados

def normalizar(input_df):
    data_num_normalized = pd.DataFrame(             # Cria um dataframe que recebe como input um ndarray (numpy)
    scaler.transform(input_df), columns=input_df.columns # O nd-array é retornado pelo método transform
    )
    return data_num_normalized

data_num_normalized = normalizar(data_num)

In [113]:
data_num_normalized

Unnamed: 0,RENDA,VBEM,NPARC,VPARC,IDADE,RESMS,ENTRADA
0,0.007792,0.003514,0.347826,0.003026,0.117647,0.114286,0.0
1,0.006494,0.045405,0.391304,0.022693,0.274510,0.014286,0.0
2,0.103896,0.142973,0.347826,0.113464,0.725490,0.114286,0.0
3,0.350649,0.068108,0.478261,0.039334,0.235294,0.142857,0.0
4,0.090909,0.137568,0.478261,0.092284,0.098039,0.016667,0.0
...,...,...,...,...,...,...,...
1495,0.025974,0.037027,0.478261,0.015129,0.803922,0.114286,0.0
1496,0.228182,0.000270,0.000000,0.405446,0.313725,0.114286,0.0
1497,0.035065,0.057027,0.478261,0.031770,0.176471,0.342857,0.0
1498,0.007792,0.022703,0.478261,0.003026,0.274510,0.085714,0.0


### Concatenando os dataframes preprocessados

In [114]:
data_preprocessed = pd.concat( [ data_cat_encoded, data_num_normalized, data['CLASSE'] ], axis=1 )
# axis=1 para a concatenação ao longo das colunas
data_preprocessed

Unnamed: 0,ESCT_0,ESCT_1,ESCT_2,ESCT_3,NDEP_0,NDEP_1,NDEP_2,NDEP_3,NDEP_4,NDEP_5,...,TEL_0,TEL_1,RENDA,VBEM,NPARC,VPARC,IDADE,RESMS,ENTRADA,CLASSE
0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.007792,0.003514,0.347826,0.003026,0.117647,0.114286,0.0,1
1,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.006494,0.045405,0.391304,0.022693,0.274510,0.014286,0.0,1
2,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.103896,0.142973,0.347826,0.113464,0.725490,0.114286,0.0,1
3,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.350649,0.068108,0.478261,0.039334,0.235294,0.142857,0.0,1
4,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.090909,0.137568,0.478261,0.092284,0.098039,0.016667,0.0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1495,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.025974,0.037027,0.478261,0.015129,0.803922,0.114286,0.0,1
1496,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.228182,0.000270,0.000000,0.405446,0.313725,0.114286,0.0,1
1497,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.035065,0.057027,0.478261,0.031770,0.176471,0.342857,0.0,1
1498,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.007792,0.022703,0.478261,0.003026,0.274510,0.085714,0.0,1


## Mineração

### Separando os conjuntos de treinamento, teste e validação

In [115]:
from sklearn.model_selection import train_test_split # Importa classe que faz amostragem (sampling)

X_train, X_test, y_train, y_test = train_test_split(data_preprocessed.iloc[:,:-1], # features
                                                                data_preprocessed.iloc[:,-1],  # labels
                                                                test_size=0.10,                 # % do test set (10%)
                                                                random_state=1)                # seed

### LogisticRegression

#### Treinamento

In [116]:
from sklearn.linear_model import LogisticRegression # Importando a classe de classificação
model = LogisticRegression(random_state=1)          # Criando um objeto de predição que usa regressão logística
model.fit(X_train, y_train)                         # Ajustando o modelo

#### Checando ajuste do modelo

Primeiramente sempre gosto de avaliar se o modelo conseguiu se ajustar aos dados. Faço isso realizando predição sobre o próprio dado treinado.

In [117]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [118]:
overfitted_pred = model.predict(X_train)
accuracy = accuracy_score(y_train, overfitted_pred)
print(f"Overfitted Accuracy: {accuracy * 100:.2f}%")

Overfitted Accuracy: 89.85%


#### Predição

In [119]:
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
confusion = confusion_matrix(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)

print(f"Accuracy: {accuracy * 100:.2f}%")
print("Confusion Matrix:\n", confusion)
print("Classification Report:\n", classification_rep)

Accuracy: 86.67%
Confusion Matrix:
 [[68  4]
 [16 62]]
Classification Report:
               precision    recall  f1-score   support

           0       0.81      0.94      0.87        72
           1       0.94      0.79      0.86        78

    accuracy                           0.87       150
   macro avg       0.87      0.87      0.87       150
weighted avg       0.88      0.87      0.87       150



#### Preprocessamento para validação do modelo
Como o dataset de validação está em outro arquivo (credtest.txt), será necessário repetir os passos de preprocessamento usando os objetos já treinados na fase de treinamento do modelo. Os passos são:
* Seleção de dados
* Separação dos atributos numéricos e categóricos
* Utilização dos objetos de preprocessamento já criados para transformar os novos dados (scaler e codificador)
* Junção do dataset
* Validação final

In [123]:
val_data = pd.read_csv('cic1205/data/credtest.txt', sep='\t', names=nome_colunas) # Seleção de dados
val_data_cat = val_data[variaveis_categoricas] # Separação dos atributos categóricos
val_data_num = val_data[variaveis_numericas]   # Separação dos atributos categóricos
val_data_cat_encoded = codificar(val_data_cat)       # Binarização
val_data_num_normalized = normalizar(val_data_num)   # Normalização
val_data_preprocessed = pd.concat( [ val_data_cat_encoded, val_data_num_normalized, val_data['CLASSE'] ], axis=1 ) # Junção
X_validation = val_data_preprocessed.iloc[:,:-1] # Conjuntos de dados
y_validation = val_data_preprocessed.iloc[:,-1]  # Classificações

#### Validação

In [127]:
y_pred_validation = model.predict(X_validation)

# Evaluate the model
accuracy = accuracy_score(y_validation, y_pred_validation)
confusion = confusion_matrix(y_validation, y_pred_validation)
classification_rep = classification_report(y_validation, y_pred_validation)

print(f"Accuracy: {accuracy * 100:.2f}%")
print("Confusion Matrix:\n", confusion)
print("Classification Report:\n", classification_rep)

Accuracy: 89.08%
Confusion Matrix:
 [[295  11]
 [ 52 219]]
Classification Report:
               precision    recall  f1-score   support

           0       0.85      0.96      0.90       306
           1       0.95      0.81      0.87       271

    accuracy                           0.89       577
   macro avg       0.90      0.89      0.89       577
weighted avg       0.90      0.89      0.89       577



Agora que todos os conjuntos de dados já foram preprocessados é possível trabalhar rapidamente com os outros previsores pois todos utilizarão as mesmas baes de dados X_data, X_test, X_validation e seus y correspondentes.

RandomForestClassifier, KNeighborsClassifier, GradientBoostingClassifier

### DecisionTreeClassifier

In [129]:
from sklearn.tree import DecisionTreeClassifier
my_tree_model = DecisionTreeClassifier(random_state=1)
