## LightGBM

Abrimos el dataset en el cual ya hemos trabajado las nuevas variables y caracteristicas.<br><br>

In [6]:
#!pip install lightgbm
#!pip install --upgrade lightgbm
#!pip install category_encoders
#!pip install geopy

In [7]:
import pandas as pd
from sklearn.model_selection import train_test_split
import lightgbm as lgb
from sklearn.preprocessing import LabelEncoder
import numpy as np
import logging
import os
import shutil
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import f1_score, accuracy_score, precision_score, recall_score, confusion_matrix, roc_curve, auc, ConfusionMatrixDisplay

In [8]:
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

Abrimos los datasets de cada mes que ya hemos balanceado

In [9]:
input_dir = 'final_datasets'

In [10]:
# Diccionario para almacenar los datasets
monthly_datasets = {}

# Listar todos los archivos en el directorio
for file_name in os.listdir(input_dir):
    if file_name.startswith('resampled_data_') and file_name.endswith('.csv'):
        # Construir el path completo del archivo
        file_path = os.path.join(input_dir, file_name)

        # Extraer la fecha del nombre del archivo
        date_str = file_name[len('resampled_data_'):-4]
        month = pd.to_datetime(date_str, format='%Y_%m')

        # Cargar el dataset
        monthly_datasets[month] = pd.read_csv(file_path)
        print(f"Dataset para {month.strftime('%Y-%m')} cargado desde: {file_path}")

Dataset para 2019-01 cargado desde: final_datasets\resampled_data_2019_01.csv
Dataset para 2019-02 cargado desde: final_datasets\resampled_data_2019_02.csv
Dataset para 2019-03 cargado desde: final_datasets\resampled_data_2019_03.csv
Dataset para 2019-04 cargado desde: final_datasets\resampled_data_2019_04.csv
Dataset para 2019-05 cargado desde: final_datasets\resampled_data_2019_05.csv
Dataset para 2019-06 cargado desde: final_datasets\resampled_data_2019_06.csv
Dataset para 2019-07 cargado desde: final_datasets\resampled_data_2019_07.csv
Dataset para 2019-08 cargado desde: final_datasets\resampled_data_2019_08.csv
Dataset para 2019-09 cargado desde: final_datasets\resampled_data_2019_09.csv
Dataset para 2019-10 cargado desde: final_datasets\resampled_data_2019_10.csv
Dataset para 2019-11 cargado desde: final_datasets\resampled_data_2019_11.csv
Dataset para 2019-12 cargado desde: final_datasets\resampled_data_2019_12.csv
Dataset para 2020-01 cargado desde: final_datasets\resampled_dat

Procedemos a separar el dataset en 80% entrenamiento y 20% validación.

### Entrenamiento con LGBM

Entrenamiento individual de los meses (no incremental)

In [11]:
# no queremos llenar el output con imagenes entonces
plt.ioff()

<contextlib.ExitStack at 0x1f3c6f53d30>

In [12]:
# Parámetros de LightGBM
params = {
    'objective': 'binary',
    'metric': 'binary_logloss',
    'verbose': 2
}

folder_path = 'LGBM_combined_performance'
roc_folder = os.path.join(folder_path, 'roc_curves')
conf_matrix_folder = os.path.join(folder_path, 'conf_matrices')
metrics_folder = os.path.join(folder_path, 'metrics_texts')

# Crear las carpetas si no existen
for path in [folder_path, roc_folder, conf_matrix_folder, metrics_folder]:
    if os.path.exists(path):
        shutil.rmtree(path)
    os.makedirs(path)

# Unir todos los datasets en uno solo
full_data = pd.concat(monthly_datasets.values())

# Dividir los datos en entrenamiento y validación
X = full_data.drop(['is_fraud'], axis=1)
y = full_data['is_fraud']
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.3, random_state=42)

# Preparar los datasets para LightGBM
train_data = lgb.Dataset(X_train, label=y_train)
valid_data = lgb.Dataset(X_val, label=y_val)

# Entrenar el modelo
lgb_model = lgb.train(
    params,
    train_data,
    num_boost_round=100,
    valid_sets=[valid_data],
    valid_names=['validation']
)

# Predecir en el conjunto de validación
y_pred = lgb_model.predict(X_val, num_iteration=lgb_model.best_iteration)
y_pred_binary = (y_pred >= 0.5).astype(int)

# Calcular métricas
f1 = f1_score(y_val, y_pred_binary)
accuracy = accuracy_score(y_val, y_pred_binary)
precision = precision_score(y_val, y_pred_binary)
recall = recall_score(y_val, y_pred_binary)
cm = confusion_matrix(y_val, y_pred_binary)

# Guardar métricas en un archivo
metrics_file_path = os.path.join(metrics_folder, 'model_metrics.txt')
with open(metrics_file_path, 'w') as metrics_file:
    metrics_file.write("Combined Dataset Metrics:\n")
    metrics_file.write(f"F1 Score: {f1:.2f}\n")
    metrics_file.write(f"Accuracy: {accuracy:.2f}\n")
    metrics_file.write(f"Precision: {precision:.2f}\n")
    metrics_file.write(f"Recall: {recall:.2f}\n")
    metrics_file.write(f"Confusion Matrix:\n{cm}\n")

# Graficar y guardar la curva ROC
fpr, tpr, _ = roc_curve(y_val, y_pred)
roc_auc = auc(fpr, tpr)
plt.figure(figsize=(10, 5))
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (area = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve for Combined Dataset')
plt.legend(loc="lower right")
plt.savefig(os.path.join(roc_folder, 'roc_curve_combined.png'))
plt.close()

# Graficar y guardar la matriz de confusión
plt.figure(figsize=(6, 5))
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot(cmap=plt.cm.Blues)
plt.title('Confusion Matrix for Combined Dataset')
plt.savefig(os.path.join(conf_matrix_folder, 'conf_matrix_combined.png'))
plt.close()

print("Revisa las carpetas dentro de 'LGBM_combined_performance' para ver el desempeño del modelo.")

[LightGBM] [Info] Number of positive: 425813, number of negative: 1289772
[LightGBM] [Debug] Dataset::GetMultiBinFromSparseFeatures: sparse rate 0.843345
[LightGBM] [Debug] Dataset::GetMultiBinFromAllFeatures: sparse rate 0.123343
[LightGBM] [Debug] init for col-wise cost 0.051941 seconds, init for row-wise cost 0.359326 seconds
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.499730 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 8520
[LightGBM] [Info] Number of data points in the train set: 1715585, number of used features: 36
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.248203 -> initscore=-1.108220
[LightGBM] [Info] Start training from score -1.108220
[LightGBM] [Debug] Trained a tree with leaves = 31 and depth = 8
[LightGBM] [Debug] Trained a tree with leaves = 31 and depth = 7
[LightGBM] [Debug] Trained a tree with leaves = 31 and depth = 9
[LightGBM] [Debug] Trained a tree with leaves = 3

<Figure size 600x500 with 0 Axes>

Entrenamiento incremental

In [13]:
# Parámetros de LightGBM
params = {
    'objective': 'binary',
    'metric': 'binary_logloss',
    'verbose': -1
}

lgb_model = None
folder_path = 'LGBM_incremental_performance'

# Subcarpetas
roc_folder = os.path.join(folder_path, 'roc_curves')
conf_matrix_folder = os.path.join(folder_path, 'conf_matrices')
metrics_folder = os.path.join(folder_path, 'metrics_texts')

# Asegurarse de que las carpetas estén limpias
for path in [folder_path, roc_folder, conf_matrix_folder, metrics_folder]:
    if os.path.exists(path):
        shutil.rmtree(path)
    os.makedirs(path)

metrics_file_path = os.path.join(metrics_folder, 'model_metrics.txt')
with open(metrics_file_path, 'w') as metrics_file:
    month_count = 0  # Contador para el reentrenamiento

    # Iterar sobre cada segmento mensual
    for name, month_data in monthly_datasets.items():
        print(f"Entrenando con datos de: {name}")

        # Dividir los datos en entrenamiento y validación
        X = month_data.drop(['is_fraud'], axis=1)
        y = month_data['is_fraud']
        X_train_month, X_val_month, y_train_month, y_val_month = train_test_split(
            X, y, test_size=0.3, random_state=42)

        train_data_month = lgb.Dataset(X_train_month, label=y_train_month)
        valid_data_month = lgb.Dataset(X_val_month, label=y_val_month)

        # Reestablecer el modelo cada 3 meses
        if month_count % 3 == 0:
            lgb_model = None

        # Entrenamiento con validación
        lgb_model = lgb.train(
            params,
            train_data_month,
            init_model=lgb_model,  # Utiliza el modelo anterior como base
            num_boost_round=100,
            valid_sets=[valid_data_month],
            valid_names=['validation']
        )

        # Incrementar el contador de meses
        month_count += 1

        # Calcular métricas
        y_pred = lgb_model.predict(X_val_month, num_iteration=lgb_model.best_iteration)
        y_pred_binary = (y_pred >= 0.5).astype(int)
        f1 = f1_score(y_val_month, y_pred_binary)
        accuracy = accuracy_score(y_val_month, y_pred_binary)
        precision = precision_score(y_val_month, y_pred_binary)
        recall = recall_score(y_val_month, y_pred_binary)
        cm = confusion_matrix(y_val_month, y_pred_binary)

        # Graficar ROC
        fpr, tpr, _ = roc_curve(y_val_month, y_pred)
        roc_auc = auc(fpr, tpr)
        plt.figure(figsize=(10, 5))
        plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (area = {roc_auc:.2f})')
        plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
        plt.xlim([0.0, 1.0])
        plt.ylim([0.0, 1.05])
        plt.xlabel('False Positive Rate')
        plt.ylabel('True Positive Rate')
        plt.title(f'ROC Curve for {name.strftime("%B %Y")}')
        plt.legend(loc="lower right")
        plt.savefig(os.path.join(roc_folder, f'roc_curve_{name.strftime("%Y_%m")}.png'))
        plt.close()

        # Graficar matriz de confusión
        plt.figure(figsize=(6, 5))
        disp = ConfusionMatrixDisplay(confusion_matrix=cm)
        disp.plot(cmap=plt.cm.Blues)
        plt.title(f'Confusion Matrix for {name.strftime("%B %Y")}')
        plt.savefig(os.path.join(conf_matrix_folder, f'conf_matrix_{name.strftime("%Y_%m")}.png'))
        plt.close()

        # Escribir métricas en el archivo
        metrics_file.write(f"Metrics for {name.strftime('%B %Y')}:\n")
        metrics_file.write(f"F1 Score: {f1:.2f}\n")
        metrics_file.write(f"Accuracy: {accuracy:.2f}\n")
        metrics_file.write(f"Precision: {precision:.2f}\n")
        metrics_file.write(f"Recall: {recall:.2f}\n")
        metrics_file.write(f"Confusion Matrix:\n{cm}\n\n")

print("Revisa las carpetas dentro de 'LGBM_incremental_performance' para ver el desempeño de los modelos.")

Entrenando con datos de: 2019-01-01 00:00:00
Entrenando con datos de: 2019-02-01 00:00:00
Entrenando con datos de: 2019-03-01 00:00:00
Entrenando con datos de: 2019-04-01 00:00:00
Entrenando con datos de: 2019-05-01 00:00:00
Entrenando con datos de: 2019-06-01 00:00:00
Entrenando con datos de: 2019-07-01 00:00:00
Entrenando con datos de: 2019-08-01 00:00:00
Entrenando con datos de: 2019-09-01 00:00:00
Entrenando con datos de: 2019-10-01 00:00:00
Entrenando con datos de: 2019-11-01 00:00:00
Entrenando con datos de: 2019-12-01 00:00:00
Entrenando con datos de: 2020-01-01 00:00:00
Entrenando con datos de: 2020-02-01 00:00:00
Entrenando con datos de: 2020-03-01 00:00:00
Entrenando con datos de: 2020-04-01 00:00:00
Entrenando con datos de: 2020-05-01 00:00:00
Entrenando con datos de: 2020-06-01 00:00:00
Entrenando con datos de: 2020-07-01 00:00:00
Entrenando con datos de: 2020-08-01 00:00:00


  fig, ax = plt.subplots()


Entrenando con datos de: 2020-09-01 00:00:00


  plt.figure(figsize=(10, 5))
  plt.figure(figsize=(6, 5))


Entrenando con datos de: 2020-10-01 00:00:00
Entrenando con datos de: 2020-11-01 00:00:00
Entrenando con datos de: 2020-12-01 00:00:00
Revisa las carpetas dentro de 'LGBM_incremental_performance' para ver el desempeño de los modelos.


<Figure size 600x500 with 0 Axes>

<Figure size 600x500 with 0 Axes>

<Figure size 600x500 with 0 Axes>

<Figure size 600x500 with 0 Axes>

<Figure size 600x500 with 0 Axes>

<Figure size 600x500 with 0 Axes>

<Figure size 600x500 with 0 Axes>

<Figure size 600x500 with 0 Axes>

<Figure size 600x500 with 0 Axes>

<Figure size 600x500 with 0 Axes>

<Figure size 600x500 with 0 Axes>

<Figure size 600x500 with 0 Axes>

<Figure size 600x500 with 0 Axes>

<Figure size 600x500 with 0 Axes>

<Figure size 600x500 with 0 Axes>

<Figure size 600x500 with 0 Axes>

<Figure size 600x500 with 0 Axes>

<Figure size 600x500 with 0 Axes>

<Figure size 600x500 with 0 Axes>

<Figure size 600x500 with 0 Axes>

<Figure size 600x500 with 0 Axes>

<Figure size 600x500 with 0 Axes>

<Figure size 600x500 with 0 Axes>

<Figure size 600x500 with 0 Axes>