Paso 1.
Realice la limpieza de los datasets:
 - Renombre la columna "default payment next month" a "default".
 - Remueva la columna "ID".
 - Elimine los registros con informacion no disponible.
 - Para la columna EDUCATION, valores > 4 indican niveles superiores
   de educación, agrupe estos valores en la categoría "others".
 - Renombre la columna "default payment next month" a "default"
 - Remueva la columna "ID".

In [1]:
import pandas as pd

df_train= pd.read_csv('../files/input/train_data.csv.zip')
df_test= pd.read_csv('../files/input/test_data.csv.zip')
df_train.head()

Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default payment next month
0,10748,310000,1,3,1,32,0,0,0,0,...,84373,57779,14163,8295,6000,4000,3000,1000,2000,0
1,12574,10000,2,3,1,49,-1,-1,-2,-1,...,1690,1138,930,0,0,2828,0,182,0,1
2,29677,50000,1,2,1,28,-1,-1,-1,0,...,45975,1300,43987,0,46257,2200,1300,43987,1386,0
3,8857,80000,2,3,1,52,2,2,3,3,...,40748,39816,40607,3700,1600,1600,0,1600,1600,1
4,21099,270000,1,1,2,34,1,2,0,0,...,22448,15490,17343,0,4000,2000,0,2000,2000,0


In [2]:
def clean_dataset(df: pd.DataFrame) -> pd.DataFrame:
    df = df.rename(columns={'default payment next month': 'default'})
    df.drop('ID', axis=1, inplace=True)
    df['EDUCATION'] = df['EDUCATION'].apply(lambda x: 4 if x>4 else x)
    df= df.query('MARRIAGE != 0 and EDUCATION != 0')
    return df

In [3]:
df_train = clean_dataset(df_train)
df_test = clean_dataset(df_test)

In [4]:
df_train.isna().sum()

LIMIT_BAL    0
SEX          0
EDUCATION    0
MARRIAGE     0
AGE          0
PAY_0        0
PAY_2        0
PAY_3        0
PAY_4        0
PAY_5        0
PAY_6        0
BILL_AMT1    0
BILL_AMT2    0
BILL_AMT3    0
BILL_AMT4    0
BILL_AMT5    0
BILL_AMT6    0
PAY_AMT1     0
PAY_AMT2     0
PAY_AMT3     0
PAY_AMT4     0
PAY_AMT5     0
PAY_AMT6     0
default      0
dtype: int64

Paso 2.
-Divida los datasets en x_train, y_train, x_test, y_test.

In [5]:
X_train,  y_train = df_train.drop('default', axis=1), df_train['default']
X_test,  y_test = df_test.drop('default', axis=1), df_test['default']

Paso 3.
 Cree un pipeline para el modelo de clasificación. Este pipeline debe
 contener las siguientes capas:
 - Transforma las variables categoricas usando el método
   one-hot-encoding.
 - Ajusta un modelo de bosques aleatorios (rando forest).

In [6]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
import numpy as np

columnas_categoricas = ['SEX','EDUCATION','MARRIAGE'] ##,'PAY_0','PAY_2','PAY_3','PAY_4','PAY_5','PAY_6']

transformer = ColumnTransformer(
    transformers=[
        ('ohe', OneHotEncoder(dtype=np.int64), columnas_categoricas)
    ],
    remainder='passthrough'
)

pipeline = Pipeline(
    steps=[
        ('transformer', transformer),
        ('randomforest', RandomForestClassifier(n_jobs=-1, random_state=2024))
    ]
)



Paso 4.
- Optimice los hiperparametros del pipeline usando validación cruzada.
- Use 10 splits para la validación cruzada. Use la función de precision balanceada para medir la precisión del  modelo.
ver documentación sobre el random forest: https://scikit-learn.org/1.5/modules/generated/sklearn.ensemble.RandomForestClassifier.html

In [7]:
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV

param_grid = {
    'randomforest__n_estimators': [180],
    'randomforest__max_features': ['sqrt'],
    #'randomforest__criterion': ['gini', 'entropy', 'log_loss'],
    'randomforest__min_samples_split': [10],
    'randomforest__min_samples_leaf': [2],
    'randomforest__bootstrap': [True],
    'randomforest__max_depth': [None]  #define la profundidad del árbol
}

grid_search = GridSearchCV(
    estimator=pipeline,
    param_grid=param_grid,
    cv=StratifiedKFold(n_splits=10,shuffle=False),
    scoring='balanced_accuracy',
    n_jobs=-1,
    verbose=1
)

In [None]:
#mejor modelo

grid_search.fit(X_train, y_train)

Fitting 10 folds for each of 1 candidates, totalling 10 fits


The format of the columns of the 'remainder' transformer in ColumnTransformer.transformers_ will change in version 1.7 to match the format of the other transformers.
At the moment the remainder columns are stored as indices (of type int). With the same ColumnTransformer configuration, in the future they will be stored as column names (of type str).



In [9]:
grid_search.best_params_

{'randomforest__bootstrap': True,
 'randomforest__max_depth': None,
 'randomforest__max_features': 'sqrt',
 'randomforest__min_samples_leaf': 2,
 'randomforest__min_samples_split': 10,
 'randomforest__n_estimators': 180}

Paso 5.
-Guarde el modelo como "files/models/model.pkl".

In [10]:
import pickle
import pickle
import gzip
import os

models_dir = '../files/models'
os.makedirs(models_dir, exist_ok=True)

with gzip.open('../files/models/model.pkl.gz', 'wb') as file:
    pickle.dump(grid_search, file)


Paso 6.
Calcule las metricas de precision, precision balanceada, recall,
y f1-score para los conjuntos de entrenamiento y prueba.
Guardelas en el archivo files/output/metrics.json. Cada fila
del archivo es un diccionario con las metricas de un modelo.
Este diccionario tiene un campo para indicar si es el conjunto
de entrenamiento o prueba. Por ejemplo:

{'dataset': 'train', 'precision': 0.8, 'balanced_accuracy': 0.7, 'recall': 0.9, 'f1_score': 0.85}
{'dataset': 'test', 'precision': 0.7, 'balanced_accuracy': 0.6, 'recall': 0.8, 'f1_score': 0.75}

In [11]:
from sklearn.metrics import (
    accuracy_score, precision_score, balanced_accuracy_score,
    recall_score, f1_score, confusion_matrix, classification_report)
np.set_printoptions(legacy='1.25')

In [12]:
def cargar_modelo_predecir(data):
    import pickle
    import gzip
    with gzip.open("../files/models/model.pkl.gz", "rb") as file:
        estimator = pickle.load(file)
        

    return estimator.predict(data)

In [13]:
y_train_pred = grid_search.predict(X_train)
y_test_pred = grid_search.predict(X_test)

def eval_metrics(dataset,y_true, y_pred):
    accuracy = precision_score(y_true, y_pred)
    balanced_accuracy = balanced_accuracy_score(y_true, y_pred)
    recall = recall_score(y_true, y_pred)
    f1 = f1_score(y_true, y_pred)
    return {'type': 'metrics','dataset': dataset, 'precision': accuracy, 'balanced_accuracy': balanced_accuracy, 'recall': recall, 'f1_score': f1} 


In [14]:
metrics_train = eval_metrics('train',y_train,y_train_pred)
metrics_test = eval_metrics('test',y_test,y_test_pred)


Paso 7.
-Calcule las matrices de confusion para los conjuntos de entrenamiento y
prueba. Guardelas en el archivo files/output/metrics.json. Cada fila
del archivo es un diccionario con las metricas de un modelo.
de entrenamiento o prueba. Por ejemplo:

{'type': 'cm_matrix', 'dataset': 'train', 'true_0': {"predicted_0": 15562, "predicte_1": 666}, 'true_1': {"predicted_0": 3333, "predicted_1": 1444}}
{'type': 'cm_matrix', 'dataset': 'test', 'true_0': {"predicted_0": 15562, "predicte_1": 650}, 'true_1': {"predicted_0": 2490, "predicted_1": 1420}}


In [15]:
# Matrices de confusión
print("M.Confu - Entrenamiento:")
cm_train=confusion_matrix(y_train, y_train_pred)
print(cm_train)

print("M.Confu - Prueba:")
cm_test=confusion_matrix(y_test, y_test_pred)
print(cm_test)

M.Confu - Entrenamiento:
[[16097   131]
 [ 1864  2861]]
M.Confu - Prueba:
[[6678  395]
 [1133  773]]


In [16]:
cm_train_dict = {
    'type': 'cm_matrix',
    'dataset': 'train',
    'true_0': {
        'predicted_0': cm_train[0,0],
        'predicted_1': cm_train[0,1]
    },
    'true_1': {
        'predicted_0': cm_train[1,0],
        'predicted_1': cm_train[1,1]
    }
}

cm_test_dict = {
    'type': 'cm_matrix',
    'dataset': 'test',
    'true_0': {
        'predicted_0': cm_test[0,0],
        'predicted_1': cm_test[0,1]
    },
    'true_1': {
        'predicted_0': cm_test[1,0],
        'predicted_1': cm_test[1,1]
    }
}

In [17]:
import os

models_dir = '../files/output'
os.makedirs(models_dir, exist_ok=True)

In [18]:
with open('../files/output/metrics.json', mode='a') as file:
    file.write(str(metrics_train).replace("'",'"')+"\n")
    file.write(str(metrics_test).replace("'",'"')+"\n")
    file.write(str(cm_train_dict).replace("'",'"')+"\n")
    file.write(str(cm_test_dict).replace("'",'"')+"\n")