# Ejercicio 2


## Introducción

En este ejercicio se aborda un problema de clasificación binaria utilizando el dataset de detección de fraudes en tarjetas de crédito, disponible en Kaggle. 

El objetivo principal es construir un modelo capaz de identificar transacciones fraudulentas a partir de los datos provistos. Para ello, se realizará una separación del dataset en conjuntos de entrenamiento y prueba, se entrenará un modelo de regresión logística y se evaluará su desempeño mediante validación cruzada, comparándolo con un clasificador dummy.

Finalmente, se seleccionará un umbral óptimo para la clasificación en base a métricas como Precision, Recall, la curva Precision-Recall y la curva ROC.

In [1]:
try:
    import kagglehub
except ImportError:
    !pip install kagglehub
    import kagglehub
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import clear_output
clear_output()

In [6]:
# Download latest version
path = kagglehub.dataset_download("mlg-ulb/creditcardfraud")

In [7]:

data = pd.read_csv(path + "/creditcard.csv")

data.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


El 0,1% de los datos es de la clase positiva (fraude), por lo que se cuenta con un dataset altamente desbalanceado

In [8]:
print("Numero de filas y columnas:", data.shape)
print(data.groupby('Class').size()/len(data))

Numero de filas y columnas: (284807, 31)
Class
0    0.998273
1    0.001727
dtype: float64


## Modelado

In [9]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

### Separacion train y test 80% - 20%

In [12]:
x_train, x_test, y_train, y_test = train_test_split(data.drop(columns=['Class']), 
                                                    data['Class'], 
                                                    test_size=0.2)

### Cross Validation

In [63]:
def cross_validation(model, x,y,folds,umbral=0.5):
    from sklearn.model_selection import KFold
    from sklearn.metrics import accuracy_score
    from sklearn.metrics import precision_score
    from sklearn.metrics import recall_score
    from sklearn.metrics import f1_score
    from sklearn.metrics import roc_auc_score
    from sklearn.metrics import precision_recall_curve
    from sklearn.metrics import average_precision_score



    kf = KFold(n_splits=folds, shuffle=True, random_state=42)
    accs = []
    recs = []
    precs = []
    f1s = []
    rocs = []
    stats = []
    iters = []
    t_perc = []
    v_perc = []
    tresholds = []
    for train_index, test_index in kf.split(x):
        x_train, x_val = x.iloc[train_index], x.iloc[test_index]
        y_train, y_val = y.iloc[train_index], y.iloc[test_index]
        model.fit(x_train, y_train)
        y_pred_p = model.predict_proba(x_val)
        y_pred = (y_pred_p >= umbral).astype(int)[:,1]

        # accuracy
        accs.append(accuracy_score(y_val, y_pred))
        # precision
        precs.append(precision_score(y_val, y_pred))
        # recall
        recs.append(recall_score(y_val, y_pred))
        # f1-score
        f1s.append(f1_score(y_val, y_pred))
        # AUC
        rocs.append(roc_auc_score(y_val, y_pred))

        # Precision-Recall curve
        precision, recall, thresholds = precision_recall_curve(y_val, y_pred_p[:,1])
        ap = average_precision_score(y_val, y_pred_p[:,1])

        #obtener el umbral donde se tocan los graficos
        bu = np.inf
        for i in range(len(precision)):
            if abs(precision[i] - recall[i]) < bu:
                bu = abs(precision[i] - recall[i])
                best_threshold = thresholds[i-1]
        tresholds.append(best_threshold)

        #status del modelo
        if isinstance(model,LogisticRegression):
            status =  model.n_iter_[0] < model.max_iter
            iterations = model.n_iter_[0]
            stats.append(status)
            iters.append(iterations)
        else:
            stats.append("N/A")
            iters.append("N/A")

        #composicion de las partes
        train_perc = y_train.sum()/len(y_train)
        val_perc = y_val.sum()/len(y_val)
        t_perc.append(train_perc)
        v_perc.append(val_perc)

        # limpiar output
        clear_output(wait=False)

    return pd.DataFrame(
        {
            "model": [model.__str__().split("(")[0] for i in range(folds)],
            "accuracy": accs,
            "precision": precs,
            "recall": recs,
            "f1": f1s,
            "roc_auc": rocs,
            "convergence": stats,
            "iterations": iters,
            "train_perc": t_perc,
            "val_perc": v_perc,
            "best_threshold": tresholds
        }
    )
    


### Modelo de regresión logística

In [64]:
cross_validation(LogisticRegression(max_iter=1000,n_jobs=-1), 
                data.drop(columns=["Class"]),
                data["Class"],
                folds = 5,umbral=0.5)

Unnamed: 0,model,accuracy,precision,recall,f1,roc_auc,convergence,iterations,train_perc,val_perc,best_threshold
0,LogisticRegression,0.99907,0.826087,0.581633,0.682635,0.790711,False,1000,0.001729,0.00172,0.060777
1,LogisticRegression,0.999333,0.887324,0.677419,0.768293,0.838639,False,1000,0.001751,0.001633,0.090631
2,LogisticRegression,0.99914,0.859375,0.578947,0.691824,0.789395,False,1000,0.001742,0.001668,0.092735
3,LogisticRegression,0.999122,0.880597,0.584158,0.702381,0.792009,False,1000,0.001716,0.001773,0.133971
4,LogisticRegression,0.999105,0.829268,0.647619,0.727273,0.823686,False,1000,0.001699,0.001843,0.177998


### Dummy classifier

Se entrena un modelo dummy que predice al azar para comparar con el modelo de regresión logística. Se puede observar que el modelo de regresión logística es ampliamente superior al modelo dummy classifier dado que este ultimo no logra captar nunca las clases positivas

In [30]:
class dummy_classifier:
    def __init__(self):
        import numpy as np
        pass
    def __str__(self):
        return "dummy_classifier"
    def fit(self, x, y):
        self.prob = y.sum()/len(y)
    def predict_proba(self, x):
        return np.random.choice([0,1], size=len(x), p=[1-self.prob, self.prob])

In [31]:
cross_validation(dummy_classifier(), 
                data.drop(columns=["Class"]),
                data["Class"],
                folds = 5)

Unnamed: 0,model,accuracy,precision,recall,f1,roc_auc,convergence,iterations,train_perc,val_perc
0,dummy_classifier,0.996331,0.0,0.0,0.0,0.499024,,,0.001729,0.00172
1,dummy_classifier,0.996664,0.0,0.0,0.0,0.499147,,,0.001751,0.001633
2,dummy_classifier,0.996489,0.0,0.0,0.0,0.499077,,,0.001742,0.001668
3,dummy_classifier,0.996594,0.0,0.0,0.0,0.499182,,,0.001716,0.001773
4,dummy_classifier,0.996541,0.0,0.0,0.0,0.499191,,,0.001699,0.001843


## Encontrar mejor umbral



En base a los datos anteriores situaremos el umbral sobre el promedio de los umbrales encontrados en la intersección de la curva precision recall de cada uno de los folds del cross validation realizado sobre la regresión logística. Esto nos da un umbral de 0.111222 
<div align="center"> 
<img src="https://i.ibb.co/GvYb38r7/image.png" alt="image" border="0">

Cuadro 1: Ejemplo de muestra
</div>




In [65]:
cross_validation(LogisticRegression(max_iter=1000, n_jobs=-1),
                         data.drop(columns=["Class"]),
                         data["Class"],
                         folds=5, umbral=0.111222)

Unnamed: 0,model,accuracy,precision,recall,f1,roc_auc,convergence,iterations,train_perc,val_perc,best_threshold
0,LogisticRegression,0.999157,0.784091,0.704082,0.741935,0.851874,False,1000,0.001729,0.00172,0.060777
1,LogisticRegression,0.99928,0.782609,0.774194,0.778378,0.886921,False,1000,0.001751,0.001633,0.090631
2,LogisticRegression,0.999122,0.764706,0.684211,0.722222,0.841929,False,1000,0.001742,0.001668,0.092735
3,LogisticRegression,0.999122,0.747573,0.762376,0.754902,0.880959,False,1000,0.001716,0.001773,0.133971
4,LogisticRegression,0.998876,0.672269,0.761905,0.714286,0.880609,False,1000,0.001699,0.001843,0.177998


Como el caso es de deteccion de fraudes, nos interesa mayormente que la clase positiva tenga menor margen de error. Por este motivo, se decide reducir el umbral a 0.05, sacrificando cierta parte de la precision generando falsos positivos, pero al mismo tiempo aumentando la tasa de deteccion de fraudes, dada por el recall.

In [66]:
cross_validation(LogisticRegression(max_iter=1000, n_jobs=-1),
                         data.drop(columns=["Class"]),
                         data["Class"],
                         folds=5, umbral=0.05)

Unnamed: 0,model,accuracy,precision,recall,f1,roc_auc,convergence,iterations,train_perc,val_perc,best_threshold
0,LogisticRegression,0.998947,0.679245,0.734694,0.705882,0.867048,False,1000,0.001729,0.00172,0.060777
1,LogisticRegression,0.998859,0.614754,0.806452,0.697674,0.902813,False,1000,0.001751,0.001633,0.090631
2,LogisticRegression,0.998139,0.462069,0.705263,0.558333,0.851946,False,1000,0.001742,0.001668,0.092735
3,LogisticRegression,0.998332,0.519231,0.80198,0.63035,0.900331,False,1000,0.001716,0.001773,0.133971
4,LogisticRegression,0.998227,0.512346,0.790476,0.621723,0.894543,False,1000,0.001699,0.001843,0.177998
