<a href="https://colab.research.google.com/github/MaxiPerrone/fraud_detection_ml/blob/main/Deteccion_fraude_preparacion_conjunto_datos.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [108]:
import pandas as pd
import os
import kagglehub

from sklearn.model_selection import train_test_split

In [109]:
dataset_path = kagglehub.dataset_download("dhanushnarayananr/credit-card-fraud")
print("dataset path:", dataset_path)

Using Colab cache for faster access to the 'credit-card-fraud' dataset.
dataset path: /kaggle/input/credit-card-fraud


In [110]:
print(os.listdir(dataset_path))

csv_file = os.path.join(dataset_path, 'card_transdata.csv')

['card_transdata.csv']


In [111]:
df_orig = pd.read_csv(csv_file)

df = df_orig.copy()
df.head()

Unnamed: 0,distance_from_home,distance_from_last_transaction,ratio_to_median_purchase_price,repeat_retailer,used_chip,used_pin_number,online_order,fraud
0,57.877857,0.31114,1.94594,1.0,1.0,0.0,0.0,0.0
1,10.829943,0.175592,1.294219,1.0,0.0,0.0,0.0,0.0
2,5.091079,0.805153,0.427715,1.0,0.0,0.0,1.0,0.0
3,2.247564,5.600044,0.362663,1.0,1.0,0.0,1.0,0.0
4,44.190936,0.566486,2.222767,1.0,1.0,0.0,1.0,0.0


In [112]:
print('Longitud del dataset: ', len(df))

Longitud del dataset:  1000000


In [113]:
df.shape

(1000000, 8)

In [114]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 8 columns):
 #   Column                          Non-Null Count    Dtype  
---  ------                          --------------    -----  
 0   distance_from_home              1000000 non-null  float64
 1   distance_from_last_transaction  1000000 non-null  float64
 2   ratio_to_median_purchase_price  1000000 non-null  float64
 3   repeat_retailer                 1000000 non-null  float64
 4   used_chip                       1000000 non-null  float64
 5   used_pin_number                 1000000 non-null  float64
 6   online_order                    1000000 non-null  float64
 7   fraud                           1000000 non-null  float64
dtypes: float64(8)
memory usage: 61.0 MB


In [115]:
print('Valores faltantes: ', df.isna().sum())

Valores faltantes:  distance_from_home                0
distance_from_last_transaction    0
ratio_to_median_purchase_price    0
repeat_retailer                   0
used_chip                         0
used_pin_number                   0
online_order                      0
fraud                             0
dtype: int64


In [116]:
print('Filas duplicadas: ', df.duplicated().sum())

Filas duplicadas:  0


In [117]:
df.dropna(axis=0, inplace=True)
df.drop_duplicates(inplace=True)

df.shape

(1000000, 8)

In [118]:
df.columns

Index(['distance_from_home', 'distance_from_last_transaction',
       'ratio_to_median_purchase_price', 'repeat_retailer', 'used_chip',
       'used_pin_number', 'online_order', 'fraud'],
      dtype='object')

used_chip: indica si la transaccion se hizo pasando la tarjeta con chip. (x ej usar chip puede ser mas seguro que la banda magnetica)
used_pin_number: indica si la compra se hizo usando el PIN

*Estos 2 datos nos puede dar una idea si el fraude se da más en compras con chip, con pin o sin ellos.*

In [119]:
chip_pin_df = df[['used_chip','used_pin_number', 'fraud']]
chip_pin_total_fraud = chip_pin_df['fraud'].sum()

In [120]:
total_chip = (chip_pin_df['used_chip'] == 1).sum()
total_pin = (chip_pin_df['used_pin_number'] == 1).sum()

In [121]:
print('Total transacciones que se hicieron con chip: ', chip_total_fraud, '/', len(chip_pin_df))
print('Total transacciones que se hicieron con pin: ', pin_total_fraud, '/', len(chip_pin_df))
print('Total fraude con chip y pin: ', chip_pin_total_fraud, '/', len(chip_pin_df))

Total transacciones que se hicieron con chip:  350399 / 1000000
Total transacciones que se hicieron con pin:  100608 / 1000000
Total fraude con chip y pin:  87403.0 / 1000000


In [122]:
total_fraud_chip = chip_pin_df[chip_pin_df['used_chip'] ==1]['fraud'].sum()
total_fraud_pin = chip_pin_df[chip_pin_df['used_pin_number'] ==1]['fraud'].sum()

In [123]:
print('Total fraude con chip: ', total_fraud_chip, '/', len(chip_pin_df))
print('Total fraude con pin: ', total_fraud_pin, '/', len(chip_pin_df))

Total fraude con chip:  22410.0 / 1000000
Total fraude con pin:  273.0 / 1000000


In [124]:
fraud_rate_chip = total_fraud_chip / total_chip if total_chip > 0 else 0
fraud_rate_pin = total_fraud_pin / total_pin if total_pin > 0 else 0

print(f"Porcentage fraude con chip: {fraud_rate_chip:.2%}")
print(f"Porcentage fraude con pin: {fraud_rate_pin:.2%}")

Porcentage fraude con chip: 6.40%
Porcentage fraude con pin: 0.27%


**Calcular el coeficiente de correlación de Pearson indica:**


1. Es positiva y alta: cuanto mayor es el ratio más probable es que sea fraude
2. Es cercana a 0: el ratio no aporta mucha información directa para predecir fraude
3. Es negativa: cuanto mayor es el ratio, menos probable es que sea fraude

ratio_to_median_purchase_price: indica cuánto se aleja el precio de la compra respecto a la media

In [125]:
correlation_df = df[['ratio_to_median_purchase_price', 'fraud']]
corr = correlation_df['ratio_to_median_purchase_price'].corr(df['fraud'])

In [126]:
print('Correlacion de ratio_to_median_purchase_price: ', corr)

Correlacion de ratio_to_median_purchase_price:  0.4623047222882617


In [127]:
X = df.drop('fraud', axis=1)
y = df['fraud']

In [132]:
online_order_df = df[['online_order', 'fraud']]

total_online = online_order_df['online_order'].sum()
total_offline = len(online_order_df) - total_online

print("Total online order: ", total_online)
print("Total offline order: ", total_offline)

Total online order:  650552.0
Total offline order:  349448.0


In [133]:
total_online_fraud = online_order_df[online_order_df['online_order'] ==1]['fraud'].sum()
total_offline_fraud = online_order_df[online_order_df['online_order'] ==0]['fraud'].sum()

print("Total fraude online order: ", total_online_fraud)
print("Total fraude offline order: ", total_offline_fraud)

Total fraude online order:  82711.0
Total fraude offline order:  4692.0


In [134]:
average_online_fraud = total_online_fraud / total_online
average_offline_fraud = total_offline_fraud / total_offline

print(f"Porcentage fraude online order: ({average_online_fraud:.2%})")
print(f"Porcentage fraude offline order: ({average_offline_fraud:.2%})")

Porcentage fraude online order: (12.71%)
Porcentage fraude offline order: (1.34%)


In [135]:
def train_val_test_split(df, rstate=42, shuffle=True, stratify=None):
  strat = df[stratify] if stratify else None
  train_set, test_set = train_test_split(
      df, test_size=0.4, random_state=rstate, shuffle=shuffle, stratify=strat)
  strat = test_set[stratify] if stratify else None
  val_set, test_set = train_test_split(
      test_set, test_size=0.5, random_state=rstate, shuffle=shuffle, stratify=strat)
  return (train_set, val_set, test_set)

In [136]:
train_set, val_set, test_set = train_val_test_split(df)

print("train set: ", len(train_set))
print("val set: ", len(val_set))
print("test set: ", len(test_set))

train set:  600000
val set:  200000
test set:  200000
