# üìä Carga y Verificaci√≥n Inicial de Datos

**Proyecto:** Detecci√≥n de Fraude en Transacciones Financieras  
**Dataset:** Base_datos.csv (PaySim Dataset)  
**Objetivo:** Cargar la base de datos y realizar una verificaci√≥n inicial b√°sica

---

## Descripci√≥n del Dataset

El dataset contiene transacciones financieras simuladas con las siguientes caracter√≠sticas:
- **step**: Unidad de tiempo (1 step = 1 hora)
- **type**: Tipo de transacci√≥n (PAYMENT, TRANSFER, CASH_OUT, DEBIT, CASH_IN)
- **amount**: Monto de la transacci√≥n
- **nameOrig**: Cliente que origina la transacci√≥n
- **oldbalanceOrg**: Balance inicial del originador
- **newbalanceOrig**: Balance final del originador
- **nameDest**: Cliente destinatario
- **oldbalanceDest**: Balance inicial del destinatario
- **newbalanceDest**: Balance final del destinatario
- **isFraud**: Variable objetivo - indica si la transacci√≥n es fraudulenta
- **isFlaggedFraud**: Marcador de fraude del sistema

In [1]:
# Importaci√≥n de librer√≠as necesarias
import pandas as pd
import numpy as np
import warnings
import os
from datetime import datetime

warnings.filterwarnings('ignore')

print("‚úÖ Librer√≠as importadas correctamente")
print(f"üìÖ Fecha de ejecuci√≥n: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

‚úÖ Librer√≠as importadas correctamente
üìÖ Fecha de ejecuci√≥n: 2025-11-06 09:42:14


In [2]:
# Cargar el dataset
print("üîÑ Cargando dataset...")
df = pd.read_csv('../../Base_datos.csv')
print(f"‚úÖ Dataset cargado exitosamente!")
print(f"üìä Dimensiones: {df.shape[0]:,} filas x {df.shape[1]} columnas")

üîÑ Cargando dataset...
‚úÖ Dataset cargado exitosamente!
üìä Dimensiones: 200,001 filas x 11 columnas
‚úÖ Dataset cargado exitosamente!
üìä Dimensiones: 200,001 filas x 11 columnas


In [3]:
# Verificaci√≥n inicial de los datos
print("=" * 80)
print("INFORMACI√ìN GENERAL DEL DATASET")
print("=" * 80)
print(f"\nüìã Primeras 5 filas:")
display(df.head())

print(f"\nüìã √öltimas 5 filas:")
display(df.tail())

INFORMACI√ìN GENERAL DEL DATASET

üìã Primeras 5 filas:


Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,133,PAYMENT,5828.35,C952771304,942012.56,936184.25,M141260766,0.0,0.0,0,0
1,397,CASH_IN,159257.06,C38956238,16102.0,175359.06,C1129173370,40954.81,0.0,0,0
2,15,PAYMENT,9701.77,C86130509,20525.0,10823.23,M267881330,0.0,0.0,0,0
3,282,CASH_IN,54068.33,C2058924344,780601.1,834669.44,C527536811,431607.53,377539.2,0,0
4,372,CASH_IN,202734.3,C1418128829,12753207.0,12955941.0,C136842001,1281123.6,1078389.4,0,0



üìã √öltimas 5 filas:


Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
199996,188,CASH_IN,54597.4,C838142680,55902.0,110499.4,C1460517462,0.0,0.0,0,0
199997,10,CASH_OUT,248659.34,C1184757616,36678.64,0.0,C2054885284,0.0,2519867.5,0,0
199998,9,PAYMENT,13832.75,C486255800,48754.0,34921.25,M905963955,0.0,0.0,0,0
199999,332,CASH_OUT,87762.4,C2045515251,0.0,0.0,C2028488604,672990.25,760752.6,0,0
200000,310,CASH_OUT,46555.21,C1369119598,0.0,0.0,C816822800,282407.16,328962.38,0,0


In [4]:
# Informaci√≥n sobre tipos de datos y valores nulos
print("=" * 80)
print("TIPOS DE DATOS Y VALORES NULOS")
print("=" * 80)
print(df.info())

print(f"\nüìä Total de valores nulos por columna:")
valores_nulos = df.isnull().sum()
if valores_nulos.sum() == 0:
    print("‚úÖ No se encontraron valores nulos en el dataset")
else:
    print(valores_nulos[valores_nulos > 0])

TIPOS DE DATOS Y VALORES NULOS
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200001 entries, 0 to 200000
Data columns (total 11 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   step            200001 non-null  int64  
 1   type            200001 non-null  object 
 2   amount          200001 non-null  float64
 3   nameOrig        200001 non-null  object 
 4   oldbalanceOrg   200001 non-null  float64
 5   newbalanceOrig  200001 non-null  float64
 6   nameDest        200001 non-null  object 
 7   oldbalanceDest  200001 non-null  float64
 8   newbalanceDest  200001 non-null  float64
 9   isFraud         200001 non-null  int64  
 10  isFlaggedFraud  200001 non-null  int64  
dtypes: float64(5), int64(3), object(3)
memory usage: 16.8+ MB
None

üìä Total de valores nulos por columna:
‚úÖ No se encontraron valores nulos en el dataset


In [5]:
# Resumen estad√≠stico b√°sico
print("=" * 80)
print("RESUMEN ESTAD√çSTICO B√ÅSICO")
print("=" * 80)
display(df.describe())

RESUMEN ESTAD√çSTICO B√ÅSICO


Unnamed: 0,step,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
count,200001.0,200001.0,200001.0,200001.0,200001.0,200001.0,200001.0,200001.0
mean,243.880986,181203.0,833292.8,854961.3,1104669.0,1230433.0,0.00129,0.0
std,142.3212,639083.8,2891330.0,2928538.0,3370583.0,3662084.0,0.035893,0.0
min,1.0,0.78,0.0,0.0,0.0,0.0,0.0,0.0
25%,156.0,13415.15,0.0,0.0,0.0,0.0,0.0,0.0
50%,239.0,75717.52,14322.0,0.0,130287.8,212660.4,0.0,0.0
75%,334.0,209507.7,107512.0,144248.4,949255.1,1115316.0,0.0,0.0
max,742.0,71172480.0,38563400.0,38939420.0,327852100.0,327963000.0,1.0,0.0


In [6]:
# Distribuci√≥n de la variable objetivo
print("=" * 80)
print("DISTRIBUCI√ìN DE LA VARIABLE OBJETIVO (isFraud)")
print("=" * 80)

fraud_counts = df['isFraud'].value_counts()
fraud_percentage = df['isFraud'].value_counts(normalize=True) * 100

print(f"\nüìä Conteo absoluto:")
print(fraud_counts)
print(f"\nüìä Porcentaje:")
for idx, val in fraud_percentage.items():
    label = "‚ùå No Fraude" if idx == 0 else "üö® Fraude"
    print(f"{label}: {val:.4f}%")

print(f"\n‚ö†Ô∏è Ratio de desbalanceo: 1:{fraud_counts[0]/fraud_counts[1]:.2f}")

DISTRIBUCI√ìN DE LA VARIABLE OBJETIVO (isFraud)

üìä Conteo absoluto:
isFraud
0    199743
1       258
Name: count, dtype: int64

üìä Porcentaje:
‚ùå No Fraude: 99.8710%
üö® Fraude: 0.1290%

‚ö†Ô∏è Ratio de desbalanceo: 1:774.20


In [7]:
# Guardar el dataset cargado para uso posterior
# Esto permitir√° que otros notebooks accedan a los datos sin recargar
import pickle

# Crear directorio de datos procesados si no existe
os.makedirs('../../data/processed', exist_ok=True)

# Guardar el dataset original
df.to_pickle('../../data/processed/df_original.pkl')
print("‚úÖ Dataset guardado en: data/processed/df_original.pkl")

# Tambi√©n guardar informaci√≥n b√°sica del dataset
dataset_info = {
    'fecha_carga': datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
    'num_filas': df.shape[0],
    'num_columnas': df.shape[1],
    'columnas': list(df.columns),
    'tipos_datos': df.dtypes.to_dict(),
    'fraudes': int(fraud_counts[1]),
    'no_fraudes': int(fraud_counts[0])
}

with open('../../data/processed/dataset_info.pkl', 'wb') as f:
    pickle.dump(dataset_info, f)

print("‚úÖ Informaci√≥n del dataset guardada")
print("\n" + "=" * 80)
print("CARGA DE DATOS COMPLETADA EXITOSAMENTE ‚úÖ")
print("=" * 80)

‚úÖ Dataset guardado en: data/processed/df_original.pkl
‚úÖ Informaci√≥n del dataset guardada

CARGA DE DATOS COMPLETADA EXITOSAMENTE ‚úÖ
