# Notebook 2: Preprocesado de datos
Este notebook realiza las siguientes tareas:

1. Carga del dataset
2. Simulación del 5% de valores faltantes
3. Imputación de valores faltantes
4. Balanceo de clases
5. Escalado de variables numéricas
6. Guardado del dataset procesado

## 1. Carga del dataset


In [None]:
#Cargamos el archivo original.
from google.colab import files
uploaded = files.upload()

Saving creditcard.csv to creditcard.csv


In [8]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

# 1. Carga del dataset
df = pd.read_csv('creditcard.csv')
print('Shape original:', df.shape)
df.head()

Shape original: (284807, 31)


Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


## 2. Simulación del 5% de valores faltantes
Seleccionamos aleatoriamente el 5% de las celdas del DataFrame y las asignamos a NaN.

In [3]:
# Numero total de celdas
total_cells = df.size
n_missing = int(total_cells * 0.05)

# Obtenemos indices planos aleatorios
np.random.seed(42)
flat_indices = np.random.choice(total_cells, n_missing, replace=False)

# Convertimos indices planos a (fila, columna)
rows = flat_indices // df.shape[1]
cols = flat_indices % df.shape[1]

# Asignamos NaN
for r, c in zip(rows, cols):
    df.iat[r, c] = np.nan

print('Valores faltantes por columna:')
print(df.isnull().sum())

Valores faltantes por columna:
Time      14122
V1        14366
V2        14252
V3        14365
V4        14089
V5        14050
V6        14274
V7        14323
V8        14228
V9        14184
V10       14056
V11       14119
V12       14133
V13       14239
V14       14232
V15       14341
V16       14316
V17       14208
V18       14197
V19       14333
V20       14376
V21       14259
V22       14109
V23       14266
V24       14325
V25       14351
V26       14311
V27       14200
V28       14170
Amount    14289
Class     14367
dtype: int64


## 3. Imputación de valores faltantes
Se utiliza la mediana para imputar valores faltantes.

In [4]:
# Definimos imputador de mediana
imputer = SimpleImputer(strategy='median')

# Ajustamos e imputamos todo el DataFrame numérico
df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)

print('Valores faltantes después de imputacion:')
print(df_imputed.isnull().sum())

Valores faltantes después de imputacion:
Time      0
V1        0
V2        0
V3        0
V4        0
V5        0
V6        0
V7        0
V8        0
V9        0
V10       0
V11       0
V12       0
V13       0
V14       0
V15       0
V16       0
V17       0
V18       0
V19       0
V20       0
V21       0
V22       0
V23       0
V24       0
V25       0
V26       0
V27       0
V28       0
Amount    0
Class     0
dtype: int64


## 4. Balanceo de clases
Se iguala el número de ejemplos de clase 0 y clase 1 mediante submuestreo de la clase mayoritaria.

In [5]:
# Separamos fraudes (Class=1) y no fraudes (Class=0)
fraudes = df_imputed[df_imputed['Class'] == 1]
no_fraudes = df_imputed[df_imputed['Class'] == 0]

# Submuestreamos no fraudes al tamaño de fraudes
no_fraudes_down = no_fraudes.sample(n=len(fraudes), random_state=42)

# Concatenamos y mezclamos
df_balanceado = pd.concat([fraudes, no_fraudes_down]).sample(frac=1, random_state=42).reset_index(drop=True)

print('Distribución de clases en dataset balanceado:')
print(df_balanceado['Class'].value_counts())
print('Shape del dataset balanceado:', df_balanceado.shape)

Distribución de clases en dataset balanceado:
Class
1.0    462
0.0    462
Name: count, dtype: int64
Shape del dataset balanceado: (924, 31)


## 5. Escalado de variables numéricas
Aplicamos estandarización (media 0, varianza 1) a la columna `Amount`.

In [6]:
scaler = StandardScaler()
df_balanceado['Amount_scaled'] = scaler.fit_transform(df_balanceado[['Amount']])

# Si se desea, se puede eliminar la columna original 'Amount'
# df_balanceado.drop(columns=['Amount'], inplace=True)

# Renombramos para mantener consistencia
df_balanceado.rename(columns={'Amount_scaled': 'Amount'}, inplace=True)

df_balanceado.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V22,V23,V24,V25,V26,V27,V28,Amount,Class,Amount.1
0,102572.0,-28.709229,22.057729,0.179011,11.845013,-18.983813,6.474115,-43.557242,-41.044261,-13.320155,...,8.316275,5.46623,0.023854,-1.527145,-0.052266,-5.682338,-0.439134,0.01,1.0,-0.449677
1,10998.0,-0.211134,0.542917,1.526624,-0.44593,-0.163348,-0.274603,0.227532,-0.027924,1.3575,...,0.006634,-0.164857,-0.474453,-0.130723,1.046205,-0.080208,0.001368,39.0,0.0,-0.271375
2,13323.0,-5.454362,8.287421,-12.752811,8.594342,-3.106002,-0.274603,-9.252794,4.245062,-6.329801,...,-0.267172,-0.310804,-1.201685,1.352176,-0.052266,1.574715,0.808725,1.0,1.0,-0.44515
3,55028.0,-0.735658,1.234143,0.731932,1.010075,0.320246,0.859934,0.194436,0.723279,-0.694886,...,0.006634,-0.196655,-0.819615,0.028766,-0.185786,0.297339,0.124236,19.95,0.0,-0.358491
4,84708.0,-10.30082,6.483095,-15.076363,6.554191,-8.880252,-4.471672,-14.900689,0.022167,-4.358441,...,1.041642,-0.68279,0.573544,-1.602389,-0.393521,-0.468893,0.10592,1.0,1.0,-0.44515


## 6. Guardar dataset procesado
Guardamos el DataFrame final en un archivo CSV `dataset_preprocesado.csv`.

In [7]:
df_balanceado.to_csv('dataset_preprocesado.csv', index=False)
print('Archivo `dataset_preprocesado.csv` guardado con éxito.')

Archivo `dataset_preprocesado.csv` guardado con éxito.
