# PREPROCESSING

- [Importar librerías](#Importar-librerías)
- [Lectura de los datasets](#Lectura-de-los-datasets)
- [Eliminación de características irrelevantes](#Eliminación-de-características-irrelevantes)
- [Manejo de datos faltantes](#Manejo-de-datos-faltantes)
- [Manejo de outliers](#Manejo-de-outliers)
- [Gestión de tipos](Gestión-de-tipos)
- [Codificación-de-variables-categóricas](Codificación-de-variables-categóricas)
- [Normalización y estandarización](#Normalización-y-estandarización)
- [Transformaciones de datos](#Transformaciones-de-datos)

## Importar librerías

!pip install -r requirements.txt

In [1]:
import os
import shutil
import zipfile
import pandas as pd
import numpy as np
from utils import *
import os
from typing import Dict, List, Tuple

## Lectura de los datasets

In [2]:
INPUT_ZIP = "./00_Data/Raw/archive.zip"  # Directorio del zip
OUTPUT_FOLDER = "./00_Data/Clean"  # Directorio de destino
TRAIN_FILENAME = "train.csv"  # Nombre del fichero de entrenamiento
TEST_FILENAME = "test.csv"  # Nombre del fichero de entrenamiento

def fetch_data(input_path=INPUT_ZIP, output_dir=OUTPUT_FOLDER):
    """
    Extrae el contenido de un archivo ZIP en un directorio de destino.

    Parámetros:
    -----------
    input_path : str, opcional
        Ruta al archivo ZIP que se desea descomprimir. El valor predeterminado es la variable 'INPUT_ZIP'.
        
    output_dir : str, opcional
        Directorio en el cual se extraerá el contenido del archivo ZIP. Si el directorio no existe,
        será creado automáticamente. El valor predeterminado es la variable 'OUTPUT_FOLDER'.

    Comportamiento:
    ---------------
    - Crea el directorio de destino si no existe.
    - Descomprime el archivo ZIP en el directorio de destino.

    Excepciones:
    ------------
    Puede lanzar una excepción si el archivo ZIP no existe o si hay problemas al descomprimirlo.

    Ejemplo de uso:
    ---------------
    fetch_data('data.zip', 'output/')
    """
    # Comprobación de que el directorio de destino existe
    os.makedirs(output_dir, exist_ok=True)

    # Descomprime el archivo ZIP en caso de que no haya ningún csv en la carpeta
    if(len([file for file in os.listdir(output_dir) if file.endswith('.csv')]) == 0):
        with zipfile.ZipFile(input_path, 'r') as zip_ref:
            zip_ref.extractall(output_dir)


def load_data(directory=OUTPUT_FOLDER, filename=TRAIN_FILENAME):
    """
    Lee un archivo CSV desde el directorio especificado.

    Parámetros:
    -----------
    directory : str
        El directorio donde se encuentra el archivo CSV.
        
    filename : str
        El nombre del archivo CSV a leer (incluyendo la extensión .csv).

    Retorna:
    --------
    pd.DataFrame
        Un DataFrame de pandas que contiene los datos del archivo CSV.

    Excepciones:
    ------------
    FileNotFoundError:
        Se lanza si el archivo no existe en el directorio dado.
    
    Ejemplo de uso:
    ---------------
    df = read_csv_from_directory('data', 'file.csv')
    """
    # Construir la ruta completa al archivo CSV
    file_path = os.path.join(directory, filename)

    # Verificar si el archivo existe
    if not os.path.exists(file_path):
        raise FileNotFoundError(f"El archivo {filename} no se encuentra en el directorio {directory}")

    # Leer el archivo CSV en un DataFrame
    return pd.read_csv(file_path)

fetch_data()
df_train = load_data(OUTPUT_FOLDER, TRAIN_FILENAME)
df_test = load_data(OUTPUT_FOLDER, TEST_FILENAME)

print("Train dataset:", df_train.shape)
print("Test dataset:", df_test.shape)

Train dataset: (227845, 31)
Test dataset: (56962, 31)


In [3]:
df_train

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,143352.0,1.955041,-0.380783,-0.315013,0.330155,-0.509374,-0.086197,-0.627978,0.035994,1.054560,...,0.238197,0.968305,0.053208,-0.278602,-0.044999,-0.216780,0.045168,-0.047145,9.99,0
1,117173.0,-0.400975,-0.626943,1.555339,-2.017772,-0.107769,0.168310,0.017959,-0.401619,0.040378,...,-0.153485,0.421703,0.113442,-1.004095,-1.176695,0.361924,-0.370469,-0.144792,45.90,0
2,149565.0,0.072509,0.820566,-0.561351,-0.709897,1.080399,-0.359429,0.787858,0.117276,-0.131275,...,-0.314638,-0.872959,0.083391,0.148178,-0.431459,0.119690,0.206395,0.070288,11.99,0
3,93670.0,-0.535045,1.014587,1.750679,2.769390,0.500089,1.002270,0.847902,-0.081323,0.371579,...,0.063525,0.443431,-0.072754,0.448192,-0.655203,-0.181038,-0.093013,-0.064931,117.44,0
4,82655.0,-4.026938,1.897371,-0.429786,-0.029571,-0.855751,-0.480406,-0.435632,1.313760,0.536044,...,-0.480691,-0.230369,0.250717,0.066399,0.470787,0.245335,0.286904,-0.322672,25.76,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
227840,75618.0,1.173488,0.100792,0.490512,0.461596,-0.296377,-0.213165,-0.165254,0.119221,-0.114199,...,-0.186027,-0.574283,0.161405,-0.006140,0.091444,0.109235,-0.020922,0.003967,1.98,0
227841,159000.0,-0.775981,0.144023,-1.142399,-1.241113,1.940358,3.912076,-0.466107,1.360620,0.400697,...,0.037078,-0.019575,0.241830,0.682820,-1.635109,-0.770941,0.066006,0.137056,89.23,0
227842,79795.0,-0.146609,0.992946,1.524591,0.485774,0.349308,-0.815198,1.076640,-0.395316,-0.491303,...,0.052649,0.354089,-0.291198,0.402849,0.237383,-0.398467,-0.121139,-0.196195,3.94,0
227843,87931.0,-2.948638,2.354849,-2.521201,-3.798905,1.866302,2.727695,-0.471769,2.217537,0.580199,...,-0.332759,-1.047514,0.143326,0.678869,0.319710,0.426309,0.496912,0.335822,1.00,0


## Eliminación de características irrelevantes

Inicialmente se proponía una reducción de variables basada en su correlación con la clase. Sin embargo, este enfoque fue descartado por no reflejar adecuadamente la relevancia predictiva de cada variable, especialmente en un problema con relaciones no lineales y gran desbalance. Se ha optado por mantener el conjunto completo de variables numéricas y posponer la selección para una fase supervisada posterior.

In [4]:
"""""
# Seleccionamos las columnas con las que nos queremos quedar
cols_keep = ["V2", "V4", "V8", "V11", "V19", "V20", "V21", "V27", "V28"]
predict_col = "Class"
print("Las columnas originales son:", df_train.columns.tolist())

df_train = df_train[cols_keep + [predict_col]]
df_test = df_test[cols_keep]

print("Las columnas tras eliminar irrelevantes son:", df_train.columns.tolist())
"""


'""\n# Seleccionamos las columnas con las que nos queremos quedar\ncols_keep = ["V2", "V4", "V8", "V11", "V19", "V20", "V21", "V27", "V28"]\npredict_col = "Class"\nprint("Las columnas originales son:", df_train.columns.tolist())\n\ndf_train = df_train[cols_keep + [predict_col]]\ndf_test = df_test[cols_keep]\n\nprint("Las columnas tras eliminar irrelevantes son:", df_train.columns.tolist())\n'

## Manejo de datos faltantes

In [5]:
df_train.duplicated().sum()

np.int64(732)

In [6]:
df_test.duplicated().sum()

np.int64(61)

In [7]:
df_train.isnull().sum()

Time      0
V1        0
V2        0
V3        0
V4        0
V5        0
V6        0
V7        0
V8        0
V9        0
V10       0
V11       0
V12       0
V13       0
V14       0
V15       0
V16       0
V17       0
V18       0
V19       0
V20       0
V21       0
V22       0
V23       0
V24       0
V25       0
V26       0
V27       0
V28       0
Amount    0
Class     0
dtype: int64

In [8]:
df_test.isnull().sum()

Time      0
V1        0
V2        0
V3        0
V4        0
V5        0
V6        0
V7        0
V8        0
V9        0
V10       0
V11       0
V12       0
V13       0
V14       0
V15       0
V16       0
V17       0
V18       0
V19       0
V20       0
V21       0
V22       0
V23       0
V24       0
V25       0
V26       0
V27       0
V28       0
Amount    0
Class     0
dtype: int64

## Manejo de outliers

In [9]:
# Columnas a analizar (excluyendo Class)
cols_excluir = ["Class"]  
cols_outlier = [col for col in df_train.columns if col not in cols_excluir]

# Calcular límites de outliers en train
outlier_limits = {}
for col in cols_outlier:
    Q1 = df_train[col].quantile(0.25)
    Q3 = df_train[col].quantile(0.75)
    IQR = Q3 - Q1
    outlier_limits[col] = (Q1 - 1.5 * IQR, Q3 + 1.5 * IQR)

# Ver cuántos outliers hay en cada columna
outliers_train = {col: df_train[(df_train[col] < outlier_limits[col][0]) | (df_train[col] > outlier_limits[col][1])][col].count() for col in cols_outlier}

print("Cantidad de outliers en df_train por columna:")
print(outliers_train)

Cantidad de outliers en df_train por columna:
{'Time': np.int64(0), 'V1': np.int64(5658), 'V2': np.int64(10824), 'V3': np.int64(2683), 'V4': np.int64(8910), 'V5': np.int64(9848), 'V6': np.int64(18410), 'V7': np.int64(7139), 'V8': np.int64(19249), 'V9': np.int64(6606), 'V10': np.int64(7662), 'V11': np.int64(622), 'V12': np.int64(12309), 'V13': np.int64(2749), 'V14': np.int64(11316), 'V15': np.int64(2306), 'V16': np.int64(6586), 'V17': np.int64(5888), 'V18': np.int64(5963), 'V19': np.int64(8206), 'V20': np.int64(22135), 'V21': np.int64(11548), 'V22': np.int64(1059), 'V23': np.int64(14858), 'V24': np.int64(3803), 'V25': np.int64(4314), 'V26': np.int64(4464), 'V27': np.int64(31220), 'V28': np.int64(24182), 'Amount': np.int64(25553)}


In [10]:
df_train["Amount"]=np.log1p(df_train["Amount"])

In [11]:
#windor
df_train_winsorized = df_train.copy()
for col in cols_outlier:
    lower, upper = outlier_limits[col]
    df_train_winsorized[col] = np.clip(df_train_winsorized[col], lower, upper)

print(df_train_winsorized)


            Time        V1        V2        V3        V4        V5        V6  \
0       143352.0  1.955041 -0.380783 -0.315013  0.330155 -0.509374 -0.086197   
1       117173.0 -0.400975 -0.626943  1.555339 -2.017772 -0.107769  0.168310   
2       149565.0  0.072509  0.820566 -0.561351 -0.709897  1.080399 -0.359429   
3        93670.0 -0.535045  1.014587  1.750679  2.769390  0.500089  1.002270   
4        82655.0 -4.026938  1.897371 -0.429786 -0.029571 -0.855751 -0.480406   
...          ...       ...       ...       ...       ...       ...       ...   
227840   75618.0  1.173488  0.100792  0.490512  0.461596 -0.296377 -0.213165   
227841  159000.0 -0.775981  0.144023 -1.142399 -1.241113  1.940358  2.142999   
227842   79795.0 -0.146609  0.992946  1.524591  0.485774  0.349308 -0.815198   
227843   87931.0 -2.948638  2.354849 -2.521201 -3.235620  1.866302  2.142999   
227844   76381.0  1.233174 -0.784851  0.386784 -0.698559 -1.034018 -0.637028   

              V7        V8        V9  .

In [12]:
# Columnas a analizar (Class)
cols_excluir = ["Class"]  
cols_outlier = [col for col in df_train_winsorized.columns if col not in cols_excluir]

# Calcular límites de outliers en train
outlier_limits = {}
for col in cols_outlier:
    Q1 = df_train_winsorized[col].quantile(0.25)
    Q3 = df_train_winsorized[col].quantile(0.75)
    IQR = Q3 - Q1
    outlier_limits[col] = (Q1 - 1.5 * IQR, Q3 + 1.5 * IQR)

# Ver cuántos outliers hay en cada columna
outliers_train = {col: df_train_winsorized[(df_train_winsorized[col] < outlier_limits[col][0]) | (df_train_winsorized[col] > outlier_limits[col][1])][col].count() for col in cols_outlier}

print("Cantidad de outliers en df_train_scaled por columna:")
print(outliers_train)

Cantidad de outliers en df_train_scaled por columna:
{'Time': np.int64(0), 'V1': np.int64(0), 'V2': np.int64(0), 'V3': np.int64(0), 'V4': np.int64(0), 'V5': np.int64(0), 'V6': np.int64(0), 'V7': np.int64(0), 'V8': np.int64(0), 'V9': np.int64(0), 'V10': np.int64(0), 'V11': np.int64(0), 'V12': np.int64(0), 'V13': np.int64(0), 'V14': np.int64(0), 'V15': np.int64(0), 'V16': np.int64(0), 'V17': np.int64(0), 'V18': np.int64(0), 'V19': np.int64(0), 'V20': np.int64(0), 'V21': np.int64(0), 'V22': np.int64(0), 'V23': np.int64(0), 'V24': np.int64(0), 'V25': np.int64(0), 'V26': np.int64(0), 'V27': np.int64(0), 'V28': np.int64(0), 'Amount': np.int64(203)}


In [13]:
df_train.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,143352.0,1.955041,-0.380783,-0.315013,0.330155,-0.509374,-0.086197,-0.627978,0.035994,1.05456,...,0.238197,0.968305,0.053208,-0.278602,-0.044999,-0.21678,0.045168,-0.047145,2.396986,0
1,117173.0,-0.400975,-0.626943,1.555339,-2.017772,-0.107769,0.16831,0.017959,-0.401619,0.040378,...,-0.153485,0.421703,0.113442,-1.004095,-1.176695,0.361924,-0.370469,-0.144792,3.848018,0
2,149565.0,0.072509,0.820566,-0.561351,-0.709897,1.080399,-0.359429,0.787858,0.117276,-0.131275,...,-0.314638,-0.872959,0.083391,0.148178,-0.431459,0.11969,0.206395,0.070288,2.56418,0
3,93670.0,-0.535045,1.014587,1.750679,2.76939,0.500089,1.00227,0.847902,-0.081323,0.371579,...,0.063525,0.443431,-0.072754,0.448192,-0.655203,-0.181038,-0.093013,-0.064931,4.774407,0
4,82655.0,-4.026938,1.897371,-0.429786,-0.029571,-0.855751,-0.480406,-0.435632,1.31376,0.536044,...,-0.480691,-0.230369,0.250717,0.066399,0.470787,0.245335,0.286904,-0.322672,3.286908,0


In [14]:
df_train_winsorized.describe()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
count,227845.0,227845.0,227845.0,227845.0,227845.0,227845.0,227845.0,227845.0,227845.0,227845.0,...,227845.0,227845.0,227845.0,227845.0,227845.0,227845.0,227845.0,227845.0,227845.0,227845.0
mean,94792.551673,0.076176,0.050431,0.026148,-0.02936,-0.002033,-0.092039,0.009206,0.068464,-0.01418,...,-0.015331,0.000723,-0.004137,0.001902,0.00281,-0.004257,0.012624,0.011246,3.151578,0.001729
std,47488.471663,1.593853,1.121805,1.369936,1.310303,1.075404,1.001552,0.863776,0.485234,1.018687,...,0.323893,0.707295,0.266998,0.594195,0.493268,0.467842,0.165498,0.124361,1.656858,0.041548
min,0.0,-4.273117,-2.700774,-3.765064,-3.23562,-2.638763,-2.515516,-2.236348,-1.012333,-2.504526,...,-0.850945,-2.148134,-0.625544,-1.546901,-1.320267,-1.179908,-0.3145,-0.250147,0.0,0.0
25%,54161.0,-0.919918,-0.597971,-0.890786,-0.84927,-0.688802,-0.768573,-0.552156,-0.208431,-0.642386,...,-0.22873,-0.542809,-0.161296,-0.354887,-0.317835,-0.327476,-0.07096,-0.05298,1.88707,0.0
50%,84707.0,0.017978,0.06605,0.179041,-0.020959,-0.054711,-0.274846,0.041272,0.022233,-0.050414,...,-0.029639,0.005491,-0.010595,0.040766,0.015101,-0.052011,0.001359,0.011366,3.135494,0.0
75%,139305.0,1.315548,0.803898,1.025399,0.74163,0.611173,0.396056,0.570639,0.327504,0.59904,...,0.18608,0.527408,0.148202,0.43979,0.350453,0.240813,0.0914,0.078464,4.35799,0.0
max,172792.0,2.45493,2.906702,3.899677,3.127981,2.561135,2.142999,2.254832,1.131406,2.46118,...,0.808295,2.132732,0.612449,1.631804,1.352885,1.093246,0.33494,0.275631,10.153941,1.0


In [15]:
df_test.describe()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
count,56962.0,56962.0,56962.0,56962.0,56962.0,56962.0,56962.0,56962.0,56962.0,56962.0,...,56962.0,56962.0,56962.0,56962.0,56962.0,56962.0,56962.0,56962.0,56962.0,56962.0
mean,94899.09006,-0.003663,0.001537,0.00368,0.005833,-0.003986,0.002452,-0.004358,0.001533,-0.003846,...,-0.000862,0.004862,-0.000699,0.000868,0.003917,0.000766,0.000128,-0.001029,87.828131,0.00172
std,47487.164345,1.960141,1.629975,1.523083,1.415928,1.350162,1.313994,1.211161,1.216935,1.102938,...,0.74211,0.727247,0.613779,0.606456,0.519315,0.481364,0.410803,0.331308,229.876748,0.041443
min,0.0,-34.148234,-48.060856,-33.680984,-5.560118,-23.669726,-20.869626,-41.506796,-50.42009,-13.434066,...,-22.889347,-8.887017,-32.828995,-2.824849,-8.696627,-2.068561,-22.565679,-11.710896,0.0,0.0
25%,54323.25,-0.921986,-0.600834,-0.888704,-0.845879,-0.702349,-0.767227,-0.560154,-0.209106,-0.646208,...,-0.227119,-0.540769,-0.164144,-0.353275,-0.314229,-0.324758,-0.070267,-0.05286,5.59,0.0
50%,84670.5,0.018747,0.06437,0.182747,-0.015815,-0.052826,-0.271924,0.035441,0.022724,-0.05443,...,-0.02878,0.011908,-0.01333,0.042024,0.022654,-0.052565,0.001293,0.010779,22.0,0.0
75%,139390.0,1.316075,0.802709,1.033586,0.749608,0.614582,0.409334,0.569454,0.326524,0.590325,...,0.187496,0.532741,0.145646,0.438769,0.351856,0.241507,0.089403,0.077587,77.49,0.0
max,172787.0,2.439207,21.467203,9.382558,12.699542,29.016124,16.493227,21.437514,19.168327,15.594995,...,27.202839,8.361985,22.083545,3.990646,6.07085,3.463246,9.200883,15.942151,10000.0,1.0


## Escritura de los dataframes resultantes

In [16]:
OUTPUT_FOLDER = "./00_Data/Cleaned/"

def save_dataframes_to_csv(output_folder, df_train, df_test, train_filename="train_winsorized_clean.csv", test_filename="test_clean.csv"):
    """
    Guarda los DataFrames de entrenamiento y prueba en formato CSV en una carpeta específica.
    Si la carpeta ya existe, borra todo su contenido antes de guardar los nuevos archivos.
    
    Args:
        output_folder (str): La ruta de la carpeta donde se guardarán los archivos CSV.
        df_train (pd.DataFrame): El DataFrame de entrenamiento que se va a guardar.
        df_test (pd.DataFrame): El DataFrame de prueba que se va a guardar.
        train_filename (str, opcional): El nombre del archivo CSV para el DataFrame de entrenamiento.
        test_filename (str, opcional): El nombre del archivo CSV para el DataFrame de prueba.
    
    """
    # Si la carpeta ya existe, eliminar todo su contenido
    if os.path.exists(output_folder):
        shutil.rmtree(output_folder)  # Borrar toda la carpeta y su contenido
        print(f"Carpeta {output_folder} eliminada.")
    
    # Crear la carpeta si no existe
    os.makedirs(output_folder, exist_ok=True)
    
    # Definir las rutas completas de los archivos
    train_path = os.path.join(output_folder, train_filename)
    test_path = os.path.join(output_folder, test_filename)
    
    # Guardar los DataFrames en formato CSV
    df_train.to_csv(train_path, index=False)
    df_test.to_csv(test_path, index=False)
    
    print(f"DataFrames guardados en {output_folder}:")
    print(f" - {train_filename}")
    print(f" - {test_filename}")

save_dataframes_to_csv(OUTPUT_FOLDER, df_train_winsorized, df_test)

Carpeta ./00_Data/Cleaned/ eliminada.
DataFrames guardados en ./00_Data/Cleaned/:
 - train_winsorized_clean.csv
 - test_clean.csv
