# Máster en Data Science - Machine Learning

# Exploratory Data Analysis (EDA)

Autores: Frida Ibarra y Gema Romero

#### **Notebook 3 (Data_Encoding_Scaling)** 

Centrado en la codificación y el escalamiento de los datos. La preparación adecuada de los datos es un paso fundamental en el desarrollo de modelos predictivos, ya que garantiza que todas las variables sean compatibles con los algoritmos utilizados. Este proceso incluye:

1. Codificación de variables categóricas. ransformación de variables categóricas en formatos numéricos, asegurando que los algoritmos puedan interpretar correctamente la información. Se seleccionará el método de codificación más adecuado según las características de las variables (nominales u ordinales) y el número de categorías.

2. Escalamiento de los datos. Normalización o estandarización de las variables numéricas para garantizar que todas operen en una misma escala, reduciendo posibles sesgos e incrementando la eficiencia del modelo.

In [1]:
# Librerías
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import category_encoders as ce
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import StandardScaler
import warnings
import sys
from sklearn.preprocessing import StandardScaler

pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)

In [10]:
#FUNCIONES
sys.path.append('../src')  # Asegúrate de que ../src es la carpeta donde está Funciones_Ayuda.py
import functions_src as fa  # Ahora debería importarse correctamente
sys.path.remove('../src')

#Semilla 
seed = 25

In [11]:
df_loans_train = pd.read_csv('../data/Processing_data/df_loans_train.csv')
df_loans_test = pd.read_csv('../data/Processing_data/df_loans_test.csv')
df_loans_train.head()

Unnamed: 0.1,Unnamed: 0,SK_ID_CURR,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,NAME_TYPE_SUITE,NAME_INCOME_TYPE,NAME_EDUCATION_TYPE,NAME_FAMILY_STATUS,NAME_HOUSING_TYPE,REGION_POPULATION_RELATIVE,DAYS_BIRTH,DAYS_EMPLOYED,DAYS_REGISTRATION,DAYS_ID_PUBLISH,OWN_CAR_AGE,FLAG_MOBIL,FLAG_EMP_PHONE,FLAG_WORK_PHONE,FLAG_CONT_MOBILE,FLAG_PHONE,FLAG_EMAIL,OCCUPATION_TYPE,CNT_FAM_MEMBERS,REGION_RATING_CLIENT,REGION_RATING_CLIENT_W_CITY,WEEKDAY_APPR_PROCESS_START,HOUR_APPR_PROCESS_START,REG_REGION_NOT_LIVE_REGION,REG_REGION_NOT_WORK_REGION,LIVE_REGION_NOT_WORK_REGION,REG_CITY_NOT_LIVE_CITY,REG_CITY_NOT_WORK_CITY,LIVE_CITY_NOT_WORK_CITY,ORGANIZATION_TYPE,EXT_SOURCE_1,EXT_SOURCE_2,EXT_SOURCE_3,APARTMENTS_AVG,BASEMENTAREA_AVG,YEARS_BEGINEXPLUATATION_AVG,YEARS_BUILD_AVG,COMMONAREA_AVG,ELEVATORS_AVG,ENTRANCES_AVG,FLOORSMAX_AVG,FLOORSMIN_AVG,LANDAREA_AVG,LIVINGAPARTMENTS_AVG,LIVINGAREA_AVG,NONLIVINGAPARTMENTS_AVG,NONLIVINGAREA_AVG,APARTMENTS_MODE,BASEMENTAREA_MODE,YEARS_BEGINEXPLUATATION_MODE,YEARS_BUILD_MODE,COMMONAREA_MODE,ELEVATORS_MODE,ENTRANCES_MODE,FLOORSMAX_MODE,FLOORSMIN_MODE,LANDAREA_MODE,LIVINGAPARTMENTS_MODE,LIVINGAREA_MODE,NONLIVINGAPARTMENTS_MODE,NONLIVINGAREA_MODE,APARTMENTS_MEDI,BASEMENTAREA_MEDI,YEARS_BEGINEXPLUATATION_MEDI,YEARS_BUILD_MEDI,COMMONAREA_MEDI,ELEVATORS_MEDI,ENTRANCES_MEDI,FLOORSMAX_MEDI,FLOORSMIN_MEDI,LANDAREA_MEDI,LIVINGAPARTMENTS_MEDI,LIVINGAREA_MEDI,NONLIVINGAPARTMENTS_MEDI,NONLIVINGAREA_MEDI,FONDKAPREMONT_MODE,HOUSETYPE_MODE,TOTALAREA_MODE,WALLSMATERIAL_MODE,EMERGENCYSTATE_MODE,OBS_30_CNT_SOCIAL_CIRCLE,DEF_30_CNT_SOCIAL_CIRCLE,OBS_60_CNT_SOCIAL_CIRCLE,DEF_60_CNT_SOCIAL_CIRCLE,DAYS_LAST_PHONE_CHANGE,FLAG_DOCUMENT_2,FLAG_DOCUMENT_3,FLAG_DOCUMENT_4,FLAG_DOCUMENT_5,FLAG_DOCUMENT_6,FLAG_DOCUMENT_7,FLAG_DOCUMENT_8,FLAG_DOCUMENT_9,FLAG_DOCUMENT_10,FLAG_DOCUMENT_11,FLAG_DOCUMENT_12,FLAG_DOCUMENT_13,FLAG_DOCUMENT_14,FLAG_DOCUMENT_15,FLAG_DOCUMENT_16,FLAG_DOCUMENT_17,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR,TARGET
0,84056,197497,Cash loans,F,N,N,1,85500.0,862560.0,25348.5,720000.0,Unaccompanied,Working,Secondary / secondary special,Married,House / apartment,0.002134,-11809,-3745,-1335.0,-2426,,1,1,0,1,0,0,Medicine staff,3.0,3,3,WEDNESDAY,8,0,0,0,1,1,1,Medicine,0.110762,0.076364,0.3808,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,4.0,1.0,4.0,1.0,-707.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,2.0,5.0,0
1,195841,327085,Cash loans,M,Y,Y,0,180000.0,1006920.0,42790.5,900000.0,Unaccompanied,Working,Secondary / secondary special,Married,House / apartment,0.003122,-13873,-3460,-2734.0,-5159,14.0,1,1,0,1,0,0,,2.0,3,3,WEDNESDAY,15,0,0,0,0,0,0,Business Entity Type 3,0.270259,0.474984,0.700184,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,0.0,0.0,-1425.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0,0
2,79945,192672,Cash loans,M,N,Y,0,180000.0,238896.0,13842.0,189000.0,Unaccompanied,Working,Secondary / secondary special,Married,House / apartment,0.031329,-16489,-8675,-7422.0,-37,,1,1,0,1,0,0,,2.0,2,2,SUNDAY,12,0,0,0,0,0,0,Industry: type 5,,0.737165,0.621226,0.0124,0.0169,0.9682,0.5648,0.0074,0.0,0.069,0.0417,0.0417,0.0201,0.0101,0.0097,0.0,0.0,0.0126,0.0175,0.9682,0.5818,0.0074,0.0,0.069,0.0417,0.0417,0.0206,0.011,0.0101,0.0,0.0,0.0125,0.0169,0.9682,0.5706,0.0074,0.0,0.069,0.0417,0.0417,0.0205,0.0103,0.0099,0.0,0.0,reg oper spec account,block of flats,0.0117,"Stone, brick",No,0.0,0.0,0.0,0.0,-1574.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,4.0,0.0,4.0,0
3,22961,126717,Revolving loans,F,N,Y,1,81000.0,180000.0,9000.0,180000.0,Unaccompanied,Working,Secondary / secondary special,Married,House / apartment,0.035792,-8413,-1468,-3275.0,-13,,1,1,0,1,0,0,Laborers,3.0,2,2,THURSDAY,14,0,0,0,0,0,0,Transport: type 3,0.216299,0.571497,,0.0144,0.0,0.9344,,,0.0,0.0345,0.0417,,,,0.0083,,0.0,0.0147,0.0,0.9345,,,0.0,0.0345,0.0417,,,,0.0087,,0.0,0.0146,0.0,0.9344,,,0.0,0.0345,0.0417,,,,0.0085,,0.0,,block of flats,0.0066,"Stone, brick",No,0.0,0.0,0.0,0.0,-1012.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,,,,,,0
4,168843,295703,Cash loans,F,N,Y,0,126000.0,675000.0,21906.0,675000.0,Unaccompanied,Pensioner,Lower secondary,Married,Municipal apartment,0.009657,-22877,365243,-11276.0,-3951,,1,0,0,1,1,0,,2.0,2,2,MONDAY,13,0,0,0,0,0,0,XNA,,0.643364,0.270707,0.0165,0.0,0.9762,0.6736,0.0018,0.0,0.069,0.0417,0.0833,0.022,0.0134,0.0147,0.0,0.0,0.0168,0.0,0.9762,0.6864,0.0018,0.0,0.069,0.0417,0.0833,0.0225,0.0147,0.0153,0.0,0.0,0.0167,0.0,0.9762,0.678,0.0018,0.0,0.069,0.0417,0.0833,0.0223,0.0137,0.0149,0.0,0.0,reg oper account,block of flats,0.0115,Panel,No,0.0,0.0,0.0,0.0,-1092.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,2.0,6.0,0


In [12]:
# Se va a eliminar la primera columna que repite el Index
df_loans_train = df_loans_train.drop('Unnamed: 0',axis=1)
df_loans_test = df_loans_test.drop('Unnamed: 0',axis=1)

In [5]:
# Diccionario de los datos:
var_description = pd.read_excel('../data/columns_description.xlsx')

### Codificación de Variables Categóricas

Las variables categóricas representan atributos discretos que no pueden ser interpretados como valores numéricos directos. Sin embargo, al aplicar una codificación adecuada, podemos transformarlas en formatos numéricos que los algoritmos de modelado puedan comprender.

Estas variables se dividen en dos tipos: ordinales y nominales. Las variables ordinales tienen un orden o jerarquía entre sus categorías, lo que permite establecer una relación de mayor o menor. En cambio, las variables nominales agrupan elementos en categorías distintas, pero sin que exista un orden o jerarquía entre ellas.

El objetivo, por lo tanto, es convertir estas variables categóricas en representaciones numéricas dentro de nuestro conjunto de datos.

Existen diversas técnicas para realizar esta codificación:

1. **One-Hot Encoding**: Este método transforma cada valor único de una variable categórica en una columna binaria (0 o 1). Se asigna un 1 para la categoría presente y 0 para las demás. Es adecuado cuando las categorías no tienen un orden inherente, evitando que el modelo asuma una relación jerárquica entre ellas.

2. **Ordinal Encoding**: Este enfoque asigna un valor numérico a cada categoría de una variable categórica, respetando el orden natural de las categorías. Se usa cuando las categorías tienen un orden jerárquico.

3. **Mean Encoding**: Consiste en reemplazar cada valor de la variable categórica por la media de la variable objetivo correspondiente a esa categoría. Es útil cuando existe una relación estadística entre la variable categórica y la variable objetivo, pero puede provocar sobreajuste si no se maneja adecuadamente.

4. **Target Encoding**: Similar al mean encoding, pero en lugar de usar la media global, se utiliza el promedio de la variable objetivo para cada categoría específica. Este método incorpora información más relevante para cada categoría, pero debe ser usado con precaución para evitar el sobreajuste.

5. **CatBoostEncoder**: variante de la codificación de objetivo que admite objetivos **binomiales** y **continuos**, y ofrece **codificación consciente del tiempo**, **regularización** y **aprendizaje en línea**. Es sensible al orden de los datos, lo que lo hace ideal para problemas de series temporales. Si los datos no tienen dependencia temporal, funcionará bien siempre que no haya fuga de información, la cual debe evitarse mediante técnicas como **shuffling** o **resampling**.

### Codificación de Variables Categóricas aplicada a esta práctica

Al principio de esta práctica, se pudo comprobar que hay variables de naturaleza categórica en formato string. Esto puede dificultar el análisis de las variables, ya que muchos algoritmos de machine learning requieren que las variables categóricas sean transformadas a un formato numérico. Además, el manejo directo de variables de tipo string puede generar problemas de rendimiento y precisión, sobre todo cuando las categorías tienen muchas variaciones o valores distintos. Para abordar este problema, se van a utilizar algunas de las técnicas de codificación de categorías mencionadas en el apartado anterior, que permiten transformar estas variables de manera eficiente. 

Elegir la técnica adecuada va a depender de la naturaleza de los datos y de la cantidad de categorías. Al aplicar estas técnicas, es posible mejorar la interpretación y la capacidad de los modelos predictivos al trabajar con variables categóricas de manera más eficiente.


In [14]:
# Llamar a la función y guardar los resultados en variables
col_bool, col_cat, col_num = fa.categorizar_columnas(df_loans_train)

In [7]:
cat_vars = df_loans_train.select_dtypes(include=['object']).columns

# Contar valores únicos en cada variable categórica
unique_counts = df_loans_train[cat_vars].nunique().sort_values(ascending=False)

print(unique_counts)

ORGANIZATION_TYPE             58
OCCUPATION_TYPE               18
NAME_INCOME_TYPE               8
NAME_TYPE_SUITE                7
WEEKDAY_APPR_PROCESS_START     7
WALLSMATERIAL_MODE             7
NAME_HOUSING_TYPE              6
NAME_EDUCATION_TYPE            5
NAME_FAMILY_STATUS             5
FONDKAPREMONT_MODE             4
CODE_GENDER                    3
HOUSETYPE_MODE                 3
NAME_CONTRACT_TYPE             2
FLAG_OWN_CAR                   2
FLAG_OWN_REALTY                2
EMERGENCYSTATE_MODE            2
dtype: int64


Gracias al resultado del análisis de WOE e IV se ha podido conocer la relación entre la variable target y las variables categóricas. Dentro de las que tienen un impacto moderado (pero no fuerte), se encuentran: *CODE_GENDER, NAME_INCOME_TYPE, NAME_FAMILY_STATUS, OCCUPATION_TYPE*. Por ello, se establecen los siguientes criterios de encoding:

- Baja cardinalidad (≤ 8 categorías): Se usará One-Hot Encoding, ya que es eficiente para variables con pocas categorías y no añade complejidad innecesaria al modelo.  

- Media cardinalidad (9-50 categorías): No se aplicará Target Encoding debido a que las variables en este rango no muestran un impacto significativo en la relación con la variable objetivo, por lo que no justifica su uso. En su lugar: 

- Para la variable OCCUPATION_TYPE (18 categorías), se aplicará Mean Encoding, ya que permite capturar de manera eficiente la relación promedio entre cada categoría y la variable objetivo.

- Alta cardinalidad (> 50 categorías): Se aplicará CatBoost Encoding para la variable ORGANIZATION_TYPE (58 categorías), una técnica adecuada para manejar variables de alta cardinalidad reduciendo el riesgo de overfitting.
Este enfoque busca balancear la complejidad del modelo y la relevancia de las variables categóricas, maximizando el desempeño general del modelo.



#### Separación x e y 

In [8]:
y_train = df_loans_train['TARGET']
X_train = df_loans_train.drop('TARGET', axis=1)
y_test = df_loans_test['TARGET']
X_test = df_loans_test.drop('TARGET', axis=1)
X_train.shape, y_train.shape, X_test.shape, y_test.shape

((246008, 121), (246008,), (61503, 121), (61503,))

#### CODIFICACIÓN ONE HOT ENCODING

In [None]:
list_columns_cat = list(df_loans_train.select_dtypes(include=["object", "category"]).columns)
exclude_vars = ['OCCUPATION_TYPE', 'ORGANIZATION_TYPE']  # Excluir estas columnas
list_columns_ohe = [col for col in list_columns_cat if col not in exclude_vars]

# Crear y aplicar One-Hot Encoder
ohe = ce.OneHotEncoder(cols=list_columns_ohe, use_cat_names=True)
ohe.fit(X_train, y_train)  

# Transformar X_train y X_test
X_train_t = ohe.transform(X_train)
X_test_t = ohe.transform(X_test)

# Verificar formas finales
print(X_train_t.shape, X_test_t.shape)

(246008, 175) (61503, 175)


#### CODIFICACIÓN MEAN ENCODING

Como se mencionó anteriormente, la variable **OCCUPATION_TYPE** muestra una relación moderada con la variable objetivo (**Target**). Sin embargo, esta relación no es lo suficientemente significativa como para aplicar una codificación basada en el **Target Encoding**. Por ello, se optará por utilizar la técnica de **Mean Encoding** para su tratamiento.

In [None]:
# Mean Encoding para OCCUPATION_TYPE
target_column = 'OCCUPATION_TYPE'

# Calcular el promedio de 'TARGET' por cada categoría de 'OCCUPATION_TYPE'
mean_encoding_map = X_train_t.copy()
mean_encoding_map['TARGET'] = y_train  # Añadimos la variable TARGET al DataFrame
mean_encoding_map = mean_encoding_map.groupby(target_column)['TARGET'].mean().to_dict()

# Aplicar Mean Encoding a X_train y X_test
X_train_me = X_train_t.copy()
X_test_me = X_test_t.copy()

X_train_me[target_column] = X_train_t[target_column].map(mean_encoding_map)
X_test_me[target_column] = X_test_t[target_column].map(mean_encoding_map).fillna(y_train.mean())

# Verificar formas finales
print(X_train_me.shape, X_test_me.shape)

(246008, 175) (61503, 175)


#### CODIFICACIÓN CATBOOST

In [None]:
# CatBoost Encoding
target_column = 'ORGANIZATION_TYPE'

# Crear y ajustar el codificador de CatBoost Encoding
catboost_enc = ce.CatBoostEncoder(cols=[target_column])
catboost_enc.fit(X_train_me[target_column], y_train)  

# Transformar X_train y X_test
X_train_mec = X_train_me.copy()
X_test_mec = X_test_me.copy()

X_train_mec[target_column] = catboost_enc.transform(X_train_me[target_column])
X_test_mec[target_column] = catboost_enc.transform(X_test_me[target_column])

# Verificar formas finales
print(X_train_mec.shape, X_test_mec.shape)

(246008, 175) (61503, 175)


Tras la codificación de variables categóricas, ha quedado un dataset de train con 175 columnnas y 246008 instancias y un dataset de test con 175 columnnas y 61503 instancias

#### ESCALARIZACIÓN DE LAS VARIABLES CATEGÓRICAS

Cuando aplicas codificación *one-hot* a tus variables categóricas, estas se transforman en un conjunto de variables binarias cuyos valores son exclusivamente 0 y 1. Este enfoque evita que las variables categóricas influyan negativamente en el modelo al eliminar cualquier noción implícita de orden o jerarquía en los datos originales. Además, esta transformación mejora la compatibilidad con los algoritmos de aprendizaje automático, permitiendo que estos aprovechen de manera más efectiva la información contenida en las categorías para mejorar la calidad de las predicciones.

In [None]:
#Comprobamos que están todas codificadas
X_train_mec.dtypes.to_dict()


{'SK_ID_CURR': dtype('int64'),
 'NAME_CONTRACT_TYPE_Cash loans': dtype('int64'),
 'NAME_CONTRACT_TYPE_Revolving loans': dtype('int64'),
 'CODE_GENDER_F': dtype('int64'),
 'CODE_GENDER_M': dtype('int64'),
 'CODE_GENDER_XNA': dtype('int64'),
 'FLAG_OWN_CAR_N': dtype('int64'),
 'FLAG_OWN_CAR_Y': dtype('int64'),
 'FLAG_OWN_REALTY_N': dtype('int64'),
 'FLAG_OWN_REALTY_Y': dtype('int64'),
 'CNT_CHILDREN': dtype('int64'),
 'AMT_INCOME_TOTAL': dtype('float64'),
 'AMT_CREDIT': dtype('float64'),
 'AMT_ANNUITY': dtype('float64'),
 'AMT_GOODS_PRICE': dtype('float64'),
 'NAME_TYPE_SUITE_Unaccompanied': dtype('int64'),
 'NAME_TYPE_SUITE_Family': dtype('int64'),
 'NAME_TYPE_SUITE_Spouse, partner': dtype('int64'),
 'NAME_TYPE_SUITE_Children': dtype('int64'),
 'NAME_TYPE_SUITE_Other_A': dtype('int64'),
 'NAME_TYPE_SUITE_Other_B': dtype('int64'),
 'NAME_TYPE_SUITE_Group of people': dtype('int64'),
 'NAME_TYPE_SUITE_nan': dtype('int64'),
 'NAME_INCOME_TYPE_Working': dtype('int64'),
 'NAME_INCOME_TYPE_Pen

In [None]:
scaler = StandardScaler()
model_scaled = scaler.fit(X_train_mec)
X_train_scaled = pd.DataFrame(scaler.transform(X_train_mec), columns=X_train_mec.columns, index=X_train_mec.index)
X_test_scaled = pd.DataFrame(scaler.transform(X_test_mec), columns=X_test_mec.columns, index=X_test_mec.index)

In [None]:
X_train_scaled.head()

Unnamed: 0,SK_ID_CURR,NAME_CONTRACT_TYPE_Cash loans,NAME_CONTRACT_TYPE_Revolving loans,CODE_GENDER_F,CODE_GENDER_M,CODE_GENDER_XNA,FLAG_OWN_CAR_N,FLAG_OWN_CAR_Y,FLAG_OWN_REALTY_N,FLAG_OWN_REALTY_Y,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,NAME_TYPE_SUITE_Unaccompanied,NAME_TYPE_SUITE_Family,"NAME_TYPE_SUITE_Spouse, partner",NAME_TYPE_SUITE_Children,NAME_TYPE_SUITE_Other_A,NAME_TYPE_SUITE_Other_B,NAME_TYPE_SUITE_Group of people,NAME_TYPE_SUITE_nan,NAME_INCOME_TYPE_Working,NAME_INCOME_TYPE_Pensioner,NAME_INCOME_TYPE_Commercial associate,NAME_INCOME_TYPE_State servant,NAME_INCOME_TYPE_Maternity leave,NAME_INCOME_TYPE_Businessman,NAME_INCOME_TYPE_Unemployed,NAME_INCOME_TYPE_Student,NAME_EDUCATION_TYPE_Secondary / secondary special,NAME_EDUCATION_TYPE_Lower secondary,NAME_EDUCATION_TYPE_Higher education,NAME_EDUCATION_TYPE_Incomplete higher,NAME_EDUCATION_TYPE_Academic degree,NAME_FAMILY_STATUS_Married,NAME_FAMILY_STATUS_Civil marriage,NAME_FAMILY_STATUS_Widow,NAME_FAMILY_STATUS_Separated,NAME_FAMILY_STATUS_Single / not married,NAME_HOUSING_TYPE_House / apartment,NAME_HOUSING_TYPE_Municipal apartment,NAME_HOUSING_TYPE_With parents,NAME_HOUSING_TYPE_Rented apartment,NAME_HOUSING_TYPE_Office apartment,NAME_HOUSING_TYPE_Co-op apartment,REGION_POPULATION_RELATIVE,DAYS_BIRTH,DAYS_EMPLOYED,DAYS_REGISTRATION,DAYS_ID_PUBLISH,OWN_CAR_AGE,FLAG_MOBIL,FLAG_EMP_PHONE,FLAG_WORK_PHONE,FLAG_CONT_MOBILE,FLAG_PHONE,FLAG_EMAIL,OCCUPATION_TYPE,CNT_FAM_MEMBERS,REGION_RATING_CLIENT,REGION_RATING_CLIENT_W_CITY,WEEKDAY_APPR_PROCESS_START_WEDNESDAY,WEEKDAY_APPR_PROCESS_START_SUNDAY,WEEKDAY_APPR_PROCESS_START_THURSDAY,WEEKDAY_APPR_PROCESS_START_MONDAY,WEEKDAY_APPR_PROCESS_START_TUESDAY,WEEKDAY_APPR_PROCESS_START_FRIDAY,WEEKDAY_APPR_PROCESS_START_SATURDAY,HOUR_APPR_PROCESS_START,REG_REGION_NOT_LIVE_REGION,REG_REGION_NOT_WORK_REGION,LIVE_REGION_NOT_WORK_REGION,REG_CITY_NOT_LIVE_CITY,REG_CITY_NOT_WORK_CITY,LIVE_CITY_NOT_WORK_CITY,ORGANIZATION_TYPE,EXT_SOURCE_1,EXT_SOURCE_2,EXT_SOURCE_3,APARTMENTS_AVG,BASEMENTAREA_AVG,YEARS_BEGINEXPLUATATION_AVG,YEARS_BUILD_AVG,COMMONAREA_AVG,ELEVATORS_AVG,ENTRANCES_AVG,FLOORSMAX_AVG,FLOORSMIN_AVG,LANDAREA_AVG,LIVINGAPARTMENTS_AVG,LIVINGAREA_AVG,NONLIVINGAPARTMENTS_AVG,NONLIVINGAREA_AVG,APARTMENTS_MODE,BASEMENTAREA_MODE,YEARS_BEGINEXPLUATATION_MODE,YEARS_BUILD_MODE,COMMONAREA_MODE,ELEVATORS_MODE,ENTRANCES_MODE,FLOORSMAX_MODE,FLOORSMIN_MODE,LANDAREA_MODE,LIVINGAPARTMENTS_MODE,LIVINGAREA_MODE,NONLIVINGAPARTMENTS_MODE,NONLIVINGAREA_MODE,APARTMENTS_MEDI,BASEMENTAREA_MEDI,YEARS_BEGINEXPLUATATION_MEDI,YEARS_BUILD_MEDI,COMMONAREA_MEDI,ELEVATORS_MEDI,ENTRANCES_MEDI,FLOORSMAX_MEDI,FLOORSMIN_MEDI,LANDAREA_MEDI,LIVINGAPARTMENTS_MEDI,LIVINGAREA_MEDI,NONLIVINGAPARTMENTS_MEDI,NONLIVINGAREA_MEDI,FONDKAPREMONT_MODE_reg oper spec account,FONDKAPREMONT_MODE_reg oper account,FONDKAPREMONT_MODE_not specified,FONDKAPREMONT_MODE_org spec account,FONDKAPREMONT_MODE_nan,HOUSETYPE_MODE_block of flats,HOUSETYPE_MODE_terraced house,HOUSETYPE_MODE_specific housing,HOUSETYPE_MODE_nan,TOTALAREA_MODE,"WALLSMATERIAL_MODE_Stone, brick",WALLSMATERIAL_MODE_Panel,WALLSMATERIAL_MODE_Monolithic,WALLSMATERIAL_MODE_Block,WALLSMATERIAL_MODE_Wooden,WALLSMATERIAL_MODE_Others,WALLSMATERIAL_MODE_Mixed,WALLSMATERIAL_MODE_nan,EMERGENCYSTATE_MODE_No,EMERGENCYSTATE_MODE_Yes,EMERGENCYSTATE_MODE_nan,OBS_30_CNT_SOCIAL_CIRCLE,DEF_30_CNT_SOCIAL_CIRCLE,OBS_60_CNT_SOCIAL_CIRCLE,DEF_60_CNT_SOCIAL_CIRCLE,DAYS_LAST_PHONE_CHANGE,FLAG_DOCUMENT_2,FLAG_DOCUMENT_3,FLAG_DOCUMENT_4,FLAG_DOCUMENT_5,FLAG_DOCUMENT_6,FLAG_DOCUMENT_7,FLAG_DOCUMENT_8,FLAG_DOCUMENT_9,FLAG_DOCUMENT_10,FLAG_DOCUMENT_11,FLAG_DOCUMENT_12,FLAG_DOCUMENT_13,FLAG_DOCUMENT_14,FLAG_DOCUMENT_15,FLAG_DOCUMENT_16,FLAG_DOCUMENT_17,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,-0.786989,0.324103,-0.324103,0.722057,-0.722031,-0.004032,0.716969,-0.716969,1.503079,-1.503079,0.807356,-0.320471,0.654397,-0.120996,0.490983,0.487182,-0.387527,-0.195664,-0.103933,-0.053112,-0.076114,-0.029781,-0.065189,0.967741,-0.46837,-0.550989,-0.275698,-0.004508,-0.006049,-0.008313,-0.00727,0.638683,-0.111726,-0.567242,-0.18595,-0.023345,0.752689,-0.327059,-0.235098,-0.262467,-0.417105,0.356817,-0.19444,-0.225169,-0.127727,-0.093034,-0.060291,-1.356089,0.968781,-0.477982,1.036982,0.375505,,0.002016,0.468486,-0.499666,0.044123,-0.625627,-0.245406,-0.938306,0.930833,1.860366,1.925206,2.219374,-0.236251,-0.444453,-0.44383,-0.460998,-0.442681,-0.351073,-1.246885,-0.123757,-0.231515,-0.206529,3.435634,1.827543,2.13686,-0.674873,-1.852376,-2.287542,-0.666517,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,-0.201525,-0.561708,-0.136934,-0.136317,0.67894,-0.97823,-0.063015,-0.069778,0.995603,,-0.516353,-0.522479,-0.075816,-0.175602,-0.133742,-0.072802,-0.086975,0.982008,-1.036392,-0.087849,1.052419,1.062079,1.908644,1.078517,2.470763,0.308446,-0.005703,0.638715,-0.009017,-0.124483,-0.310334,-0.013823,-0.297606,-0.063112,-0.004508,-0.061999,-0.002851,-0.059021,-0.054704,-0.034591,-0.100151,-0.015356,-0.090785,-0.024201,-0.022816,-0.018809,-0.075788,-0.063294,-0.167352,-0.292797,2.081297,1.66103
1,0.474027,0.324103,-0.324103,-1.384932,1.384982,-0.004032,-1.394761,1.394761,-0.665301,0.665301,-0.577365,0.041994,1.012907,1.08049,0.977882,0.487182,-0.387527,-0.195664,-0.103933,-0.053112,-0.076114,-0.029781,-0.065189,0.967741,-0.46837,-0.550989,-0.275698,-0.004508,-0.006049,-0.008313,-0.00727,0.638683,-0.111726,-0.567242,-0.18595,-0.023345,0.752689,-0.327059,-0.235098,-0.262467,-0.417105,0.356817,-0.19444,-0.225169,-0.127727,-0.093034,-0.060291,-1.284497,0.495804,-0.475964,0.639578,-1.434343,0.162861,0.002016,0.468486,-0.499666,0.044123,-0.625627,-0.245406,,-0.166762,1.860366,1.925206,2.219374,-0.236251,-0.444453,-0.44383,-0.460998,-0.442681,-0.351073,0.898051,-0.123757,-0.231515,-0.206529,-0.291067,-0.547183,-0.467976,0.643345,-1.096724,-0.204769,0.971847,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,-0.201525,-0.561708,-0.136934,-0.136317,0.67894,-0.97823,-0.063015,-0.069778,0.995603,,-0.516353,-0.522479,-0.075816,-0.175602,-0.133742,-0.072802,-0.086975,0.982008,-1.036392,-0.087849,1.052419,-0.588298,-0.320576,-0.586487,-0.275869,-0.559982,-0.005703,0.638715,-0.009017,-0.124483,-0.310334,-0.013823,-0.297606,-0.063112,-0.004508,-0.061999,-0.002851,-0.059021,-0.054704,-0.034591,-0.100151,-0.015356,-0.090785,-0.024201,-0.022816,-0.018809,-0.075788,-0.063294,-0.167352,-0.292797,-0.318824,-0.48101
2,-0.833941,0.324103,-0.324103,-1.384932,1.384982,-0.004032,0.716969,-0.716969,-0.665301,0.665301,-0.577365,0.041994,-0.894437,-0.913616,-0.94537,0.487182,-0.387527,-0.195664,-0.103933,-0.053112,-0.076114,-0.029781,-0.065189,0.967741,-0.46837,-0.550989,-0.275698,-0.004508,-0.006049,-0.008313,-0.00727,0.638683,-0.111726,-0.567242,-0.18595,-0.023345,0.752689,-0.327059,-0.235098,-0.262467,-0.417105,0.356817,-0.19444,-0.225169,-0.127727,-0.093034,-0.060291,0.759424,-0.103667,-0.512889,-0.69211,1.957549,,0.002016,0.468486,-0.499666,0.044123,-0.625627,-0.245406,,-0.166762,-0.103735,-0.063215,-0.450578,4.232779,-0.444453,-0.44383,-0.460998,-0.442681,-0.351073,-0.021207,-0.123757,-0.231515,-0.206529,-0.291067,-0.547183,-0.467976,-0.411544,,1.165109,0.566814,-0.968547,-0.867741,-0.157619,-1.654945,-0.491591,-0.58672,-0.804502,-1.275505,-1.182014,-0.569863,-0.978716,-0.882864,-0.182891,-0.407306,-0.940159,-0.831589,-0.134671,-1.613032,-0.474012,-0.563585,-0.753006,-1.255572,-1.159099,-0.542238,-0.96554,-0.856733,-0.172943,-0.3844,-0.964023,-0.865236,-0.156136,-1.650583,-0.490253,-0.580792,-0.797052,-1.269096,-1.176061,-0.567781,-0.978504,-0.878571,-0.180952,-0.40194,4.962173,-0.561708,-0.136934,-0.136317,-1.472885,1.022255,-0.063015,-0.069778,-1.004416,-0.84228,1.936661,-0.522479,-0.075816,-0.175602,-0.133742,-0.072802,-0.086975,-1.018322,0.964886,-0.087849,-0.950192,-0.588298,-0.320576,-0.586487,-0.275869,-0.7402,-0.005703,0.638715,-0.009017,-0.124483,-0.310334,-0.013823,-0.297606,-0.063112,-0.004508,-0.061999,-0.002851,-0.059021,-0.054704,-0.034591,-0.100151,-0.015356,-0.090785,-0.024201,-0.022816,-0.018809,-0.075788,-0.063294,-0.167352,4.096706,-0.318824,1.12552
3,-1.475747,-3.085434,3.085434,0.722057,-0.722031,-0.004032,0.716969,-0.716969,-0.665301,0.665301,0.807356,-0.337731,-1.040702,-1.247155,-0.969714,0.487182,-0.387527,-0.195664,-0.103933,-0.053112,-0.076114,-0.029781,-0.065189,0.967741,-0.46837,-0.550989,-0.275698,-0.004508,-0.006049,-0.008313,-0.00727,0.638683,-0.111726,-0.567242,-0.18595,-0.023345,0.752689,-0.327059,-0.235098,-0.262467,-0.417105,0.356817,-0.19444,-0.225169,-0.127727,-0.093034,-0.060291,1.08282,1.746994,-0.46186,0.4859,1.973442,,0.002016,0.468486,-0.499666,0.044123,-0.625627,-0.245406,0.786954,0.930833,-0.103735,-0.063215,-0.450578,-0.236251,2.249955,-0.44383,-0.460998,-0.442681,-0.351073,0.591632,-0.123757,-0.231515,-0.206529,-0.291067,-0.547183,-0.467976,4.344386,-1.35237,0.299502,,-0.950123,-1.072813,-0.721485,,,-0.58672,-1.148415,-1.275505,,,,-0.895505,,-0.407306,-0.92075,-1.039464,-0.651449,,,-0.563585,-1.093862,-1.255572,,,,-0.869234,,-0.3844,-0.944825,-1.071146,-0.713661,,,-0.580792,-1.139927,-1.269096,,,,-0.891024,,-0.40194,-0.201525,-0.561708,-0.136934,-0.136317,0.67894,1.022255,-0.063015,-0.069778,-1.004416,-0.889494,1.936661,-0.522479,-0.075816,-0.175602,-0.133742,-0.072802,-0.086975,-1.018322,0.964886,-0.087849,-0.950192,-0.588298,-0.320576,-0.586487,-0.275869,-0.060454,-0.005703,-1.565644,-0.009017,-0.124483,-0.310334,-0.013823,-0.297606,-0.063112,-0.004508,-0.061999,-0.002851,-0.059021,-0.054704,-0.034591,-0.100151,-0.015356,-0.090785,-0.024201,-0.022816,-0.018809,,,,,,
4,0.16865,0.324103,-0.324103,0.722057,-0.722031,-0.004032,0.716969,-0.716969,-0.665301,0.665301,-0.577365,-0.165129,0.188603,-0.358131,0.369258,0.487182,-0.387527,-0.195664,-0.103933,-0.053112,-0.076114,-0.029781,-0.065189,-1.033335,2.135064,-0.550989,-0.275698,-0.004508,-0.006049,-0.008313,-0.00727,-1.565722,8.950444,-0.567242,-0.18595,-0.023345,0.752689,-0.327059,-0.235098,-0.262467,-0.417105,-2.802553,5.142968,-0.225169,-0.127727,-0.093034,-0.060291,-0.810962,-1.567512,2.134588,-1.78689,-0.634381,,0.002016,-2.134535,-0.499666,0.044123,1.598398,-0.245406,,-0.166762,-0.103735,-0.063215,-0.450578,-0.236251,-0.444453,2.253114,-0.460998,-0.442681,-0.351073,0.285212,-0.123757,-0.231515,-0.206529,-0.291067,-0.547183,-0.467976,-1.363261,,0.675006,-1.231263,-0.930776,-1.072813,-0.024159,-0.696431,-0.565482,-0.58672,-0.804502,-1.275505,-0.923905,-0.54648,-0.943133,-0.837715,-0.182891,-0.407306,-0.90134,-1.039464,-0.011994,-0.665257,-0.549436,-0.563585,-0.753006,-1.255572,-0.900821,-0.519063,-0.927833,-0.810301,-0.172943,-0.3844,-0.925627,-1.071146,-0.024178,-0.694039,-0.563985,-0.580792,-0.797052,-1.269096,-0.91894,-0.545925,-0.942239,-0.834097,-0.180952,-0.40194,-0.201525,1.780285,-0.136934,-0.136317,-1.472885,1.022255,-0.063015,-0.069778,-1.004416,-0.844132,-0.516353,1.913953,-0.075816,-0.175602,-0.133742,-0.072802,-0.086975,-1.018322,0.964886,-0.087849,-0.950192,-0.588298,-0.320576,-0.586487,-0.275869,-0.157215,-0.005703,0.638715,-0.009017,-0.124483,-0.310334,-0.013823,-0.297606,-0.063112,-0.004508,-0.061999,-0.002851,-0.059021,-0.054704,-0.034591,-0.100151,-0.015356,-0.090785,-0.024201,-0.022816,-0.018809,-0.075788,-0.063294,-0.167352,-0.292797,2.081297,2.196541


### CONCLUSIONES FINALES DEL EDA

El análisis exploratorio de datos (EDA) ha permitido identificar patrones clave, relaciones significativas y posibles inconsistencias en los datos, proporcionando una base sólida para la preparación y modelado. Este proceso ha sido esencial para comprender la estructura de las variables y su impacto en la variable objetivo, lo que orientará las siguientes etapas del análisis. 

Se han sacado la siguientes conclusiones:

- Exploración y análisis inicial (EDA): Comenzamos con un análisis exploratorio para comprender el tamaño y las características de nuestro conjunto de datos, prestando especial atención a la variable objetivo, TARGET, y sus relaciones con las demás variables.

- Análisis de tipos de variables: Analizamos las variables en términos de su tipo (categóricas, numéricas, etc.) y evaluamos cómo se relacionan con la variable objetivo. Esto nos ayudó a determinar qué técnicas de codificación eran más adecuadas para cada tipo de variable.

- Análisis de correlación: Estudiamos las relaciones de correlación entre las variables para identificar posibles redundancias y relaciones lineales importantes que podrían influir en la selección de características.

- División en conjuntos de entrenamiento y prueba: Realizamos una división estratificada de los datos en conjuntos de entrenamiento y prueba, asegurando que la distribución de la variable objetivo se mantuviera en ambas particiones.

- Tratamiento de outliers: Detectamos y tratamos los valores atípicos en las variables numéricas, asegurándonos de que no afectaran negativamente al modelo.

- Imputación de valores faltantes: Abordamos los valores nulos en el conjunto de datos mediante técnicas de imputación adecuadas, garantizando que las variables estuvieran completas para el entrenamiento del modelo.

- Análisis WoE e IV: Aplicamos el análisis de Weight of Evidence (WoE) y la Información de Valor (IV) a las variables categóricas para medir su capacidad predictiva y determinar qué variables serían más útiles para el modelado.

- Codificación de variables categóricas: Utilizamos técnicas de codificación numérica como One-Hot Encoding, Mean Encoding y CatBoost Encoding para transformar las variables categóricas en valores numéricos adecuados para los modelos.

- Escalado de variables: Procedimos con el escalado de las variables numéricas para asegurar que todas las características tuvieran un rango similar y fueran adecuadamente procesadas por los modelos de Machine Learning.

- Con estos pasos completados, nuestros datos están listos para continuar con las siguientes fases del proyecto: la selección de características (feature processing), la construcción y evaluación de modelos, la implementación del modelo final, y la explicación y conclusiones del trabajo realizado.