## Preprocesamiento de datos

Aplicar operaciones sobre los datos con el fin de mejorar los modelados.

Preprocesadores de `sklearn.preprocessing`:

* Escalado de datos
    * StandardScaler
    * MinMaxScaler
    * RobustScaler
* Transformación de distribuciones de datos (intentar reducir la asimetría de los datos) (similar a aplicar función raíz o logaritmo a los datos)
    * QuantileTransformer
    * PowerTransformer
* Encoders para codificación de categóricos a numéricos:
    * OneHotEncoder (Equivalente a pd.get_dummies) Habitual usarlo en la entrada X.
    * LabelEncoder (Equivalente a hacer un .map() en pandas con un diccionario). Habitual usarlo en salida y.
* Discretización:
    * KBinsDiscretizer
    * Binarizer
* Imputer:
    * SimpleImputer: Para estrategia de mean, median, most_frequent, valor estático (las estrategias clásicas que ya habíamos visto con pandas)
    * KNNImputer: HAce una predicción para una celda en la que haya un hueco, teniendo en cuenta los valores que ya hay
    * IterativeImputer: Usamos el algoritmo que queramos para predecir el valor que falte (es más sofisticado)

Todas estas clases tienen algo en común, tienen métodos fit y transform para que puedan usarse de forma similar, lo que cambia es las operaciones que realizan sobre los datos, por ejemplo: escalar, transformar, codificar, imputar, discretizar, binarizar, normalizar, estandarizar...

Los pipelines de scikit learn simplifican el uso de preprocesadores cuando queremos aplicar varios de ellos y combinarlos.

In [1]:
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib as plt

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR

from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler

In [2]:
df = sns.load_dataset('diamonds').dropna().sample(5000, random_state=42).reset_index(drop=True)
#df = sns.load_dataset('diamonds').dropna().reset_index(drop=True)
df.head(3)


Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.24,Ideal,G,VVS1,62.1,56.0,559,3.97,4.0,2.47
1,0.58,Very Good,F,VVS2,60.0,57.0,2201,5.44,5.42,3.26
2,0.4,Ideal,E,VVS2,62.1,55.0,1238,4.76,4.74,2.95


In [3]:
from sklearn.metrics import r2_score, mean_absolute_error, root_mean_squared_error, mean_absolute_percentage_error

## Particionar y crear método calculate_metrics para hacer un modelado antes de hacer nada y ver si aplica.ndo procesos.....
X = df[['carat', 'depth', 'table', 'x', 'y', 'z']]
y = df['price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

df_resultados = pd.DataFrame(columns=['Modelo', 'Preprocesado', 'R2', 'MAE', 'RMSE', 'MAPE'])

def calculate_metrics(preprocessor_name, X_train, X_test, y_train, y_test ) :
    models = {
        'LinearRegression': LinearRegression(),
        'KNN': KNeighborsRegressor(),
        'SVR': SVR(),
        'DecisionTree': DecisionTreeRegressor(random_state=42),
        'RandomForest': RandomForestRegressor(random_state=42)                
    }
    for model_name, model in models.items():
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        df_resultados.loc[len(df_resultados)] = [model_name, preprocessor_name, 
                                                 r2_score(y_test, y_pred), 
                                                 mean_absolute_error(y_test, y_pred),
                                                 root_mean_squared_error(y_test, y_pred),
                                                 mean_absolute_percentage_error(y_test, y_pred)
                                                 ]
    return df_resultados.sort_values('R2', ascending=False)
    

In [4]:
calculate_metrics('Sin preprocesado', X_train, X_test, y_train, y_test)

Unnamed: 0,Modelo,Preprocesado,R2,MAE,RMSE,MAPE
4,RandomForest,Sin preprocesado,0.867955,835.518392,1500.214999,0.210188
0,LinearRegression,Sin preprocesado,0.861482,930.832368,1536.548541,0.288899
1,KNN,Sin preprocesado,0.849006,899.1862,1604.254748,0.22642
3,DecisionTree,Sin preprocesado,0.766569,1122.743,1994.676075,0.2744
2,SVR,Sin preprocesado,-0.179999,2967.146616,4484.706331,1.10095


## StandardScaler

Transforma los datos para que cada característica tenga media 0 y desviación estándar 1.  

$$
X_{\text{scaled}} = \frac{X - \mu}{\sigma}
$$

donde $\mu$ es la media y $\sigma$ la desviación estándar (calculados **solamente** en el conjunto de entrenamiento).
  
**Cuándo usarlo**:  
- Cuando los datos no tienen outliers extremadamente grandes (o son relativamente cercanos a una distribución normal).  
- Es el escalado más común, especialmente para algoritmos que asumen normalidad o que son sensibles a la escala (regresiones lineales, redes neuronales, SVM, etc.).

In [5]:
scaler = StandardScaler()
scaler.fit(X_train)

X_train_scaled = scaler.transform(X_train) # Devuelve un array de numpy
X_test_scaled = scaler.transform(X_test) # Devuelve un array de numpy

# Opcional:
# para conservar los nombres de las columnas:
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X.columns)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X.columns)
X_train_scaled.head()


Unnamed: 0,carat,depth,table,x,y,z
0,-1.028331,0.973355,-0.630227,-1.277261,-1.25903,-1.173678
1,0.439665,2.353854,0.692667,0.507249,0.500558,0.810866
2,0.104123,-0.338119,0.251703,0.347442,0.295124,0.278778
3,-1.028331,-1.580567,2.456526,-1.259505,-1.241166,-1.38939
4,0.439665,1.525555,-1.512156,0.516128,0.473762,0.69582


In [6]:
calculate_metrics('StandardScaler', X_train_scaled, X_test_scaled, y_train, y_test)

Unnamed: 0,Modelo,Preprocesado,R2,MAE,RMSE,MAPE
9,RandomForest,StandardScaler,0.868495,834.839074,1497.149208,0.210013
4,RandomForest,Sin preprocesado,0.867955,835.518392,1500.214999,0.210188
5,LinearRegression,StandardScaler,0.861482,930.832368,1536.548541,0.288899
0,LinearRegression,Sin preprocesado,0.861482,930.832368,1536.548541,0.288899
6,KNN,StandardScaler,0.859353,874.932,1548.314323,0.221851
1,KNN,Sin preprocesado,0.849006,899.1862,1604.254748,0.22642
8,DecisionTree,StandardScaler,0.766985,1119.155,1992.897756,0.273665
3,DecisionTree,Sin preprocesado,0.766569,1122.743,1994.676075,0.2744
7,SVR,StandardScaler,0.033553,2363.397416,4058.655409,0.675242
2,SVR,Sin preprocesado,-0.179999,2967.146616,4484.706331,1.10095


## MinMaxScaler

Escala y traslada cada característica individual a un rango definido, por defecto $[0,1]$.  

$$
X_{\text{scaled}} = \frac{X - X_{\min}}{X_{\max} - X_{\min}}
$$


**Cuándo usarlo**:

- Cuando quieres que los datos estén **acotados entre 0 y 1** o entre otro rango definido (por ejemplo, $[-1, 1]$), porque se puede personalizar el rango a $[min, max]$. Para algoritmos basados en distancias como KNN.
- Sin embargo, **es muy sensible a los outliers**. Un valor muy grande puede comprimir el resto de datos.

In [7]:
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train) # equivalente al ejemplo anterior pero todo en una linea (más compacto)
X_test_scaled = scaler.transform(X_test)
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X.columns)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X.columns)
X_train_scaled.head()

Unnamed: 0,carat,depth,table,x,y,z
0,0.026247,0.5,0.291667,0.066773,0.077901,0.265306
1,0.209974,0.625,0.416667,0.386328,0.391097,0.546939
2,0.167979,0.38125,0.375,0.357711,0.354531,0.471429
3,0.026247,0.26875,0.583333,0.069952,0.081081,0.234694
4,0.209974,0.55,0.208333,0.387917,0.386328,0.530612


In [8]:
calculate_metrics('MinMaxScaler', X_train_scaled, X_test_scaled, y_train, y_test)

Unnamed: 0,Modelo,Preprocesado,R2,MAE,RMSE,MAPE
14,RandomForest,MinMaxScaler,0.868646,835.654962,1496.286474,0.210257
9,RandomForest,StandardScaler,0.868495,834.839074,1497.149208,0.210013
4,RandomForest,Sin preprocesado,0.867955,835.518392,1500.214999,0.210188
5,LinearRegression,StandardScaler,0.861482,930.832368,1536.548541,0.288899
0,LinearRegression,Sin preprocesado,0.861482,930.832368,1536.548541,0.288899
10,LinearRegression,MinMaxScaler,0.861482,930.832368,1536.548541,0.288899
6,KNN,StandardScaler,0.859353,874.932,1548.314323,0.221851
11,KNN,MinMaxScaler,0.857112,868.6464,1560.600343,0.220208
1,KNN,Sin preprocesado,0.849006,899.1862,1604.254748,0.22642
8,DecisionTree,StandardScaler,0.766985,1119.155,1992.897756,0.273665


## RobustScaler

Escala los datos usando **mediana** e **IQR** (rango intercuartílico).  

$$
X_{\text{scaled}} = \frac{X - \text{mediana}(X)}{\text{IQR}}
$$
donde $\text{IQR} = Q_3 - Q_1$.

**Cuándo usarlo**:

Cuando existen **outliers** en los datos que pueden afectar mucho al escalado.
Al usar la mediana y el IQR en lugar de la media y desviación estándar, resulta mucho **menos sensible a valores atípicos**.

In [9]:
scaler = RobustScaler()
X_train_scaled = scaler.fit_transform(X_train) # equivalente al ejemplo anterior pero todo en una linea (más compacto)
X_test_scaled = scaler.transform(X_test)
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X.columns)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X.columns)
X_train_scaled.head()

Unnamed: 0,carat,depth,table,x,y,z
0,-0.625,0.866667,-0.333333,-0.768176,-0.767956,-0.707965
1,0.46875,2.2,0.666667,0.334705,0.320442,0.513274
2,0.21875,-0.4,0.333333,0.23594,0.19337,0.185841
3,-0.625,-1.6,2.0,-0.757202,-0.756906,-0.840708
4,0.46875,1.4,-1.0,0.340192,0.303867,0.442478


In [10]:
calculate_metrics('RobustScaler', X_train_scaled, X_test_scaled, y_train, y_test)

Unnamed: 0,Modelo,Preprocesado,R2,MAE,RMSE,MAPE
14,RandomForest,MinMaxScaler,0.868646,835.654962,1496.286474,0.210257
9,RandomForest,StandardScaler,0.868495,834.839074,1497.149208,0.210013
19,RandomForest,RobustScaler,0.868256,835.22396,1498.505051,0.210161
4,RandomForest,Sin preprocesado,0.867955,835.518392,1500.214999,0.210188
5,LinearRegression,StandardScaler,0.861482,930.832368,1536.548541,0.288899
0,LinearRegression,Sin preprocesado,0.861482,930.832368,1536.548541,0.288899
15,LinearRegression,RobustScaler,0.861482,930.832368,1536.548541,0.288899
10,LinearRegression,MinMaxScaler,0.861482,930.832368,1536.548541,0.288899
6,KNN,StandardScaler,0.859353,874.932,1548.314323,0.221851
16,KNN,RobustScaler,0.8583,876.919,1554.09643,0.221305


## QuantileTransformer

Es una transformación basada en **cuantiles**:  

1. Ordena los valores de cada columna y les asigna su posición cuantílica (e.g. percentiles).  

2. Mapea esos cuantiles ya sea a una distribución **uniforme** en $[0,1]$ o a una distribución **normal** (Gaussiana) si se especifica `output_distribution='normal'`.

- Por defecto, `output_distribution='uniform'`, lo que hace que cada característica se distribuya aproximadamente **de manera uniforme** en $[0, 1]$.  

- Si pones `output_distribution='normal'`, intentará que los datos se parezcan a una **distribución normal (Gaussiana)** con media 0 y desviación estándar 1.

¿Cuándo usarlo?

- Cuando quieres aplanar la distribución de una variable que está muy sesgada (skewed) o con colas largas. El método de cuantiles “estira” y “comprime” la distribución de forma que cada cuantil se mapea a un cuantil de la distribución objetivo (uniforme o normal).

- Es útil cuando quieres datos:
  - Bien distribuidos entre $[0, 1]$ (caso uniforme).
  - O aproximar una Gaussiana sin realizar transformaciones paramétricas (e.g. logaritmo).

Puede ser más fuerte que raíz o logaritmo porque no solo reduce sesgo redistribuye los valores, puede ser demasiado agresivo si los datos ya son simétricos.

In [11]:
X_train.skew() # cuanto más cercano a 0 más simétrico

carat    1.193090
depth    0.159501
table    0.681555
x        0.434562
y        0.434972
z        0.407144
dtype: float64

In [12]:
from sklearn.preprocessing import QuantileTransformer

transformer = QuantileTransformer()
X_train_transformed = transformer.fit_transform(X_train)
X_test_transformed = transformer.transform(X_test)
X_train_transformed = pd.DataFrame(X_train_transformed, columns=X.columns)
X_test_transformed = pd.DataFrame(X_test_transformed, columns=X.columns)

X_train_transformed.head(2)

Unnamed: 0,carat,depth,table,x,y,z
0,0.05956,0.883383,0.285285,0.041041,0.057558,0.131632
1,0.670671,0.983984,0.773273,0.657157,0.657157,0.787788


In [13]:
X_train_transformed.skew() # Vemos que se ha reducido la asimetrís

carat    0.001497
depth   -0.000490
table    0.007868
x       -0.000119
y        0.000009
z        0.000117
dtype: float64

In [14]:
calculate_metrics('QuantileTransformer', X_train_transformed, X_test_transformed, y_train, y_test)


Unnamed: 0,Modelo,Preprocesado,R2,MAE,RMSE,MAPE
14,RandomForest,MinMaxScaler,0.868646,835.654962,1496.286474,0.210257
9,RandomForest,StandardScaler,0.868495,834.839074,1497.149208,0.210013
24,RandomForest,QuantileTransformer,0.868345,835.537068,1498.000096,0.210055
19,RandomForest,RobustScaler,0.868256,835.22396,1498.505051,0.210161
4,RandomForest,Sin preprocesado,0.867955,835.518392,1500.214999,0.210188
0,LinearRegression,Sin preprocesado,0.861482,930.832368,1536.548541,0.288899
5,LinearRegression,StandardScaler,0.861482,930.832368,1536.548541,0.288899
15,LinearRegression,RobustScaler,0.861482,930.832368,1536.548541,0.288899
10,LinearRegression,MinMaxScaler,0.861482,930.832368,1536.548541,0.288899
6,KNN,StandardScaler,0.859353,874.932,1548.314323,0.221851


## PowerTransformer

intenta que los datos sean parecidos a una distribución normal.

PowerTransformer aplica transformaciones de **potencia** para hacer que los datos se acerquen más a una distribución normal.

- Admite dos métodos principales:

  1. **Box-Cox**: requiere que todos los datos sean **estrictamente positivos**.  
  2. **Yeo-Johnson**: puede manejar datos con valores 0 o negativos.  Yeo-Johnson es una versión mejorada de Box-Cox que funciona con valores negativos y positivos.

Internamente, `PowerTransformer` encuentra el mejor parámetro de potencia que estabiliza la varianza y reduce la asimetría (skew) de los datos, por tanto es una opción más flexible y automatizada que aplicar manualmente un np.sqrt o np.log a una columna.

¿Cuándo usarlo?

- Cuando tus datos están fuertemente sesgados (tienen heavy skew) y necesitas **mejorar la normalidad**. El QuantileTransfomer podría ser más fuerte. 

- Se suele usar antes de **modelos lineales** o algoritmos que asumen distribuciones aproximadamente gaussianas, ayudando a cumplir hipótesis de homocedasticidad (misma varianza) y mejorando la linealidad.  

- Si tus datos tienen valores cero o negativos, no puedes usar Box-Cox, pero sí Yeo-Johnson.
- Puedes luego aplicar un escalado adicional (por ejemplo, `StandardScaler`) tras la transformación de potencia si lo deseas.

In [15]:
from sklearn.preprocessing import PowerTransformer

transformer = PowerTransformer()
X_train_transformed = transformer.fit_transform(X_train)
X_test_transformed = transformer.transform(X_test)
X_train_transformed = pd.DataFrame(X_train_transformed, columns=X.columns)
X_test_transformed = pd.DataFrame(X_test_transformed, columns=X.columns)

X_train_transformed.head(2)

Unnamed: 0,carat,depth,table,x,y,z
0,-1.332196,0.97448,-0.592305,-1.430225,-1.403233,-1.264652
1,0.715544,2.309622,0.764271,0.613714,0.607868,0.866351


In [16]:
print('skew antes:\n', X_train.skew())
print('\nskew después:\n', X_train_transformed.skew())

skew antes:
 carat    1.193090
depth    0.159501
table    0.681555
x        0.434562
y        0.434972
z        0.407144
dtype: float64

skew después:
 carat    0.127325
depth    0.002243
table   -0.005639
x        0.037489
y        0.037876
z        0.030497
dtype: float64


Vemos que hace menor transformación, al ser los valores más altos que en el caso anterior

In [17]:
calculate_metrics('PowerTransformer', X_train_transformed, X_test_transformed, y_train, y_test)


Unnamed: 0,Modelo,Preprocesado,R2,MAE,RMSE,MAPE
14,RandomForest,MinMaxScaler,0.868646,835.654962,1496.286474,0.210257
9,RandomForest,StandardScaler,0.868495,834.839074,1497.149208,0.210013
24,RandomForest,QuantileTransformer,0.868345,835.537068,1498.000096,0.210055
19,RandomForest,RobustScaler,0.868256,835.22396,1498.505051,0.210161
29,RandomForest,PowerTransformer,0.86802,834.399099,1499.846193,0.209508
4,RandomForest,Sin preprocesado,0.867955,835.518392,1500.214999,0.210188
5,LinearRegression,StandardScaler,0.861482,930.832368,1536.548541,0.288899
0,LinearRegression,Sin preprocesado,0.861482,930.832368,1536.548541,0.288899
15,LinearRegression,RobustScaler,0.861482,930.832368,1536.548541,0.288899
10,LinearRegression,MinMaxScaler,0.861482,930.832368,1536.548541,0.288899


In [18]:
transformer = PowerTransformer(standardize=False)
X_train_transformed = transformer.fit_transform(X_train)
X_test_transformed = transformer.transform(X_test)
calculate_metrics('PowerTransformer standard False', X_train_transformed, X_test_transformed, y_train, y_test)


Unnamed: 0,Modelo,Preprocesado,R2,MAE,RMSE,MAPE
14,RandomForest,MinMaxScaler,0.868646,835.654962,1496.286474,0.210257
9,RandomForest,StandardScaler,0.868495,834.839074,1497.149208,0.210013
24,RandomForest,QuantileTransformer,0.868345,835.537068,1498.000096,0.210055
19,RandomForest,RobustScaler,0.868256,835.22396,1498.505051,0.210161
29,RandomForest,PowerTransformer,0.86802,834.399099,1499.846193,0.209508
4,RandomForest,Sin preprocesado,0.867955,835.518392,1500.214999,0.210188
34,RandomForest,PowerTransformer standard False,0.863769,842.218551,1523.808319,0.21188
0,LinearRegression,Sin preprocesado,0.861482,930.832368,1536.548541,0.288899
5,LinearRegression,StandardScaler,0.861482,930.832368,1536.548541,0.288899
15,LinearRegression,RobustScaler,0.861482,930.832368,1536.548541,0.288899


En este caso no ha mejorado

## OneHotEncoder

Equivale a pd.get_dummies de pandas pero es de Scikit Learn.

Crea **columnas binarias (dummies)** para cada categoría de una **feature** nominal.

**Cuándo usarlo**:  

1. Para **features categóricas nominales** (sin orden), como color, ciudad, tipo de mascota, etc.  
2. Normalmente se aplica a **variables de entrada** (X).  
3. Útil en la mayoría de los modelos que necesitan variables numéricas y no tienen forma de manejar directamente categorías.
4. Se puede usar en pipelines de scikit learn

Parámetro sparse_output:

* sparse_output=True: Devuelve la transformación como una matriz dispersa (scipy.sparse.csr_matrix) en lugar de un numpy.ndarray. 
    * Ventaja: Usa menos memoria si hay muchas categorías con muchos ceros (matriz dispersa)
    * Desventaja: Puede ser incompatible con algunas funciones de Pandas y Scikit-learn que esperan una matriz densa.
* sparse_output=False: Devuelve la transformación como un array denso (numpy.ndarray), en lugar de una matriz dispersa.
    * Ventaja: Se puede convertir fácilmente en un DataFrame de Pandas sin errores ni conversiones adicionales.
    * Desventaja: Puede consumir más memoria si hay muchas categorías y muchos ceros.

* Diferencia entre matriz densa y dispersa:
    * Matriz densa: Es una matriz donde todos los valores, incluyendo los ceros, son almacenados en memoria.
    * Matriz dispersa: Es una matriz en la que se almacenan solo los valores distintos de cero, junto con sus coordenadas (índices de fila y columna). Más óptima pero más difícil de manipular directamente, requiere conversión a formato denso para ciertas operaciones.

In [19]:
X = df[['carat', 'depth', 'table', 'x', 'y', 'z', 'cut', 'color', 'clarity']]
y = df['price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

# Hace el encoding de categóricas, no da problemas si incluimos las numéricas
pd.get_dummies(X)

Unnamed: 0,carat,depth,table,x,y,z,cut_Ideal,cut_Premium,cut_Very Good,cut_Good,...,color_I,color_J,clarity_IF,clarity_VVS1,clarity_VVS2,clarity_VS1,clarity_VS2,clarity_SI1,clarity_SI2,clarity_I1
0,0.24,62.1,56.0,3.97,4.00,2.47,True,False,False,False,...,False,False,False,True,False,False,False,False,False,False
1,0.58,60.0,57.0,5.44,5.42,3.26,False,False,True,False,...,False,False,False,False,True,False,False,False,False,False
2,0.40,62.1,55.0,4.76,4.74,2.95,True,False,False,False,...,False,False,False,False,True,False,False,False,False,False
3,0.43,60.8,57.0,4.92,4.89,2.98,False,True,False,False,...,False,False,False,False,True,False,False,False,False,False
4,1.55,62.3,55.0,7.44,7.37,4.61,True,False,False,False,...,False,False,False,False,False,False,False,False,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4995,0.31,60.8,57.0,4.40,4.38,2.67,True,False,False,False,...,False,False,True,False,False,False,False,False,False,False
4996,1.06,61.2,55.0,6.57,6.61,4.03,True,False,False,False,...,False,False,False,False,False,False,True,False,False,False
4997,0.71,61.0,56.0,5.77,5.80,3.53,True,False,False,False,...,False,False,False,False,False,True,False,False,False,False
4998,0.90,63.3,56.0,6.13,6.10,3.87,False,False,True,False,...,False,True,False,False,False,False,True,False,False,False


In [20]:
X_train.select_dtypes(exclude=['object', 'category']).columns.to_list()


['carat', 'depth', 'table', 'x', 'y', 'z']

In [21]:
X_train.select_dtypes(include=['object', 'category']).columns.to_list()

['cut', 'color', 'clarity']

In [22]:
from sklearn.preprocessing import OneHotEncoder

# obtener nombres de las columnas numéricas y categóricas
# separamos los tipos para luego usarlos como filtros
numerical_columns = X_train.select_dtypes(exclude=['object', 'category']).columns.to_list()
categorical_columns = X_train.select_dtypes(include=['object', 'category']).columns.to_list()

encoder = OneHotEncoder(sparse_output=False) # sparse_output=False para obtenerlo como matriz de 0s y 1s
X_train_encoded = encoder.fit_transform(X_train[categorical_columns]) # array de numpy con las codificaciones
X_test_encoded = encoder.transform(X_test[categorical_columns])

#X_train_encoded
# pasarlo a dataframse de poandas y juntarlo con las numéricas para obtener resultado como pd.get_dummies
X_train_final = pd.concat(
    [
        pd.DataFrame(X_train_encoded, columns=encoder.get_feature_names_out()).reset_index(drop=True),  # categoricas
        X_train[numerical_columns].reset_index(drop=True) # numéricas
    ], 
    axis=1
)
X_test_final = pd.concat(
    [
        pd.DataFrame(X_test_encoded, columns=encoder.get_feature_names_out()).reset_index(drop=True),  # categoricas
        X_test[numerical_columns].reset_index(drop=True) # numéricas
    ], 
    axis=1
)
X_test_final
#pd.DataFrame(X_test_encoded, columns=encoder.get_feature_names_out())


Unnamed: 0,cut_Fair,cut_Good,cut_Ideal,cut_Premium,cut_Very Good,color_D,color_E,color_F,color_G,color_H,...,clarity_VS1,clarity_VS2,clarity_VVS1,clarity_VVS2,carat,depth,table,x,y,z
0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.70,60.6,58.0,5.80,5.72,3.49
1,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.03,61.0,60.0,6.46,6.53,3.96
2,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.31,62.6,57.0,4.33,4.29,2.70
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,1.00,62.7,58.0,6.41,6.32,3.99
4,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,1.0,0.0,0.0,0.50,61.6,55.0,5.11,5.14,3.16
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.43,61.8,56.0,4.87,4.84,3.00
996,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.32,63.1,56.0,4.34,4.38,2.75
997,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.31,60.7,61.0,4.36,4.40,2.66
998,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.37,61.0,56.0,4.70,4.65,2.85


In [23]:
calculate_metrics('OneHotEncoder', X_train_final, X_test_final, y_train, y_test)

Unnamed: 0,Modelo,Preprocesado,R2,MAE,RMSE,MAPE
39,RandomForest,OneHotEncoder,0.963541,397.252065,788.302258,0.09711
38,DecisionTree,OneHotEncoder,0.930697,528.805,1086.85135,0.126546
35,LinearRegression,OneHotEncoder,0.914045,791.194142,1210.402235,0.438605
36,KNN,OneHotEncoder,0.887188,764.8498,1386.662439,0.20158
14,RandomForest,MinMaxScaler,0.868646,835.654962,1496.286474,0.210257
9,RandomForest,StandardScaler,0.868495,834.839074,1497.149208,0.210013
24,RandomForest,QuantileTransformer,0.868345,835.537068,1498.000096,0.210055
19,RandomForest,RobustScaler,0.868256,835.22396,1498.505051,0.210161
29,RandomForest,PowerTransformer,0.86802,834.399099,1499.846193,0.209508
4,RandomForest,Sin preprocesado,0.867955,835.518392,1500.214999,0.210188


In [24]:
# usando drop='first' en el encoder

numerical_columns = X_train.select_dtypes(exclude=['object', 'category']).columns.to_list()
categorical_columns = X_train.select_dtypes(include=['object', 'category']).columns.to_list()

encoder = OneHotEncoder(sparse_output=False, drop='first') # sparse_output=False para obtenerlo como matriz de 0s y 1s
X_train_encoded = encoder.fit_transform(X_train[categorical_columns]) # array de numpy con las codificaciones
X_test_encoded = encoder.transform(X_test[categorical_columns])

#X_train_encoded
# pasarlo a dataframse de poandas y juntarlo con las numéricas para obtener resultado como pd.get_dummies
X_train_final = pd.concat(
    [
        pd.DataFrame(X_train_encoded, columns=encoder.get_feature_names_out()).reset_index(drop=True),  # categoricas
        X_train[numerical_columns].reset_index(drop=True) # numéricas
    ], 
    axis=1
)
X_test_final = pd.concat(
    [
        pd.DataFrame(X_test_encoded, columns=encoder.get_feature_names_out()).reset_index(drop=True),  # categoricas
        X_test[numerical_columns].reset_index(drop=True) # numéricas
    ], 
    axis=1
)


In [25]:
calculate_metrics('OneHotEncoder_drop_first', X_train_final, X_test_final, y_train, y_test)

Unnamed: 0,Modelo,Preprocesado,R2,MAE,RMSE,MAPE
39,RandomForest,OneHotEncoder,0.963541,397.252065,788.302258,0.09711
44,RandomForest,OneHotEncoder_drop_first,0.945799,448.282515,961.15868,0.107496
38,DecisionTree,OneHotEncoder,0.930697,528.805,1086.85135,0.126546
35,LinearRegression,OneHotEncoder,0.914045,791.194142,1210.402235,0.438605
40,LinearRegression,OneHotEncoder_drop_first,0.914045,791.194142,1210.402235,0.438605
43,DecisionTree,OneHotEncoder_drop_first,0.898461,609.138,1315.555589,0.144406
36,KNN,OneHotEncoder,0.887188,764.8498,1386.662439,0.20158
41,KNN,OneHotEncoder_drop_first,0.879003,784.138,1436.085445,0.204685
14,RandomForest,MinMaxScaler,0.868646,835.654962,1496.286474,0.210257
9,RandomForest,StandardScaler,0.868495,834.839074,1497.149208,0.210013


In [26]:
# Combinar OneHotEncoder con MinMaxScaler
# obtener nombres de columnas numéricas y categóricas
numerical_columns = X_train.select_dtypes(exclude=['object', 'category']).columns.to_list() # np.number alternativa
categorical_columns = X_train.select_dtypes(include=['object', 'category']).columns.to_list()

encoder = OneHotEncoder(sparse_output=False) # sparse_output=False para obtenerlo como matriz de 0s y 1s , probar drop='first'
X_train_encoded = encoder.fit_transform(X_train[categorical_columns]) # array de numpy con las codificaciones
X_test_encoded = encoder.transform(X_test[categorical_columns])

scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train[numerical_columns])
X_test_scaled = scaler.transform(X_test[numerical_columns])

X_train_final = pd.concat(
    [
        pd.DataFrame(X_train_encoded, columns=encoder.get_feature_names_out()).reset_index(drop=True), # categoricas
        pd.DataFrame(X_train_scaled, columns=numerical_columns).reset_index(drop=True) # numéricas
    ],
    axis=1
)
X_test_final = pd.concat(
    [
        pd.DataFrame(X_test_encoded, columns=encoder.get_feature_names_out()).reset_index(drop=True), # categoricas
        pd.DataFrame(X_test_scaled, columns=numerical_columns).reset_index(drop=True) # numéricas
    ],
    axis=1
)

In [27]:
calculate_metrics('OneHotEncoder+MinMaxScaler', X_train_final, X_test_final, y_train, y_test)

Unnamed: 0,Modelo,Preprocesado,R2,MAE,RMSE,MAPE
39,RandomForest,OneHotEncoder,0.963541,397.252065,788.302258,0.09711
49,RandomForest,OneHotEncoder+MinMaxScaler,0.963456,397.241947,789.227699,0.097156
44,RandomForest,OneHotEncoder_drop_first,0.945799,448.282515,961.15868,0.107496
48,DecisionTree,OneHotEncoder+MinMaxScaler,0.935301,519.908,1050.125344,0.126376
38,DecisionTree,OneHotEncoder,0.930697,528.805,1086.85135,0.126546
45,LinearRegression,OneHotEncoder+MinMaxScaler,0.914045,791.194142,1210.402235,0.438605
35,LinearRegression,OneHotEncoder,0.914045,791.194142,1210.402235,0.438605
40,LinearRegression,OneHotEncoder_drop_first,0.914045,791.194142,1210.402235,0.438605
43,DecisionTree,OneHotEncoder_drop_first,0.898461,609.138,1315.555589,0.144406
36,KNN,OneHotEncoder,0.887188,764.8498,1386.662439,0.20158


## LabelEncoder

Convierte etiquetas (categorías) a valores numéricos enteros de 0 a n-1.  

Por ejemplo, si tienes las categorías `["rojo", "verde", "azul"]`, podría asignar  
- *rojo* $\to 0$,  
- *verde* $\to 1$,  
- *azul* $\to 2$.

* Normalmente se usa para la variable de salida (y) si se trata de un problema de clasificación multiclase.
* Convierte cada clase categórica a un entero distinto.
* También puede usarse en columnas de entrada si (y solo si) tienen un orden real (caso ordinal) o si el modelo puede manejarlo sin suponer que 2 > 1 > 0 (pero esto no es común; en features categóricas nominales, lo típico es OneHotEncoder).

Equivalente a cuando hacemos el `df['class'].map({'setosa':0, 'virginica':1, 'versicolor':2})`

In [28]:
X = df[['carat', 'depth', 'table', 'x', 'y', 'z', 'price']]
y = df['cut'] # categorica --> calsificacion multiclase

In [29]:
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
y_encoded = encoder.fit_transform(y)
y_encoded

X_train, X_test, y_train, y_test = train_test_split(X, y_encoded, test_size=0.20, random_state=42)

print('calses del encoder:', encoder.classes_)
print('ejemplo y_encoded:', y_encoded[:10])
print('ejemplo y_train:', y_train[:10])
print('ejemplo y_test:', y_test[:10])


calses del encoder: ['Fair' 'Good' 'Ideal' 'Premium' 'Very Good']
ejemplo y_encoded: [2 4 2 3 2 0 2 2 3 2]
ejemplo y_train: [1 0 3 4 1 2 2 2 2 4]
ejemplo y_test: [3 3 2 3 2 2 2 2 2 1]


In [30]:
# con inverse_transform podemos obtener las categorias originales a partir de los datos codificados
# podemos aplicar inverse_transform sobre y_pred para obtener las categorias de las predicciones si queremos 
encoder.inverse_transform(y_encoded)[:10]

array(['Ideal', 'Very Good', 'Ideal', 'Premium', 'Ideal', 'Fair', 'Ideal',
       'Ideal', 'Premium', 'Ideal'], dtype=object)

In [31]:
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier # suele darnos mejor resultado

model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy_score(y_test, y_pred)

0.749

In [32]:
print('predicciones', y_pred[:10])
print('predicciones descodificadas', encoder.inverse_transform(y_pred)[:10])

predicciones [3 3 2 3 2 2 2 2 2 1]
predicciones descodificadas ['Premium' 'Premium' 'Ideal' 'Premium' 'Ideal' 'Ideal' 'Ideal' 'Ideal'
 'Ideal' 'Good']


## KBinsDiscretizer

Similar a pd.cut de pandas para discretizar columnas numéricas.

Convierte variables numéricas continuas en variables discretas, dividiendo los valores en **intervalos o "bins"**. Cada intervalo recibe una etiqueta numérica.

¿Cuándo usarlo?

* Cuando queremos convertir variables numéricas continuas en categorías discretas (por ejemplo, dividir `carat` en "pequeño", "mediano" y "grande").  
* Cuando un modelo puede beneficiarse de información categorizada en lugar de valores continuos.  
* Para mejorar la interpretabilidad de un modelo.  

Formas de discretización:

- `uniform`: Divide el rango en intervalos de **igual tamaño**.
- `quantile`: Crea intervalos con **igual número de muestras**.
- `kmeans`: Usa **K-Means** para definir los bins.

In [33]:
# realizar clasificación multiclase sobre price
X = df[['carat', 'depth', 'table', 'x', 'y', 'z']]
y = df[['price']] #[[]] para que sea 2d para discretizar, col numérica que transformaremos a categórica --> calsificacion multiclase

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

In [34]:
from sklearn.preprocessing import KBinsDiscretizer

# discretizar el precio en 4 grupos
discretizer = KBinsDiscretizer(encode='ordinal', n_bins=4, strategy='kmeans') # encode='onehot-dense' generaría una matriz densa estilo onehot
# por defecto divide en 5 grupos
# strategy kmeans suele ser mejor
# vamos a convertir columnas numericas en discretas
discretizer.fit(y_train)

y_train_discretized = discretizer.transform(y_train).ravel() # pasar de 2d a 1d para usar en scikit fit y predict
y_test_discretized = discretizer.transform(y_test).ravel() # pasar de 2d a 1d para usar en scikit fit y predict

model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train_discretized)
y_pred = model.predict(X_test)
accuracy_score(y_test_discretized, y_pred)

#y_train_discretizer

0.853

In [35]:
print('predictions: ', y_pred[:10])
y_train_discretized

predictions:  [0. 1. 0. 1. 0. 0. 0. 1. 1. 1.]


array([0., 1., 0., ..., 1., 2., 0.])

In [36]:
print('discretize.n_bins_', discretizer.n_bins_)
print('discretize.n_features_in_', discretizer.n_features_in_)
print('discretize.features_names_in_:', discretizer.feature_names_in_)
print('discretize.bin_edges', discretizer.bin_edges_)
print('bin min', discretizer.bin_edges_[0][0])
print('bin 1', discretizer.bin_edges_[0][1])
print('bin 2', discretizer.bin_edges_[0][2])
print('bin 3', discretizer.bin_edges_[0][3])
print('bin max', discretizer.bin_edges_[0][4])

discretize.n_bins_ [4]
discretize.n_features_in_ 1
discretize.features_names_in_: ['price']
discretize.bin_edges [array([  336.        ,  2964.95252869,  6925.92917442, 12188.07840494,
        18823.        ])                                               ]
bin min 336.0
bin 1 2964.9525286855264
bin 2 6925.929174424701
bin 3 12188.078404935175
bin max 18823.0


## Binarizer

Binarizer convierte valores numéricos en valores binarios (0 o 1) en función de un umbral. 

Se usa cuando quieres transformar una variable numérica en categórica, lo que puede ser útil para mejorar el rendimiento de algunos modelos.

Por ejemplo podemos convertir la variable precio a una variable binaria barato (0) y caro (1) para realizar clasificación binaria.

Otro ejemplo es binarizar la edad de una persona en adulto (0 o 1) en función de si tiene igual o más de 18 años o no.

In [37]:
from sklearn.preprocessing import Binarizer

X = df[['carat', 'depth', 'table', 'x', 'y', 'z']]
y = df[['price']] #[[]] para que sea 2d para discretizar, col numérica que transformaremos a categórica --> calsificacion multiclase

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

binarizer = Binarizer(threshold=df['carat'].median())
X_train['carat'] = binarizer.fit_transform(X_train[['carat']])
X_test['carat'] = binarizer.transform(X_test[['carat']])
X_train.head(3)

Unnamed: 0,carat,depth,table,x,y,z
4227,0.0,63.2,56.0,4.27,4.3,2.71
4676,1.0,65.2,59.0,6.28,6.27,4.09
800,1.0,61.3,58.0,6.1,6.04,3.72


## SimpleImputer

Clase para imputar valores nulos (NaN) en dataframes o arrays. 

Estrategias:

* mean: Rellena con la media de la columna (numérica).
* median: Rellena con la mediana de la columna (numérica).

* most_frequent: Rellena los valores nulos con el valor más frecuente de la columna (puede usarse tanto en numéricas como en categóricas).
* constant: Rellena con un valor constante que definimos (por ejemplo, "missing" en categóricas o 0 en numéricas).

In [38]:
df.isna().sum()

carat      0
cut        0
color      0
clarity    0
depth      0
table      0
price      0
x          0
y          0
z          0
dtype: int64

In [39]:
# Generacion de datos legacy
np.random.seed(42)
df2 = df.copy()
#introducir números nulos aleatoriamente en una columna numérica
random_indices = np.random.choice(df2.index, size=50, replace=False)
df2.loc[random_indices, 'carat'] = np.nan

random_indices = np.random.choice(df2.index, size=50, replace=False)
df2.loc[random_indices, 'cut'] = np.nan

df2.isna().sum() # comprobamos que ya tenemos los nulos

carat      50
cut        50
color       0
clarity     0
depth       0
table       0
price       0
x           0
y           0
z           0
dtype: int64

In [40]:
# probamos de otra manera:
#random_state = np.random.RandomState(seed=42)

# Nueva API random recomendable
random_state = np.random.default_rng(seed=42) # se debe poder usar tb en random state

df2 = df.copy()
#introducir números nulos aleatoriamente en una columna numérica
random_indices = np.random.choice(df2.index, size=50, replace=False)
df2.loc[random_indices, 'carat'] = np.nan

random_indices = np.random.choice(df2.index, size=50, replace=False)
df2.loc[random_indices, 'cut'] = np.nan

df2.isna().sum() # comprobamos que ya tenemos los nulos

carat      50
cut        50
color       0
clarity     0
depth       0
table       0
price       0
x           0
y           0
z           0
dtype: int64

In [41]:
# Ya hemos creado nulos 
X = df2[['carat', 'depth', 'table', 'x', 'y', 'z', 'cut']] # añadimos alguna columna categorica
y = df2[['price']] 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

In [42]:
# seleccionar los nombres de las columnas numéricas y categóricas
from sklearn.impute import SimpleImputer # tiene su propio módulo para cálculos básicos, mean median moda, constante

numerical_cols = X_train.select_dtypes(include=[np.number]).columns
categorical_cols = X_train.select_dtypes(exclude=[np.number]).columns

imputer_num = SimpleImputer(strategy='median')
X_train_numerical = imputer_num.fit_transform(X_train[numerical_cols])
X_test_numerical = imputer_num.transform(X_test[numerical_cols]) 
# ya hemos solucionado los problemas de las col numericas, vamos ahora a las categoricas
#imputer_cat = SimpleImputer(strategy='constant', fill_value='Other')
imputer_cat = SimpleImputer(strategy='most_frequent')

X_train_categorical = imputer_cat.fit_transform(X_train[categorical_cols])
X_test_categorical = imputer_cat.fit_transform(X_test[categorical_cols])

X_train_array = np.concatenate([X_train_numerical, X_train_categorical], axis=1)
X_test_array = np.concatenate([X_test_numerical, X_test_categorical], axis=1)

# Opcional: pasar a dataframes de pandas para tener nombres de columnas 
X_train_imputed = pd.DataFrame(X_train_array, columns=X_train.columns, index=X_train.index) # le damos los nombresw de las columnas
X_test_imputed = pd.DataFrame(X_test_array, columns=X_test.columns, index=X_test.index) # le damos los nombresw de las columnas

print(X_train_imputed.isna().sum()) # comprobamos que ya no hay nulos
print(X_test_imputed.isna().sum()) # ya no hay nulos

carat    0
depth    0
table    0
x        0
y        0
z        0
cut      0
dtype: int64
carat    0
depth    0
table    0
x        0
y        0
z        0
cut      0
dtype: int64


## KNNImputer

Imputer que usa algoritmo de K Vecinos más cercanos (K-Nearest Neighbors) para imputar valores faltantes. 

En lugar de rellenar las celdas nulas (NaN) con una estadística global (como la media o la mediana) de la columna, el KNNImputer busca k instancias (filas) “similares” (más cercanas en el espacio de características) y calcula el valor de la celda faltante a partir de ellas.

In [43]:
random_state = np.random.default_rng(seed=42) # se debe poder usar tb en random state

df2 = df.copy()
#introducir números nulos aleatoriamente en una columna numérica
random_indices = np.random.choice(df2.index, size=50, replace=False)
df2.loc[random_indices, 'carat'] = np.nan

random_indices = np.random.choice(df2.index, size=50, replace=False)
df2.loc[random_indices, 'cut'] = np.nan

df2.isna().sum() # comprobamos que ya tenemos los nulos creados

carat      50
cut        50
color       0
clarity     0
depth       0
table       0
price       0
x           0
y           0
z           0
dtype: int64

In [44]:
# seleccionar los nombres de las columnas numéricas y categóricas
from sklearn.impute import KNNImputer, SimpleImputer 

X = df2[['carat', 'depth', 'table', 'x', 'y', 'z', 'cut']]
y = df2[['price']]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

numerical_cols = X_train.select_dtypes(include=[np.number]).columns
categorical_cols = X_train.select_dtypes(exclude=[np.number]).columns

imputer_num = KNNImputer(n_neighbors=7) ## Hemos cambiado SimpleImputer por KNNImputer
X_train_numerical = imputer_num.fit_transform(X_train[numerical_cols])
X_test_numerical = imputer_num.transform(X_test[numerical_cols]) 

imputer_cat = SimpleImputer(strategy='most_frequent') # Se mantiene SimpleImputer porque KNNImputer no trabaja con categoricas
X_train_categorical = imputer_cat.fit_transform(X_train[categorical_cols])
X_test_categorical = imputer_cat.fit_transform(X_test[categorical_cols])

X_train_array = np.concatenate([X_train_numerical, X_train_categorical], axis=1)
X_test_array = np.concatenate([X_test_numerical, X_test_categorical], axis=1)

# Opcional: pasar a dataframes de pandas para tener nombres de columnas 
X_train_imputed = pd.DataFrame(X_train_array, columns=X_train.columns, index=X_train.index) # le damos los nombresw de las columnas
X_test_imputed = pd.DataFrame(X_test_array, columns=X_test.columns, index=X_test.index) # le damos los nombresw de las columnas

print(X_train_imputed.isna().sum()) # comprobamos que ya no hay nulos
print(X_test_imputed.isna().sum()) # ya no hay nulos

carat    0
depth    0
table    0
x        0
y        0
z        0
cut      0
dtype: int64
carat    0
depth    0
table    0
x        0
y        0
z        0
cut      0
dtype: int64


## IterativeImputer

IterativeImputer implementa una estrategia de imputación multivariable. 

A diferencia de los enfoques simples (media, mediana, moda) o KNNImputer, el IterativeImputer entrena un modelo para predecir el valor faltante en cada característica usando las otras características como predictores.

In [45]:
random_state = np.random.default_rng(seed=42) 

df2 = df.copy()
random_indices = np.random.choice(df2.index, size=50, replace=False)
df2.loc[random_indices, 'carat'] = np.nan

random_indices = np.random.choice(df2.index, size=50, replace=False)
df2.loc[random_indices, 'cut'] = np.nan

df2.isna().sum() 

carat      50
cut        50
color       0
clarity     0
depth       0
table       0
price       0
x           0
y           0
z           0
dtype: int64

In [46]:
'''from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

X = df2[['carat', 'depth', 'table', 'x', 'y', 'z', 'cut']] 
y = df2[['price']] 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

numerical_cols = X_train.select_dtypes(include=[np.number]).columns
categorical_cols = X_train.select_dtypes(exclude=[np.number]).columns

model = RandomForestRegressor(random_state=42)
imputer_num = IterativeImputer(model, random_state=42, initial_strategy='median') # Hemos cambiado a IterativeImputer
X_train_numerical = imputer_num.fit_transform(X_train[numerical_cols])
X_test_numerical = imputer_num.transform(X_test[numerical_cols]) 

model = RandomForestClassifier(random_state=42)
imputer_cat = IterativeImputer(model, random_state=42, initial_strategy='most_frequent') # Se mantiene SimpleImputer porque KNNImputer no trabaja con categoricas
X_train_categorical = imputer_cat.fit_transform(X_train[categorical_cols])
X_test_categorical = imputer_cat.transform(X_test[categorical_cols])

X_train_array = np.concatenate([X_train_numerical, X_train_categorical], axis=1)
X_test_array = np.concatenate([X_test_numerical, X_test_categorical], axis=1)

# Opcional: pasar a dataframes de pandas para tener nombres de columnas 
X_train_imputed = pd.DataFrame(X_train_array, columns=X_train.columns, index=X_train.index) 
X_test_imputed = pd.DataFrame(X_test_array, columns=X_test.columns, index=X_test.index) 

print(X_train_imputed.isna().sum()) 
print(X_test_imputed.isna().sum()) '''


"from sklearn.experimental import enable_iterative_imputer\nfrom sklearn.impute import IterativeImputer\n\nX = df2[['carat', 'depth', 'table', 'x', 'y', 'z', 'cut']] \ny = df2[['price']] \n\nX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)\n\nnumerical_cols = X_train.select_dtypes(include=[np.number]).columns\ncategorical_cols = X_train.select_dtypes(exclude=[np.number]).columns\n\nmodel = RandomForestRegressor(random_state=42)\nimputer_num = IterativeImputer(model, random_state=42, initial_strategy='median') # Hemos cambiado a IterativeImputer\nX_train_numerical = imputer_num.fit_transform(X_train[numerical_cols])\nX_test_numerical = imputer_num.transform(X_test[numerical_cols]) \n\nmodel = RandomForestClassifier(random_state=42)\nimputer_cat = IterativeImputer(model, random_state=42, initial_strategy='most_frequent') # Se mantiene SimpleImputer porque KNNImputer no trabaja con categoricas\nX_train_categorical = imputer_cat.fit_transform(X_tra

In [47]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.preprocessing import OrdinalEncoder

# Ejemplo usando IterativeImputer para categóricas (hay que hacer one hot encoder primero)
df2 = df.copy()
indices = random_state.choice(df2.index, size=50, replace=False)
df2.loc[indices, 'carat'] = np.nan
indices = random_state.choice(df2.index, size=50, replace=False)
df2.loc[indices, 'cut'] = np.nan

X = df2[['carat', 'depth', 'table', 'x', 'y', 'z', 'cut']] 
y = df2[['price']] 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

numerical_cols = X_train.select_dtypes(include=[np.number]).columns
categorical_cols = X_train.select_dtypes(exclude=[np.number]).columns

model = RandomForestRegressor(random_state=42)
imputer_num = IterativeImputer(model, random_state=42, initial_strategy='median') # Hemos cambiado a IterativeImputer
X_train_numerical = imputer_num.fit_transform(X_train[numerical_cols])
X_test_numerical = imputer_num.transform(X_test[numerical_cols]) 

# En principio IterativeImputer está pensado para columnas numércias, por lo que si queremos imputar categoricas
# habría que convertirlas primero a numéricas:
encoder = OrdinalEncoder() # CUIDADO: puede introducir una jerarquía u orden fiocticio: 0, 1, 2, 3... puede sugerir que más es mejor
#encoder = OneHotEncoder(sparse_output=False) # la matriz sale con 0 y 1 par apoder usarla
X_train_categorical = encoder.fit_transform(X_train[categorical_cols])
X_test_categorical = encoder.transform(X_test[categorical_cols])

model = RandomForestClassifier(random_state=42)
imputer_cat = IterativeImputer(model, random_state=42, initial_strategy='most_frequent') # Se mantiene SimpleImputer porque KNNImputer no trabaja con categoricas
X_train_categorical = imputer_cat.fit_transform(X_train_categorical)
X_test_categorical = imputer_cat.transform(X_test_categorical)

X_train_array = np.concatenate([X_train_numerical, X_train_categorical], axis=1)
X_test_array = np.concatenate([X_test_numerical, X_test_categorical], axis=1)

# Opcional: pasar a dataframes de pandas para tener nombres de columnas 
encoded_columns = encoder.get_feature_names_out(categorical_cols)
all_columns = list(numerical_cols) + list(encoded_columns)
X_train_imputed = pd.DataFrame(X_train_array, columns=all_columns, index=X_train.index) 
X_test_imputed = pd.DataFrame(X_test_array, columns=all_columns, index=X_test.index) 

print(X_train_imputed.isna().sum()) 
print(X_test_imputed.isna().sum()) 


carat    0
depth    0
table    0
x        0
y        0
z        0
cut      0
dtype: int64
carat    0
depth    0
table    0
x        0
y        0
z        0
cut      0
dtype: int64


## Outliers



In [66]:
# OJO, volvemos a usar df, no df2 con los nulos
X = df[['carat', 'depth', 'table', 'x', 'y', 'z']] 
y = df['price'] 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

## Filtrar outliers manualmente con Pandas y el método tukey IQR
# lo calculamos sobre los datos de entrenamiento
Q1 = X_train.quantile(0.25)
Q3 = X_train.quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

filtro = ~((X_train < lower_bound) | (X_train > upper_bound)).any(axis=1)
X_train_filtered = X_train[filtro]
y_train_filtered = y_train[filtro]
#X_train_filtered
print('X_train len', X_train.shape[0])
print('X_train_filtered len', X_train_filtered.shape[0])

filtro = ~((X_test < lower_bound) | (X_test > upper_bound)).any(axis=1)
X_test_filtered = X_test[filtro]
y_test_filtered = y_test[filtro]

calculate_metrics('- Outliers IQR', X_train_filtered, X_test_filtered, y_train_filtered, y_test_filtered)

X_train len 4000
X_train_filtered len 3615


Unnamed: 0,Modelo,Preprocesado,R2,MAE,RMSE,MAPE
39,RandomForest,OneHotEncoder,0.963541,397.252065,788.302258,0.09711
49,RandomForest,OneHotEncoder+MinMaxScaler,0.963456,397.241947,789.227699,0.097156
44,RandomForest,OneHotEncoder_drop_first,0.945799,448.282515,961.15868,0.107496
48,DecisionTree,OneHotEncoder+MinMaxScaler,0.935301,519.908,1050.125344,0.126376
38,DecisionTree,OneHotEncoder,0.930697,528.805,1086.85135,0.126546
45,LinearRegression,OneHotEncoder+MinMaxScaler,0.914045,791.194142,1210.402235,0.438605
35,LinearRegression,OneHotEncoder,0.914045,791.194142,1210.402235,0.438605
40,LinearRegression,OneHotEncoder_drop_first,0.914045,791.194142,1210.402235,0.438605
43,DecisionTree,OneHotEncoder_drop_first,0.898461,609.138,1315.555589,0.144406
36,KNN,OneHotEncoder,0.887188,764.8498,1386.662439,0.20158


In [68]:
# IsolationForest algoritmo de conjunto que está en el paquete Ensemble para detectar anomalías o outliers
from sklearn.ensemble import IsolationForest

X = df[['carat', 'depth', 'table', 'x', 'y', 'z']] 
y = df['price'] 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

# ojo, el siguiente no es un modelo!
outlier_detector = IsolationForest(contamination='auto', random_state=42) # indicamos que se esèra un 5% de outliers, por defecto es auto, 
outlier_detector.fit(X_train)

# Eliminar datos outliers de train:
y_train_pred = outlier_detector.predict(X_train) # devuelve un array así: array([1,1,-1]) donde -1 indica que es anómalo
#y_train_pred[:40]
filtro = (y_train_pred != -1) # nos quedamos con los NO outliers
X_train_filtered = X_train[filtro]
#X_train_filtered
y_train_filtered = y_train[filtro]

print('X_train len', X_train.shape[0])
print('X_train_filtered len', X_train_filtered.shape[0]) # comprobamos que si lo dejamos en 'auto' en vez de 0.05 0o 0.01, elimina más que el IQR 

y_test_pred = outlier_detector.predict(X_test) # no hacemos fit, igual que preprocessors para evitar data leakage
filtro = (y_test_pred != -1) 
X_test_filtered = X_test[filtro]
#X_train_filtered
y_test_filtered = y_test[filtro]

print('X_test len', X_test.shape[0])
print('X_test_filtered len', X_test_filtered.shape[0]) # comprobamos que si lo dejamos en 'auto' en vez de 0.05 0o 0.01, elimina más que el IQR 


X_train len 4000
X_train_filtered len 3279
X_test len 1000
X_test_filtered len 806


In [69]:
calculate_metrics('IsolationForest', X_train_filtered, X_test_filtered, y_train_filtered, y_test_filtered)

Unnamed: 0,Modelo,Preprocesado,R2,MAE,RMSE,MAPE
39,RandomForest,OneHotEncoder,0.963541,397.252065,788.302258,0.097110
49,RandomForest,OneHotEncoder+MinMaxScaler,0.963456,397.241947,789.227699,0.097156
44,RandomForest,OneHotEncoder_drop_first,0.945799,448.282515,961.158680,0.107496
48,DecisionTree,OneHotEncoder+MinMaxScaler,0.935301,519.908000,1050.125344,0.126376
38,DecisionTree,OneHotEncoder,0.930697,528.805000,1086.851350,0.126546
...,...,...,...,...,...,...
57,SVR,- Outliers IQR,-0.161501,2449.256493,3610.189419,1.034480
32,SVR,PowerTransformer standard False,-0.178926,2963.875294,4482.666046,1.097420
2,SVR,Sin preprocesado,-0.179999,2967.146616,4484.706331,1.100950
42,SVR,OneHotEncoder_drop_first,-0.180797,2968.788587,4486.221809,1.102247


In [75]:
df_resultados[
    (df_resultados['Preprocesado'] == 'IsolationForest')
    |
    (df_resultados['Preprocesado'] == 'MinMaxScaler')
    | 
    (df_resultados['Preprocesado'] == 'StandardScaler')
    |
    (df_resultados['Preprocesado'] == 'OneHotEncoder')
].sort_values('R2', ascending=False)

Unnamed: 0,Modelo,Preprocesado,R2,MAE,RMSE,MAPE
39,RandomForest,OneHotEncoder,0.963541,397.252065,788.302258,0.09711
38,DecisionTree,OneHotEncoder,0.930697,528.805,1086.85135,0.126546
35,LinearRegression,OneHotEncoder,0.914045,791.194142,1210.402235,0.438605
36,KNN,OneHotEncoder,0.887188,764.8498,1386.662439,0.20158
14,RandomForest,MinMaxScaler,0.868646,835.654962,1496.286474,0.210257
9,RandomForest,StandardScaler,0.868495,834.839074,1497.149208,0.210013
5,LinearRegression,StandardScaler,0.861482,930.832368,1536.548541,0.288899
10,LinearRegression,MinMaxScaler,0.861482,930.832368,1536.548541,0.288899
6,KNN,StandardScaler,0.859353,874.932,1548.314323,0.221851
11,KNN,MinMaxScaler,0.857112,868.6464,1560.600343,0.220208
