### Preprocesamiento de datos

Aplicar operaciones sobre los datos con el fin de mejorar los modelados.

Prepocesadores de `sklearn.preprocessing`:

* Escalado de datos
    * StandardScaler
    * MinMaxScaler
    * RobustScaler
* Transformación de distribución de datos (intentar reducir la asimetría de los datos) (similar a aplicar la función raíz o logaritmo a los datos)
    * QuantileTransformer
    * PowerTransformer
* Encoders para codificaicón de categóricos a numéricos:
    * OneHotEncoder (Equivalente a pd.get_dummies). Habitual usarlo en la entrada X.
    * LabelEncoder (Equivalente a hacer un .map()en pandas con un diccionario). Habitual usarlo en la salida y.

* Discretización:
    * KBinsDiscretizer
    * Binarizer
* Imputers:
    * SimpleImputer: mean, median, most_frequent
    * KNNImputer
    * IterativeImputer

Todas estas clases tienen algo en común, tienen métodos fit y transform para que puedan usarse de forma similar, lo que cambia es las operaciones que realizan sobre los datos, por ejemplo: escalar, transformar, codificar, imputar, discretizar, binarizar, normalizar, estandarizar...

Los **pipelines de scikit learn** simplifican el uso de preprocesadores cuando queremos aplicar varios de ellos y combinarlos.



In [70]:
import seaborn as sns 
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR

from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler

In [71]:
# Carga de datos
df = sns.load_dataset("diamonds").dropna().sample(5000, random_state=42).reset_index(drop=True)
df.head(3)

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.24,Ideal,G,VVS1,62.1,56.0,559,3.97,4.0,2.47
1,0.58,Very Good,F,VVS2,60.0,57.0,2201,5.44,5.42,3.26
2,0.4,Ideal,E,VVS2,62.1,55.0,1238,4.76,4.74,2.95


In [72]:
from sklearn.metrics import r2_score, mean_absolute_error, root_mean_squared_error, mean_absolute_percentage_error
# Particionar y crear método calculate_metrics para hacer un modelado antes de hacer nada y ver si aplicando preprocesadores mejora
X = df[["carat", "depth", "table", "x", "y", "z"]]
y = df["price"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

df_resultado = pd.DataFrame(columns=["Modelo", "Preprocesado", "R2", "MAE", "RMSE", "MAPE"])

def calucate_metrics(preprocessor_name, X_train, X_test, y_train, y_test):
    modelos = {
        "LinearRegression": LinearRegression(),
        "KNN": KNeighborsRegressor(),
        "SVR": SVR(),
        "DecisionTree": DecisionTreeRegressor(),
        "RandomForest": RandomForestRegressor()
    }
    for model_name, model in modelos.items():
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        df_resultado.loc[len(df_resultado)] = [model_name, preprocessor_name, 
                                               r2_score(y_test, y_pred),
                                               mean_absolute_error(y_test, y_pred),
                                               root_mean_squared_error(y_test, y_pred),
                                               mean_absolute_percentage_error(y_test, y_pred)]
        
    return df_resultado.sort_values("R2", ascending=False)

In [73]:
calucate_metrics("Sin preprocesado", X_train, X_test, y_train, y_test)

Unnamed: 0,Modelo,Preprocesado,R2,MAE,RMSE,MAPE
4,RandomForest,Sin preprocesado,0.8686,831.599182,1496.549411,0.209466
0,LinearRegression,Sin preprocesado,0.861482,930.832368,1536.548541,0.288899
1,KNN,Sin preprocesado,0.849006,899.1862,1604.254748,0.22642
3,DecisionTree,Sin preprocesado,0.774659,1100.567,1959.808612,0.269483
2,SVR,Sin preprocesado,-0.179999,2967.146616,4484.706331,1.10095


### StandarScaler

Transforma los datos para que cada característica tenga media 0 y desviación estándar 1.

Cuándo usarlo:
* Cuando los datos no tienen outliers extremadamente grandes (o son relativamente cercanos a una distribución normal).
* Es el escalado más común, especialmente para algoritmos que asumen normalidad o que son sensibles a la escala (regresiones lineales, redes neuronales, SVM, etc).

In [74]:
scaler = StandardScaler()
scaler.fit(X_train) # fit solo sobre train y no en test para evitar data leakage

X_train_scaled = scaler.transform(X_train) # Por defecto devuelve un array de numpy
X_test_scaled = scaler.transform(X_test) # Por defecto devuelve un array de numpy

# Opcional, pasarlo a dataframes de pandas con los nombres de las columnas
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X.columns)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X.columns)
X_train_scaled.head(3)

Unnamed: 0,carat,depth,table,x,y,z
0,-1.028331,0.973355,-0.630227,-1.277261,-1.25903,-1.173678
1,0.439665,2.353854,0.692667,0.507249,0.500558,0.810866
2,0.104123,-0.338119,0.251703,0.347442,0.295124,0.278778


In [75]:
calucate_metrics("StandardScaler", X_train_scaled, X_test_scaled, y_train, y_test)

Unnamed: 0,Modelo,Preprocesado,R2,MAE,RMSE,MAPE
4,RandomForest,Sin preprocesado,0.8686,831.599182,1496.549411,0.209466
9,RandomForest,StandardScaler,0.86819,838.420478,1498.880314,0.211034
5,LinearRegression,StandardScaler,0.861482,930.832368,1536.548541,0.288899
0,LinearRegression,Sin preprocesado,0.861482,930.832368,1536.548541,0.288899
6,KNN,StandardScaler,0.859353,874.932,1548.314323,0.221851
1,KNN,Sin preprocesado,0.849006,899.1862,1604.254748,0.22642
8,DecisionTree,StandardScaler,0.785102,1082.5095,1913.854195,0.268234
3,DecisionTree,Sin preprocesado,0.774659,1100.567,1959.808612,0.269483
7,SVR,StandardScaler,0.033553,2363.397416,4058.655409,0.675242
2,SVR,Sin preprocesado,-0.179999,2967.146616,4484.706331,1.10095


### MInMaxScaler

Escala y traslada cada caraterística indificual a un rango definido, por defecto [0, 1].

Cuándo usarlo:
* Cuando quieres que los datos estén acotados entre 0 y 1 o entre otro rango definido, porque se puede personalizar el rango (min, max).
* Sin embargo, es muy sensible a los outliers. Un valor muy grande puede comprimir el resto de los datos.

In [76]:
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train) # Equivalente al ejemplo anterior pero todo en una línea
X_test_scaled = scaler.transform(X_test)

X_train_scaled = pd.DataFrame(X_train_scaled, columns=X.columns)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X.columns)
X_train_scaled.head(3)

Unnamed: 0,carat,depth,table,x,y,z
0,0.026247,0.5,0.291667,0.066773,0.077901,0.265306
1,0.209974,0.625,0.416667,0.386328,0.391097,0.546939
2,0.167979,0.38125,0.375,0.357711,0.354531,0.471429


In [77]:
calucate_metrics("MinMaxScaler", X_train_scaled, X_test_scaled, y_train, y_test)

Unnamed: 0,Modelo,Preprocesado,R2,MAE,RMSE,MAPE
4,RandomForest,Sin preprocesado,0.8686,831.599182,1496.549411,0.209466
9,RandomForest,StandardScaler,0.86819,838.420478,1498.880314,0.211034
14,RandomForest,MinMaxScaler,0.868044,834.96183,1499.712227,0.210139
5,LinearRegression,StandardScaler,0.861482,930.832368,1536.548541,0.288899
0,LinearRegression,Sin preprocesado,0.861482,930.832368,1536.548541,0.288899
10,LinearRegression,MinMaxScaler,0.861482,930.832368,1536.548541,0.288899
6,KNN,StandardScaler,0.859353,874.932,1548.314323,0.221851
11,KNN,MinMaxScaler,0.857112,868.6464,1560.600343,0.220208
1,KNN,Sin preprocesado,0.849006,899.1862,1604.254748,0.22642
8,DecisionTree,StandardScaler,0.785102,1082.5095,1913.854195,0.268234


### RobustScaler

Escala los datos usando Mediana e IQR (rango intercuartílico).

Cuando existen outlier en los datos que pueden afectar mucho al escalado. Al usar la mediana y el IQR en lugar de la media y desviación estándar, resulta mucho menos sensible a valores atípicos.

In [78]:
scaler = RobustScaler()
X_train_scaled = scaler.fit_transform(X_train)  
X_test_scaled = scaler.transform(X_test)

X_train_scaled = pd.DataFrame(X_train_scaled, columns=X.columns)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X.columns)
X_train_scaled.head(3)

Unnamed: 0,carat,depth,table,x,y,z
0,-0.625,0.866667,-0.333333,-0.768176,-0.767956,-0.707965
1,0.46875,2.2,0.666667,0.334705,0.320442,0.513274
2,0.21875,-0.4,0.333333,0.23594,0.19337,0.185841


In [79]:
calucate_metrics("RobustScaler", X_train_scaled, X_test_scaled, y_train, y_test)

Unnamed: 0,Modelo,Preprocesado,R2,MAE,RMSE,MAPE
19,RandomForest,RobustScaler,0.868617,840.47698,1496.450348,0.211633
4,RandomForest,Sin preprocesado,0.8686,831.599182,1496.549411,0.209466
9,RandomForest,StandardScaler,0.86819,838.420478,1498.880314,0.211034
14,RandomForest,MinMaxScaler,0.868044,834.96183,1499.712227,0.210139
5,LinearRegression,StandardScaler,0.861482,930.832368,1536.548541,0.288899
0,LinearRegression,Sin preprocesado,0.861482,930.832368,1536.548541,0.288899
15,LinearRegression,RobustScaler,0.861482,930.832368,1536.548541,0.288899
10,LinearRegression,MinMaxScaler,0.861482,930.832368,1536.548541,0.288899
6,KNN,StandardScaler,0.859353,874.932,1548.314323,0.221851
16,KNN,RobustScaler,0.8583,876.919,1554.09643,0.221305


## QuantileTransformer

Es una transformación basada en **cuantiles**

In [80]:
X_train.skew()

carat    1.193090
depth    0.159501
table    0.681555
x        0.434562
y        0.434972
z        0.407144
dtype: float64

In [81]:
from sklearn.preprocessing import QuantileTransformer

transformed = QuantileTransformer()
X_train_transformed = scaler.fit_transform(X_train)  
X_test_transformed = scaler.transform(X_test)

X_train_transformed = pd.DataFrame(X_train_transformed, columns=X.columns)
X_test_transformed = pd.DataFrame(X_test_transformed, columns=X.columns)
X_train_transformed.head(3)

Unnamed: 0,carat,depth,table,x,y,z
0,-0.625,0.866667,-0.333333,-0.768176,-0.767956,-0.707965
1,0.46875,2.2,0.666667,0.334705,0.320442,0.513274
2,0.21875,-0.4,0.333333,0.23594,0.19337,0.185841


In [82]:
X_train_transformed.skew() # Vemos que ha reducido la asimetría

carat    1.193090
depth    0.159501
table    0.681555
x        0.434562
y        0.434972
z        0.407144
dtype: float64

In [83]:
calucate_metrics("QuantileTransformer", X_train_transformed, X_test_transformed, y_train, y_test)

Unnamed: 0,Modelo,Preprocesado,R2,MAE,RMSE,MAPE
24,RandomForest,QuantileTransformer,0.869424,836.960855,1491.846494,0.210598
19,RandomForest,RobustScaler,0.868617,840.47698,1496.450348,0.211633
4,RandomForest,Sin preprocesado,0.8686,831.599182,1496.549411,0.209466
9,RandomForest,StandardScaler,0.86819,838.420478,1498.880314,0.211034
14,RandomForest,MinMaxScaler,0.868044,834.96183,1499.712227,0.210139
0,LinearRegression,Sin preprocesado,0.861482,930.832368,1536.548541,0.288899
5,LinearRegression,StandardScaler,0.861482,930.832368,1536.548541,0.288899
20,LinearRegression,QuantileTransformer,0.861482,930.832368,1536.548541,0.288899
15,LinearRegression,RobustScaler,0.861482,930.832368,1536.548541,0.288899
10,LinearRegression,MinMaxScaler,0.861482,930.832368,1536.548541,0.288899


### PowerTransformer

Cuando eel sesgo es más moderado, intenta que los datos sean parecidos a una distribución normal.

PowerTransformer aplica transformaciones de **potencia** para hacer que los datos se acerquen más a una distribución normal.

- Admite dos métodos principales:
  1. **Box-Cox**: requiere que todos los datos sean **estrictamente positivos**.  
  2. **Yeo-Johnson**: puede manejar datos con valores 0 o negativos.  Yeo-Johnson es una versión mejorada de Box-Cox que funciona con valores negativos y positivos.

Internamente, `PowerTransformer` encuentra el mejor parámetro de potencia que estabiliza la varianza y reduce la asimetría (skew) de los datos, por tanto es una opción más flexible y automatizada que aplicar manualmente un np.sqrt o np.log a una columna.

¿Cuándo usarlo?

- Cuando tus datos están fuertemente sesgados (tienen heavy skew) y necesitas **mejorar la normalidad**. El QuantileTransfomer podría ser más fuerte. 

- Se suele usar antes de **modelos lineales** o algoritmos que asumen distribuciones aproximadamente gaussianas, ayudando a cumplir hipótesis de homocedasticidad (misma varianza) y mejorando la linealidad.  

- Si tus datos tienen valores cero o negativos, no puedes usar Box-Cox, pero sí Yeo-Johnson.
- Puedes luego aplicar un escalado adicional (por ejemplo, `StandardScaler`) tras la transformación de potencia si lo deseas.

In [84]:
from sklearn.preprocessing import PowerTransformer

transformer = PowerTransformer()
X_train_transformed = transformer.fit_transform(X_train)  
X_test_transformed = transformer.transform(X_test)

X_train_transformed = pd.DataFrame(X_train_transformed, columns=X.columns)
X_test_transformed = pd.DataFrame(X_test_transformed, columns=X.columns)
X_train_transformed.head(3)

Unnamed: 0,carat,depth,table,x,y,z
0,-1.332196,0.97448,-0.592305,-1.430225,-1.403233,-1.264652
1,0.715544,2.309622,0.764271,0.613714,0.607868,0.866351
2,0.404235,-0.329079,0.350714,0.469433,0.420965,0.393914


In [85]:
print("skew antes:\n", X_train.skew())
print("skew después:\n", X_train_transformed.skew())

skew antes:
 carat    1.193090
depth    0.159501
table    0.681555
x        0.434562
y        0.434972
z        0.407144
dtype: float64
skew después:
 carat    0.127325
depth    0.002243
table   -0.005639
x        0.037489
y        0.037876
z        0.030497
dtype: float64


In [86]:
calucate_metrics("PowerTransformer", X_train_transformed, X_test_transformed, y_train, y_test)

Unnamed: 0,Modelo,Preprocesado,R2,MAE,RMSE,MAPE
24,RandomForest,QuantileTransformer,0.869424,836.960855,1491.846494,0.210598
19,RandomForest,RobustScaler,0.868617,840.47698,1496.450348,0.211633
4,RandomForest,Sin preprocesado,0.8686,831.599182,1496.549411,0.209466
9,RandomForest,StandardScaler,0.86819,838.420478,1498.880314,0.211034
29,RandomForest,PowerTransformer,0.868158,835.479976,1499.062634,0.210861
14,RandomForest,MinMaxScaler,0.868044,834.96183,1499.712227,0.210139
5,LinearRegression,StandardScaler,0.861482,930.832368,1536.548541,0.288899
0,LinearRegression,Sin preprocesado,0.861482,930.832368,1536.548541,0.288899
15,LinearRegression,RobustScaler,0.861482,930.832368,1536.548541,0.288899
20,LinearRegression,QuantileTransformer,0.861482,930.832368,1536.548541,0.288899


In [87]:
transformer = PowerTransformer()
X_train_transformed = transformer.fit_transform(X_train)  
X_test_transformed = transformer.transform(X_test)
calucate_metrics("PowerTransformer standar False", X_train_transformed, X_test_transformed, y_train, y_test)

Unnamed: 0,Modelo,Preprocesado,R2,MAE,RMSE,MAPE
24,RandomForest,QuantileTransformer,0.869424,836.960855,1491.846494,0.210598
34,RandomForest,PowerTransformer standar False,0.868888,831.324652,1494.908547,0.209092
19,RandomForest,RobustScaler,0.868617,840.47698,1496.450348,0.211633
4,RandomForest,Sin preprocesado,0.8686,831.599182,1496.549411,0.209466
9,RandomForest,StandardScaler,0.86819,838.420478,1498.880314,0.211034
29,RandomForest,PowerTransformer,0.868158,835.479976,1499.062634,0.210861
14,RandomForest,MinMaxScaler,0.868044,834.96183,1499.712227,0.210139
0,LinearRegression,Sin preprocesado,0.861482,930.832368,1536.548541,0.288899
5,LinearRegression,StandardScaler,0.861482,930.832368,1536.548541,0.288899
15,LinearRegression,RobustScaler,0.861482,930.832368,1536.548541,0.288899


### OneHotEncoder

Equivalente a pd.get_dummies de pandas pero es de Scikit Learn.

Crea **columnas binarias (dummies)** para cada categoría de una **feature** nominal.

**Cuándo usarlo**:  

1. Para **features categóricas nominales** (sin orden), como color, ciudad, tipo de mascota, etc.  
2. Normalmente se aplica a **variables de entrada** (X).  
3. Útil en la mayoría de los modelos que necesitan variables numéricas y no tienen forma de manejar directamente categorías.
4. Se puede usar en pipelines de scikit learn

Parámetro sparse_output:

* sparse_output=True: Devuelve la transformación como una matriz dispersa (scipy.sparse.csr_matrix) en lugar de un numpy.ndarray. 
    * Ventaja: Usa menos memoria si hay muchas categorías con muchos ceros (matriz dispersa)
    * Desventaja: Puede ser incompatible con algunas funciones de Pandas y Scikit-learn que esperan una matriz densa.
* sparse_output=False: Devuelve la transformación como un array denso (numpy.ndarray), en lugar de una matriz dispersa.
    * Ventaja: Se puede convertir fácilmente en un DataFrame de Pandas sin errores ni conversiones adicionales.
    * Desventaja: Puede consumir más memoria si hay muchas categorías y muchos ceros.

* Diferencia entre matriz densa y dispersa:
    * Matriz densa: Es una matriz donde todos los valores, incluyendo los ceros, son almacenados en memoria.
    * Matriz dispersa: Es una matriz en la que se almacenan solo los valores distintos de cero, junto con sus coordenadas (índices de fila y columna). Más óptima pero más difícil de manipular directamente, requiere conversión a formato denso para ciertas operaciones.

In [88]:
X = df[["carat", "depth", "table", "x", "y", "z", "cut", "color", "clarity"]]
y = df["price"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

# Hace el encoding de categóricas, no da problemas si incluimos las numéricas
# pd.ger_dummies(X)
pd.get_dummies(X)

Unnamed: 0,carat,depth,table,x,y,z,cut_Ideal,cut_Premium,cut_Very Good,cut_Good,...,color_I,color_J,clarity_IF,clarity_VVS1,clarity_VVS2,clarity_VS1,clarity_VS2,clarity_SI1,clarity_SI2,clarity_I1
0,0.24,62.1,56.0,3.97,4.00,2.47,True,False,False,False,...,False,False,False,True,False,False,False,False,False,False
1,0.58,60.0,57.0,5.44,5.42,3.26,False,False,True,False,...,False,False,False,False,True,False,False,False,False,False
2,0.40,62.1,55.0,4.76,4.74,2.95,True,False,False,False,...,False,False,False,False,True,False,False,False,False,False
3,0.43,60.8,57.0,4.92,4.89,2.98,False,True,False,False,...,False,False,False,False,True,False,False,False,False,False
4,1.55,62.3,55.0,7.44,7.37,4.61,True,False,False,False,...,False,False,False,False,False,False,False,False,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4995,0.31,60.8,57.0,4.40,4.38,2.67,True,False,False,False,...,False,False,True,False,False,False,False,False,False,False
4996,1.06,61.2,55.0,6.57,6.61,4.03,True,False,False,False,...,False,False,False,False,False,False,True,False,False,False
4997,0.71,61.0,56.0,5.77,5.80,3.53,True,False,False,False,...,False,False,False,False,False,True,False,False,False,False
4998,0.90,63.3,56.0,6.13,6.10,3.87,False,False,True,False,...,False,True,False,False,False,False,True,False,False,False


In [89]:
from sklearn.preprocessing import OneHotEncoder

# obtener nombres de columnas numéricas y categóricas
numerical_columns = X_train.select_dtypes(exclude=['object', 'category']).columns.to_list() # np.number alternativa
categorical_columns = X_train.select_dtypes(include=['object', 'category']).columns.to_list()

encoder = OneHotEncoder(sparse_output=False) # sparse_output=False para obtenerlo como matriz de 0s y 1s , probar drop='first'
X_train_encoded = encoder.fit_transform(X_train[categorical_columns]) # array de numpy con las codificaciones
X_test_encoded = encoder.transform(X_test[categorical_columns])


# pasarlo a dataframes de pandas y juntarlo con las numéricas para obtener resultado como pd.get_dummies
X_train_final = pd.concat(
    [
        pd.DataFrame(X_train_encoded, columns=encoder.get_feature_names_out()).reset_index(drop=True), # categoricas
        X_train[numerical_columns].reset_index(drop=True) # numéricas
    ],
    axis=1
)
X_test_final = pd.concat(
    [
        pd.DataFrame(X_test_encoded, columns=encoder.get_feature_names_out()).reset_index(drop=True), # categoricas
        X_test[numerical_columns].reset_index(drop=True) # numéricas
    ],
    axis=1
)

X_test_final

# pd.DataFrame(X_test_encoded, columns=encoder.get_feature_names_out() )

Unnamed: 0,cut_Fair,cut_Good,cut_Ideal,cut_Premium,cut_Very Good,color_D,color_E,color_F,color_G,color_H,...,clarity_VS1,clarity_VS2,clarity_VVS1,clarity_VVS2,carat,depth,table,x,y,z
0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.70,60.6,58.0,5.80,5.72,3.49
1,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.03,61.0,60.0,6.46,6.53,3.96
2,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.31,62.6,57.0,4.33,4.29,2.70
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,1.00,62.7,58.0,6.41,6.32,3.99
4,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,1.0,0.0,0.0,0.50,61.6,55.0,5.11,5.14,3.16
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.43,61.8,56.0,4.87,4.84,3.00
996,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.32,63.1,56.0,4.34,4.38,2.75
997,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.31,60.7,61.0,4.36,4.40,2.66
998,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.37,61.0,56.0,4.70,4.65,2.85


In [90]:
calucate_metrics("OneHotEncoder", X_train_final, X_test_final, y_train, y_test)

Unnamed: 0,Modelo,Preprocesado,R2,MAE,RMSE,MAPE
39,RandomForest,OneHotEncoder,0.962238,405.085617,802.271909,0.09709
38,DecisionTree,OneHotEncoder,0.92911,529.884,1099.221206,0.125833
35,LinearRegression,OneHotEncoder,0.914045,791.194142,1210.402235,0.438605
36,KNN,OneHotEncoder,0.887188,764.8498,1386.662439,0.20158
24,RandomForest,QuantileTransformer,0.869424,836.960855,1491.846494,0.210598
34,RandomForest,PowerTransformer standar False,0.868888,831.324652,1494.908547,0.209092
19,RandomForest,RobustScaler,0.868617,840.47698,1496.450348,0.211633
4,RandomForest,Sin preprocesado,0.8686,831.599182,1496.549411,0.209466
9,RandomForest,StandardScaler,0.86819,838.420478,1498.880314,0.211034
29,RandomForest,PowerTransformer,0.868158,835.479976,1499.062634,0.210861


In [91]:
# OPCIONAL, NO ES OBLIGATORIO
# Combinar OneHotEncoder con MinMaxScaler
# obtener nombres de columnas numéricas y categóricas
numerical_columns = X_train.select_dtypes(exclude=['object', 'category']).columns.to_list() # np.number alternativa
categorical_columns = X_train.select_dtypes(include=['object', 'category']).columns.to_list()

encoder = OneHotEncoder(sparse_output=False) # sparse_output=False para obtenerlo como matriz de 0s y 1s , probar drop='first'
X_train_encoded = encoder.fit_transform(X_train[categorical_columns]) # array de numpy con las codificaciones
X_test_encoded = encoder.transform(X_test[categorical_columns])

sclaer = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train[numerical_columns])
X_test_scaled = scaler.fit_transform(X_test[numerical_columns])


X_train_final = pd.concat(
    [
        pd.DataFrame(X_train_encoded, columns=encoder.get_feature_names_out()).reset_index(drop=True), # categoricas
        pd.DataFrame(X_train_scaled, columns=numerical_columns).reset_index(drop=True) #numéricas
    ],
    axis=1
)
X_test_final = pd.concat(
    [
        pd.DataFrame(X_test_encoded, columns=encoder.get_feature_names_out()).reset_index(drop=True), # categoricas
        pd.DataFrame(X_test_scaled, columns=numerical_columns).reset_index(drop=True)
    ],
    axis=1
)

X_test_final

Unnamed: 0,cut_Fair,cut_Good,cut_Ideal,cut_Premium,cut_Very Good,color_D,color_E,color_F,color_G,color_H,...,clarity_VS1,clarity_VS2,clarity_VVS1,clarity_VVS2,carat,depth,table,x,y,z
0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,1.0,0.0,0.0,-0.014493,-0.866667,0.333333,0.030928,-0.010204,-0.037736
1,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.463768,-0.600000,1.000000,0.371134,0.403061,0.356394
2,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,1.0,0.0,0.0,-0.579710,0.466667,0.000000,-0.726804,-0.739796,-0.700210
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.420290,0.533333,0.333333,0.345361,0.295918,0.381551
4,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,1.0,0.0,0.0,-0.304348,-0.200000,-0.666667,-0.324742,-0.306122,-0.314465
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,-0.405797,-0.066667,-0.333333,-0.448454,-0.459184,-0.448637
996,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,-0.565217,0.800000,-0.333333,-0.721649,-0.693878,-0.658281
997,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,...,0.0,1.0,0.0,0.0,-0.579710,-0.800000,1.333333,-0.711340,-0.683673,-0.733753
998,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,-0.492754,-0.600000,-0.333333,-0.536082,-0.556122,-0.574423


In [92]:
calucate_metrics("OneHotEncoder+MinMaxScaler", X_train_final, X_test_final, y_train, y_test)

Unnamed: 0,Modelo,Preprocesado,R2,MAE,RMSE,MAPE
39,RandomForest,OneHotEncoder,0.962238,405.085617,802.271909,0.09709
44,RandomForest,OneHotEncoder+MinMaxScaler,0.947313,485.062421,947.640511,0.10617
38,DecisionTree,OneHotEncoder,0.92911,529.884,1099.221206,0.125833
35,LinearRegression,OneHotEncoder,0.914045,791.194142,1210.402235,0.438605
43,DecisionTree,OneHotEncoder+MinMaxScaler,0.913513,642.6835,1214.138572,0.142175
40,LinearRegression,OneHotEncoder+MinMaxScaler,0.912996,787.160456,1217.760529,0.427536
36,KNN,OneHotEncoder,0.887188,764.8498,1386.662439,0.20158
24,RandomForest,QuantileTransformer,0.869424,836.960855,1491.846494,0.210598
34,RandomForest,PowerTransformer standar False,0.868888,831.324652,1494.908547,0.209092
19,RandomForest,RobustScaler,0.868617,840.47698,1496.450348,0.211633


### LabelEncoder

Convierte etiquetas (categorías) a valores numéricos enteros de 0 a n-1.  

Por ejemplo, si tienes las categorías `["rojo", "verde", "azul"]`, podría asignar  
- *rojo* $\to 0$,  
- *verde* $\to 1$,  
- *azul* $\to 2$.

* Normalmente se usa para la variable de salida (y) si se trata de un problema de clasificación multiclase.
* Convierte cada clase categórica a un entero distinto.
* También puede usarse en columnas de entrada si (y solo si) tienen un orden real (caso ordinal) o si el modelo puede manejarlo sin suponer que 2 > 1 > 0 (pero esto no es común; en features categóricas nominales, lo típico es OneHotEncoder).

Equivalente a cuando hacemos el `df['class'].map({'setosa':0, 'virginica':1, 'versicolor':2})`

In [93]:
X = df[["carat", "depth", "table", "x", "y", "z","price" ]]
y = df["cut"] # categórica --- clasificación multiclase


In [94]:
from sklearn.calibration import LabelEncoder


encoder = LabelEncoder()
y_encoded = encoder.fit_transform(y)

X_train, X_test, y_train, y_test = train_test_split(X, y_encoded, test_size=0.20, random_state=42)

print("classes del encoder:", encoder.classes_)
print("ejemplo y_encoded:", y_encoded[:10])
print("ejemplo y_train:", y_train[:10])
print("ejemplo y_test:", y_test[:10])

classes del encoder: ['Fair' 'Good' 'Ideal' 'Premium' 'Very Good']
ejemplo y_encoded: [2 4 2 3 2 0 2 2 3 2]
ejemplo y_train: [1 0 3 4 1 2 2 2 2 4]
ejemplo y_test: [3 3 2 3 2 2 2 2 2 1]


In [95]:
# con inverse_transform podemos obtener las categorías originales a partir de los datos codificados
# podemos palicar inverse_transform sobre y_pred para obtener las categorías de las predicciones sis queremos
encoder.inverse_transform(y_encoded)[:10]

array(['Ideal', 'Very Good', 'Ideal', 'Premium', 'Ideal', 'Fair', 'Ideal',
       'Ideal', 'Premium', 'Ideal'], dtype=object)

In [96]:
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy_score(y_test, y_pred)

0.749

In [97]:
# Predicciones devueltas
print("predicciones", y_pred[:10])
print("predicciones descodificadas", encoder.inverse_transform(y_pred)[:10])

predicciones [3 3 2 3 2 2 2 2 2 1]
predicciones descodificadas ['Premium' 'Premium' 'Ideal' 'Premium' 'Ideal' 'Ideal' 'Ideal' 'Ideal'
 'Ideal' 'Good']


### KBinsDiscretizer

Similar a pd.cut de pandas para discretizar columnas numéricas.

Convierte variables numéricas continuas en variables discretas, dividiendo los valores en **intervalos o "bins"**. Cada intervalo recibe una etiqueta numérica.

¿Cuándo usarlo?

* Cuando queremos convertir variables numéricas continuas en categorías discretas (por ejemplo, dividir `carat` en "pequeño", "mediano" y "grande").  
* Cuando un modelo puede beneficiarse de información categorizada en lugar de valores continuos.  
* Para mejorar la interpretabilidad de un modelo.  

Formas de discretización:

- `uniform`: Divide el rango en intervalos de **igual tamaño**.
- `quantile`: Crea intervalos con **igual número de muestras**.
- `kmeans`: Usa **K-Means** para definir los bins.

In [98]:
# Realizar clasificación multiclase sobre "price"
X = df[["carat", "depth", "table", "x", "y", "z"]]
y = df[["price"]] #[[]] para que sea 2 dimensiones para discretizar. 
        # Numérica que trasformaremos a categórica --- clasificación multiclase

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

In [99]:
from sklearn.preprocessing import KBinsDiscretizer

# Discretizar el precio en 4 grupos
discretizer = KBinsDiscretizer(encode="ordinal", n_bins=4, strategy="kmeans") # encode= "onehot-dense" genera matriz densa estilo one hot para la X
discretizer.fit(y_train)

y_train_discretized = discretizer.transform(y_train).ravel() # ravel pasa de 2 dimensiones a 1 dimensión para usar en scikit fit y predict
y_test_discretized = discretizer.transform(y_test).ravel()

model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train_discretized)
y_pred = model.predict(X_test)
accuracy_score(y_test_discretized, y_pred)

0.853

In [100]:
print("predicciones: ", y_pred[:10])

predicciones:  [0. 1. 0. 1. 0. 0. 0. 1. 1. 1.]


In [101]:
print("discretizer.n_bins_", discretizer.n_bins_)
print("discretizer.n_features_in_:", discretizer.n_features_in_)
print("discretizer.n_features_in_:", discretizer.feature_names_in_)
print("discretizer.n_edges_:", discretizer.bin_edges_)
print("minimo", discretizer.bin_edges_[0][0])
print("bin 1", discretizer.bin_edges_[0][1])
print("bin 2", discretizer.bin_edges_[0][2])
print("bin 3", discretizer.bin_edges_[0][3])
print("bin max", discretizer.bin_edges_[0][4])

discretizer.n_bins_ [4]
discretizer.n_features_in_: 1
discretizer.n_features_in_: ['price']
discretizer.n_edges_: [array([  336.        ,  2964.95252869,  6925.92917442, 12188.07840494,
        18823.        ])                                               ]
minimo 336.0
bin 1 2964.952528685527
bin 2 6925.929174424699
bin 3 12188.078404935168
bin max 18823.0


### Binarizer

Binarizer convierte valores numéricos en valores binarios (0 o 1) en función de un umbral. 

Se usa cuando quieres transformar una variable numérica en categórica, lo que puede ser útil para mejorar el rendimiento de algunos modelos.

Por ejemplo podemos convertir la variable precio a una variable binaria barato (0) y caro (1) para realizar clasificación binaria.

Otro ejemplo es binarizar la edad de una persona en adulto (0 o 1) en función de si tiene igual o más de 18 años o no.

In [102]:
from sklearn.preprocessing import Binarizer


X = df[["carat", "depth", "table", "x", "y", "z"]]
y = df[["price"]] 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

binarizer =Binarizer(threshold=df["carat"].median())
X_train["carat"] = binarizer.fit_transform(X_train[["carat"]])
X_test["carat"] = binarizer.fit_transform(X_test[["carat"]])
X_train.head(3)

Unnamed: 0,carat,depth,table,x,y,z
4227,0.0,63.2,56.0,4.27,4.3,2.71
4676,1.0,65.2,59.0,6.28,6.27,4.09
800,1.0,61.3,58.0,6.1,6.04,3.72


### SimpleImputer

Clase para imputar valores nulos (NaN) en los dataframes o arrays.

Estrategrias:
* mean: rellena con la media de la columna (numérica)
* median: rellena con la mediana de la columna (numérica)
* most_frequent: rellena los valores nulos con el valor más frecuente de la columna (puede usarse tanto en numéricas como en categóricas)
* constant: rellena con un valor constante que definimos (por ejemplo, "mising" en categóicas o 0 en numéricas).


In [103]:
# Generación aleatoria legacy
np.random.seed(42)
df2 = df.copy()

# Introducir números nulos aleatoriamente en una columna numérica
indices = np.random.choice(df2.index, size=50, replace=False)
df2.loc[indices, "carat"] = np.nan

indices = np.random.choice(df2.index, size=50, replace=False)
df2.loc[indices, "cut"] = np.nan

df2.isna().sum()

carat      50
cut        50
color       0
clarity     0
depth       0
table       0
price       0
x           0
y           0
z           0
dtype: int64

In [104]:
# Nueva API random
# random_state = np.random.RandomState(seed=42)
random_state =np.random.default_rng(seed=42)

df2 = df.copy()

# Introducir números nulos aleatoriamente en una columna numérica
indices = np.random.choice(df2.index, size=50, replace=False)
df2.loc[indices, "carat"] = np.nan

indices = np.random.choice(df2.index, size=50, replace=False)
df2.loc[indices, "cut"] = np.nan

df2.isna().sum()

carat      50
cut        50
color       0
clarity     0
depth       0
table       0
price       0
x           0
y           0
z           0
dtype: int64

In [105]:
X = df2[["carat", "depth", "table", "x", "y", "z", "cut"]]
y = df2[["price"]] 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)


In [106]:
from sklearn.impute import SimpleImputer


numerical_cols = X_train.select_dtypes(include=[np.number]).columns
categorical_cols = X_train.select_dtypes(exclude=[np.number]).columns

imputer_num = SimpleImputer(strategy="median")
X_train_numerical= imputer_num.fit_transform(X_train[numerical_cols])
X_test_numerical= imputer_num.fit_transform(X_test[numerical_cols])

# imputer_cat = SimpleImputer(strategy="constant", fill_value="Other")
imputer_cat = SimpleImputer(strategy="most_frequent")
X_train_categorical= imputer_cat.fit_transform(X_train[categorical_cols])
X_test_categorical= imputer_cat.fit_transform(X_test[categorical_cols])

X_train_array = np.concatenate([X_train_numerical, X_train_categorical], axis=1)
X_test_array = np.concatenate([X_test_numerical, X_test_categorical], axis=1)

# Opcionel: pasar a dataframes de pandas para tener nombres de columnas
X_train_imputed =pd.DataFrame(X_train_array, columns=  X_train.columns, index=X_train.index)
X_test_imputed =pd.DataFrame(X_test_array, columns=  X_test.columns, index=X_test.index)

print(X_train_imputed.isna().sum()) # Comprobar que ya no hay nulos
print(X_test_imputed.isna().sum()) # Comprobar que ya no hay nulos

carat    0
depth    0
table    0
x        0
y        0
z        0
cut      0
dtype: int64
carat    0
depth    0
table    0
x        0
y        0
z        0
cut      0
dtype: int64


## KNNImputer

Imputer que usa el algoritno de K Vecinos más cercanos (K-Nearest Neighbors) para imputar valores faltantes.

En lugar de rellenar las celdas nulas (NaN) con una estadística global (como la media o la mediana) de la columna, el KNNImputer busca k distancias (filas) "similares" (más cercanas en el espacio de características)

In [107]:
random_state =np.random.default_rng(seed=42)

df2 = df.copy()

# Introducir números nulos aleatoriamente en una columna numérica
indices = np.random.choice(df2.index, size=50, replace=False)
df2.loc[indices, "carat"] = np.nan

indices = np.random.choice(df2.index, size=50, replace=False)
df2.loc[indices, "cut"] = np.nan

df2.isna().sum()

carat      50
cut        50
color       0
clarity     0
depth       0
table       0
price       0
x           0
y           0
z           0
dtype: int64

In [108]:
from sklearn.impute import KNNImputer


X = df2[["carat", "depth", "table", "x", "y", "z", "cut"]]
y = df2[["price"]] 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)
numerical_cols = X_train.select_dtypes(include=[np.number]).columns
categorical_cols = X_train.select_dtypes(exclude=[np.number]).columns

imputer_num = KNNImputer(n_neighbors=7) # Hemos cambiado SimpleImputer por KNNImputer
X_train_numerical= imputer_num.fit_transform(X_train[numerical_cols])
X_test_numerical= imputer_num.fit_transform(X_test[numerical_cols])

# imputer_cat = SimpleImputer(strategy="constant", fill_value="Other")
imputer_cat = SimpleImputer(strategy="most_frequent") # Se mantiene SimpleImputer  porque KNNImputer no trabaja con categóricas
X_train_categorical= imputer_cat.fit_transform(X_train[categorical_cols])
X_test_categorical= imputer_cat.fit_transform(X_test[categorical_cols])

X_train_array = np.concatenate([X_train_numerical, X_train_categorical], axis=1)
X_test_array = np.concatenate([X_test_numerical, X_test_categorical], axis=1)

# Opcionel: pasar a dataframes de pandas para tener nombres de columnas
X_train_imputed =pd.DataFrame(X_train_array, columns=  X_train.columns, index=X_train.index)
X_test_imputed =pd.DataFrame(X_test_array, columns=  X_test.columns, index=X_test.index)

print(X_train_imputed.isna().sum()) # Comprobar que ya no hay nulos
print(X_test_imputed.isna().sum()) # Comprobar que ya no hay nulos


carat    0
depth    0
table    0
x        0
y        0
z        0
cut      0
dtype: int64
carat    0
depth    0
table    0
x        0
y        0
z        0
cut      0
dtype: int64


### IterativeImputer

IterativeImputer implementa una estrategia de imputación multivariable

A diferencia de los enfoques simples (media, mediana, moda) o KNNImputer, el IterativeImputes entrena un modelo para predecir el valor faltante ene cada caraterística usando las otras características como predictores

In [109]:
random_state =np.random.default_rng(seed=42)

df2 = df.copy()

# Introducir números nulos aleatoriamente en una columna numérica
indices = np.random.choice(df2.index, size=50, replace=False)
df2.loc[indices, "carat"] = np.nan

indices = np.random.choice(df2.index, size=50, replace=False)
df2.loc[indices, "cut"] = np.nan

df2.isna().sum()

carat      50
cut        50
color       0
clarity     0
depth       0
table       0
price       0
x           0
y           0
z           0
dtype: int64

In [110]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

X = df2[['carat', 'depth', 'table', 'x', 'y', 'z', 'cut']]
y = df2[['price']]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

numerical_cols = X_train.select_dtypes(include=[np.number]).columns
categorical_cols = X_train.select_dtypes(exclude=[np.number]).columns

model = RandomForestRegressor(random_state=42)
imputer_num = IterativeImputer(model, random_state=42, initial_strategy='median') # Hemos cambiado a IterativeImputer
X_train_numerical = imputer_num.fit_transform(X_train[numerical_cols])
X_test_numerical = imputer_num.transform(X_test[numerical_cols])

imputer_cat = SimpleImputer(initial_strategy='most_frequent') # Se mantiene SimpleImputer porque KNNImputer no trabaja con categóricas
X_train_categorical = imputer_cat.fit_transform(X_train[categorical_cols])
X_test_categorical = imputer_cat.transform(X_test[categorical_cols])

X_train_array = np.concatenate([X_train_numerical, X_train_categorical], axis=1)
X_test_array = np.concatenate([X_test_numerical, X_test_categorical], axis=1)

# opcional: pasar a dataframes de pandas para tener nombres de columnas
X_train_imputed = pd.DataFrame(X_train_array, columns = X_train.columns, index=X_train.index)
X_test_imputed = pd.DataFrame(X_test_array, columns = X_test.columns, index=X_test.index)

print(X_train_imputed.isna().sum()) # Ya no hay nulos
print(X_test_imputed.isna().sum()) # Ya no hay nulos

TypeError: SimpleImputer.__init__() got an unexpected keyword argument 'initial_strategy'

In [111]:
# Ejemplo usando IterativeImputer para categóricas (hay que hacer one hot encoder primero)
df2 = df.copy()
indices = random_state.choice(df2.index, size=50, replace=False)
df2.loc[indices, 'carat'] = np.nan
indices = random_state.choice(df2.index, size=50, replace=False)
df2.loc[indices, 'cut'] = np.nan

X = df2[['carat', 'depth', 'table', 'x', 'y', 'z', 'cut']]
y = df2[['price']]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

numerical_cols = X_train.select_dtypes(include=[np.number]).columns
categorical_cols = X_train.select_dtypes(exclude=[np.number]).columns

model = RandomForestRegressor(random_state=42)
imputer_num = IterativeImputer(model, random_state=42, initial_strategy='median') # Hemos cambiado a IterativeImputer
X_train_numerical = imputer_num.fit_transform(X_train[numerical_cols])
X_test_numerical = imputer_num.transform(X_test[numerical_cols])

encoder = OneHotEncoder(sparse_output=False)
X_train_categorical = encoder.fit_transform(X_train[categorical_cols])
X_test_categorical = encoder.transform(X_test[categorical_cols])

model = RandomForestClassifier(random_state=42)
imputer_cat = IterativeImputer(model, random_state=42, initial_strategy='most_frequent') # Se mantiene SimpleImputer porque KNNImputer no trabaja con categóricas
X_train_categorical = imputer_cat.fit_transform(X_train_categorical)
X_test_categorical = imputer_cat.transform(X_test_categorical)

X_train_array = np.concatenate([X_train_numerical, X_train_categorical], axis=1)
X_test_array = np.concatenate([X_test_numerical, X_test_categorical], axis=1)

# opcional: pasar a dataframes de pandas para tener nombres de columnas
encoded_columns = encoder.get_feature_names_out(categorical_cols)
all_columns = list(numerical_cols) + list(encoded_columns)

X_train_imputed = pd.DataFrame(X_train_array, columns = all_columns, index=X_train.index)
X_test_imputed = pd.DataFrame(X_test_array, columns = all_columns, index=X_test.index)

print(X_train_imputed.isna().sum()) # Ya no hay nulos
print(X_test_imputed.isna().sum()) # Ya no hay nulos

carat            0
depth            0
table            0
x                0
y                0
z                0
cut_Fair         0
cut_Good         0
cut_Ideal        0
cut_Premium      0
cut_Very Good    0
cut_nan          0
dtype: int64
carat            0
depth            0
table            0
x                0
y                0
z                0
cut_Fair         0
cut_Good         0
cut_Ideal        0
cut_Premium      0
cut_Very Good    0
cut_nan          0
dtype: int64


## Outlier
Los outliers son valores que se encuentran significativamente alejados de la mayoría de los datos en un conjunto de datos. 

Estos valores pueden deberse a errores en la recopilación, errores de entrada, condiciones extremas o eventos raros. En términos estadísticos, los outliers pueden ser definidos utilizando diferentes métodos, como:

* Desviación estándar: Valores que se encuentran más allá de 2 o 3 desviaciones estándar de la media.
* Rango intercuartílico (IQR): Datos que están fuera del rango [Q1 - 1.5 * IQR, Q3 + 1.5 * IQR], donde IQR es la diferencia entre el tercer y primer cuartil.

Los outliers afectan al modelado:

* Distorsionan las métricas estadísticas
* Dificultan la generalización
* Pueden provocar sobreajuste

Algoritmos sensibles a outliers (requieren preprocesamiento):

* Regresión lineal y polinómica
* KNN
* SVM con kernel lineal
* Redes neuronales
* KMeans y PCA

Algoritmos menos sensibles a outliers (no les afectan tanto):

* Regresión logística
* Árboles de decisión
* Gradient Boosting
* Algoritmos basados en medianas: IsolationForest

In [113]:
X = df[['carat', 'depth', 'table', 'x', 'y', 'z']]
y = df['price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

## Filtrar outliers manualmente con Pandas y el método tukey IQR
Q1 = X_train.quantile(0.25)
Q3 = X_train.quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR # Límite inferior
upper_bound = Q3 + 1.5 * IQR # Límite superior

filtro = ~((X_train < lower_bound) | (X_train > upper_bound)).any(axis=1)
X_train_filtered = X_train[filtro]
y_train_filtered = y_train[filtro]

filtro = ~((X_test < lower_bound) | (X_test > upper_bound)).any(axis=1)
X_test_filtered = X_test[filtro]
y_test_filtered = y_test[filtro]

calucate_metrics('- Outliers IQR', X_train_filtered, X_test_filtered, y_train_filtered, y_test_filtered)

Unnamed: 0,Modelo,Preprocesado,R2,MAE,RMSE,MAPE
39,RandomForest,OneHotEncoder,0.962238,405.085617,802.271909,0.09709
44,RandomForest,OneHotEncoder+MinMaxScaler,0.947313,485.062421,947.640511,0.10617
38,DecisionTree,OneHotEncoder,0.92911,529.884,1099.221206,0.125833
35,LinearRegression,OneHotEncoder,0.914045,791.194142,1210.402235,0.438605
43,DecisionTree,OneHotEncoder+MinMaxScaler,0.913513,642.6835,1214.138572,0.142175
40,LinearRegression,OneHotEncoder+MinMaxScaler,0.912996,787.160456,1217.760529,0.427536
36,KNN,OneHotEncoder,0.887188,764.8498,1386.662439,0.20158
24,RandomForest,QuantileTransformer,0.869424,836.960855,1491.846494,0.210598
34,RandomForest,PowerTransformer standar False,0.868888,831.324652,1494.908547,0.209092
19,RandomForest,RobustScaler,0.868617,840.47698,1496.450348,0.211633


In [120]:
# IsolationForest algoritmo de conjunto para detectar anomalías o outliers
from sklearn.ensemble import IsolationForest

X = df[['carat', 'depth', 'table', 'x', 'y', 'z']]
y = df['price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

outlier_detector = IsolationForest(contamination="auto", random_state=42) # Indicamos que se  espera 5% de outliers, por defecto, es automático
outlier_detector.fit(X_train) # Lo entrenamos

# Eliminar datos outliers de train:
y_train_pred = outlier_detector.predict(X_train) # Devuelve un array así: array([1, 1, -1]) donde -1 es que es anómalo

X_train_filtered = X_train[filtro]
y_train_filtered = y_train[filtro]
print("X_train len", X_train.shape[0])
print("X_train_filtered len", X_train_filtered.shape[0])

y_test_pred = outlier_detector.predict(X_test)
filtro = (y_test_pred != -1)
X_test_filtered = X_test[filtro]
y_test_filtered = y_test[filtro]
print("X_test len", X_test.shape[0])
print("X_test_filtered len", X_test_filtered.shape[0])

X_train len 4000
X_train_filtered len 3279
X_test len 1000
X_test_filtered len 806


In [121]:
calucate_metrics('IsolationForest', X_train_filtered, X_test_filtered, y_train_filtered, y_test_filtered)

Unnamed: 0,Modelo,Preprocesado,R2,MAE,RMSE,MAPE
39,RandomForest,OneHotEncoder,0.962238,405.085617,802.271909,0.09709
44,RandomForest,OneHotEncoder+MinMaxScaler,0.947313,485.062421,947.640511,0.10617
38,DecisionTree,OneHotEncoder,0.92911,529.884,1099.221206,0.125833
35,LinearRegression,OneHotEncoder,0.914045,791.194142,1210.402235,0.438605
43,DecisionTree,OneHotEncoder+MinMaxScaler,0.913513,642.6835,1214.138572,0.142175
40,LinearRegression,OneHotEncoder+MinMaxScaler,0.912996,787.160456,1217.760529,0.427536
36,KNN,OneHotEncoder,0.887188,764.8498,1386.662439,0.20158
24,RandomForest,QuantileTransformer,0.869424,836.960855,1491.846494,0.210598
34,RandomForest,PowerTransformer standar False,0.868888,831.324652,1494.908547,0.209092
19,RandomForest,RobustScaler,0.868617,840.47698,1496.450348,0.211633


In [125]:
df_resultado[
    (df_resultado["Preprocesado"] == "IsolationForest")
    |
    (df_resultado["Preprocesado"] == "MinMaxScaler")
    |
    (df_resultado["Preprocesado"] == "StandardScaler")
    |
    (df_resultado["Preprocesado"] == "OneHotEncoder")
].sort_values("R2", ascending=False)



Unnamed: 0,Modelo,Preprocesado,R2,MAE,RMSE,MAPE
39,RandomForest,OneHotEncoder,0.962238,405.085617,802.271909,0.09709
38,DecisionTree,OneHotEncoder,0.92911,529.884,1099.221206,0.125833
35,LinearRegression,OneHotEncoder,0.914045,791.194142,1210.402235,0.438605
36,KNN,OneHotEncoder,0.887188,764.8498,1386.662439,0.20158
9,RandomForest,StandardScaler,0.86819,838.420478,1498.880314,0.211034
14,RandomForest,MinMaxScaler,0.868044,834.96183,1499.712227,0.210139
5,LinearRegression,StandardScaler,0.861482,930.832368,1536.548541,0.288899
10,LinearRegression,MinMaxScaler,0.861482,930.832368,1536.548541,0.288899
6,KNN,StandardScaler,0.859353,874.932,1548.314323,0.221851
11,KNN,MinMaxScaler,0.857112,868.6464,1560.600343,0.220208
