# Imputación de valores no disponibles utilizando KNN

Tras estudiar las estrategias de tratamiento de valores numéricos no disponibles como se explica en el notebook de preprocesado del dataset housing, propón una nueva solución utilizando la clase KNNImputer de scikit-learn.

## *Imports* y pasos previos

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import KNNImputer
from sklearn.compose import make_column_selector, make_column_transformer
from sklearn.pipeline import make_pipeline


In [2]:
housing = pd.read_csv("./data/housing.csv")

X_train, X_test, y_train, y_test = train_test_split(
    housing.drop(columns="median_house_value"), # features
    housing["median_house_value"], # target
    stratify=pd.cut(housing["median_income"], bins=[0., 1.5, 3.0, 4.5, 6., np.inf], labels=[1, 2, 3, 4, 5]),
    test_size=0.2, random_state=42
    )


null_rows_idx = X_train.isnull().any(axis=1) # índices de las filas con valores nulos
X_train.loc[null_rows_idx].head() # visualizamos las primeras filas con valores nulos

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity
1606,-122.08,37.88,26.0,2947.0,,825.0,626.0,2.933,NEAR BAY
10915,-117.87,33.73,45.0,2264.0,,1970.0,499.0,3.4193,<1H OCEAN
19150,-122.7,38.35,14.0,2313.0,,954.0,397.0,3.7813,<1H OCEAN
4186,-118.23,34.13,48.0,1308.0,,835.0,294.0,4.2891,<1H OCEAN
16885,-122.4,37.58,26.0,3281.0,,1145.0,480.0,6.358,NEAR OCEAN


## Solución paso a paso

Aunque hay métodos para calcular el valor optimo de k comparando distintos valores, en este caso vamos a seguir el criterio de que k sea igual a la raíz cuadrada del número de muestras, que es el [criterio que se suele seguir en la práctica](https://towardsdatascience.com/how-to-find-the-optimal-value-of-k-in-knn-35d936e554eb#:~:text=The%20optimal%20K%20value%20usually,be%20aware%20of%20the%20outliers.). Tratandose solo de una imputación y no de de un modelo, no es necesario profundizar más en la optimización de k.

In [3]:
k_value = np.sqrt(X_train.shape[0]).astype(int)
k_value

128

el [método `set_output`](https://scikit-learn.org/stable/auto_examples/miscellaneous/plot_set_output.html) permite establecer que los métodos transformadores de scikit-learn devuelvan DataFrames en luegar de ndarrays.

Aplicamos también OneHotEncoder para las variables categóricas.

In [4]:
cat_encoder = OneHotEncoder(sparse_output=False).set_output(transform="pandas")
housing_cat_ohe = cat_encoder.fit_transform(X_train[["ocean_proximity"]])
housing_cat_ohe

Unnamed: 0,ocean_proximity_<1H OCEAN,ocean_proximity_INLAND,ocean_proximity_ISLAND,ocean_proximity_NEAR BAY,ocean_proximity_NEAR OCEAN
12655,0.0,1.0,0.0,0.0,0.0
15502,0.0,0.0,0.0,0.0,1.0
2908,0.0,1.0,0.0,0.0,0.0
14053,0.0,0.0,0.0,0.0,1.0
20496,1.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...
15174,1.0,0.0,0.0,0.0,0.0
12661,0.0,1.0,0.0,0.0,0.0
19263,1.0,0.0,0.0,0.0,0.0
19140,1.0,0.0,0.0,0.0,0.0


concatenamos los DataFrames:

In [5]:
X_train_tr1 = pd.concat([housing_cat_ohe, X_train.drop(columns="ocean_proximity")], axis=1)
X_train_tr1

Unnamed: 0,ocean_proximity_<1H OCEAN,ocean_proximity_INLAND,ocean_proximity_ISLAND,ocean_proximity_NEAR BAY,ocean_proximity_NEAR OCEAN,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income
12655,0.0,1.0,0.0,0.0,0.0,-121.46,38.52,29.0,3873.0,797.0,2237.0,706.0,2.1736
15502,0.0,0.0,0.0,0.0,1.0,-117.23,33.09,7.0,5320.0,855.0,2015.0,768.0,6.3373
2908,0.0,1.0,0.0,0.0,0.0,-119.04,35.37,44.0,1618.0,310.0,667.0,300.0,2.8750
14053,0.0,0.0,0.0,0.0,1.0,-117.13,32.75,24.0,1877.0,519.0,898.0,483.0,2.2264
20496,1.0,0.0,0.0,0.0,0.0,-118.70,34.28,27.0,3536.0,646.0,1837.0,580.0,4.4964
...,...,...,...,...,...,...,...,...,...,...,...,...,...
15174,1.0,0.0,0.0,0.0,0.0,-117.07,33.03,14.0,6665.0,1231.0,2026.0,1001.0,5.0900
12661,0.0,1.0,0.0,0.0,0.0,-121.42,38.51,15.0,7901.0,1422.0,4769.0,1418.0,2.8139
19263,1.0,0.0,0.0,0.0,0.0,-122.72,38.44,48.0,707.0,166.0,458.0,172.0,3.1797
19140,1.0,0.0,0.0,0.0,0.0,-122.70,38.31,14.0,3155.0,580.0,1208.0,501.0,4.1964


Decidimos utilizar el algoritmo KNN para la imputación de valores no disponibles en la columna 'total_bedrooms'. Para ello, utilizamos la clase KNNImputer de scikit-learn.

Como KNN se basa en medidas de distancia, es importante normalizar los datos antes de aplicar el algoritmo, ya que si no, las características con órdenes de magnitud más grandes dominarán las distancias.

Para ello, utilizamos la clase StandardScaler de scikit-learn.

In [6]:
scaler = StandardScaler().set_output(transform="pandas") # Para que el resultado sea un DataFrame
X_train_num_scaled = scaler.fit_transform(X_train_tr1)
X_train_num_scaled

Unnamed: 0,ocean_proximity_<1H OCEAN,ocean_proximity_INLAND,ocean_proximity_ISLAND,ocean_proximity_NEAR BAY,ocean_proximity_NEAR OCEAN,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income
12655,-0.887683,1.46218,-0.011006,-0.354889,-0.384217,-0.941350,1.347438,0.027564,0.584777,0.635123,0.732602,0.556286,-0.893647
15502,-0.887683,-0.68391,-0.011006,-0.354889,2.602693,1.171782,-1.192440,-1.722018,1.261467,0.775677,0.533612,0.721318,1.292168
2908,-0.887683,1.46218,-0.011006,-0.354889,-0.384217,0.267581,-0.125972,1.220460,-0.469773,-0.545045,-0.674675,-0.524407,-0.525434
14053,-0.887683,-0.68391,-0.011006,-0.354889,2.602693,1.221738,-1.351474,-0.370069,-0.348652,-0.038567,-0.467617,-0.037297,-0.865929
20496,1.126529,-0.68391,-0.011006,-0.354889,-0.384217,0.437431,-0.635818,-0.131489,0.427179,0.269198,0.374060,0.220898,0.325752
...,...,...,...,...,...,...,...,...,...,...,...,...,...
15174,1.126529,-0.68391,-0.011006,-0.354889,-0.384217,1.251711,-1.220505,-1.165333,1.890456,1.686854,0.543471,1.341519,0.637374
12661,-0.887683,1.46218,-0.011006,-0.354889,-0.384217,-0.921368,1.342761,-1.085806,2.468471,2.149712,3.002174,2.451492,-0.557509
19263,1.126529,-0.68391,-0.011006,-0.354889,-0.384217,-1.570794,1.310018,1.538566,-0.895802,-0.894007,-0.862013,-0.865118,-0.365475
19140,1.126529,-0.68391,-0.011006,-0.354889,-0.384217,-1.560803,1.249211,-1.165333,0.249005,0.109257,-0.189747,0.010616,0.168261


Y con todo el resto de variables escaladas, aplicamos KNNImputer para imputar los valores no disponibles en la columna 'total_bedrooms'.

In [7]:
X_train_imputed_a = KNNImputer(n_neighbors=k_value).set_output(transform="pandas").fit_transform(X_train_num_scaled)

print(X_train_imputed_a.isna().any().any()) # Verificamos que no hay valores nulos
X_train_imputed_a

False


Unnamed: 0,ocean_proximity_<1H OCEAN,ocean_proximity_INLAND,ocean_proximity_ISLAND,ocean_proximity_NEAR BAY,ocean_proximity_NEAR OCEAN,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income
12655,-0.887683,1.46218,-0.011006,-0.354889,-0.384217,-0.941350,1.347438,0.027564,0.584777,0.635123,0.732602,0.556286,-0.893647
15502,-0.887683,-0.68391,-0.011006,-0.354889,2.602693,1.171782,-1.192440,-1.722018,1.261467,0.775677,0.533612,0.721318,1.292168
2908,-0.887683,1.46218,-0.011006,-0.354889,-0.384217,0.267581,-0.125972,1.220460,-0.469773,-0.545045,-0.674675,-0.524407,-0.525434
14053,-0.887683,-0.68391,-0.011006,-0.354889,2.602693,1.221738,-1.351474,-0.370069,-0.348652,-0.038567,-0.467617,-0.037297,-0.865929
20496,1.126529,-0.68391,-0.011006,-0.354889,-0.384217,0.437431,-0.635818,-0.131489,0.427179,0.269198,0.374060,0.220898,0.325752
...,...,...,...,...,...,...,...,...,...,...,...,...,...
15174,1.126529,-0.68391,-0.011006,-0.354889,-0.384217,1.251711,-1.220505,-1.165333,1.890456,1.686854,0.543471,1.341519,0.637374
12661,-0.887683,1.46218,-0.011006,-0.354889,-0.384217,-0.921368,1.342761,-1.085806,2.468471,2.149712,3.002174,2.451492,-0.557509
19263,1.126529,-0.68391,-0.011006,-0.354889,-0.384217,-1.570794,1.310018,1.538566,-0.895802,-0.894007,-0.862013,-0.865118,-0.365475
19140,1.126529,-0.68391,-0.011006,-0.354889,-0.384217,-1.560803,1.249211,-1.165333,0.249005,0.109257,-0.189747,0.010616,0.168261


In [8]:
X_train_imputed_a.loc[null_rows_idx].head() # visualizamos las filas que tenían valores nulos

Unnamed: 0,ocean_proximity_<1H OCEAN,ocean_proximity_INLAND,ocean_proximity_ISLAND,ocean_proximity_NEAR BAY,ocean_proximity_NEAR OCEAN,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income
1606,-0.887683,-0.68391,-0.011006,2.817783,-0.384217,-1.251077,1.048079,-0.211016,0.151734,0.080575,-0.533051,0.343342,-0.494985
10915,1.126529,-0.68391,-0.011006,-0.354889,-0.384217,0.852065,-0.89308,1.299986,-0.167671,-0.046215,0.493276,0.005292,-0.239693
19150,1.126529,-0.68391,-0.011006,-0.354889,-0.384217,-1.560803,1.267921,-1.165333,-0.144756,-0.238985,-0.417421,-0.266212,-0.049654
4186,1.126529,-0.68391,-0.011006,-0.354889,-0.384217,0.672224,-0.705981,1.538566,-0.614744,-0.577098,-0.524088,-0.540378,0.216926
16885,-0.887683,-0.68391,-0.011006,-0.354889,2.602693,-1.410935,0.907754,-0.211016,0.307929,-0.165944,-0.246217,-0.045282,1.303035


## Creando un pipeline

In [9]:
transformer1 = make_column_transformer(
    (OneHotEncoder(), make_column_selector(dtype_include=object)),
    remainder='passthrough'
)
pipeline = make_pipeline(transformer1, StandardScaler() ,KNNImputer(n_neighbors=k_value)) # creamos un pipeline con las transformaciones previas y la imputación

X_train_imputed_b = pipeline.fit_transform(X_train) # Aplicamos el pipeline a los datos de entrenamiento

## Comprobación de la solución

Podemos también convertir de nuevo a DataFrame y comparar los resultados procesados por los dos métodos.

In [10]:
X_train_imputed_b = pd.DataFrame( # convertimos el resultado a DataFrame
    X_train_imputed_b,
    columns=transformer1.get_feature_names_out(), index=X_train.index)

X_train_imputed_b


Unnamed: 0,onehotencoder__ocean_proximity_<1H OCEAN,onehotencoder__ocean_proximity_INLAND,onehotencoder__ocean_proximity_ISLAND,onehotencoder__ocean_proximity_NEAR BAY,onehotencoder__ocean_proximity_NEAR OCEAN,remainder__longitude,remainder__latitude,remainder__housing_median_age,remainder__total_rooms,remainder__total_bedrooms,remainder__population,remainder__households,remainder__median_income
12655,-0.887683,1.46218,-0.011006,-0.354889,-0.384217,-0.941350,1.347438,0.027564,0.584777,0.635123,0.732602,0.556286,-0.893647
15502,-0.887683,-0.68391,-0.011006,-0.354889,2.602693,1.171782,-1.192440,-1.722018,1.261467,0.775677,0.533612,0.721318,1.292168
2908,-0.887683,1.46218,-0.011006,-0.354889,-0.384217,0.267581,-0.125972,1.220460,-0.469773,-0.545045,-0.674675,-0.524407,-0.525434
14053,-0.887683,-0.68391,-0.011006,-0.354889,2.602693,1.221738,-1.351474,-0.370069,-0.348652,-0.038567,-0.467617,-0.037297,-0.865929
20496,1.126529,-0.68391,-0.011006,-0.354889,-0.384217,0.437431,-0.635818,-0.131489,0.427179,0.269198,0.374060,0.220898,0.325752
...,...,...,...,...,...,...,...,...,...,...,...,...,...
15174,1.126529,-0.68391,-0.011006,-0.354889,-0.384217,1.251711,-1.220505,-1.165333,1.890456,1.686854,0.543471,1.341519,0.637374
12661,-0.887683,1.46218,-0.011006,-0.354889,-0.384217,-0.921368,1.342761,-1.085806,2.468471,2.149712,3.002174,2.451492,-0.557509
19263,1.126529,-0.68391,-0.011006,-0.354889,-0.384217,-1.570794,1.310018,1.538566,-0.895802,-0.894007,-0.862013,-0.865118,-0.365475
19140,1.126529,-0.68391,-0.011006,-0.354889,-0.384217,-1.560803,1.249211,-1.165333,0.249005,0.109257,-0.189747,0.010616,0.168261


In [11]:
print("¿Hay valores nulos?", X_train_imputed_b.isna().any().any()) # Verificamos que no hay valores nulos

# Comprobamos que X_train_imputed_a y X_train_imputed_b son iguales
print("¿Son los resultados iguales paso a paso y con el pipeline?",(X_train_imputed_a.values.round(8)==X_train_imputed_b.values.round(8)).all())

¿Hay valores nulos? False
¿Son los resultados iguales paso a paso y con el pipeline? True
