*Los ejercicios se basan en el conjunto de viviendas "housing.csv"*

### Ejercicio 4

Intente crear un transformador personalizado que entrene a un regresor de vecinos más cercanos (`sklearn.neighbors.KNeighborsRegresso`) en su método `fit()`, y genere las predicciones del modelo en su método `transform()`. 

A continuación, **agregue esta función al pipe de preprocesamiento**, utilizando la latitud y la longitud como entradas a este transformador. 

Esto añadirá una característica en el modelo que corresponde al precio medio de la vivienda de los distritos más cercanos.

#### Paso 1: Preparando los datos

In [3]:
import requests
import tarfile

URL = "https://mymldatasets.s3.eu-de.cloud-object-storage.appdomain.cloud/housing.tgz"
PATH = "housing.tgz"

def getData(url=URL, path=PATH):
  r = requests.get(url)
  with open(path, 'wb') as f:
    f.write(r.content)
  housing_tgz = tarfile.open(path)
  housing_tgz.extractall()
  housing_tgz.close()

getData()

In [4]:
import pandas as pd

data = pd.read_csv('housing.csv')
data.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.6+ MB


In [6]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(data, test_size=0.2) # 20% para test y 80% para entrenamiento

len(data), len(train), len(test)

(20640, 16512, 4128)

#### Paso 2: Tratar los valores inexistentes ("missing values")

In [7]:
train.to_csv('train.csv', index=False)
test.to_csv('test.csv', index=False)

In [8]:
train_data, y_train = train.drop(['median_house_value'], axis=1), train['median_house_value'].copy() # quitar columna de targets del dataset de entrenamiento
test_data, y_test = test.drop(['median_house_value'], axis=1), test['median_house_value'].copy() # agregar la columna de targets al set

# separar variables en numéricas y categóricas
train_num = train_data.drop(['ocean_proximity'], axis=1)
train_cat = train_data[['ocean_proximity']]

In [9]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy="median") # definir imputer
imputer.fit(train_num) # calcular mediana

# mediana de cada característica o columna (7)
imputer.statistics_ # valores calculados

array([-118.48  ,   34.24  ,   29.    , 2125.    ,  434.    , 1165.    ,
        409.    ,    3.5275])

In [10]:
# llamamos a transform y le pasamos los valores que debe sustituir en cada característica

X_train_num = imputer.transform(train_num) # cambiar valores inexistentes por la media

X_train_num

array([[-1.2109e+02,  3.9480e+01,  2.5000e+01, ...,  8.4500e+02,
         3.3000e+02,  1.5603e+00],
       [-1.2252e+02,  3.7880e+01,  5.2000e+01, ...,  8.8000e+01,
         4.6000e+01,  3.8068e+00],
       [-1.2196e+02,  3.8360e+01,  1.1000e+01, ...,  1.7720e+03,
         6.9400e+02,  2.7434e+00],
       ...,
       [-1.1859e+02,  3.4210e+01,  2.6000e+01, ...,  1.9860e+03,
         6.4500e+02,  2.9974e+00],
       [-1.2272e+02,  3.8420e+01,  2.6000e+01, ...,  9.3700e+02,
         2.4800e+02,  1.9458e+00],
       [-1.1836e+02,  3.3920e+01,  2.6000e+01, ...,  2.3080e+03,
         1.0090e+03,  2.6667e+00]])

#### Paso 3: Preprocesado

In [11]:
## PREPROCESADO DE LAS VARIABLES NUMÉRICAS (escalado)

from sklearn.preprocessing import StandardScaler # también hay min-max

scaler = StandardScaler() # mean y std
scaler.fit(X_train_num)
X_train_num_scaled = scaler.transform(X_train_num)
X_train_num_scaled

array([[-0.76558832,  1.81138192, -0.28940083, ..., -0.52972113,
        -0.44530818, -1.21220839],
       [-1.47930356,  1.06129734,  1.85122327, ..., -1.22414766,
        -1.19282265, -0.02730433],
       [-1.19980668,  1.28632271, -1.39935407, ...,  0.32065323,
         0.51277374, -0.5881888 ],
       ...,
       [ 0.48216561, -0.65920917, -0.21011846, ...,  0.51696403,
         0.38380118, -0.4542179 ],
       [-1.57912388,  1.31445089, -0.21011846, ..., -0.44532583,
        -0.66113982, -1.00887853],
       [ 0.59695897, -0.795162  , -0.21011846, ...,  0.81234758,
         1.3418831 , -0.6286438 ]])

In [12]:
## PREPROCESADO DE LAS VARIABLES CATEGÓRICAS (OneHotEncoding)

from sklearn.preprocessing import OneHotEncoder

cat_encoder = OneHotEncoder()
X_train_cat = cat_encoder.fit_transform(train_cat)
X_train_cat

<16512x5 sparse matrix of type '<class 'numpy.float64'>'
	with 16512 stored elements in Compressed Sparse Row format>

In [13]:
X_train_cat.toarray()

array([[0., 1., 0., 0., 0.],
       [0., 0., 0., 1., 0.],
       [0., 1., 0., 0., 0.],
       ...,
       [1., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0.]])

#### Paso 4: Pipelines

In [14]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

## >> CREACIÓN
# inicializamos el pipeline y le pasamos una lista con los pasos a ejecutar

num_pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy="median")), # 1- imputer (rellenar huecos)
        ('std_scaler', StandardScaler()), # 2- función para normalizar (escalado)
    ])

##  >> SEPARACIÓN 
# separo los atributos en numéricos y categóricos

num_attribs = list(train_num) # atributos numéricos
cat_attribs = ["ocean_proximity"] # atributos categóricos

##  >> APLICACIÓN
# tenemos la posibilidad de aplicar diferentes pipelines a diferentes columnas

full_pipeline = ColumnTransformer([
    ("num", num_pipeline, num_attribs), # aplico "num_pipeline" a las columnas numéricas
    ("cat", OneHotEncoder(), cat_attribs), # aplico OneHotEncoder a las columnas categóricas
])

### NOVEDAD DEL EJERCICIO

En lugar de limitarnos a regresores KNN, vamos a crear un **transformador** que acepte cualquier regresor.

Para esto podemos extender MetaEstimatorMixin y tener un argumento estimador requerido en el constructor.

El método `fit()` debe funcionar en un clon de este estimador, y también debe guardar feature_names_in_. 

MetaEstimatorMixin se asegurará de que el estimador se incluya como parámetro obligatorio y actualizará `get_params()` y `set_params()` para que los hiperparámetros del estimador estén disponibles para el ajuste. 

Por último, creamos un método `get_feature_names_out()`: el nombre de la columna de salida es ...

In [15]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.base import MetaEstimatorMixin, clone

class FeatureFromRegressor(MetaEstimatorMixin, BaseEstimator, TransformerMixin):
    def __init__(self, estimator):
        self.estimator = estimator

    def fit(self, X, y=None):
        estimator_ = clone(self.estimator)
        estimator_.fit(X, y)
        self.estimator_ = estimator_
        self.n_features_in_ = self.estimator_.n_features_in_
        if hasattr(self.estimator, "feature_names_in_"):
            self.feature_names_in_ = self.estimator.feature_names_in_
        return self  # always return self!
    
    def transform(self, X):
        check_is_fitted(self)
        predictions = self.estimator_.predict(X)
        if predictions.ndim == 1:
            predictions = predictions.reshape(-1, 1)
        return predictions

    def get_feature_names_out(self, names=None):
        check_is_fitted(self)
        n_outputs = getattr(self.estimator_, "n_outputs_", 1)
        estimator_class_name = self.estimator_.__class__.__name__
        estimator_short_name = estimator_class_name.lower().replace("_", "")
        return [f"{estimator_short_name}_prediction_{i}"
                for i in range(n_outputs)]

NameError: name 'BaseEstimator' is not defined

In [16]:
from sklearn.utils.estimator_checks import check_estimator

check_estimator(FeatureFromRegressor(KNeighborsRegressor()))

NameError: name 'FeatureFromRegressor' is not defined

In [17]:
knn_reg = KNeighborsRegressor(n_neighbors=3, weights="distance")
knn_transformer = FeatureFromRegressor(knn_reg)
geo_features = housing[["latitude", "longitude"]]
knn_transformer.fit_transform(geo_features, housing_labels)

NameError: name 'FeatureFromRegressor' is not defined

In [18]:
knn_transformer.get_feature_names_out()

NameError: name 'knn_transformer' is not defined

Ahora lo incluimos en el pipeline:

In [19]:
from sklearn.base import clone

transformers = [(name, clone(transformer), columns)
                for name, transformer, columns in preprocessing.transformers]
geo_index = [name for name, _, _ in transformers].index("geo")
transformers[geo_index] = ("geo", knn_transformer, ["latitude", "longitude"])

new_geo_preprocessing = ColumnTransformer(transformers)

NameError: name 'preprocessing' is not defined

In [20]:
new_geo_pipeline = Pipeline([
    ('preprocessing', new_geo_preprocessing),
    ('svr', SVR(C=rnd_search.best_params_["svr__C"],
                gamma=rnd_search.best_params_["svr__gamma"],
                kernel=rnd_search.best_params_["svr__kernel"])),
])

NameError: name 'new_geo_preprocessing' is not defined

In [21]:
new_pipe_rmses = -cross_val_score(new_geo_pipeline,
                                  housing.iloc[:5000],
                                  housing_labels.iloc[:5000],
                                  scoring="neg_root_mean_squared_error",
                                  cv=3)
pd.Series(new_pipe_rmses).describe()

NameError: name 'cross_val_score' is not defined

¡Sí, eso es terrible! Aparentemente, las características de similitud de los grupos eran mucho mejores. Pero, ¿quizás deberíamos ajustar los hiperparámetros de KNeighborsRegressor? De eso se trata el siguiente ejercicio.