### Pipelines

Un Pipeline (o tubería) es un objeto que encadena (o "pasa por una tubería") diversas fases de preprocesamiento y un estimador final. Por ejemplo:

* Transformaciones iniciales (ej. imputación de valores faltantes, escalado de datos, selección de características, etc.).
* Modelo final (ej. una regresión lineal, un clasificador random forest, un SVM, etc.).

Al usar un Pipeline, estas fases se integran en un solo objeto que se entrena y se evalúa de forma conjunta. 

* Ventajas:
    * Se asegura que todas las transformaciones se realicen siempre de la misma forma en entrenamiento y en predicción.
    * Se reduce el riesgo de fugas de información (data leakage).
    * Se simplifica el código y se puede integrar fácilmente con las rutinas de búsqueda de hiperparámetros (e.g. GridSearchCV o RandomizedSearchCV) y validación cruzada.
    
- Función python vs pipeline:
    * Gestionar manualmente el particionado evitando la fuga de datos
    * Aprovecha el polimorfismo ya que todos los preprocesadores de scikit heredan de una clase en común: TransformerMixin por tanto tienen unos métodos comunes: fit, transform y fit_transform
    * Facilita la exportación para usar en producción porque exporta un objeto con todos los preprocesados y modelado incluido
    * Facilita la composición de pasos de forma muy simplificada


* Objetivo:
    * crear un pipeline que tenga preprocesados y modelo y exportarlo. De esta forma si lo cargamos en otro entorno podemos pedirle predicciones sin tener que limpiar / preprocesar los datos, ya hace el propio pipeline.

* Ámbito: los pipelines están diseñados para transformar la X, es decir, los datos de entrada a través de pasos.
    * Cuando se ejecutan los métoso fit, predict de papelines no se aplican transformaciones a la "y", solo a la "X"
    * Si se quiere modificar la "y" se puede hacer antesde entrenar el pipeline

Clases y métodos de scikit learn:

* Pipeline: 
    * Permite encadenar una secuencia de transformadores y un estimador final.
* make_pipeline: 
    * función para crear un objeto Pipeline sin necesidad de asignar manualmente un nombre a cada paso.

* ColumnTransformer: 
    * Permite aplicar diferentes transformaciones a subconjuntos específicos (por ejemplo, columnas) de un conjunto de datos. Útil para trabajar con datos tabulares que contienen variables de distintos tipos
    * El conjunto de datos completo, pero se especifican columnas específicas para cada transformador.
    * Por ejemplo combinar MinMaxScaler con OneHotEncoding

* FeatureUnion: 
    * Entrada única para todos: Aplica cada transformador de la unión a la misma matriz de entrada completa. Por ejemplo combinar PCA y SelectKBest
* make_union: 
    * Función de ayuda para crear una FeatureUnion de forma automática, similar a make_pipeline


In [1]:
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression
import seaborn as sns 
import numpy as np 
import pandas as pd 
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import make_pipeline

df = sns.load_dataset('penguins')
df = df.dropna(subset=['body_mass_g']) # Quitar nulos de la salida "y" porque es la variable a predecir

X = df[['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm']]
y = df['body_mass_g']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

In [2]:
pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('model', LinearRegression())
])
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

print(pipeline.named_steps)
print(pipeline.named_steps['imputer'])
print(pipeline.named_steps['model'])

{'imputer': SimpleImputer(strategy='median'), 'model': LinearRegression()}
SimpleImputer(strategy='median')
LinearRegression()


In [3]:
X

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm
0,39.1,18.7,181.0
1,39.5,17.4,186.0
2,40.3,18.0,195.0
4,36.7,19.3,193.0
5,39.3,20.6,190.0
...,...,...,...
338,47.2,13.7,214.0
340,46.8,14.3,215.0
341,50.4,15.7,222.0
342,45.2,14.8,212.0


In [4]:
X_new = pd.DataFrame([[39.1, np.nan, 181.0]], columns=['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm'])
pipeline.predict(X_new)

array([3209.64419227])

In [5]:
# Alternativa con make_pipeline



pipeline = make_pipeline(
    SimpleImputer(strategy='median'),
    LinearRegression()
)
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

print(pipeline.named_steps)
print(pipeline.named_steps['simpleimputer'])
print(pipeline.named_steps['linearregression'])

{'simpleimputer': SimpleImputer(strategy='median'), 'linearregression': LinearRegression()}
SimpleImputer(strategy='median')
LinearRegression()


### Pipeline con GridSearchCV

Notación especial "__" para separar los pasos de los parámetros (doble barra baja).

In [6]:
from sklearn.metrics import r2_score
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import MinMaxScaler, PowerTransformer

pipeline = Pipeline([
    ('imputer', SimpleImputer()),
    ("transformer", PowerTransformer()),
    ('scaler', MinMaxScaler()),
    ('model', KNeighborsRegressor())
])

params = {
    'imputer__strategy': ['mean', 'median'],
    "transformer__method": ["yeo-jhonson", "box-cox"],
    'scaler__feature_range': [(0, 1), (0, 2)],
    'model__n_neighbors': np.arange(3,20)
}

grid = GridSearchCV(pipeline, params, scoring='r2')
grid.fit(X_train, y_train)
y_pred = grid.predict(X_test)
print('r2_score', r2_score(y_test, y_pred))
print('grid best params', grid.best_params_)

r2_score 0.815285482180468
grid best params {'imputer__strategy': 'mean', 'model__n_neighbors': np.int64(11), 'scaler__feature_range': (0, 1), 'transformer__method': 'box-cox'}


340 fits failed out of a total of 680.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
340 fits failed with the following error:
Traceback (most recent call last):
  File "c:\Users\carme\AppData\Local\Programs\Python\Python313\Lib\site-packages\sklearn\model_selection\_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
    ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\carme\AppData\Local\Programs\Python\Python313\Lib\site-packages\sklearn\base.py", line 1389, in wrapper
    return fit_method(estimator, *args, **kwargs)
  File "c:\Users\carme\AppData\Local\Programs\Python\Python313\Lib\site-packages\sklearn\pipeline.py", line 654, in fit
    Xt = self._fit(X, y, routed_params, ra

In [7]:
X_new = pd.DataFrame([[39.1, np.nan, 181.0]], columns=['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm'])
grid.predict(X_new)

array([3461.36363636])

In [8]:
# Desactivando pasos
pipeline = Pipeline([
    ('imputer', SimpleImputer()),
    ('transformer', PowerTransformer()), # En este ejemplo lo hacemos opcional
    ('scaler', MinMaxScaler()), # En este ejemplo lo hacemos opcional   
    ('model', KNeighborsRegressor())
])
params = {
    'imputer__strategy': ['mean', 'median'],
    'transformer': [None, PowerTransformer(method='yeo-johnson'), PowerTransformer(method='box-cox')],
    'scaler': [None, MinMaxScaler(feature_range=(0, 1)), MinMaxScaler(feature_range=(0, 2))],
    'model__n_neighbors': np.arange(3, 20)
}
grid = GridSearchCV(pipeline, params, scoring='r2')
grid.fit(X_train, y_train)
y_pred = grid.predict(X_test)
print('r2_score:', r2_score(y_test, y_pred))
print('grid best params:', grid.best_params_)

r2_score: 0.8253040480659294
grid best params: {'imputer__strategy': 'mean', 'model__n_neighbors': np.int64(18), 'scaler': MinMaxScaler(), 'transformer': None}


In [9]:
# Probando varios modelos
from sklearn.tree import DecisionTreeRegressor


pipeline = Pipeline([
    ('imputer', SimpleImputer()),
    ('transformer', PowerTransformer()),
    ('scaler', MinMaxScaler()),   
    ('model', "placeholder") # Se reemplaza con cada modelo en la búsqueda
])
params = [
    # KNN
    {
        'imputer__strategy': ['mean', 'median'],
        "transformer__method": ["yeo-jhonson", "box-cox"],
        'scaler__feature_range': [(0, 1), (0, 2)],
        "model": [KNeighborsRegressor()],
        'model__n_neighbors': np.arange(3, 20)
    },
    # Decision Tree
    {
        'imputer__strategy': ['mean', 'median'],
        "transformer__method": ["yeo-jhonson", "box-cox"],
        'scaler__feature_range': [(0, 1), (0, 2)],
        "model": [DecisionTreeRegressor()],
        'model__max_depth': [None, 2, 3, 4, 5, 6, 7, 8, 9, 10]
    }
]

grid = GridSearchCV(pipeline, params, scoring='r2', n_jobs=-1, verbose=1)
grid.fit(X_train, y_train)
y_pred = grid.predict(X_test)
print('r2_score:', r2_score(y_test, y_pred))
print('grid best params:', grid.best_params_)

Fitting 5 folds for each of 216 candidates, totalling 1080 fits
r2_score: 0.815285482180468
grid best params: {'imputer__strategy': 'mean', 'model': KNeighborsRegressor(), 'model__n_neighbors': np.int64(11), 'scaler__feature_range': (0, 1), 'transformer__method': 'box-cox'}


540 fits failed out of a total of 1080.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
347 fits failed with the following error:
Traceback (most recent call last):
  File "c:\Users\carme\AppData\Local\Programs\Python\Python313\Lib\site-packages\sklearn\model_selection\_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
    ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\carme\AppData\Local\Programs\Python\Python313\Lib\site-packages\sklearn\base.py", line 1389, in wrapper
    return fit_method(estimator, *args, **kwargs)
  File "c:\Users\carme\AppData\Local\Programs\Python\Python313\Lib\site-packages\sklearn\pipeline.py", line 654, in fit
    Xt = self._fit(X, y, routed_params, r

### FunctionTransformer

Usos de FunctionTransforme para crear funciones personalizdas que usar en el pipeline.

In [10]:


def log_transform(X):
    return np.log(X)

pipeline = Pipeline([
    ('imputer', SimpleImputer()),
    ('log', FunctionTransformer(log_transform)), 
    ("scaler", MinMaxScaler()),  
    ('model', KNeighborsRegressor())
])

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

print(pipeline.named_steps)
print(r2_score(y_test, y_pred))


{'imputer': SimpleImputer(), 'log': FunctionTransformer(func=<function log_transform at 0x0000028730174CC0>), 'scaler': MinMaxScaler(), 'model': KNeighborsRegressor()}
0.8134498329740103


### ColumnTransformer

Separar y combinar pipelines para hacer distintos tratamientos diferentes columnas

In [11]:
df = sns.load_dataset('penguins')
df = df.dropna(subset=['body_mass_g']) # Quitar nulos de la salida "y" porque es la variable a predecir

X = df[['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm']]
y = df['body_mass_g']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

In [12]:
from sklearn.compose import ColumnTransformer
from sklearn.impute import KNNImputer
from sklearn.preprocessing import OneHotEncoder

# pipeline numéricas
numerical_cols = X_train.select_dtypes(include=[np.number]).columns
pipeline_numerical = Pipeline([
    ('imputer', KNNImputer(n_neighbors=7)),
    ("scaler", MinMaxScaler()),
])

# pipeline categóricas
categorical_cols = X_train.select_dtypes(exclude=[np.number]).columns
pipeline_categorical = Pipeline([
    ('imputer', SimpleImputer(strategy="most_frequent")),
    ("encoder", OneHotEncoder(sparse_output=False))
])

# unir papelines con ColumnTransformer
pipeline_all = ColumnTransformer([
    ("numeric", pipeline_numerical, numerical_cols),
    ('categorical', pipeline_categorical, categorical_cols)
])

# papeline final con el modelo
pipeline = make_pipeline(
    pipeline_all,
    KNeighborsRegressor(n_neighbors=7)
)
pipeline

In [13]:
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

print(r2_score(y_test, y_pred))

0.8212291991117089


In [14]:
# remainder 'drop' (por defecto)
from sklearn.preprocessing import StandardScaler

pipeline = ColumnTransformer([
        ('numeric', StandardScaler(), ['bill_length_mm', 'bill_depth_mm']),
        ('categorical', OneHotEncoder(), ['species', 'island']),
    ], remainder='drop'
)

# 'flipper_length_mm' y 'sex' han sido eliminadas y no se han procesado
pd.DataFrame(pipeline.fit_transform(X_train, y_train)).head()

ValueError: A given column is not a column of the dataframe

In [None]:
# remainder "passthrough"
from sklearn.preprocessing import StandardScaler

pipeline = ColumnTransformer([
        ('numeric', StandardScaler(), ['bill_length_mm', 'bill_depth_mm']),
        ('categorical', OneHotEncoder(), ['species', 'island']),
    ], remainder='passthrough'
)

# 'flipper_length_mm' y 'sex' se mantienen, pero no se han procesado, simplemente se agregan al resultado final
pd.DataFrame(pipeline.fit_transform(X_train, y_train)).head()

ValueError: A given column is not a column of the dataframe

In [26]:
# remainder con un preprocesador:
from sklearn.preprocessing import StandardScaler
X = df[['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm']]
y = df['body_mass_g']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)


pipeline_numerical1 = Pipeline([
    ('imputer', KNNImputer(n_neighbors=7)),
    ('scaler', StandardScaler())
])

pipeline_numerical2 = Pipeline([
    ('imputer', KNNImputer(n_neighbors=7)),
    ('scaler', MinMaxScaler())
])

pipeline = ColumnTransformer([
        ('numeric', pipeline_numerical1, ['bill_length_mm', 'bill_depth_mm']),
    ], remainder=pipeline_numerical2
)
pipeline

In [27]:
# 'bill_length_mm', 'bill_depth_mm' se les aplica StandardScaler, y a 'flipper_length_mm'  se aplica MinMaxScaler
pd.DataFrame(pipeline.fit_transform(X_train, y_train)).head()

Unnamed: 0,0,1,2
0,-0.240411,0.603547,0.40678
1,-1.795331,0.502493,0.355932
2,-1.270998,-0.305938,0.220339
3,1.350671,-0.406992,0.983051
4,1.224108,0.098278,0.949153


### Transformador personalizado

Para crear transformadores preprocesadores personalizados podemos crear una clase Python que herede de BaseEstimator, TransformerMixin

In [36]:
from sklearn.base import BaseEstimator, TransformerMixin

# Transformador personalizado para imprimir los datos e inspeccionarlos despuçes de cada paso de un pipeline
class Debugger(BaseEstimator, TransformerMixin):
    
    def __init__(self, title, show_shape=True):
        self.title = title
        self.show_shape = show_shape
        
    def fit(self, X, y=None):
        # Normalmente aquí se aprende o se calculan parámetros a partir de datos de entrada
        print(f"Ejecutanto Debugger {self.title}")
        if self.show_shape:
            print(f"Shape de X: {X.shape}")
            print(f"X sample: {X[:1]}\n")
            
        return self # Devuelve la instania Debugger para encadenar el Pipeline
            
        
    def transform(self, X):
        X_copy = X.copy()
        # Aquí  haríamos transformaciones sobre X_copy
        if self.show_shape:
            print(f"Shape de X: {X.shape}")
            print(f"X sample: {X_copy[:1]}\n")
        return X_copy
    
    def fit_transform(self, X, y = None):
        self.fit(X,y)
        return X

In [None]:
from sklearn.preprocessing import PowerTransformer
X = df[['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm']]
y = df['body_mass_g']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)


pipeline = Pipeline([
    ("debug1", Debugger(title= "Datos X sin procesar")),
    
    ('imputer', SimpleImputer()),
    ("debug2", Debugger(title= "Datos X tras SimpleImputer")),
    
    ('log', PowerTransformer()), 
    ("debug3", Debugger(title= "Datos X tras PowerTransformer")),
    
    ("scaler", MinMaxScaler()),
    ("debug4", Debugger(title= "Datos X tras MinMaxScaler")),
      
    ('model', KNeighborsRegressor()) # modelo predictor
])

pipeline.fit(X_train, y_train) # Resultado: un modelo entrenado condatos preprocesados
#y_pred = pipeline.predict(X_test)

#print(r2_score(y_test, y_pred))

Ejecutanto Debugger Datos X sin procesar
Shape de X: (273, 3)
X sample:      bill_length_mm  bill_depth_mm  flipper_length_mm
115            42.7           18.3              196.0

Ejecutanto Debugger Datos X tras SimpleImputer
Shape de X: (273, 3)
X sample: [[ 42.7  18.3 196. ]]

Ejecutanto Debugger Datos X tras PowerTransformer
Shape de X: (273, 3)
X sample: [[-0.22916448  0.58947392 -0.30358024]]

Ejecutanto Debugger Datos X tras MinMaxScaler
Shape de X: (273, 3)
X sample: [[0.39914212 0.59376449 0.50336414]]



In [38]:
pipeline.predict(X_train)

Shape de X: (273, 3)
X sample:      bill_length_mm  bill_depth_mm  flipper_length_mm
115            42.7           18.3              196.0

Shape de X: (273, 3)
X sample: [[ 42.7  18.3 196. ]]

Shape de X: (273, 3)
X sample: [[-0.22916448  0.58947392 -0.30358024]]

Shape de X: (273, 3)
X sample: [[0.39914212 0.59376449 0.50336414]]



array([4205., 3455., 3340., 5620., 5500., 5550., 3225., 5540., 3780.,
       4655., 4030., 3215., 5520., 4190., 5410., 3370., 3320., 3600.,
       4750., 3070., 3830., 5450., 4715., 4670., 4475., 3720., 4220.,
       3560., 4065., 3345., 3715., 4060., 3650., 4420., 3555., 5025.,
       4995., 3680., 5180., 4405., 3730., 3310., 3460., 3580., 3720.,
       3865., 4740., 4880., 3300., 3670., 3800., 5400., 4845., 4070.,
       4620., 3565., 4640., 4850., 3600., 3615., 5190., 3365., 3640.,
       5460., 4210., 3680., 3695., 5730., 3870., 4320., 3115., 3330.,
       4050., 3425., 3850., 4090., 3985., 3330., 5490., 5240., 3240.,
       3605., 3820., 3520., 5050., 4600., 5120., 3940., 5640., 5185.,
       4805., 3710., 5090., 3400., 4050., 5620., 4690., 3775., 4190.,
       4320., 4350., 3320., 5315., 5185., 5110., 5150., 4005., 4070.,
       4800., 3745., 4285., 3790., 3355., 3770., 5025., 3870., 5225.,
       5560., 3190., 4005., 4695., 3475., 3430., 5550., 3505., 3855.,
       4090., 3470.,

### Transformador personalizado para outliers

In [None]:
class OutlierRemover(BaseEstimator, TransformerMixin):
    
    def __init__(self, factor=1.5):
        self.factor = factor # factor para calcular umbrales inferior y superior (método tukey)
        
    def fit(self, X, y=None):
        # if not isinstance(X, pd.DataFrame):
        #     X = pd.DataFrame(X)
            
        # self.numerical_cols_ = X.select_dtypes(include=[np.number]).columns
        
        # Q1 = X.quantile(0.25)
        # Q3 = X.quantile(0.75)
        Q1 = np.percentile(X, 25, axis=0)
        Q3 = np.percentile(X, 75, axis=0)
        IQR = Q3 - Q1
        
        # cálculo de límites
        self.lower_bound_ = Q1 - self.factor * IQR
        self.upper_bound_ = Q3 + self.factor * IQR
        return self
    
    def transform(self, X):
        X_copy = X.copy()
        
        #filtro = ~((X_copy < self.lower_bound_) | (X_copy > self.upper_bound_)).any(axis=1)
        filtro = np.all((X_copy >= self.lower_bound_) & (X <= self.upper_bound_), axis=1)
        return X_copy[filtro]
  

In [53]:
# Comprobar que funciona usando un factor pequeño para verificar que elimina outliers:
remover = OutlierRemover(factor=0.4)
remover.fit_transform(X_train).shape

(187, 3)

In [None]:
X = df[['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm']]
y = df['body_mass_g']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)


pipeline = make_pipeline(
    SimpleImputer(),
    Debugger(title= "X tras SimpleImputer"),
    
    OutlierRemover(factor=1.4), # demostramos que el facor es editable desde fuera
    Debugger(title= "X tras OutlierRemover"),
    
    PowerTransformer(), 
    Debugger(title= "X tras PowerTransformer"),
    
    MinMaxScaler(),
    Debugger(title= "Datos X tras MinMaxScaler"),
      
    KNeighborsRegressor() # modelo predictor
)

pipeline.fit(X_train, y_train)

Ejecutanto Debugger X tras SimpleImputer
Shape de X: (273, 3)
X sample: [[ 42.7  18.3 196. ]]

Ejecutanto Debugger X tras OutlierRemover
Shape de X: (273, 3)
X sample: [[ 42.7  18.3 196. ]]

Ejecutanto Debugger X tras PowerTransformer
Shape de X: (273, 3)
X sample: [[-0.22916448  0.58947392 -0.30358024]]

Ejecutanto Debugger Datos X tras MinMaxScaler
Shape de X: (273, 3)
X sample: [[0.39914212 0.59376449 0.50336414]]



### Transformador personalizado para crear nuevas features

Ejemplo: para crear una nueva columna en el dataset titanic

sibsp + parch + 1 = family_size

In [17]:
df = sns.load_dataset("titanic").head(1)
X = df.drop("alive", axis=1)
y = df["alive"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)


ValueError: With n_samples=1, test_size=0.2 and train_size=None, the resulting train set will be empty. Adjust any of the aforementioned parameters.

In [63]:
class familySizeFeature(BaseEstimator, TransformerMixin):
    def transform(self, X, y=None):
        X_copy = X.copy()
        X_copy["family_size"] = X_copy["sibsp"] + X_copy["parch"] + 1
        return X_copy
    def fit(self, X, y= None, **fit_params):
        return self

In [None]:
pipeline = make_pipeline(
    familySizeFeature()
)
pipeline.fit_transform(X_train, y_train)

KeyError: 'sibsp'

### Feature Selection en pipelines

In [None]:
from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()
df = pd.DataFrame(data.data, columns= data.feature_names)
X = df
y = 

In [None]:
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.neighbors import KNeighborsClassifier
pipeline = make_pipeline(
    OutlierToNaN(factor=0.9),
    Debugger(title='X after OutlierRemover'),
    
    SimpleImputer(strategy='median'),
    Debugger(title='X after SimpleImputer'),
    
    SelectKBest(f_classif, k=10),
    Debugger(title='X after SelectKBest'),
        
    PowerTransformer(),
    Debugger(title='X after PowerTransformer'),
    
    MinMaxScaler(), 
    Debugger(title='X after MinMaxScaler'),
    KNeighborsClassifier(), 
)
pipeline.fit(X_train, y_train)
pipeline.predict(X_test)