# PipeLines en sklearn
Descripcion
#### Descripcion
En SciKit learn tenemos pipeline que nos permiter guardar en un solo objecto todo un pre-procesamiento de datos, hasta incluso entrenar modelos.
Los pipeline estan compuestos de steps ( pasos ) que se ejecutan de forma secuencial llamando a las funciones **fit()**, **transform()** para cada step ( o **fit()** solo si el step es un estimador ) 


#### Codigo
Vamos a construir un pipeline para envolver todo el pre-procesamiento de un df en un solo objecto, para esto se separan las columnas en numericas y categoricas, luego se contruye un pipeline diferente para cada tipo, y estas se unen mediante la Clase **ColumnTransformer**


In [66]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# lib for testing
import pandas as pd
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

- Cargamos el dataset de titanic para probar el Pipe

In [42]:
df = pd.read_csv('../Data/titanic.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 89 entries, 0 to 88
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  89 non-null     int64  
 1   Survived     89 non-null     int64  
 2   Pclass       89 non-null     int64  
 3   Name         89 non-null     object 
 4   Sex          89 non-null     object 
 5   Age          73 non-null     float64
 6   SibSp        89 non-null     int64  
 7   Parch        89 non-null     int64  
 8   Ticket       89 non-null     object 
 9   Fare         89 non-null     float64
 10  Cabin        19 non-null     object 
 11  Embarked     89 non-null     object 
dtypes: float64(2), int64(5), object(5)
memory usage: 8.5+ KB


# Pipeline 
Recomiendo ver el notebook Transformadores_custom para poder usar pre-procesamiento no estandar (ej borrar outliers) 

In [65]:
X = df.drop('Survived', axis=1)
y = df.Survived


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

- creamos 2 lista con las columas de cada tipo ( numericas y categoricas )

In [67]:
numeric_features = X_train.select_dtypes(include=['float64', 'int64']).columns
categorical_features = X_train.select_dtypes(include=['object']).columns

- Creamos 2 pipeline

In [68]:
numeric_pipe = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_pipe = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

- Unimos los pipeline indicando sobre que columnas va a trabajar cada uno

In [69]:
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_pipe, numeric_features),
        ('cat', categorical_pipe, categorical_features)])

- Agregamos un estimador que se entrene una vez finalizado el pre-procesamiento

In [70]:
rf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', RandomForestClassifier())])

- Ejecutamos todo el Pipeline

In [71]:
rf.fit(X, y)

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='median')),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  Index(['PassengerId', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare'], dtype='object')),
                                                 ('cat',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('onehot',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),


- Podemos predecir directamente, con la certeza que X_train recibe el mismo tratamiento que el conjunto X_train

In [74]:
rf.predict(X_test)

array([0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1])

- Si nos interesa podemos ver los steps, accediendo en profundidad

In [98]:
rf.steps

[('preprocessor',
  ColumnTransformer(transformers=[('num',
                                   Pipeline(steps=[('imputer',
                                                    SimpleImputer(strategy='median')),
                                                   ('scaler', StandardScaler())]),
                                   Index(['PassengerId', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare'], dtype='object')),
                                  ('cat',
                                   Pipeline(steps=[('imputer',
                                                    SimpleImputer(strategy='most_frequent')),
                                                   ('onehot',
                                                    OneHotEncoder(handle_unknown='ignore'))]),
                                   Index(['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked'], dtype='object'))])),
 ('classifier', RandomForestClassifier())]

In [97]:
# media de cada columna numerica
rf.steps[0][1].transformers_[0][1].steps[1][1].mean_

StandardScaler()


array([4.43191011e+02, 2.30337079e+00, 2.85205618e+01, 5.50561798e-01,
       4.15730337e-01, 3.19889517e+01])

In [87]:
# estimadores en RandomForest
rf.steps[1][1].estimators_[:5]

[DecisionTreeClassifier(max_features='auto', random_state=171587506),
 DecisionTreeClassifier(max_features='auto', random_state=1874114238),
 DecisionTreeClassifier(max_features='auto', random_state=178100134),
 DecisionTreeClassifier(max_features='auto', random_state=1725464692),
 DecisionTreeClassifier(max_features='auto', random_state=2070386455)]