# PipeLines en sklearn
Descripcion
#### Descripcion
En SciKit learn tenemos pipeline que nos permiter guardar en un solo objecto todo un pre-procesamiento de datos, hasta incluso entrenar modelos.
Los pipeline estan compuestos de steps ( pasos ) que se ejecutan de forma secuencial llamando a las funciones **fit()**, **transform()** para cada step ( o **fit()** solo si el step es un estimador ) 


#### Codigo
Vamos a construir un pipeline para envolver todo el pre-procesamiento de un df en un solo objecto, para esto se separan las columnas en numericas y categoricas, luego se contruye un pipeline diferente para cada tipo, y estas se unen mediante la Clase **ColumnTransformer**


In [12]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# lib for testing
import pandas as pd
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

- Cargamos el dataset de titanic para probar el Pipe

In [13]:
df = pd.read_csv('../Data/titanic.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 89 entries, 0 to 88
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  89 non-null     int64  
 1   Survived     89 non-null     int64  
 2   Pclass       89 non-null     int64  
 3   Name         89 non-null     object 
 4   Sex          89 non-null     object 
 5   Age          73 non-null     float64
 6   SibSp        89 non-null     int64  
 7   Parch        89 non-null     int64  
 8   Ticket       89 non-null     object 
 9   Fare         89 non-null     float64
 10  Cabin        19 non-null     object 
 11  Embarked     89 non-null     object 
dtypes: float64(2), int64(5), object(5)
memory usage: 8.5+ KB


# Pipeline 
Recomiendo ver el notebook Transformadores_custom para poder usar pre-procesamiento no estandar (ej borrar outliers) 

In [43]:
X = df.drop('Survived', axis=1)
y = df.Survived


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)

- creamos 2 lista con las columas de cada tipo ( numericas y categoricas )

In [44]:
numeric_features = X_train.select_dtypes(include=['float64', 'int64']).columns
categorical_features = X_train.select_dtypes(include=['object']).columns

- Creamos 2 pipeline

In [45]:
numeric_pipe = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_pipe = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

- Unimos los pipeline indicando sobre que columnas va a trabajar cada uno

In [46]:
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_pipe, numeric_features),
        ('cat', categorical_pipe, categorical_features)])

- Agregamos un estimador que se entrene una vez finalizado el pre-procesamiento

In [47]:
rf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', RandomForestClassifier(max_depth=2))])

- Ejecutamos todo el Pipeline

In [48]:
rf.fit(X_train, y_train)

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='median')),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  Index(['PassengerId', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare'], dtype='object')),
                                                 ('cat',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('onehot',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),


- Podemos predecir directamente, con la certeza que X_train recibe el mismo tratamiento que el conjunto X_train

In [49]:
rf.predict(X_test)

array([0, 1, 1, 1, 0, 1, 0, 1, 1])

In [50]:
print(classification_report(y_test, rf.predict(X_test)))

              precision    recall  f1-score   support

           0       0.67      0.50      0.57         4
           1       0.67      0.80      0.73         5

    accuracy                           0.67         9
   macro avg       0.67      0.65      0.65         9
weighted avg       0.67      0.67      0.66         9



- Si nos interesa podemos ver los steps, accediendo en profundidad

In [51]:
rf.steps

[('preprocessor',
  ColumnTransformer(transformers=[('num',
                                   Pipeline(steps=[('imputer',
                                                    SimpleImputer(strategy='median')),
                                                   ('scaler', StandardScaler())]),
                                   Index(['PassengerId', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare'], dtype='object')),
                                  ('cat',
                                   Pipeline(steps=[('imputer',
                                                    SimpleImputer(strategy='most_frequent')),
                                                   ('onehot',
                                                    OneHotEncoder(handle_unknown='ignore'))]),
                                   Index(['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked'], dtype='object'))])),
 ('classifier', RandomForestClassifier(max_depth=2))]

In [52]:
# media de cada columna numerica
rf.steps[0][1].transformers_[0][1].steps[1][1].mean_

array([4.5410000e+02, 2.3375000e+00, 2.7566625e+01, 5.2500000e-01,
       3.3750000e-01, 3.0026980e+01])

In [53]:
# estimadores en RandomForest
rf.steps[1][1].estimators_[:5]

[DecisionTreeClassifier(max_depth=2, max_features='auto',
                        random_state=1180566417),
 DecisionTreeClassifier(max_depth=2, max_features='auto', random_state=272057628),
 DecisionTreeClassifier(max_depth=2, max_features='auto',
                        random_state=1145996662),
 DecisionTreeClassifier(max_depth=2, max_features='auto', random_state=272055954),
 DecisionTreeClassifier(max_depth=2, max_features='auto',
                        random_state=1389285843)]

#### Optimizamos Hiperparametros
- Podemos optimizar cualquier parametro/hiper-parametro en el pipeline accediendo a travez de los nombres de cada **step**

In [54]:
params = {
    'preprocessor__num__scaler__with_mean': [True, False],
    'classifier__n_estimators': np.arange(50, 500, 50),
    'classifier__max_depth': np.arange(3, 10),
}

In [55]:
gs = GridSearchCV(rf, params, scoring='f1', verbose=3, n_jobs=-1)
gs.fit(X_train, y_train)

Fitting 5 folds for each of 126 candidates, totalling 630 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  16 tasks      | elapsed:    1.0s
[Parallel(n_jobs=-1)]: Done 112 tasks      | elapsed:   15.1s
[Parallel(n_jobs=-1)]: Done 272 tasks      | elapsed:   41.1s
[Parallel(n_jobs=-1)]: Done 496 tasks      | elapsed:  1.2min
[Parallel(n_jobs=-1)]: Done 630 out of 630 | elapsed:  1.7min finished


GridSearchCV(estimator=Pipeline(steps=[('preprocessor',
                                        ColumnTransformer(transformers=[('num',
                                                                         Pipeline(steps=[('imputer',
                                                                                          SimpleImputer(strategy='median')),
                                                                                         ('scaler',
                                                                                          StandardScaler())]),
                                                                         Index(['PassengerId', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare'], dtype='object')),
                                                                        ('cat',
                                                                         Pipeline(steps=[('imputer',
                                                                                         

In [56]:
gs.best_params_

{'classifier__max_depth': 9,
 'classifier__n_estimators': 50,
 'preprocessor__num__scaler__with_mean': True}

In [57]:
print(classification_report(y_test, gs.predict(X_test)))

              precision    recall  f1-score   support

           0       0.75      0.75      0.75         4
           1       0.80      0.80      0.80         5

    accuracy                           0.78         9
   macro avg       0.78      0.78      0.78         9
weighted avg       0.78      0.78      0.78         9



## Nota
Podemos usar estos pipeline para guardar todo el flujo de trabajo de nuesto modelo y implementarlo de manera simple en produccion

In [99]:
import pickle

In [100]:
with open('pipeline.pk', 'wb') as fp:
    pickle.dump(rf, fp)