# Pipeline

Dados todos los pasos que comprende un proceso de entrenamiento, con imputación de valores faltantes, estandarización, separación de muestras, etc. Es importante conocer que existen formas de encadenar todas estas operaciones y así disponer de un artefacto único sobre el que trabajar.

Esto se debe en gran medida a que scikit-learn implementa la misma estructura para todos sus modelos y tareas de preprocesado.

Los modelos implementan las funciones:

* `fit`para entrenamiento
* `predict`para predicción
* `score` para puntuación

Mientras que las clases de preprocesado implementan:

* `fit`para ajuste
* `transform`para ejecución

Este estandarizado permite realizar sustituciones de piezas en un mismo proceso planificado y tratarlo como un todo.

<img src="https://skrub-data.org/stable/_images/sklearn_pipeline.svg" width=40%/>

In [1]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=2)),
    ('classifier', LogisticRegression())
])

pipeline

0,1,2
,steps,"[('scaler', ...), ('pca', ...), ...]"
,transform_input,
,memory,
,verbose,False

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,n_components,2
,copy,True
,whiten,False
,svd_solver,'auto'
,tol,0.0
,iterated_power,'auto'
,n_oversamples,10
,power_iteration_normalizer,'auto'
,random_state,

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,100


Internamente podemos adecuar los parámetros de cada paso de forma individual accediendo a  `<paso>__<parametro>`.

In [2]:
pipeline.get_params()

{'memory': None,
 'steps': [('scaler', StandardScaler()),
  ('pca', PCA(n_components=2)),
  ('classifier', LogisticRegression())],
 'transform_input': None,
 'verbose': False,
 'scaler': StandardScaler(),
 'pca': PCA(n_components=2),
 'classifier': LogisticRegression(),
 'scaler__copy': True,
 'scaler__with_mean': True,
 'scaler__with_std': True,
 'pca__copy': True,
 'pca__iterated_power': 'auto',
 'pca__n_components': 2,
 'pca__n_oversamples': 10,
 'pca__power_iteration_normalizer': 'auto',
 'pca__random_state': None,
 'pca__svd_solver': 'auto',
 'pca__tol': 0.0,
 'pca__whiten': False,
 'classifier__C': 1.0,
 'classifier__class_weight': None,
 'classifier__dual': False,
 'classifier__fit_intercept': True,
 'classifier__intercept_scaling': 1,
 'classifier__l1_ratio': None,
 'classifier__max_iter': 100,
 'classifier__multi_class': 'deprecated',
 'classifier__n_jobs': None,
 'classifier__penalty': 'l2',
 'classifier__random_state': None,
 'classifier__solver': 'lbfgs',
 'classifier__tol'

In [3]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split

X, y = load_wine(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

pipeline.fit(X_train, y_train)

0,1,2
,steps,"[('scaler', ...), ('pca', ...), ...]"
,transform_input,
,memory,
,verbose,False

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,n_components,2
,copy,True
,whiten,False
,svd_solver,'auto'
,tol,0.0
,iterated_power,'auto'
,n_oversamples,10
,power_iteration_normalizer,'auto'
,random_state,

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,100


In [4]:
pipeline.score(X_test, y_test)

0.9555555555555556

In [5]:
from sklearn.metrics import classification_report

print(classification_report(pipeline.predict(X_test), y_test))

              precision    recall  f1-score   support

           0       0.87      1.00      0.93        13
           1       1.00      0.90      0.95        20
           2       1.00      1.00      1.00        12

    accuracy                           0.96        45
   macro avg       0.96      0.97      0.96        45
weighted avg       0.96      0.96      0.96        45



Uno de los grandes retos que nos encontraremos es el análisis preliminar y construir el proceso que convierte todos nuestros datos en valores numéricos.

In [6]:
from sklearn.datasets import fetch_openml

# https://www.openml.org/search?type=data&sort=runs&id=1590&status=active
adult = fetch_openml('adult', as_frame=True, version=2)
X, y = adult.data, adult.target

In [7]:
X

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States
4,18,,103497,Some-college,10,Never-married,,Own-child,White,Female,0,0,30,United-States
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48837,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States
48838,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States
48839,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States
48840,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States


Sobre las funcionalidades de scikit-learn se creó [skrub](https://skrub-data.org/) precisamente para aliviar esta tarea. nos ofrece funciones para realizar análisis preliminares de forma sencilla.

In [9]:
from skrub import TableReport

TableReport(X)

ModuleNotFoundError: No module named 'skrub'

Y transformar los datos en valores numéricos representativos. Esta labor requiere determinar los tratamientos más adecuados en los casos de:

* Valores numéricos
* Valores categóricos
* Valores con baja cardinalidad
* Valores de alta cardinalidad
* Fechas y horas

In [None]:
from skrub import TableVectorizer

vectorizer = TableVectorizer()
vectorizer.fit_transform(X)

Unnamed: 0,age,workclass_Federal-gov,workclass_Local-gov,workclass_Never-worked,workclass_Private,workclass_Self-emp-inc,workclass_Self-emp-not-inc,workclass_State-gov,workclass_Without-pay,workclass_nan,...,native-country_20,native-country_21,native-country_22,native-country_23,native-country_24,native-country_25,native-country_26,native-country_27,native-country_28,native-country_29
0,25.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,4.994264e-09,2.417478e-08,-1.365088e-09,9.864670e-09,-1.725437e-09,9.885835e-07,1.919139e-10,-1.009524e-09,-3.339897e-10,1.530799e-08
1,38.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,4.994264e-09,2.417478e-08,-1.365088e-09,9.864670e-09,-1.725437e-09,9.885835e-07,1.919139e-10,-1.009524e-09,-3.339897e-10,1.530799e-08
2,28.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,4.994264e-09,2.417478e-08,-1.365088e-09,9.864670e-09,-1.725437e-09,9.885835e-07,1.919139e-10,-1.009524e-09,-3.339897e-10,1.530799e-08
3,44.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,4.994264e-09,2.417478e-08,-1.365088e-09,9.864670e-09,-1.725437e-09,9.885835e-07,1.919139e-10,-1.009524e-09,-3.339897e-10,1.530799e-08
4,18.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,4.994264e-09,2.417478e-08,-1.365088e-09,9.864670e-09,-1.725437e-09,9.885835e-07,1.919139e-10,-1.009524e-09,-3.339897e-10,1.530799e-08
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48837,27.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,4.994264e-09,2.417478e-08,-1.365088e-09,9.864670e-09,-1.725437e-09,9.885835e-07,1.919139e-10,-1.009524e-09,-3.339897e-10,1.530799e-08
48838,40.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,4.994264e-09,2.417478e-08,-1.365088e-09,9.864670e-09,-1.725437e-09,9.885835e-07,1.919139e-10,-1.009524e-09,-3.339897e-10,1.530799e-08
48839,58.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,4.994264e-09,2.417478e-08,-1.365088e-09,9.864670e-09,-1.725437e-09,9.885835e-07,1.919139e-10,-1.009524e-09,-3.339897e-10,1.530799e-08
48840,22.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,4.994264e-09,2.417478e-08,-1.365088e-09,9.864670e-09,-1.725437e-09,9.885835e-07,1.919139e-10,-1.009524e-09,-3.339897e-10,1.530799e-08


In [None]:
vectorizer

0,1,2
,cardinality_threshold,40
,low_cardinality,OneHotEncoder..._output=False)
,high_cardinality,StringEncoder()
,numeric,PassThrough()
,datetime,DatetimeEncoder()
,specific_transformers,()
,drop_null_fraction,1.0
,drop_if_constant,False
,drop_if_unique,False
,datetime_format,

0,1,2
,resolution,'hour'
,add_weekday,False
,add_total_seconds,True
,add_day_of_year,False
,periodic_encoding,

0,1,2
,categories,'auto'
,drop,'if_binary'
,sparse_output,False
,dtype,'float32'
,handle_unknown,'ignore'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'

0,1,2
,n_components,30
,vectorizer,'tfidf'
,ngram_range,"(3, ...)"
,analyzer,'char_wb'
,stop_words,
,random_state,


Podemos ver qué operaciones se están realizando para cada columna en particular.

In [None]:
vectorizer.all_processing_steps_['age']

[DropUninformative(), ToFloat32(), PassThrough(), {'age': ToFloat32()}]

In [None]:
vectorizer.all_processing_steps_['occupation']

[DropUninformative(),
 CleanCategories(),
 OneHotEncoder(drop='if_binary', dtype='float32', handle_unknown='ignore',
               sparse_output=False),
 {'occupation_Adm-clerical': ToFloat32(), 'occupation_Armed-Forces': ToFloat32(), ...}]

El objetivo último de skrub es poder reducir gran parte de todas estas tareas una vez tenemos bien definido el objeto de nuestro estudio a esencialmente dos líneas de código.

In [None]:
import numpy as np
from skrub import tabular_pipeline
from sklearn.model_selection import cross_val_score

clf = tabular_pipeline('classifier')
res = cross_val_score(clf, X, y, cv=5)

print("Resultado promedio de las validaciones cruzadas: ", np.mean(res))

Resultado promedio de las validaciones cruzadas:  0.8729987464735135


In [None]:
clf

0,1,2
,steps,"[('tablevectorizer', ...), ('histgradientboostingclassifier', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,cardinality_threshold,40
,low_cardinality,ToCategorical()
,high_cardinality,StringEncoder()
,numeric,PassThrough()
,datetime,DatetimeEncoder()
,specific_transformers,()
,drop_null_fraction,1.0
,drop_if_constant,False
,drop_if_unique,False
,datetime_format,

0,1,2
,resolution,'hour'
,add_weekday,False
,add_total_seconds,True
,add_day_of_year,False
,periodic_encoding,

0,1,2
,n_components,30
,vectorizer,'tfidf'
,ngram_range,"(3, ...)"
,analyzer,'char_wb'
,stop_words,
,random_state,

0,1,2
,loss,'log_loss'
,learning_rate,0.1
,max_iter,100
,max_leaf_nodes,31
,max_depth,
,min_samples_leaf,20
,l2_regularization,0.0
,max_features,1.0
,max_bins,255
,categorical_features,'from_dtype'


De estar contentos con el modelo podemos guardarlo para su uso posterior. [Joblib](https://joblib.readthedocs.io/en/stable/) es una de las aportaciones del mismo equipo que podemos emplear para esta labor.

In [None]:
import joblib

joblib.dump(clf, "clasificador.joblib")

['clasificador.joblib']