## **Pipelines - Machine Learning**

Pipelines sckit-learn são espécies de "containers" que podem ter objetos do tipo:

- Transformer (não é de NLP, é de processamento mesmo).
- Estimator (nome que o sklearn dá para algorimos de regressão, classificação e clusterização).
- Pipeline (sim, é possível utilizar pipelines dentro de outro pipeline).
- FeatureUnion (ajuda a juntar pipelines diferentes).

In [1]:
import pandas as pd
import numpy as np

from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.impute import KNNImputer, SimpleImputer
from sklearn.model_selection import train_test_split, KFold, cross_val_score, GridSearchCV

In [120]:
treino = pd.read_csv('train_data.csv')
teste  = pd.read_csv('test_data.csv')

display(treino.head(3))
display(teste.head(3))

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q


In [121]:
treino.drop(columns=['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1, inplace=True)
teste.drop(columns=['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1, inplace=True)

In [122]:
treino.isnull().sum()

Survived      0
Pclass        0
Sex           0
Age         177
SibSp         0
Parch         0
Fare          0
Embarked      2
dtype: int64

In [123]:
imputer = SimpleImputer(strategy='median')
treino['Age'] = imputer.fit_transform(treino[['Age']]).ravel()

treino.dropna(inplace=True)
treino.info()

<class 'pandas.core.frame.DataFrame'>
Index: 889 entries, 0 to 890
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  889 non-null    int64  
 1   Pclass    889 non-null    int64  
 2   Sex       889 non-null    object 
 3   Age       889 non-null    float64
 4   SibSp     889 non-null    int64  
 5   Parch     889 non-null    int64  
 6   Fare      889 non-null    float64
 7   Embarked  889 non-null    object 
dtypes: float64(2), int64(4), object(2)
memory usage: 62.5+ KB


## **Criando o Pipeline NA MÃO**

In [124]:
pipeline_modelo = Pipeline([
    ('onte_hot_encoder', OneHotEncoder(handle_unknown='ignore')),
    ('standard_scaler', StandardScaler(with_mean=False)),
    ('classificador', RandomForestClassifier())
])

pipeline_modelo

## **Usando o MakePipeline para CRIAR**

In [125]:
pipeline_automatico = make_pipeline(OneHotEncoder(handle_unknown='ignore'), 
                                    StandardScaler(with_mean=False), 
                                    RandomForestClassifier())
pipeline_automatico

In [126]:
X = treino.drop(columns=['Survived'], axis=1)
y = treino['Survived']

Xtrain, Xvalid, ytrain, yvalid = train_test_split(X, y, test_size=0.30)

Xtrain.shape, Xvalid.shape, ytrain.shape, yvalid.shape

((622, 7), (267, 7), (622,), (267,))

In [127]:
pipeline_modelo.fit(Xtrain, ytrain)

ypred = pipeline_modelo.predict(Xvalid)

pipeline_modelo.score(Xvalid, yvalid)

0.850187265917603

## **Com Pré-Processamento dos Dados**

In [128]:
Xtrain.info()

<class 'pandas.core.frame.DataFrame'>
Index: 622 entries, 485 to 460
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Pclass    622 non-null    int64  
 1   Sex       622 non-null    object 
 2   Age       622 non-null    float64
 3   SibSp     622 non-null    int64  
 4   Parch     622 non-null    int64  
 5   Fare      622 non-null    float64
 6   Embarked  622 non-null    object 
dtypes: float64(2), int64(3), object(2)
memory usage: 38.9+ KB


In [129]:
Xtrain.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
485,3,female,28.0,3,1,25.4667,S
645,1,male,48.0,1,0,76.7292,C
103,3,male,33.0,0,0,8.6542,S
269,1,female,35.0,0,0,135.6333,S
140,3,female,28.0,0,2,15.2458,C


In [130]:
categoricas = Xtrain.select_dtypes(include='object')
categoricas

Unnamed: 0,Sex,Embarked
485,female,S
645,male,C
103,male,S
269,female,S
140,female,C
...,...,...
839,male,C
406,male,S
355,male,S
124,male,S


In [131]:
numericas = Xtrain.select_dtypes(include='number')
numericas

Unnamed: 0,Pclass,Age,SibSp,Parch,Fare
485,3,28.0,3,1,25.4667
645,1,48.0,1,0,76.7292
103,3,33.0,0,0,8.6542
269,1,35.0,0,0,135.6333
140,3,28.0,0,2,15.2458
...,...,...,...,...,...
839,1,28.0,0,0,29.7000
406,3,51.0,0,0,7.7500
355,3,28.0,0,0,9.5000
124,1,54.0,0,1,77.2875


### **Pipeline - Colunas Categóricas**

In [132]:
pipeline_categoricas = Pipeline([
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

### **Pipeline - Colunas Numéricas**

In [133]:
pipeline_numericas = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('escaler', MinMaxScaler())
])

In [134]:
pre_processamento = ColumnTransformer([
    ('cat', pipeline_categoricas, categoricas),
    ('num', pipeline_numericas, numericas),
])

In [135]:
pipeline_randomforest = make_pipeline(pre_processamento, RandomForestClassifier(random_state=42))
pipeline_logistica    = make_pipeline(pre_processamento, LogisticRegression(random_state=42))

In [136]:
pipeline_randomforest

In [137]:
pipeline_logistica

In [138]:
type(ytrain)

pandas.core.series.Series

In [None]:
pipeline_logistica.fit_transform(Xtrain, ytrain)