<a href="https://colab.research.google.com/github/MathMachado/DSWP/blob/master/Notebooks/NB22_Pipelines.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PIPELINES
* Nesta seção vamos estudar Pipelines.
* Pipelines consitem de uma combinação de transformadores (Data Preparation) e estimadores;

## Leitura Recomendada:
* [Why, How and When to Scale your Features](https://medium.com/greyatom/why-how-and-when-to-scale-your-features-4b30ab09db5e)
* [Demonstrating the different strategies of KBinsDiscretizer](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_discretization_strategies.html#sphx-glr-auto-examples-preprocessing-plot-discretization-strategies-py);
* [Why do we need feature scaling in Machine Learning and how to do it using SciKit Learn?](https://medium.com/@contactsunny/why-do-we-need-feature-scaling-in-machine-learning-and-how-to-do-it-using-scikit-learn-d8314206fe73)

## Machine Learning com Python (Scikit-Learn)

![Scikit-Learn](https://github.com/MathMachado/Materials/blob/master/scikit-learn-1.png?raw=true)

## Dados

A seguir, as variáveis/atributos do dataframe:

* Dicionário de dados do dataframe Titanic:
    * **PassengerID**: ID do passageiro;
    * **Survived**: Indicador, sendo 1= Passageiro sobreviveu e 0= Passageiro morreu;
    * **Pclass**: Classe em que o passageiro viaja (1 classe, 2 classe, 3 classe, etc);
    * **Age**: Idade do Passageiro;
    * **SibSp**: Número de parentes a bordo (esposa, irmãos, pais e etc);
    * **Parch**: Número de pais/crianças a bordo;
    * **Fare**: Valor pago pela viagem;
    * **Cabin**: Cabine do Passageiro;
    * **Embarked**: A porta pelo qual o Passageiro embarcou.
    * **Name**: Nome do Passageiro;
    * **Sex**: Sexo do Passageiro.

## Carregar as bibliotecas (genéricas) Python

In [None]:
import pandas as pd
from pandas import Series, DataFrame

import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
matplotlib.style.use('ggplot')

# remove warnings to keep notebook clean
import warnings
#warnings.filterwarnings('ignore')

## Carregar Dados

In [None]:
url = 'https://raw.githubusercontent.com/MathMachado/DataFrames/master/df_Tratado.csv'
url2 = 'https://raw.githubusercontent.com/MathMachado/DataFrames/master/Titanic_Original.csv'

# Carrega os dataframes de treinamento e teste e define 'PassengerId' como chave
df = pd.read_csv(url, index_col = 'PassengerId')
df_original = pd.read_csv(url2, index_col = 'PassengerId')

A seguir, nosso Dataframe:

In [None]:
df_original.head()

In [None]:
df.head()

In [None]:
df.columns = [cols.lower() for cols in df.columns]
df.head()

In [None]:
df2 = df.copy()
df2 = df.drop(columns= ['survived2'], axis= 1)

In [None]:
df2.shape

In [None]:
def mostra_missing_value(df):
    total = df.isnull().sum().sort_values(ascending=False)
    percent = 100*round((df.isnull().sum()/df.isnull().count()).sort_values(ascending=False),2)
    missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percentual'])
    print(missing_data.head(10))

In [None]:
mostra_missing_value(df2)

Nenhum NaN porque fizemos todo o tratamento no projeto Titanic. Vamos adiante...

**OBSERVAÇÃO**: Não há nenhum tratamento a ser feito neste capítulo 3DP porque o fizemos antes. Ok?

# Pipelines
* O Pipeline engloba as fases: 3DP + 4M + 5MSE de uma única vez!

![CRISP-DM](https://github.com/MathMachado/Materials/blob/master/CRISP-DM.png?raw=true)
[Fonte](https://www.sv-europe.com/crisp-dm-methodology/)

In [None]:
df3 = df2.copy()
df3.head()

Definindo as variáveis numéricas e categóricas que serão transformadas:

In [None]:
# Listas das variáveis numéricas e objetos
features_numericas = ['fare','seat','age2','age3', 'age_inf','age2_outlier_zs',
                   'age3_outlier_zs','age_inf_outlier_zs',
                   'fare_outlier_zs','age2_outlier_iqr',
                   'age3_outlier_iqr','age_inf_outlier_iqr',
                   'fare_outlier_iqr']

features_categoricas = ['sex','deck','embarked','age_category']

## Transformers
* OneHotEncoder faz a mesma coisa que pd.get_dummies.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

from sklearn import preprocessing
from sklearn.model_selection import cross_val_score

In [None]:
numeric_transformer_ss = Pipeline(steps = [('StandardScaler', StandardScaler())])
#numeric_transformer_mms = Pipeline(steps = [('MinMaxScaler', MinMaxScaler())])
categorical_transformer = Pipeline(steps = [('onehot', OneHotEncoder(handle_unknown = 'ignore'))])

In [None]:
from sklearn.compose import ColumnTransformer

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, features_numericas),
        ('cat', categorical_transformer, features_categoricas)])

preprocessor

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.decomposition import PCA

# Adicionar o Classificador (RandomForestClassifier) ao Pipeline:
rf = Pipeline([
               ('preprocessor', preprocessor), 
               ('reduce_dim', PCA()),
               ('classifier', RandomForestClassifier())
               ]
              )

Definindo as amostras de treinamento e teste.

In [None]:
X = df3[df3['survived'].notna()]
X = X.drop(columns = ['survived'], axis= 1)

y = df3[df3['survived'].notna()]
y = y['survived']

print(X.shape, y.shape)

In [None]:
X.head()

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score

X_treinamento, X_teste, y_treinamento, y_teste = train_test_split(X, y, test_size = 0.2)

In [None]:
X_treinamento.head()

Observe que as colunas/variáveis de X_treinamento não estão transformadas.

## 4M - Modeling

In [None]:
rf.fit(X_treinamento, y_treinamento)
print("Training: model score: %.3f" % rf.score(X_treinamento, y_treinamento))
print("Test....: model score: %.3f" % rf.score(X_teste, y_teste))

In [None]:
y_pred = rf.predict(X_teste)
y_pred

## 5MSE - Model Selection and Evaluation
* Aqui vamos usar Ensemble (vários classificadores).

In [None]:
from sklearn.metrics import accuracy_score, log_loss
from sklearn.neighbors import KNeighborsClassifier, NearestCentroid
from sklearn.svm import SVC, LinearSVC, NuSVC
from sklearn.tree import DecisionTreeClassifier

from sklearn.ensemble import ExtraTreesClassifier

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

from sklearn.linear_model import RidgeClassifier, PassiveAggressiveClassifier, SGDClassifier, LogisticRegressionCV, LogisticRegression
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB

from sklearn.neural_network import MLPClassifier # Multi-Layer Perceptron Classifier

from sklearn.ensemble import AdaBoostClassifier, BaggingClassifier, RandomForestClassifier, VotingClassifier, RandomForestClassifier, GradientBoostingClassifier

from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.svm import SVC, LinearSVC

from sklearn.feature_selection import SelectKBest, chi2

from lightgbm import LGBMClassifier

from xgboost import XGBClassifier

from sklearn.linear_model import Perceptron

Definir lista de todos os classificadores que desejo aplicar:

In [None]:
classifiers = [KNeighborsClassifier(5),
               SVC(kernel="rbf", C=0.025, probability=True),
               NuSVC(probability=True),
               DecisionTreeClassifier(),
               RandomForestClassifier(),
               GradientBoostingClassifier(),
               RidgeClassifier(),
               AdaBoostClassifier(),
               GaussianNB(),
               BernoulliNB(),
               PassiveAggressiveClassifier(),
               LinearSVC(),
               SGDClassifier(loss='log', penalty='elasticnet'),
               LogisticRegression(),
               NearestCentroid(),
               Perceptron(),
               MLPClassifier(hidden_layer_sizes=(10, 10, 10), max_iter=1000),
               LGBMClassifier(),
               BaggingClassifier(KNeighborsClassifier(), max_samples=0.5, max_features=0.5),
               GaussianProcessClassifier(),
               XGBClassifier(n_estimators= 2000,max_depth= 4,min_child_weight= 2,gamma=0.9,subsample=0.8,colsample_bytree=0.8,objective='reg:logistic',scale_pos_weight=1),
               ExtraTreesClassifier(n_estimators = 750, max_features = 'sqrt', max_depth = 35,  criterion = 'entropy', random_state = 42)]

In [None]:
for classifier in classifiers:
    pipe = Pipeline(steps=[('preprocessor', preprocessor), 
                           ('reduce_dim', PCA()), 
                           ('classifier', classifier)])
    
    scores = cross_val_score(pipe, X_treinamento, y_treinamento, cv = 10)
    pipe.fit(X_treinamento, y_treinamento)   
    print(classifier)
    print("Training Sample: model score: %.3f" % pipe.score(X_treinamento, y_treinamento))
    print("Test Sample....: model score: %.3f" % pipe.score(X_teste, y_teste))
    print("****************************************************************************************\n")
    print(classifier, ":", round(scores.mean(),2))
    print("\n")

## Fine Tuning
* A seguir, fazemos o fine tuning do XGBClassifier:

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
# Adicionando o Classificador (RandomForestClassifier) ao Pipeline:
xg = Pipeline([('preprocessor', preprocessor), 
               ('reduce_dim', PCA()), # PCA está a ser aplicado em tudo!
               ('classifier', XGBClassifier())])

Os principais parâmetros para fine tuning:

In [None]:
import timeit

Versão simplificada:

In [None]:
start = timeit.timeit()
param_grid = { 
    'classifier__n_estimators': [100, 300, 500, 700, 900],
    'classifier__max_features': ['auto', 'log2'],
    'classifier__max_depth' : [3,5,7,9],
    'classifier__criterion' :['entropy'],
    'classifier__learning_rate':[0.0001, 0.001, 0.01, 0.1],
    'classifier__gamma':[0,1,5]}

CV = GridSearchCV(xg, param_grid, cv = 10, n_jobs = 50, scoring = 'accuracy', verbose = 10)
                  
CV.fit(X_treinamento, y_treinamento)  
print(CV.best_params_)    
print(CV.best_score_)

end = timeit.timeit()
print(end - start)

Output do processo de fine tuning:

```
Fitting 10 folds for each of 480 candidates, totalling 4800 fits
[Parallel(n_jobs=50)]: Using backend LokyBackend with 50 concurrent workers.
[Parallel(n_jobs=50)]: Done  13 tasks      | elapsed:   59.6s
[Parallel(n_jobs=50)]: Done  28 tasks      | elapsed:  1.2min
[Parallel(n_jobs=50)]: Done  45 tasks      | elapsed:  1.5min
[Parallel(n_jobs=50)]: Done  62 tasks      | elapsed:  1.8min
[Parallel(n_jobs=50)]: Done  81 tasks      | elapsed:  2.0min
[Parallel(n_jobs=50)]: Done 100 tasks      | elapsed:  2.5min
[Parallel(n_jobs=50)]: Done 121 tasks      | elapsed:  2.7min
[Parallel(n_jobs=50)]: Done 142 tasks      | elapsed:  3.3min
[Parallel(n_jobs=50)]: Done 165 tasks      | elapsed:  3.8min
[Parallel(n_jobs=50)]: Done 188 tasks      | elapsed:  4.2min
[Parallel(n_jobs=50)]: Done 213 tasks      | elapsed:  4.8min
[Parallel(n_jobs=50)]: Done 238 tasks      | elapsed:  5.2min
[Parallel(n_jobs=50)]: Done 265 tasks      | elapsed:  6.3min
[Parallel(n_jobs=50)]: Done 292 tasks      | elapsed:  6.9min
[Parallel(n_jobs=50)]: Done 321 tasks      | elapsed:  7.6min
[Parallel(n_jobs=50)]: Done 350 tasks      | elapsed:  8.4min
[Parallel(n_jobs=50)]: Done 381 tasks      | elapsed:  9.2min
[Parallel(n_jobs=50)]: Done 412 tasks      | elapsed:  9.9min
[Parallel(n_jobs=50)]: Done 445 tasks      | elapsed: 10.3min
[Parallel(n_jobs=50)]: Done 478 tasks      | elapsed: 10.8min
[Parallel(n_jobs=50)]: Done 513 tasks      | elapsed: 11.4min
[Parallel(n_jobs=50)]: Done 548 tasks      | elapsed: 12.1min
[Parallel(n_jobs=50)]: Done 585 tasks      | elapsed: 12.9min
[Parallel(n_jobs=50)]: Done 622 tasks      | elapsed: 13.9min
[Parallel(n_jobs=50)]: Done 661 tasks      | elapsed: 14.9min
[Parallel(n_jobs=50)]: Done 700 tasks      | elapsed: 15.8min
[Parallel(n_jobs=50)]: Done 741 tasks      | elapsed: 17.0min
[Parallel(n_jobs=50)]: Done 782 tasks      | elapsed: 18.2min
[Parallel(n_jobs=50)]: Done 825 tasks      | elapsed: 18.9min
[Parallel(n_jobs=50)]: Done 868 tasks      | elapsed: 19.7min
[Parallel(n_jobs=50)]: Done 913 tasks      | elapsed: 20.4min
[Parallel(n_jobs=50)]: Done 958 tasks      | elapsed: 21.4min
[Parallel(n_jobs=50)]: Done 1005 tasks      | elapsed: 22.3min
[Parallel(n_jobs=50)]: Done 1052 tasks      | elapsed: 23.5min
[Parallel(n_jobs=50)]: Done 1101 tasks      | elapsed: 24.8min
[Parallel(n_jobs=50)]: Done 1150 tasks      | elapsed: 26.2min
[Parallel(n_jobs=50)]: Done 1201 tasks      | elapsed: 27.4min
[Parallel(n_jobs=50)]: Done 1252 tasks      | elapsed: 28.1min
[Parallel(n_jobs=50)]: Done 1305 tasks      | elapsed: 28.8min
[Parallel(n_jobs=50)]: Done 1358 tasks      | elapsed: 29.8min
[Parallel(n_jobs=50)]: Done 1413 tasks      | elapsed: 30.8min
[Parallel(n_jobs=50)]: Done 1468 tasks      | elapsed: 31.7min
[Parallel(n_jobs=50)]: Done 1525 tasks      | elapsed: 33.0min
[Parallel(n_jobs=50)]: Done 1582 tasks      | elapsed: 34.1min
[Parallel(n_jobs=50)]: Done 1641 tasks      | elapsed: 35.0min
[Parallel(n_jobs=50)]: Done 1700 tasks      | elapsed: 35.9min
[Parallel(n_jobs=50)]: Done 1761 tasks      | elapsed: 37.0min
[Parallel(n_jobs=50)]: Done 1822 tasks      | elapsed: 38.4min
[Parallel(n_jobs=50)]: Done 1885 tasks      | elapsed: 40.0min
[Parallel(n_jobs=50)]: Done 1948 tasks      | elapsed: 41.8min
[Parallel(n_jobs=50)]: Done 2013 tasks      | elapsed: 43.3min
[Parallel(n_jobs=50)]: Done 2078 tasks      | elapsed: 44.2min
[Parallel(n_jobs=50)]: Done 2145 tasks      | elapsed: 45.5min
[Parallel(n_jobs=50)]: Done 2212 tasks      | elapsed: 47.0min
[Parallel(n_jobs=50)]: Done 2281 tasks      | elapsed: 49.0min
[Parallel(n_jobs=50)]: Done 2350 tasks      | elapsed: 50.8min
[Parallel(n_jobs=50)]: Done 2421 tasks      | elapsed: 52.4min
[Parallel(n_jobs=50)]: Done 2492 tasks      | elapsed: 53.5min
[Parallel(n_jobs=50)]: Done 2565 tasks      | elapsed: 55.0min
[Parallel(n_jobs=50)]: Done 2638 tasks      | elapsed: 56.4min
[Parallel(n_jobs=50)]: Done 2713 tasks      | elapsed: 58.6min
[Parallel(n_jobs=50)]: Done 2788 tasks      | elapsed: 60.6min
[Parallel(n_jobs=50)]: Done 2865 tasks      | elapsed: 61.9min
[Parallel(n_jobs=50)]: Done 2942 tasks      | elapsed: 63.2min
[Parallel(n_jobs=50)]: Done 3021 tasks      | elapsed: 64.9min
[Parallel(n_jobs=50)]: Done 3100 tasks      | elapsed: 66.8min
[Parallel(n_jobs=50)]: Done 3181 tasks      | elapsed: 69.0min
[Parallel(n_jobs=50)]: Done 3262 tasks      | elapsed: 70.3min
[Parallel(n_jobs=50)]: Done 3345 tasks      | elapsed: 71.7min
[Parallel(n_jobs=50)]: Done 3428 tasks      | elapsed: 73.4min
[Parallel(n_jobs=50)]: Done 3513 tasks      | elapsed: 75.5min
[Parallel(n_jobs=50)]: Done 3598 tasks      | elapsed: 77.5min
[Parallel(n_jobs=50)]: Done 3685 tasks      | elapsed: 78.7min
[Parallel(n_jobs=50)]: Done 3772 tasks      | elapsed: 80.3min
[Parallel(n_jobs=50)]: Done 3861 tasks      | elapsed: 82.5min
[Parallel(n_jobs=50)]: Done 3950 tasks      | elapsed: 84.8min
[Parallel(n_jobs=50)]: Done 4041 tasks      | elapsed: 86.5min
[Parallel(n_jobs=50)]: Done 4132 tasks      | elapsed: 88.1min
[Parallel(n_jobs=50)]: Done 4225 tasks      | elapsed: 90.0min
[Parallel(n_jobs=50)]: Done 4318 tasks      | elapsed: 92.5min
[Parallel(n_jobs=50)]: Done 4413 tasks      | elapsed: 94.9min
[Parallel(n_jobs=50)]: Done 4508 tasks      | elapsed: 96.3min
[Parallel(n_jobs=50)]: Done 4605 tasks      | elapsed: 98.3min
[Parallel(n_jobs=50)]: Done 4800 out of 4800 | elapsed: 103.5min finished
/usr/local/lib/python3.6/dist-packages/sklearn/model_selection/_search.py:814: DeprecationWarning: The default of the `iid` parameter will change from True to False in version 0.22 and will be removed in 0.24. This will change numeric results when test-set sizes are unequal.
  DeprecationWarning)
{'classifier__criterion': 'entropy', 'classifier__gamma': 1, 'classifier__learning_rate': 0.1, 'classifier__max_depth': 9, 'classifier__max_features': 'auto', 'classifier__n_estimators': 100}
0.7991573033707865
0.0019175359993823804
```

Depois de 103.5 minutos de processamento usando 12GB de RAM, temos o seguinte resultado:

{'classifier__criterion': 'entropy', 'classifier__gamma': 1, 'classifier__learning_rate': 0.1, 'classifier__max_depth': 9, 'classifier__max_features': 'auto', 'classifier__n_estimators': 100}

Então, por fim, aplicamos esses parâmetros no classificador:

In [None]:
xg2= Pipeline([('preprocessor', preprocessor), 
               ('reduce_dim', PCA()),
               ('classifier', XGBClassifier(criterion= 'entropy', gamma= 1, learning_rate= 0.1, max_depth= 9, max_features= 'auto', n_estimators= 100))])

In [None]:
xg2.fit(X_treinamento, y_treinamento)
print("Training Sample: model score: %.3f" % xg2.score(X_treinamento, y_treinamento))
print("Test Sample....: model score: %.3f" % xg2.score(X_teste, y_teste))

Sabem explicar porque serão ajustados 480 modelos?

Qual a conclusão? O fine tuning obteve ou não um modelo melhor do que o modelo-padrão?

* Qual é a acurácia do XGBClassifier antes do fine tuning?
* Qual é a acurácia do XGBClassifier depois do fine tuning?

Mais métricas para avaliação da acurácia:

In [None]:
y_pred_antes1 = xg.fit(X_treinamento, y_treinamento).predict(X_treinamento)
y_pred_depois1 = xg2.fit(X_treinamento, y_treinamento).predict(X_treinamento)

In [None]:
from sklearn.metrics import accuracy_score
print("treinamento - Antes do Fine tuning.:", accuracy_score(y_treinamento, y_pred_antes1))
print("treinamento - Depois do Fine tuning:", accuracy_score(y_treinamento, y_pred_depois1))

In [None]:
y_pred_antes2 = xg.fit(X_teste, y_teste).predict(X_teste)
y_pred_depois2 = xg2.fit(X_teste, y_teste).predict(X_teste)

In [None]:
print("teste - Antes do Fine tuning.:", accuracy_score(y_teste, y_pred_antes2))
print("teste - Depois do Fine tuning:", accuracy_score(y_teste, y_pred_depois2))

Versão mais completa:

```
start = timeit.timeit()
param_grid = { 
    'classifier__n_estimators': [50, 100, 200, 300, 400, 500, 600, 700, 800, 900],
    'classifier__max_features': ['auto', 'sqrt', 'log2'],
    'classifier__max_depth' : [3,4,5,6,7,8,9],
    'classifier__criterion' :['gini', 'entropy'],
    'classifier__learning_rate':[0.0001, 0.001, 0.01, 0.1, 0.2, 0.3],
    'classifier__subsample':[0.8,0.85,0.9,0.95,1],
    'classifier__colsample_bytree':[0.3,0.5,0.7,0.9,1],
    'classifier__gamma':[0,1,5]}

CV = GridSearchCV(xg, param_grid, cv= 10, n_jobs= 100, scoring='accuracy', verbose=10)
                  
CV.fit(X_treinamento, y_treinamento)  
print(CV.best_params_)    
print(CV.best_score_)

end = timeit.timeit()
print(end - start)
```