<a href="https://colab.research.google.com/github/FGalvao77/Aplicacao-pratico-de-Pipelines-em-Machine-Learning-classification/blob/main/Aplica%C3%A7%C3%A3o_pr%C3%A1tico_de_Pipelines_em_ML_(passo_a_passo)_%7C_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Aplicação prático de Pipelines em Machine Learning (passo a passo) | classification**

---

Aplicação passo a passo do `Pipeline` em um projeto de aprendizado de máquina 🏄

### **Importando as bibliotecas**

In [1]:
# importando as bibliotecas e os módulos necessários
import pandas as pd
import numpy as np

from sklearn.pipeline import Pipeline, make_pipeline

from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier

from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.impute import KNNImputer, SimpleImputer
from sklearn.model_selection import train_test_split, KFold, cross_val_score, GridSearchCV

### **Carregando o conjunto de dados**

In [2]:
# carregando o conjunto de dados de um arquivo do tipo ".csv" direto de uma url
!wget 'https://biostat.app.vumc.org/wiki/pub/Main/DataSets/titanic3.csv'

--2021-11-21 06:25:37--  https://biostat.app.vumc.org/wiki/pub/Main/DataSets/titanic3.csv
Resolving biostat.app.vumc.org (biostat.app.vumc.org)... 160.129.8.31
Connecting to biostat.app.vumc.org (biostat.app.vumc.org)|160.129.8.31|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 116752 (114K) [text/csv]
Saving to: ‘titanic3.csv.1’


2021-11-21 06:25:37 (961 KB/s) - ‘titanic3.csv.1’ saved [116752/116752]



> Segue o link do site de onde foi extraido o dataset:
- [Department of Biostatistics - Vanderbilt University School of Medicine](https://biostat.app.vumc.org/wiki/Main/WebHome)

In [3]:
# realizando a leitura do conjunto de dados 
df = pd.read_csv('/content/titanic3.csv')

### **Análise exploratória dos dados**

In [4]:
# visualizando as 5 primeiras linhas do conjunto de dados
df.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.92,1,2,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


In [5]:
# dimensão do conjunto de dados
df.shape  # linhas e colunas

(1309, 14)

In [6]:
# tipo de dados dos atributos (colunas)
df.dtypes

pclass         int64
survived       int64
name          object
sex           object
age          float64
sibsp          int64
parch          int64
ticket        object
fare         float64
cabin         object
embarked      object
boat          object
body         float64
home.dest     object
dtype: object

In [7]:
# informações gerais
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 14 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   pclass     1309 non-null   int64  
 1   survived   1309 non-null   int64  
 2   name       1309 non-null   object 
 3   sex        1309 non-null   object 
 4   age        1046 non-null   float64
 5   sibsp      1309 non-null   int64  
 6   parch      1309 non-null   int64  
 7   ticket     1309 non-null   object 
 8   fare       1308 non-null   float64
 9   cabin      295 non-null    object 
 10  embarked   1307 non-null   object 
 11  boat       486 non-null    object 
 12  body       121 non-null    float64
 13  home.dest  745 non-null    object 
dtypes: float64(3), int64(4), object(7)
memory usage: 143.3+ KB


### **Preparando os dados**

In [8]:
# criando uma cópia do conjunto de dados
df_copied = df.copy()

Irei realizar uma etapa muito importante na aplicação de técnicas de _Machine Learnig_. Realizarei o particionamento do conjunto de dados em duas partes:     
- 70% para treinamento e validação do modelo e, 
- 30% para teste final do modelo.

_`Lembrando que na parte de 70% do conjunto de dados, ainda realizarei outro particionamento dos dados em treino e validação.`_

**E por fim, aplicarei o modelo que melhor performou no treinamento e validação na parte do 30% que foi reservado para o teste final.**

In [9]:
# aplicando o particionamento do conjunto de dados em "data" e "test_data"
data = df_copied.sample(frac=0.7, random_state=786)
test_data = df_copied.drop(data.index)

data.reset_index(drop=True, inplace=True)
test_data.reset_index(drop=True, inplace=True)

print('Data for Modeling: ' + str(data.shape))
print('Unseen Data For Predictions: ' + str(test_data.shape))

Data for Modeling: (916, 14)
Unseen Data For Predictions: (393, 14)


In [10]:
# visualizando as bases de dados
display(data.head(10))
print('\n\n')
display(test_data.head(10))

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,3,0,"Mernagh, Mr. Robert",male,,0,0,368703,7.75,,Q,,,
1,2,1,"Reynaldo, Ms. Encarnacion",female,28.0,0,0,230434,13.0,,S,9,,Spain
2,1,1,"Calderhead, Mr. Edward Pennington",male,42.0,0,0,PC 17476,26.2875,E24,S,5,,"New York, NY"
3,3,1,"Moubarek, Master. Gerios",male,,1,1,2661,15.2458,,C,C,,
4,1,0,"Head, Mr. Christopher",male,42.0,0,0,113038,42.5,B11,S,,,London / Middlesex
5,2,1,"Nye, Mrs. (Elizabeth Ramell)",female,29.0,0,0,C.A. 29395,10.5,F33,S,11,,"Folkstone, Kent / New York, NY"
6,2,1,"Trout, Mrs. William H (Jessie L)",female,28.0,0,0,240929,12.65,,S,,,"Columbus, OH"
7,3,0,"Zakarian, Mr. Ortin",male,27.0,0,0,2670,7.225,,C,,,
8,1,1,"Bishop, Mr. Dickinson H",male,25.0,1,0,11967,91.0792,B49,C,7,,"Dowagiac, MI"
9,1,0,"Thayer, Mr. John Borland",male,49.0,1,1,17421,110.8833,C68,C,,,"Haverford, PA"







Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
1,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
2,1,1,"Anderson, Mr. Harry",male,48.0,0,0,19952,26.55,E12,S,3.0,,"New York, NY"
3,1,0,"Andrews, Mr. Thomas Jr",male,39.0,0,0,112050,0.0,A36,S,,,"Belfast, NI"
4,1,1,"Astor, Mrs. John Jacob (Madeleine Talmadge Force)",female,18.0,1,0,PC 17757,227.525,C62 C64,C,4.0,,"New York, NY"
5,1,1,"Aubart, Mme. Leontine Pauline",female,24.0,0,0,PC 17477,69.3,B35,C,9.0,,"Paris, France"
6,1,1,"Barber, Miss. Ellen ""Nellie""",female,26.0,0,0,19877,78.85,,S,6.0,,
7,1,1,"Baxter, Mrs. James (Helene DeLaudeniere Chaput)",female,50.0,0,1,PC 17558,247.5208,B58 B60,C,6.0,,"Montreal, PQ"
8,1,1,"Bazzani, Miss. Albina",female,32.0,0,0,11813,76.2917,D15,C,8.0,,
9,1,1,"Bidois, Miss. Rosalie",female,42.0,0,0,PC 17757,227.525,,C,4.0,,


### **Aplicando Pipeline "básico"**

**Criando um pipeline `"na mão"`**

In [11]:
# criando um pipeline básico
pipeline_1 = Pipeline([
                       ('one_hot_encoder',  OneHotEncoder(handle_unknown='ignore')),
                       ('standard_scaler', StandardScaler(with_mean=False)), 
                       ('random_forest', RandomForestClassifier())
])

# visualizando o pipeline
pipeline_1

Pipeline(steps=[('one_hot_encoder', OneHotEncoder(handle_unknown='ignore')),
                ('standard_scaler', StandardScaler(with_mean=False)),
                ('random_forest', RandomForestClassifier())])

In [12]:
# visualizando as etapas do pipeline
pipeline_1.steps

[('one_hot_encoder', OneHotEncoder(handle_unknown='ignore')),
 ('standard_scaler', StandardScaler(with_mean=False)),
 ('random_forest', RandomForestClassifier())]

**Usando `make_pipeline` para criar um pipeline**

In [13]:
# criando pipeline com o "make_pipeline"
make_pipeline(OneHotEncoder(handle_unknown='ignore'), 
              StandardScaler(with_mean=False), 
              RandomForestClassifier())

Pipeline(steps=[('onehotencoder', OneHotEncoder(handle_unknown='ignore')),
                ('standardscaler', StandardScaler(with_mean=False)),
                ('randomforestclassifier', RandomForestClassifier())])

**Preparando os dados**

In [14]:
# instanciando as variáveis explicativas (X) e a resposta (y)
X = data.drop('survived', axis=1)
y = data['survived']

In [15]:
# particionando os dados em treino e validação
X_train, X_valid, y_train, y_valid = train_test_split(X, y, 
                                                      test_size=0.3,
                                                      random_state=42)

In [16]:
# visualizando a dimensão do particionamento dos dados
X_train.shape, X_valid.shape, y_train.shape, y_valid.shape

((641, 13), (275, 13), (641,), (275,))

**Aplicando os pipelines**

In [17]:
# aplicando o pipeline nos dados de treino
pipeline_1.fit(X_train, y_train)

Pipeline(steps=[('one_hot_encoder', OneHotEncoder(handle_unknown='ignore')),
                ('standard_scaler', StandardScaler(with_mean=False)),
                ('random_forest', RandomForestClassifier())])

In [18]:
# realizando as prediçoes nos dados de validação - "X_valid"
pipeline_1.predict(X_valid)

array([1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0,
       0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0,
       0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0,
       0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0,
       1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0,
       0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1,
       0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0,
       0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0,
       1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0,
       1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1,
       1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0])

In [19]:
# avaliando a acurácia do modelo com os dados de validação - "X_valid" e "y_valid"
pipeline_1.score(X_valid, y_valid)

0.9745454545454545

### **Aplicando Pipeline "completo"**

#### **Pré- processamento dos dados**

**Separando as variáveis categóricas e numéricas e realizar as devidas transformações**

In [20]:
# informações gerais
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 14 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   pclass     1309 non-null   int64  
 1   survived   1309 non-null   int64  
 2   name       1309 non-null   object 
 3   sex        1309 non-null   object 
 4   age        1046 non-null   float64
 5   sibsp      1309 non-null   int64  
 6   parch      1309 non-null   int64  
 7   ticket     1309 non-null   object 
 8   fare       1308 non-null   float64
 9   cabin      295 non-null    object 
 10  embarked   1307 non-null   object 
 11  boat       486 non-null    object 
 12  body       121 non-null    float64
 13  home.dest  745 non-null    object 
dtypes: float64(3), int64(4), object(7)
memory usage: 143.3+ KB


In [21]:
# eliminado os atributos "name" e "home.dest"
data_droped = df.drop(['name', 'home.dest'], axis=1)

# visualizando as colunas
data_droped.columns

Index(['pclass', 'survived', 'sex', 'age', 'sibsp', 'parch', 'ticket', 'fare',
       'cabin', 'embarked', 'boat', 'body'],
      dtype='object')

In [22]:
# visualizando o tipo de dado dos atributos
data_droped['sex'].dtype.name, data_droped['age'].dtype.name, data_droped['parch'].dtype.name,

('object', 'float64', 'int64')

In [23]:
# realizando o particionamento dos dados para treino/validação e teste final
data = data_droped.sample(frac=0.7, random_state=786)
test_data = data_droped.drop(data.index)

# resetando os índices
data.reset_index(drop=True, inplace=True)
test_data.reset_index(drop=True, inplace=True)

# visualizando a dimensão do particionamento dos dados
print('Data for Modeling: ' + str(data.shape))
print('Unseen Data For Predictions: ' + str(test_data.shape))

Data for Modeling: (916, 12)
Unseen Data For Predictions: (393, 12)


In [24]:
# instanciando as variáveis explicativas (X) e a resposta (y)
X = data.drop('survived', axis=1)
y = data['survived']

In [25]:
# particionando os dados em treino e validação
X_train, X_valid, y_train, y_valid = train_test_split(X, y, 
                                                      test_size=0.3,
                                                      random_state=42)

In [26]:
# criando uma função para separar as colunas numéricas e categóricas do dataframe
def separate_cols(df):
    cols_cats = []
    cols_nums = []

    for col in df.columns: 
        if df.dtypes[col] == 'object':
            cols_cats.append(col)
        else:
            cols_nums.append(col)
    
    return f'cols categ: {len(cols_cats), cols_cats}', f'cols nums: {len(cols_nums), cols_nums}'

In [27]:
# aplicando a função no df
separate_cols(X_train)

("cols categ: (5, ['sex', 'ticket', 'cabin', 'embarked', 'boat'])",
 "cols nums: (6, ['pclass', 'age', 'sibsp', 'parch', 'fare', 'body'])")

In [28]:
# instanciando as variáveis categóricas
vars_cat = [col for col in X_train if X_train[col].dtype.name == 'object']

# visualizando as variáveis categóricas
vars_cat

['sex', 'ticket', 'cabin', 'embarked', 'boat']

In [29]:
# tipo do objeto criado
type(vars_cat)

list

In [30]:
# instanciando as variáveis numéricas
vars_num = [col for col in X_train.columns if col not in vars_cat]

# visualizando as variáveis numéricas
vars_num

['pclass', 'age', 'sibsp', 'parch', 'fare', 'body']

**Criando um Pipeline para tratamento das variáveis categóricas**

In [31]:
# instanciando um pipeline para tratamento das variáveis categóricas
pipeline_cat = Pipeline([
                         ('imputer', SimpleImputer(strategy='constant', 
                                                   fill_value='missing')),
                         ('encoder', OneHotEncoder(handle_unknown='ignore', 
                                                   sparse=False))                       
])

# visualizando o pipeline 
pipeline_cat

Pipeline(steps=[('imputer',
                 SimpleImputer(fill_value='missing', strategy='constant')),
                ('encoder',
                 OneHotEncoder(handle_unknown='ignore', sparse=False))])

**Criando um Pipeline para tratamento das variáveis numéricas**

In [32]:
# instanciando um pipeline para tratamento das variáveis numéricas
pipeline_num = Pipeline([
                         ('imputer', SimpleImputer(strategy='median')),
                         ('scaler', StandardScaler())                       
])

# visualizando o pipeline 
pipeline_num 

Pipeline(steps=[('imputer', SimpleImputer(strategy='median')),
                ('scaler', StandardScaler())])

**Concatenando os Pipelines**

In [33]:
# concatenando os pipelines para realizar o pré-processamento do dados
pre_process = ColumnTransformer([
                                 ('cat', pipeline_cat, vars_cat), 
                                 ('num', pipeline_num, vars_num)
])

# visualizando objeto criado
pre_process

ColumnTransformer(transformers=[('cat',
                                 Pipeline(steps=[('imputer',
                                                  SimpleImputer(fill_value='missing',
                                                                strategy='constant')),
                                                 ('encoder',
                                                  OneHotEncoder(handle_unknown='ignore',
                                                                sparse=False))]),
                                 ['sex', 'ticket', 'cabin', 'embarked',
                                  'boat']),
                                ('num',
                                 Pipeline(steps=[('imputer',
                                                  SimpleImputer(strategy='median')),
                                                 ('scaler', StandardScaler())]),
                                 ['pclass', 'age', 'sibsp', 'parch', 'fare',
                                  

### **Criando os Pipelines**

#### **Utilizando o "make_pipeline" para criação**

**Instanciando o pipeline com "Random Forest"**

In [34]:
# instanciando o pipeline com o "make_pipeline", utilizando o "Random Forest"
pipeline_random_forest = make_pipeline(pre_process, 
                                       RandomForestClassifier(random_state=42))

# visualizando o pipeline
pipeline_random_forest

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('cat',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(fill_value='missing',
                                                                                 strategy='constant')),
                                                                  ('encoder',
                                                                   OneHotEncoder(handle_unknown='ignore',
                                                                                 sparse=False))]),
                                                  ['sex', 'ticket', 'cabin',
                                                   'embarked', 'boat']),
                                                 ('num',
                                                  Pipeline(steps=[('imputer',
                                                    

In [35]:
# visualizando as etapas do pipeline
pipeline_random_forest[0], pipeline_random_forest[1]

(ColumnTransformer(transformers=[('cat',
                                  Pipeline(steps=[('imputer',
                                                   SimpleImputer(fill_value='missing',
                                                                 strategy='constant')),
                                                  ('encoder',
                                                   OneHotEncoder(handle_unknown='ignore',
                                                                 sparse=False))]),
                                  ['sex', 'ticket', 'cabin', 'embarked',
                                   'boat']),
                                 ('num',
                                  Pipeline(steps=[('imputer',
                                                   SimpleImputer(strategy='median')),
                                                  ('scaler', StandardScaler())]),
                                  ['pclass', 'age', 'sibsp', 'parch', 'fare',
                    

In [36]:
# aplicando o pipeline "pipeline_random_forest" nos dados de treino
pipeline_random_forest.fit(X_train, y_train)

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('cat',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(fill_value='missing',
                                                                                 strategy='constant')),
                                                                  ('encoder',
                                                                   OneHotEncoder(handle_unknown='ignore',
                                                                                 sparse=False))]),
                                                  ['sex', 'ticket', 'cabin',
                                                   'embarked', 'boat']),
                                                 ('num',
                                                  Pipeline(steps=[('imputer',
                                                    

In [37]:
# visualizando acurácia do pipeline "pipeline_random_forest" nos dados de validação
pipeline_random_forest.score(X_valid, y_valid)

0.9745454545454545

**Instanciando o pipeline com "Logist Regression"**

In [38]:
# instanciando o pipeline com o "make_pipeline", utilizando o "Logistic Regression"
pipeline_log_reg = make_pipeline(pre_process, 
                                 LogisticRegression(random_state=42))

# visualizando o pipeline
pipeline_log_reg

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('cat',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(fill_value='missing',
                                                                                 strategy='constant')),
                                                                  ('encoder',
                                                                   OneHotEncoder(handle_unknown='ignore',
                                                                                 sparse=False))]),
                                                  ['sex', 'ticket', 'cabin',
                                                   'embarked', 'boat']),
                                                 ('num',
                                                  Pipeline(steps=[('imputer',
                                                    

In [39]:
# visualizando as etapas do pipeline
pipeline_log_reg[0], pipeline_log_reg[1]

(ColumnTransformer(transformers=[('cat',
                                  Pipeline(steps=[('imputer',
                                                   SimpleImputer(fill_value='missing',
                                                                 strategy='constant')),
                                                  ('encoder',
                                                   OneHotEncoder(handle_unknown='ignore',
                                                                 sparse=False))]),
                                  ['sex', 'ticket', 'cabin', 'embarked',
                                   'boat']),
                                 ('num',
                                  Pipeline(steps=[('imputer',
                                                   SimpleImputer(strategy='median')),
                                                  ('scaler', StandardScaler())]),
                                  ['pclass', 'age', 'sibsp', 'parch', 'fare',
                    

In [40]:
# aplicando o pipeline "pipeline_log_reg" nos dados de treino
pipeline_log_reg.fit(X_train, y_train)

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('cat',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(fill_value='missing',
                                                                                 strategy='constant')),
                                                                  ('encoder',
                                                                   OneHotEncoder(handle_unknown='ignore',
                                                                                 sparse=False))]),
                                                  ['sex', 'ticket', 'cabin',
                                                   'embarked', 'boat']),
                                                 ('num',
                                                  Pipeline(steps=[('imputer',
                                                    

In [41]:
# visualizando acurácia do pipeline "pipeline_log_reg" nos dados de validação
pipeline_log_reg.score(X_valid, y_valid)

0.9709090909090909

**Instanciando o pipeline com "KNN", hiperparâmetros e busca em grade**

In [42]:
# otimização dos hiperparâmetros do modelo K-NN
params ={
    'n_neighbors': list(range(1, 11)),                      # número de vizinhos
    'p': [1, 2],                                            # métrica de distância (manhattan = 1 | euclidiana = 2)
    'weights': ['uniform', 'distance'],                     # função de pesos para predição
    'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute']  # algoritmos para computar as distâncias
}

In [43]:
# definindo a busca em grade para o processo de otimização
grid = GridSearchCV(estimator=KNeighborsClassifier(), param_grid=params, 
                    cv=10, verbose=2)

# visualizando os hiperparâmetros
grid

GridSearchCV(cv=10, estimator=KNeighborsClassifier(),
             param_grid={'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
                         'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
                         'p': [1, 2], 'weights': ['uniform', 'distance']},
             verbose=2)

In [44]:
# instanciando o pipeline com o "make_pipeline", utilizando o "grid"
pipeline_knn = make_pipeline(pre_process, 
                             grid)

# visualizando o pipeline
pipeline_knn

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('cat',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(fill_value='missing',
                                                                                 strategy='constant')),
                                                                  ('encoder',
                                                                   OneHotEncoder(handle_unknown='ignore',
                                                                                 sparse=False))]),
                                                  ['sex', 'ticket', 'cabin',
                                                   'embarked', 'boat']),
                                                 ('num',
                                                  Pipeline(steps=[('imputer',
                                                    

In [45]:
# visualizando as etapas do pipeline
pipeline_knn[0], pipeline_knn[1]

(ColumnTransformer(transformers=[('cat',
                                  Pipeline(steps=[('imputer',
                                                   SimpleImputer(fill_value='missing',
                                                                 strategy='constant')),
                                                  ('encoder',
                                                   OneHotEncoder(handle_unknown='ignore',
                                                                 sparse=False))]),
                                  ['sex', 'ticket', 'cabin', 'embarked',
                                   'boat']),
                                 ('num',
                                  Pipeline(steps=[('imputer',
                                                   SimpleImputer(strategy='median')),
                                                  ('scaler', StandardScaler())]),
                                  ['pclass', 'age', 'sibsp', 'parch', 'fare',
                    

In [46]:
# aplicando o pipeline "pipeline_knn" nos dados de treino
pipeline_knn.fit(X_train, y_train)

Fitting 10 folds for each of 160 candidates, totalling 1600 fits
[CV] END algorithm=auto, n_neighbors=1, p=1, weights=uniform; total time=   0.0s
[CV] END algorithm=auto, n_neighbors=1, p=1, weights=uniform; total time=   0.0s
[CV] END algorithm=auto, n_neighbors=1, p=1, weights=uniform; total time=   0.0s
[CV] END algorithm=auto, n_neighbors=1, p=1, weights=uniform; total time=   0.0s
[CV] END algorithm=auto, n_neighbors=1, p=1, weights=uniform; total time=   0.0s
[CV] END algorithm=auto, n_neighbors=1, p=1, weights=uniform; total time=   0.0s
[CV] END algorithm=auto, n_neighbors=1, p=1, weights=uniform; total time=   0.0s
[CV] END algorithm=auto, n_neighbors=1, p=1, weights=uniform; total time=   0.0s
[CV] END algorithm=auto, n_neighbors=1, p=1, weights=uniform; total time=   0.0s
[CV] END algorithm=auto, n_neighbors=1, p=1, weights=uniform; total time=   0.0s
[CV] END algorithm=auto, n_neighbors=1, p=1, weights=distance; total time=   0.0s
[CV] END algorithm=auto, n_neighbors=1, p=1

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('cat',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(fill_value='missing',
                                                                                 strategy='constant')),
                                                                  ('encoder',
                                                                   OneHotEncoder(handle_unknown='ignore',
                                                                                 sparse=False))]),
                                                  ['sex', 'ticket', 'cabin',
                                                   'embarked', 'boat']),
                                                 ('num',
                                                  Pipeline(steps=[('imputer',
                                                    

In [47]:
# visualizando acurácia do pipeline "pipeline_knn" nos dados de validação
pipeline_knn.score(X_valid, y_valid)

0.9163636363636364

### **Cross-validation**

In [48]:
# instanciando o "cross validation"
cross_val = KFold(n_splits=10, 
                  shuffle=True, 
                  random_state=42)

# visualizando o objeto
cross_val

KFold(n_splits=10, random_state=42, shuffle=True)

**"Cross validation" com o "pipeline_random_forest"**


In [49]:
# aplicando o "cross validation" com o "pipeline_random_forest"
cross_val_score(pipeline_random_forest, 
                X, y, 
                cv=cross_val)

array([0.9673913 , 0.97826087, 0.97826087, 0.9673913 , 0.93478261,
       0.98913043, 0.98901099, 0.98901099, 0.96703297, 0.97802198])

In [50]:
# aplicando o "cross validation" com o "pipeline_random_forest" e extraindo a média
acc_med_randon_forest = cross_val_score(pipeline_random_forest, 
                             X, y, 
                             cv=cross_val).mean()

# visualizando o resultado 
acc_med_randon_forest

0.9738294314381271

**"Cross validation" com o "pipeline_log_reg"**

In [55]:
# aplicando o "cross validation" com o "pipeline_log_reg"
cross_val_score(pipeline_log_reg, 
                X, y, 
                cv=cross_val)

array([0.9673913 , 0.97826087, 0.97826087, 0.9673913 , 0.94565217,
       0.98913043, 0.98901099, 0.98901099, 0.96703297, 0.97802198])

In [52]:
# aplicando o "cross validation" com o "pipeline_log_reg" e extraindo a média
acc_med_log_reg = cross_val_score(pipeline_log_reg, 
                                  X, y, 
                                  cv=cross_val).mean()

# visualizando o resultado
acc_med_log_reg

0.9749163879598661

### **Aplicando o modelo que apresentou melhor performance nos dados de teste**

In [53]:
# instanciando as variáveis explicativas e a resposta
X_test = test_data.drop('survived', axis=1)
y_test = test_data['survived']

In [54]:
# aplicando o "cross validation" com o "pipeline_log_reg" e extraindo a média
acc_med_log_reg = cross_val_score(pipeline_log_reg, 
                                  X_test, y_test, 
                                  cv=cross_val).mean()

# visualizando o resultado
acc_med_log_reg

0.9796153846153846