<a href="https://colab.research.google.com/github/Diogo364/StepsIntoML/blob/master/Pipeline_Titanic_Challenge.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tutorial Sklearn Pipelines

### Understanding Pipelines
---
[PT-BR]

O uso de pipelines é muito comum para alcançar melhor reprodutibilidade de processos como limpeza e até predições utilizando Machine Learning.

Para entender melhor o conceito de Pipeline aplicado a Machine Learning acesse o link em [português](https://docs.microsoft.com/pt-br/azure/machine-learning/concept-ml-pipelines).

[EN-US]

It is very common the use o Pipelines to achieve better reproducibility of processes as cleanning and even predictions using ML.

To understand better the concept of Machine Learning Pipelines check this [link](https://medium.com/analytics-vidhya/what-is-a-pipeline-in-machine-learning-how-to-create-one-bda91d0ceaca).


## Basic Libraries
---
[PT-BR]

Primeiramente iremos fazer a importação das bibliotecas clássicas de manipulação de dados do Python.

[EN-US]

First, we will import the classic Python data manipulation libraries.

In [0]:
import pandas as pd
import numpy as np

## Import DataSet
---

[PT-BR]

1.   Fazer download do [DataSet do Titanic no Kaggle](https://www.kaggle.com/c/titanic/data);
2.   Subir os arquivos para o Colab;
3.   Continuar o tutorial.

[EN-US]

1.   Download [Titanic Dataset from Kaggle](https://www.kaggle.com/c/titanic/data) from the link;
2.   Upload the files here;
3.   Continue with the tutorial.

In [0]:
train_set = pd.read_csv('train.csv')
test_set = pd.read_csv('test.csv')

## Creating a Data Transformation Pipeline
---

In [0]:
from sklearn.base import BaseEstimator
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

## Bibliotecas Básicas
---
[PT-BR]

Primeiro vamos definir a classe baseada na `BaseEstimator` para construir nossa própria transformação dos dados.

Iremos utilizar o método `transformation` para:

 1.   Criar a nova feature: `family`
 2.   Remover as colunas `['Name', 'Ticket', 'Cabin', 'SibSp', 'Parch', 'PassengerId']` 


[EN-US]

First we'll define a class based on `BaseEstimator` to make our own custom transformation on the data.

We'll use the transformation method to:

 1.   Create a new feature: `family`
 2.   drop `['Name', 'Ticket', 'Cabin', 'SibSp', 'Parch', 'PassengerId']` 


In [0]:
class TitanicFeatureEngeneering(BaseEstimator):
  
  def __init__(self):
        pass

  def fit(self, documents, y=None):
      return self

  def transform(self, x_dataset):
      drop_columns = ['Name', 'Ticket', 'Cabin', 'SibSp', 'Parch', 'PassengerId']
      x_dataset['family'] = x_dataset['SibSp'] + x_dataset['Parch']
      transformed_dataset = x_dataset.drop(columns=drop_columns)
      return transformed_dataset

### Dealing with missing values
---
[PT-BR]

Usaremos duas abordagens diferentes dependendo do tipo do dado.


[EN-US]

We'll use two different approachs depending on the type of the data.

#### 1. Numerical


[PT-BR]

Nossas Features numéricas são:

[EN-US]

Our Numerical Features are:


In [0]:
numeric_features = ['Age', 'Fare', 'family']

[PT-BR]

Para lidar com os missing values nós iremos substituí-los pelo valor da mediana.

[EN-US]

To deal with their missing values we'll simply replace them by the median value

In [0]:
numeric_transformer = \
Pipeline(steps=[
                ('imputer', SimpleImputer(strategy='median'))
])

#### 2. Categorical


[PT-BR]

Nossas Features categóricas são:

[EN-US]

Our Categorical Features are:

In [0]:
categorical_features = ['Embarked', 'Sex', 'Pclass']

[PT-BR]

Para lidar com Features categóricas faremos assim?

1.   Substituir os missing values pelo valor mais frequente;
2.   Transformar as Features categóricas em numéricas utilizando `OneHotEncoder`

[EN-US]

To deal with Categorical Features we'll do as following:

1.   Replace missing values for the most frequent one
2.   Transform our Categorical Features in Numerical Features using `OneHotEncoder`

In [0]:
categorical_transformer = \
Pipeline(steps=[
                ('imputer', SimpleImputer(strategy='most_frequent')),
                ('onehot', OneHotEncoder(handle_unknown="ignore", sparse=False))
])

#### ColumnTransformer
---

[PT-BR]

Para aplicar pipelines diferentes em features diferentes nós iremos utilizar a classe `ColumnTransformer`

[EN-US]

To apply different pipelines on different features we'll use `ColumnTransformer` Class

In [0]:
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])


[PT-BR]

Finalmente, iremos finalizar nosso <b><u>Data Transformation Pipeline</u></b> juntando a classe `TitanicFeatureEngeneering` com as transformações de colunas do `preprocessor`

[EN-US]

Then we'll finish our <b><u>Data Transformation Pipeline</u></b> gathering the `TitanicFeatureEngeneering` class that with our columns transformation from `preprocessor`

In [0]:
transformation_pipeline =\
Pipeline(steps=[
                ('family', TitanicFeatureEngeneering()),
                ('preprocessor', preprocessor),
])

## Creating a Machine Learning pipeline


[PT-BR]

Para essa parte nós juntaremos nosso Data Transformation Pipeline com o algoritmo do Scikitlearn - `RandomForestClassifier`  - usando os parametros abaixo:

[EN-US]

For this part we'll gather our Data Transformation Pipeline with the Scikitlearn's `RandomForestClassifier` Algorithm using the following params:


```
{
  'max_depth': 6,
  'min_samples_leaf': 1, 
  'min_samples_split': 4, 
  'random_state': 42
}
```



In [0]:
from sklearn.ensemble import RandomForestClassifier

In [0]:
machine_learning_pipeline =\
Pipeline(steps=[
                ('datatransformation', transformation_pipeline),
                ('machinelearning', RandomForestClassifier(max_depth=6, 
                                                           min_samples_leaf=1, 
                                                           min_samples_split=4, 
                                                           random_state=seed))
])

## Using our pipelines

### Data Transformation
---


[PT-BR]

Primeiro iremos separar nossas features da nossa variável de interesse, então iremos ver nosso Data Transformation Pipeline funcionando.

[EN-US]

First we'll split our features from our target, then we'll see our Data Transformation Pipeline working.

In [0]:
x_training = train_set.drop(columns='Survived')
y_training = train_set['Survived']

#### Original Data

In [0]:
x_training.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.28,C85,C
2,3,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.92,,S
3,4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


#### Transformed Data

In [0]:
transformation_pipeline.fit(x_training, y_training)

Pipeline(memory=None,
         steps=[('family', TitanicFeatureEngeneering()),
                ('preprocessor',
                 ColumnTransformer(n_jobs=None, remainder='drop',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('num',
                                                  Pipeline(memory=None,
                                                           steps=[('imputer',
                                                                   SimpleImputer(add_indicator=False,
                                                                                 copy=True,
                                                                                 fill_value=None,
                                                                                 missing_values=nan,
                                                                                 strategy='median',
    

In [0]:
transformed_data = transformation_pipeline.transform(x_training)
transformed_data

array([[22.    ,  7.25  ,  1.    , ...,  0.    ,  0.    ,  1.    ],
       [38.    , 71.2833,  1.    , ...,  1.    ,  0.    ,  0.    ],
       [26.    ,  7.925 ,  0.    , ...,  0.    ,  0.    ,  1.    ],
       ...,
       [28.    , 23.45  ,  3.    , ...,  0.    ,  0.    ,  1.    ],
       [26.    , 30.    ,  0.    , ...,  1.    ,  0.    ,  0.    ],
       [32.    ,  7.75  ,  0.    , ...,  0.    ,  0.    ,  1.    ]])


[PT-BR]

Embora o nome das colunas tenha desaparecido nós podemos comparar os dados das primeiras 3 colunas, que correspondem às nossas Features numéricas.

[EN-US]

Even though the column names desappeared we can compare the data of the first 3 columns, that correspond to our Numeric Features.

In [0]:
pd.DataFrame(transformed_data).head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,22.0,7.25,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
1,38.0,71.28,1.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
2,26.0,7.92,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0
3,35.0,53.1,1.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0
4,35.0,8.05,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0


### Machine Learning
---

[PT-BR]

Agora veremos o quão fácil é utilizar nosso Pipeline completo.

[EN-US]

Now we'll see how easy it is to use our entire Pipeline.

#### Fitting

In [0]:
machine_learning_pipeline.fit(x_training, y_training)

Pipeline(memory=None,
         steps=[('datatransformation',
                 Pipeline(memory=None,
                          steps=[('family', TitanicFeatureEngeneering()),
                                 ('preprocessor',
                                  ColumnTransformer(n_jobs=None,
                                                    remainder='drop',
                                                    sparse_threshold=0.3,
                                                    transformer_weights=None,
                                                    transformers=[('num',
                                                                   Pipeline(memory=None,
                                                                            steps=[('imputer',
                                                                                    SimpleImputer(add_indicator=False,
                                                                                                  copy=True,
   

#### Scoring

In [0]:
machine_learning_pipeline.score(x_training, y_training)

0.867564534231201

#### Predicting

In [0]:
predictions = machine_learning_pipeline.predict(test_set)
predictions

array([0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0,
       1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
       1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0,
       1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,
       0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,
       1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1,
       0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
       1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
       0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,


[PT-BR]

Pronto!

Agora se quiser submeter ao leadeboard do Kaggle você precisa apenas juntar as predições com o `PassangersId` e fazer o Download do arquivo csv.

[EN-US]

Thats it! 

Now, if you want to submit to kaggle's leaderboard you just have to join our predictions with the `PassangersId` and Download the csv file.

In [0]:
test_set['Survived'] = predictions
submit_data = test_set[['PassengerId', 'Survived']]
submit_data

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,0
2,894,0
3,895,0
4,896,1
...,...,...
413,1305,0
414,1306,1
415,1307,0
416,1308,0


In [0]:
submit_data.to_csv('survival_predictions.csv', index=False)