# Tutorial de Machine Learning #4 - Como Fazer Validação Cruzada com Scikit-learn

## Titanic: Machine Learning from Disaster

**Predict survival on the Titanic and get familiar with ML basics: [Start here!](https://www.kaggle.com/c/titanic/data)**

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns

from matplotlib import pyplot as plt

%matplotlib inline

In [2]:
train = pd.read_csv('data/train.csv') 
test = pd.read_csv('data/train.csv')

In [3]:
def transformar_sexo(valor):
    if valor == 'female':
        return 1
    else:
        return 0

train['Sex_binario'] = train['Sex'].map(transformar_sexo)
test['Sex_binario'] = test['Sex'].map(transformar_sexo)

In [4]:
variaveis = ['Sex_binario', 'Age']

X = train[variaveis]
y = train['Survived']

In [6]:
from sklearn.model_selection import train_test_split

In [8]:
X = X.fillna(-1)

In [9]:
X.head()

Unnamed: 0,Sex_binario,Age
0,0,22.0
1,1,38.0
2,1,26.0
3,1,35.0
4,0,35.0


In [13]:
X.shape

(891, 2)

### Divisão treino e teste

In [None]:
np.random.seed(0)
X_treino, X_test, y_treino, y_test = train_test_split(X, y, test_size=0.5)


## Validação cruzada

In [23]:
from sklearn.model_selection import KFold
from sklearn.ensemble import RandomForestClassifier

In [13]:
?KFold

[1;31mInit signature:[0m [0mKFold[0m[1;33m([0m[0mn_splits[0m[1;33m=[0m[1;34m'warn'[0m[1;33m,[0m [0mshuffle[0m[1;33m=[0m[1;32mFalse[0m[1;33m,[0m [0mrandom_state[0m[1;33m=[0m[1;32mNone[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m     
K-Folds cross-validator

Provides train/test indices to split data in train/test sets. Split
dataset into k consecutive folds (without shuffling by default).

Each fold is then used once as a validation while the k - 1 remaining
folds form the training set.

Read more in the :ref:`User Guide <cross_validation>`.

Parameters
----------
n_splits : int, default=3
    Number of folds. Must be at least 2.

    .. versionchanged:: 0.20
        ``n_splits`` default value will change from 3 to 5 in v0.22.

shuffle : boolean, optional
    Whether to shuffle the data before splitting into batches.

random_state : int, RandomState instance or None, optional, default=None
    If int, random_state is the seed used by the ran

In [16]:
X_falso = np.arange(10)
X_falso

# 3 split, 
kf = KFold(3, shuffle=True, random_state=0)

for linhas_treino, linhas_valid in kf.split(X_falso):
    print('Treino: ', linhas_treino)
    print('Valid: ', linhas_valid)
    print()

Treino:  [0 1 3 5 6 7]
Valid:  [2 4 8 9]

Treino:  [0 2 3 4 5 8 9]
Valid:  [1 6 7]

Treino:  [1 2 4 6 7 8 9]
Valid:  [0 3 5]



## Criar 10 repetições de 2 validações aleatórias

In [37]:
resultados = []

for rep in range(10):
    print('Repedição: ', rep)
    kf = KFold(2, shuffle=True, random_state=rep)
    
    for linhas_treino, linhas_valid in kf.split(X):
        print('Treino: ', linhas_treino.shape[0])
        print('Valid: ', linhas_valid.shape[0])
    #     print()

        X_treino, X_valid = X.iloc[linhas_treino], X.iloc[linhas_valid]
        y_treino, y_valid = y.iloc[linhas_treino], y.iloc[linhas_valid]

        modelo = RandomForestClassifier(n_estimators=100, 
                                    n_jobs=-1, 
                                    random_state=0)
        modelo.fit(X_treino, y_treino)

        p = modelo.predict(X_valid)

        acc = np.mean(y_valid == p)
        resultados.append(acc)
        print('Accuracy: ', acc)
        print()

    #     print(X_treino.head())
    #     print()

Repedição:  0
Treino:  445
Valid:  446
Accuracy:  0.7713004484304933

Treino:  446
Valid:  445
Accuracy:  0.7797752808988764

Repedição:  1
Treino:  445
Valid:  446
Accuracy:  0.7443946188340808

Treino:  446
Valid:  445
Accuracy:  0.7955056179775281

Repedição:  2
Treino:  445
Valid:  446
Accuracy:  0.7757847533632287

Treino:  446
Valid:  445
Accuracy:  0.7887640449438202

Repedição:  3
Treino:  445
Valid:  446
Accuracy:  0.7533632286995515

Treino:  446
Valid:  445
Accuracy:  0.7573033707865169

Repedição:  4
Treino:  445
Valid:  446
Accuracy:  0.7354260089686099

Treino:  446
Valid:  445
Accuracy:  0.7415730337078652

Repedição:  5
Treino:  445
Valid:  446
Accuracy:  0.7219730941704036

Treino:  446
Valid:  445
Accuracy:  0.7056179775280899

Repedição:  6
Treino:  445
Valid:  446
Accuracy:  0.7757847533632287

Treino:  446
Valid:  445
Accuracy:  0.7303370786516854

Repedição:  7
Treino:  445
Valid:  446
Accuracy:  0.7040358744394619

Treino:  446
Valid:  445
Accuracy:  0.7348314606

In [40]:
len(resultados)

20

In [38]:
np.mean(resultados)

0.7550007557817302