# Minicurso Classificadores

## Lição 3: Otimizando os parâmetros...

### Reconhecendo números de 0 a 9 escritos a mão

#### Importando dependências e setup

In [1]:
# Import datasets, and ML algorithms
from sklearn import datasets, metrics
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.svm import SVC

# Import Numeric library
import numpy as np

# Import plotting library
import matplotlib.pyplot as plt

# Set plot to be show inside the notebook
%matplotlib inline

In [2]:
# Set the random seed to reproducibility
import random
random.seed(0)

#### Preparando os dados

In [3]:
digits = datasets.load_digits()

n_samples = len(digits.images)

X = digits.images.reshape((n_samples, -1))
X

array([[ 0.,  0.,  5., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ..., 10.,  0.,  0.],
       [ 0.,  0.,  0., ..., 16.,  9.,  0.],
       ...,
       [ 0.,  0.,  1., ...,  6.,  0.,  0.],
       [ 0.,  0.,  2., ..., 12.,  0.,  0.],
       [ 0.,  0., 10., ..., 12.,  1.,  0.]])

In [4]:
y = digits.target
y

array([0, 1, 2, ..., 8, 9, 8])

#### Afinando os parâmetros

##### Testanto todas as combinações

In [5]:
# Read about the parameters at: https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html
svc = SVC(gamma='scale')

parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
clf = GridSearchCV(svc, parameters, cv=3)

clf.fit(X, y)
clf.best_estimator_

SVC(C=10, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

In [6]:
clf.predict(X)

array([0, 1, 2, ..., 8, 9, 8])

In [7]:
clf.score(X,y)

1.0

##### Testanto aleatóriamente

In [8]:
parameters = {
    'kernel':('linear', 'rbf', 'poly', 'sigmoid'),
    'C': [10**x for x in range(0,5)],
    'gamma': ('auto', 'scale', 0.0001),
    'tol': (1e-3, 1e-6, 1e-9), # Tolerance for stopping criterion.
    }

clf = RandomizedSearchCV(SVC(), parameters, cv=3)

clf.fit(X, y)
clf.best_estimator_

SVC(C=1000, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='poly',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=1e-09, verbose=False)

In [9]:
clf.predict(X)

array([0, 1, 2, ..., 8, 9, 8])

In [10]:
clf.score(X,y)

1.0

#### Otimizando e Testando

Se utilizarmos a mesma base para afinar (*tunar*) os parâmetros e avaliar estamos caindo no mesmo problema de quando usavamos a mesma base para treinar e testar.

![train_validation_test.png](https://cdn-images-1.medium.com/max/720/1*4G__SV580CxFj78o9yUXuQ.png)

*(Imagem obtida em: https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6)*

##### Treinando e Validando

In [11]:
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    stratify=y,
                                                    test_size=0.20) # Valores comum: .2, .25 e .3.

print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(1437, 64) (1437,)
(360, 64) (360,)


In [12]:
svc = SVC(gamma='scale', probability=True)

parameters = {'C':[1, 10]}
clf = GridSearchCV(svc, parameters, cv=3, scoring='f1_macro')  # n_jobs=-1

clf.fit(X_train, y_train)

clf.best_estimator_

SVC(C=10, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
    max_iter=-1, probability=True, random_state=None, shrinking=True, tol=0.001,
    verbose=False)

In [13]:
clf.score(X_train,y_train)

1.0

##### Testando

In [13]:
clf.predict(X_test)

array([4, 0, 6, 0, 4, 4, 4, 7, 2, 2, 1, 1, 3, 3, 4, 3, 1, 5, 1, 1, 0, 2,
       4, 1, 4, 5, 1, 4, 8, 3, 8, 5, 3, 7, 5, 3, 9, 2, 5, 7, 4, 2, 0, 8,
       0, 9, 0, 9, 7, 6, 6, 7, 2, 3, 7, 7, 5, 8, 6, 6, 8, 5, 3, 1, 5, 9,
       1, 5, 3, 7, 0, 8, 7, 2, 8, 3, 9, 2, 8, 8, 9, 3, 1, 7, 2, 7, 4, 0,
       8, 7, 1, 9, 1, 6, 5, 1, 0, 0, 4, 2, 1, 6, 4, 8, 3, 6, 4, 6, 5, 9,
       9, 2, 5, 0, 4, 2, 0, 7, 4, 0, 3, 9, 0, 8, 4, 9, 7, 6, 2, 1, 0, 5,
       7, 9, 1, 0, 3, 1, 6, 9, 8, 4, 1, 6, 6, 4, 0, 1, 3, 2, 5, 2, 7, 3,
       6, 2, 1, 7, 1, 0, 6, 7, 1, 5, 5, 4, 1, 6, 0, 8, 8, 0, 8, 7, 7, 8,
       9, 8, 9, 2, 0, 2, 3, 6, 0, 9, 7, 6, 1, 7, 1, 4, 2, 1, 6, 1, 2, 5,
       5, 1, 5, 5, 2, 8, 6, 5, 9, 0, 5, 3, 2, 9, 9, 0, 4, 7, 3, 7, 1, 7,
       6, 9, 9, 5, 8, 5, 4, 3, 8, 2, 3, 8, 0, 7, 3, 3, 5, 4, 4, 4, 6, 8,
       9, 3, 6, 0, 4, 6, 4, 6, 6, 9, 9, 1, 4, 5, 2, 2, 8, 4, 4, 5, 6, 5,
       7, 4, 3, 3, 0, 9, 2, 4, 9, 5, 9, 5, 6, 0, 3, 8, 9, 0, 1, 5, 2, 6,
       6, 3, 6, 7, 1, 3, 6, 9, 8, 8, 7, 5, 7, 0, 9,

In [14]:
# For each sample the probability for each class
probas = clf.predict_proba(X_test)
probas

array([[3.56982209e-03, 8.92419553e-04, 1.43885613e-03, ...,
        2.70690243e-03, 1.13086194e-03, 4.94026696e-04],
       [9.80233455e-01, 4.60541219e-04, 1.36088136e-03, ...,
        9.24481364e-04, 1.78907844e-03, 7.85523384e-03],
       [9.66663509e-04, 1.10965193e-02, 3.46355848e-03, ...,
        2.32462371e-03, 3.25292571e-03, 1.93700936e-03],
       ...,
       [7.45776936e-04, 9.77829499e-01, 7.23602444e-04, ...,
        1.45572379e-03, 9.17451312e-03, 2.77010430e-03],
       [2.43875007e-03, 9.82635780e-04, 3.24468197e-03, ...,
        3.00679941e-03, 3.76600640e-03, 9.69491843e-01],
       [7.19736984e-04, 8.66881382e-04, 1.12586769e-03, ...,
        3.21442627e-03, 1.04267538e-03, 3.35778971e-04]])

In [15]:
# Let's take a look at the probabilities for the 1st predict sample:
probas[0]

array([3.56982209e-03, 8.92419553e-04, 1.43885613e-03, 4.50727314e-04,
       9.86201606e-01, 9.03912231e-04, 2.21086550e-03, 2.70690243e-03,
       1.13086194e-03, 4.94026696e-04])

In [16]:
clf.score(X_test,y_test)

0.9943645434249332

---

### Para saber mais...

Dataset:
https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html#sklearn.datasets.load_digits

Outros datasets para treinar:
https://scikit-learn.org/stable/datasets/index.html

Tutorial que usei de base para esta lição:
- https://scikit-learn.org/stable/auto_examples/classification/plot_digits_classification.html#sphx-glr-auto-examples-classification-plot-digits-classification-py

### Dica de execução

Uma forma bem simples de executar este Notebook é usando o Google Colab: https://colab.research.google.com/

Se for utilizar sua máquina, lembre de intalar o Python 3 (eu usei o 3.7) e as dependências:
- NumPy
- Scikit-learn
- Jupyter Notebook
- Matplotlib

Sugiro instalar tanto o python quanto as dependências via [Anaconda](https://www.anaconda.com/distribution/#download-section) (ou [MiniConda](https://conda.io/en/latest/miniconda.html)) criando um Environment.