In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)

# A Sabedoria das Massas

O que é melhor? 

1) Um sábio que acerta 95% das vezes.

2) 10000 goiabas que acertam 51% das vezes.

$x_{i}$: resposta da pessoa $i$

$x_i$: ~ $Bernoulli(p=0.51)$

Democracia das goiabas: 

- y contagem de sucessos em $n=10000$ tentaivas
- y ~ $Binomial (n=10000, p=0.51)$

Decisão das goiabas: acertam se 5001 ou mais goiabas acertarem; 

- $P(y>5000)?$
- $P(y>5000) = 1 - P(y \leq 5000) \approx 97.7\%$

In [2]:
from scipy.stats import binom

1 - binom.cdf(5000, n=10000, p=0.51) 

0.9767182874807615

### Princípios da democracia: 

- $p \gt 50\%$, pelo menos (melhor totalmente aleatório)
    - "Aprendizes" fracos (weak learners) &rarr; modelos fracos
- $n$ grande, para conceber os erros
    - número grande de modelos
        - Algoritmos diferentes, mesmos dados
        - mesmo algoritmo, diferentes dados (mesmo algoritmo, subamostras do mesmo dado)
- Independêcia

https://www.youtube.com/watch?v=iOucwX7Z1HU

Artigo de Francis Galton: Galton, F. (1907). "Vox populi". Nature. 75: 450–451. doi:10.1038/075450a0. (https://www.nature.com/articles/075450a0.pdf)



**Atividades:**

Estude o capítulo 7 do livro texto e faça os exercícios 1, 2, 8 e 9.

1. If you have trained five different models on the exact same training data, and
they all achieve 95% precision, is there any chance that you can combine these
models to get better results? If so, how? If not, why?
2. What is the difference between hard and soft voting classifiers?

8. Load the MNIST data (introduced in Chapter 3), and split it into a training set, a
validation set, and a test set (e.g., use the first 40,000 instances for training, the
next 10,000 for validation, and the last 10,000 for testing). Then train various
classifiers, such as a Random Forest classifier, an Extra-Trees classifier, and an
SVM. Next, try to combine them into an ensemble that outperforms them all on
the validation set, using a soft or hard voting classifier. Once you have found one,
try it on the test set. How much better does it perform compared to the individual
classifiers?

In [3]:
try:
    from sklearn.datasets import fetch_openml
    mnist = fetch_openml('mnist_784', version=1)
    mnist.target = mnist.target.astype(np.int64)
except ImportError:
    from sklearn.datasets import fetch_mldata
    mnist = fetch_mldata('MNIST original')

In [4]:
from sklearn.model_selection import train_test_split

X_train_val, X_test, y_train_val, y_test = train_test_split(mnist.data, mnist.target, test_size=10000, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=10000, random_state=42)

In [5]:
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.svm import LinearSVC
from sklearn.neural_network import MLPClassifier

random_forest_clf = RandomForestClassifier(n_estimators=10, random_state=RANDOM_SEED)
extra_trees_clf = ExtraTreesClassifier(n_estimators=10, random_state=RANDOM_SEED)
svm_clf = LinearSVC(random_state=RANDOM_SEED)
mlp_clf = MLPClassifier(random_state=RANDOM_SEED)

In [6]:
estimators = [random_forest_clf, extra_trees_clf, svm_clf, mlp_clf]
for estimator in estimators:
    print("Training the", estimator)
    estimator.fit(X_train, y_train)

Training the RandomForestClassifier(n_estimators=10, random_state=42)
Training the ExtraTreesClassifier(n_estimators=10, random_state=42)
Training the LinearSVC(random_state=42)




Training the MLPClassifier(random_state=42)


In [7]:
[estimator.score(X_val, y_val) for estimator in estimators]

[0.9469, 0.9492, 0.8695, 0.9655]

In [8]:
from sklearn.ensemble import VotingClassifier

voting_clf = VotingClassifier([
    ("random_forest_clf", random_forest_clf),
    ("extra_trees_clf", extra_trees_clf),
    ("svm_clf", svm_clf),
    ("mlp_clf", mlp_clf),
])

voting_clf.fit(X_train, y_train)

voting_clf.score(X_val, y_val)



0.9626

Removendo SVM para ver se melhora

In [9]:
voting_clf_noSVM = VotingClassifier([
    ("random_forest_clf", random_forest_clf),
    ("extra_trees_clf", extra_trees_clf),
    ("mlp_clf", mlp_clf),
])

voting_clf_noSVM.fit(X_train, y_train)

voting_clf_noSVM.score(X_val, y_val)

0.9649

In [10]:
voting_clf_noSVM.voting = "soft"

voting_clf_noSVM.score(X_val, y_val)

0.9707

In [11]:
voting_clf_noSVM.score(X_test, y_test)

0.9686

In [12]:
[estimator.score(X_test, y_test) for estimator in voting_clf_noSVM.estimators_]

[0.9437, 0.9474, 0.9624]

9. Run the individual classifiers from the previous exercise to make predictions on
the validation set, and create a new training set with the resulting predictions:
each training instance is a vector containing the set of predictions from all your
classifiers for an image, and the target is the image’s class. Congratulations, you
have just trained a blender, and together with the classifiers they form a stacking
ensemble! Now let’s evaluate the ensemble on the test set. For each image in the
test set, make predictions with all your classifiers, then feed the predictions to the
blender to get the ensemble’s predictions. How does it compare to the voting classifier
you trained earlier?

In [13]:
X_val_predictions = np.empty((len(X_val), len(estimators)), dtype=np.float32)

for index, estimator in enumerate(estimators):
    X_val_predictions[:, index] = estimator.predict(X_val)

In [14]:
X_val_predictions

array([[5., 5., 5., 5.],
       [8., 8., 8., 8.],
       [2., 2., 2., 2.],
       ...,
       [7., 7., 7., 7.],
       [6., 6., 6., 6.],
       [7., 7., 7., 7.]], dtype=float32)

In [15]:
rnd_forest_blender = RandomForestClassifier(n_estimators=200, oob_score=True, random_state=42)
rnd_forest_blender.fit(X_val_predictions, y_val)

RandomForestClassifier(n_estimators=200, oob_score=True, random_state=42)

In [19]:
rnd_forest_blender.oob_score_

0.9623

In [20]:
X_test_predictions = np.empty((len(X_test), len(estimators)), dtype=np.float32)

for index, estimator in enumerate(estimators):
    X_test_predictions[:, index] = estimator.predict(X_test)

In [21]:
y_pred = rnd_forest_blender.predict(X_test_predictions)

In [22]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)

0.9601