# Voting Classifiers

Suppose you have trained a few classifiers, each one achieving about 80% accuracy.

<img src='img_1.png'>

A very simple way to create an even better classifier is to aggregate the predictions of each classifier and predict the class that gets the most votes. 

This majority-vote classifier is called a hard voting classifier

<img src='img_2.png'>

Somewhat surprisingly, this voting classifier often achieves a higher accuracy than the best classifier in the ensemble.

provided there are a sufficient number of weak learners and they are sufficiently diverse. Even if each classifier is a weak learner, the ensemble can still be a strong learner

This is due to the law of large numbers: suppose you build an ensemble containing 1,000 classifiers that are individually correct only 51% of the time. If you predict the majority voted class, you can hope for up to 75% accuracy.


However, this is only true if all classifiers are:
1. Perfectly independent
2. Making uncorrelated errors


If they are trained on the same data -> make the same types of errors -> many majority votes for the wrong
class -> reducing the ensemble’s accuracy

##### Note:

Ensemble methods work best when the predictors are as independent from one another as possible. One way to get *diverse classifiers*is to train them using very *different algorithms*.

This increases the chance that they will make very different types of errors, improving
the ensemble’s accuracy.




In [16]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import VotingClassifier, RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

In [21]:
data = load_iris()
X = data.data[:, 2:]
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7)

In [28]:
log_clf = LogisticRegression()
svc_clf = SVC(probability=True)
rand_clf = RandomForestClassifier()
voting_clf = VotingClassifier(
            estimators=[('lr', log_clf), ('svc', svc_clf), ('rnd_fst', rand_clf)], 
            voting = 'hard'
)
soft_voting_clf = VotingClassifier(
            estimators=[('lr', log_clf), ('svc', svc_clf), ('rnd_fst', rand_clf)], 
            voting = 'soft'
)

In [11]:
voting_clf.fit(X_train, y_train)

In [19]:
from sklearn.metrics import accuracy_score

In [22]:
for clf in (log_clf, svc_clf, rand_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_pred, y_test))

LogisticRegression 1.0
SVC 1.0
RandomForestClassifier 1.0
VotingClassifier 1.0


In [30]:
for clf in (log_clf, svc_clf, rand_clf, soft_voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print('*'*30)
    print(clf.__class__.__name__)
    print('proba: ', clf.predict_proba(X_test[:2]))
    print("accuracy: ", accuracy_score(y_pred, y_test))

******************************
LogisticRegression
proba:  [[0.04740948 0.89372093 0.05886959]
 [0.0383541  0.91679756 0.04484835]]
accuracy:  1.0
******************************
SVC
proba:  [[0.01213904 0.96834888 0.01951208]
 [0.01102813 0.97378426 0.01518761]]
accuracy:  1.0
******************************
RandomForestClassifier
proba:  [[0. 1. 0.]
 [0. 1. 0.]]
accuracy:  0.9777777777777777
******************************
VotingClassifier
proba:  [[0.02025744 0.9533237  0.02641887]
 [0.01684015 0.96291029 0.02024955]]
accuracy:  1.0


 ##### Soft Voting
 1. If all classifiers are able to estimate class probabilities  (i.e., they have a predict_proba() method)
 2. predict the class with the highest class probability, averaged over all the individual classifiers
 3. It often achieves higher performance than hard voting because it gives more v weight to highly confident votes