<a href="https://colab.research.google.com/github/Richish/hands_on_ml/blob/master/7_1_ensemble_learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Hard voting classifier

Suppose you have trained a few classifiers, each one achieving about x% accuracy
A very simple way to create an even better classifier is to aggregate the predictions of each classifier and predict the class that gets the most votes. This majority-vote classifier is called a hard voting classifier.

even if each classifier is a weak learner (meaning
it does only slightly better than random guessing), the ensemble can still be a
strong learner (achieving high accuracy), provided there are a sufficient number of
weak learners and they are sufficiently diverse.



## Analogy from game theory:
Suppose you have a slightly biased coin that has a 51% chance of coming up heads,
and 49% chance of coming up tails. If you toss it 1,000 times, you will generally get
more or less 510 heads and 490 tails, and hence a majority of heads. If you do the
math, you will find that the probability of obtaining a majority of heads after 1,000
tosses is close to 75%. The more you toss the coin, the higher the probability (e.g.,
with 10,000 tosses, the probability climbs over 97%). This is due to the law of large
numbers: as you keep tossing the coin, the ratio of heads gets closer and closer to the
probability of heads (51%).

## Example of voting classifier on moons dataset

we will use ensemble of svc, random forest and logistic regression

In [30]:
# data prep
from sklearn.datasets import make_moons

from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression

X, y = make_moons(n_samples=10_000, noise=0.4, random_state=42)
X.shape, y.shape

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=42)
X_train, X_test, y_train, y_test

(array([[-0.56413534,  0.29283681],
        [-1.16033479,  0.96512577],
        [-0.06598769, -0.15191052],
        ...,
        [ 0.38876425, -0.78662881],
        [ 2.50492832,  0.21133631],
        [ 0.35428745,  0.74582457]]), array([[ 0.69945888, -0.8734481 ],
        [ 1.7764418 ,  0.13222334],
        [-1.14450821,  0.24446319],
        ...,
        [ 0.66336269,  0.79833307],
        [-0.6493245 ,  1.19920859],
        [-0.09883144,  0.40961263]]), array([0, 0, 1, ..., 1, 1, 0]), array([1, 1, 0, ..., 0, 0, 0]))

In [31]:
# training each classifier with no hyperparameter tuning
lr_clf = LogisticRegression()
rf_clf = RandomForestClassifier()
svc_clf = SVC()

lr_clf.fit(X_train, y_train)
rf_clf.fit(X_train, y_train)
svc_clf.fit(X_train, y_train)

from sklearn.metrics import accuracy_score
# seeing the individual performance for each classifier
for clf in (lr_clf, rf_clf, svc_clf):
    y_pred = clf.predict(X_test)
    acc_score = accuracy_score(y_true=y_test, y_pred=y_pred)
    print("{}: {}".format(clf.__class__.__name__, acc_score))



LogisticRegression: 0.8415
RandomForestClassifier: 0.854
SVC: 0.874


In [32]:
# checking performance of a voting classifier based on exact same models

voting_clf = VotingClassifier(estimators=[('lr', lr_clf), ('rf', rf_clf), ('svc', svc_clf)], voting='hard')
voting_clf.fit(X_train, y_train)
y_pred = voting_clf.predict(X_test)
acc_score = accuracy_score(y_true=y_test, y_pred=y_pred)
print("voting clf: {}".format(acc_score)) 
# in most of the cases will be higher that all of the constituents thogh did not happen in this particular example.
# in this particular case though looks like svc is smiply too good for this data pattern


voting clf: 0.87
