# Soft and Hard Voting

Voting Classifier

A Voting Classifier is a machine learning model that trains on an ensemble of numerous models and predicts an output (class) 
based on their highest probability of chosen class as the output.If we have trained a few classifiers, each one achieving about 
80% accuracy.You may have a Logistic Regression classifier, an SVM classifier, a Random Forest classifier, a K-Nearest Neighbors
classifier, and perhaps a few more . A very simple way to create an even better classifier is to aggregate the predictions of 
each classifier and predict the class that gets the most votes. This majority-vote classifier is called a hard voting classifier.
Somewhat surprisingly, this voting classifier often achieves a higher accuracy than the best classifier in the ensemble. 
In fact, even if each classifier is a weak learner (meaning it does only slightly better than random guessing), the ensemble 
can still be a strong learner (achieving high accuracy), provided there are a sufficient number of weak learners and they are 
sufficiently diverse. Voting Classifier supports two types of votings:
    
Hard Voting: In hard voting, the predicted output class is a class with the highest majority of votes i.e the class which had 
    the highest probability of being predicted by each of the classifiers. Suppose three classifiers predicted the output 
    class(A, A, B), so here the majority predicted A as output. Hence A will be the final prediction.

Soft Voting: In soft voting, the output class is the prediction based on the average of probability given to that class. 
    Suppose given some input to three models, the prediction probability for class A = (0.30, 0.47, 0.53) and 
    B = (0.20, 0.32, 0.40). So the average for class A is 0.4333 and B is 0.3067, the winner is clearly class 
    A because it had the highest probability averaged by each classifier.

Note: Make sure to include a variety of models to feed a Voting Classifier to be sure that the error made by one might be 
      resolved by the other.


In [16]:
import pandas as pd
import numpy as np

from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV

In [3]:
df = pd.read_csv('advertising.csv')

In [5]:
df.head()

Unnamed: 0,Daily Time Spent on Site,Age,Area Income,Daily Internet Usage,Ad Topic Line,City,Male,Country,Timestamp,Clicked on Ad
0,68.95,35,61833.9,256.09,Cloned 5thgeneration orchestration,Wrightburgh,0,Tunisia,27-03-2016 00:53,0
1,80.23,31,68441.85,193.77,Monitored national standardization,West Jodi,1,Nauru,04-04-2016 1:39,0
2,69.47,26,59785.94,236.5,Organic bottom-line service-desk,Davidton,0,San Marino,13-03-2016 20:35,0
3,74.15,29,54806.18,245.89,Triple-buffered reciprocal time-frame,West Terrifurt,1,Italy,10-01-2016 2:31,0
4,68.37,35,73889.99,225.58,Robust logistical utilization,South Manuel,0,Iceland,03-06-2016 3:36,0


In [9]:
df.dtypes

daily_time_spent        float64
Age                       int64
area_income             float64
daily_internet_usage    float64
ad_topic_line            object
City                     object
Male                      int64
Country                  object
Timestamp                object
clicked_ad                int64
dtype: object

In [6]:
df.describe()

Unnamed: 0,Daily Time Spent on Site,Age,Area Income,Daily Internet Usage,Male,Clicked on Ad
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,65.0002,36.009,55000.00008,180.0001,0.481,0.5
std,15.853615,8.785562,13414.634022,43.902339,0.499889,0.50025
min,32.6,19.0,13996.5,104.78,0.0,0.0
25%,51.36,29.0,47031.8025,138.83,0.0,0.0
50%,68.215,35.0,57012.3,183.13,0.0,0.5
75%,78.5475,42.0,65470.635,218.7925,1.0,1.0
max,91.43,61.0,79484.8,269.96,1.0,1.0


In [7]:
# Renaming column
df=df.rename({'Daily Time Spent on Site':'daily_time_spent', 
              'Area Income':'area_income', 
              'Daily Internet Usage':'daily_internet_usage',
              'Ad Topic Line':'ad_topic_line', 
              'Clicked on Ad':'clicked_ad'}, axis=1)

In [8]:
# Dependent and Independent variable
x= df[['daily_time_spent', 'Age', 'area_income', 'daily_internet_usage', 'Male']]

y= df['clicked_ad']

In [11]:
# split
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size = 0.27, random_state= 1000)

## Hard Voting

In [12]:
log_clf = LogisticRegression(solver="lbfgs", random_state=42)
rnd_clf = RandomForestClassifier(n_estimators=100, random_state=42)
svm_clf = SVC(gamma="scale", random_state=42)

voting_clf = VotingClassifier(
    estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],voting='hard')

voting_clf.fit(x_train, y_train)

In [13]:
from sklearn.metrics import accuracy_score

for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
    clf.fit(x_train, y_train)
    y_pred = clf.predict(x_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))


LogisticRegression 0.8851851851851852
RandomForestClassifier 0.9555555555555556
SVC 0.7222222222222222
VotingClassifier 0.9185185185185185


## Soft Voting 

In [14]:
log_clf = LogisticRegression(solver="lbfgs", random_state=42)
rnd_clf = RandomForestClassifier(n_estimators=100, random_state=42)
svm_clf = SVC(gamma="scale", probability=True, random_state=42)

voting_clf = VotingClassifier(
    estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],
    voting='soft')

voting_clf.fit(x_train, y_train)

In [25]:
from sklearn.metrics import accuracy_score

for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
    clf.fit(x_train, y_train)
    y_pred = clf.predict(x_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred) '\n \',)
    print(clf.__class__.__name__, classification_report(y_test, y_pred))

LogisticRegression 0.8851851851851852
LogisticRegression               precision    recall  f1-score   support

           0       0.85      0.94      0.89       139
           1       0.93      0.82      0.87       131

    accuracy                           0.89       270
   macro avg       0.89      0.88      0.88       270
weighted avg       0.89      0.89      0.88       270

RandomForestClassifier 0.9555555555555556
RandomForestClassifier               precision    recall  f1-score   support

           0       0.97      0.94      0.96       139
           1       0.94      0.97      0.95       131

    accuracy                           0.96       270
   macro avg       0.96      0.96      0.96       270
weighted avg       0.96      0.96      0.96       270

SVC 0.7222222222222222
SVC               precision    recall  f1-score   support

           0       0.68      0.86      0.76       139
           1       0.79      0.58      0.67       131

    accuracy                     

In [23]:
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print('\n Confusion Matrix\n', clf.__class__.__name__, cm)


 Confusion Matrix
 VotingClassifier [[134   5]
 [  9 122]]


In [24]:
#