# Improving Classification with the Voting Classifier Ensemble 🍇

**Ensemble Learning** is a machine learning concept where multiple models are combined to produce a more powerful and accurate final prediction. The idea is that by aggregating the "votes" of several models, we can leverage their collective strengths and average out their individual weaknesses.

The **`VotingClassifier`** in `scikit-learn` is a simple yet effective ensemble method. It works by training several different types of models (e.g., Logistic Regression, SVM, and a Decision Tree) on the same data. It then makes a final prediction based on their combined output.

This notebook will demonstrate two types of voting:
1.  **Hard Voting:** A simple majority rule vote.
2.  **Soft Voting:** A vote weighted by the predicted probabilities of each model.

We will use the Raisin dataset to see how a Voting Classifier can improve upon the performance of a single baseline model.


## 1. The Dataset and a Baseline SVM Model

First, we'll load the Raisin dataset and train a single Support Vector Machine (SVM) model to establish a baseline performance that we can aim to beat.


In [1]:
import pandas as pd

df = pd.read_excel('Raisin_Dataset.xlsx')
df.head()

Unnamed: 0,Area,MajorAxisLength,MinorAxisLength,Eccentricity,ConvexArea,Extent,Perimeter,Class
0,87524,442.246011,253.291155,0.819738,90546,0.758651,1184.04,Kecimen
1,75166,406.690687,243.032436,0.801805,78789,0.68413,1121.786,Kecimen
2,90856,442.267048,266.328318,0.798354,93717,0.637613,1208.575,Kecimen
3,45928,286.540559,208.760042,0.684989,47336,0.699599,844.162,Kecimen
4,79408,352.19077,290.827533,0.564011,81463,0.792772,1073.251,Kecimen


First, we split our data into training and testing sets.

In [6]:
# Prepare the data
X = df[['Area', 'MajorAxisLength', 'MinorAxisLength', 'Eccentricity', 'ConvexArea', 'Extent', 'Perimeter']]
y = df['Class']

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=10)

Now, let's train our baseline SVM model.

In [7]:
from sklearn.svm import SVC
from sklearn.metrics import classification_report

model = SVC(kernel='rbf')
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print("--- SVM Baseline Report ---")
print(classification_report(y_test, y_pred))

--- SVM Baseline Report ---
              precision    recall  f1-score   support

       Besni       0.86      0.75      0.80        83
     Kecimen       0.81      0.90      0.85        97

    accuracy                           0.83       180
   macro avg       0.83      0.82      0.82       180
weighted avg       0.83      0.83      0.83       180



The single SVM model achieves an accuracy of **83%**.

## 2. Creating a Voting Ensemble

We will now create an ensemble that combines three different types of models:
1.  `LogisticRegression`
2.  `SVC` (Support Vector Classifier)
3.  `DecisionTreeClassifier`


### a) Hard Voting (Majority Rule)

**Hard Voting** is a simple democratic vote. The final prediction is the class label that receives the majority of votes from the individual models. For example, if the Logistic Regression and Decision Tree models vote 'Besni' and the SVM votes 'Kecimen', the final prediction will be 'Besni'.


In [8]:
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

log_model = LogisticRegression(max_iter=1000)
svc_model = SVC(kernel='rbf', probability=True) # probability=True is needed for soft voting
dt_model = DecisionTreeClassifier()

# Hard Voting is the default, so we don't need to specify voting='hard'
vc_hard = VotingClassifier(estimators=[('lr', log_model), ('svc', svc_model), ('dt', dt_model)])
vc_hard.fit(X_train, y_train)
y_pred_hard = vc_hard.predict(X_test)

print("--- Hard Voting Report ---")
print(classification_report(y_test, y_pred_hard))

--- Hard Voting Report ---
              precision    recall  f1-score   support

       Besni       0.93      0.83      0.88        83
     Kecimen       0.87      0.95      0.91        97

    accuracy                           0.89       180
   macro avg       0.90      0.89      0.89       180
weighted avg       0.90      0.89      0.89       180



By combining the models, our accuracy improves significantly from 83% to **89%**.

### b) Soft Voting (Weighted by Probability)

**Soft Voting** is often more powerful. It works by averaging the predicted probabilities for each class from all the models. The class with the highest average probability is chosen as the final prediction. This method gives more weight to models that are highly confident in their predictions. (Note: this requires that all base classifiers have a `predict_proba` method).


In [9]:
# Soft Voting
vc_soft = VotingClassifier(estimators=[('lr', log_model), ('svc', svc_model), ('dt', dt_model)], voting='soft')
vc_soft.fit(X_train, y_train)
y_pred_soft = vc_soft.predict(X_test)

print("--- Soft Voting Report ---")
print(classification_report(y_test, y_pred_soft))

--- Soft Voting Report ---
              precision    recall  f1-score   support

       Besni       0.93      0.83      0.88        83
     Kecimen       0.87      0.95      0.91        97

    accuracy                           0.89       180
   macro avg       0.90      0.89      0.89       180
weighted avg       0.90      0.89      0.89       180



In this case, soft voting achieved a strong accuracy of **88%**, which is still a significant improvement over the baseline models.


## 3. Conclusion

| Model | Accuracy |
|:--- |:--- |
| Single SVM | 83% |
| **Voting Classifier (Soft)** | **88%** |
| **Voting Classifier (Hard)** | **89%** |

This experiment clearly shows the power of ensemble learning. The `VotingClassifier` provides a simple way to combine the strengths of fundamentally different models, leading to a more robust and accurate final prediction than any of the individual models could achieve on their own.