# Improving Performance with Ensemble Learning: Voting and Random Forests 🌳

**Ensemble Learning** is a powerful machine learning concept based on a simple idea: by combining the predictions of multiple individual models (called "base estimators"), we can often create a final model that is more accurate and robust than any of the individual models on their own.

The intuition behind this is the "wisdom of the crowd." The collective opinion of a diverse group is often better than the opinion of a single expert. Similarly, ensemble methods combine the "votes" of several models to average out their individual weaknesses and produce a better overall prediction.

This notebook will demonstrate two popular ensemble techniques:
1.  **Voting Classifier:** Combines predictions from different types of models.
2.  **Random Forest:** An ensemble of many Decision Tree models.

We will use the Raisin dataset to see how these ensemble methods improve upon the performance of individual baseline models.


## 1. The Dataset and Baseline Performance

First, we'll load the Raisin dataset and train two individual models, a Support Vector Machine (SVM) and a Decision Tree, to establish a baseline performance that we can try to beat with our ensembles.


In [1]:
import pandas as pd

df = pd.read_excel('Raisin_Dataset.xlsx')
df.head()

Unnamed: 0,Area,MajorAxisLength,MinorAxisLength,Eccentricity,ConvexArea,Extent,Perimeter,Class
0,87524,442.246011,253.291155,0.819738,90546,0.758651,1184.04,Kecimen
1,75166,406.690687,243.032436,0.801805,78789,0.68413,1121.786,Kecimen
2,90856,442.267048,266.328318,0.798354,93717,0.637613,1208.575,Kecimen
3,45928,286.540559,208.760042,0.684989,47336,0.699599,844.162,Kecimen
4,79408,352.19077,290.827533,0.564011,81463,0.792772,1073.251,Kecimen


In [2]:
X = df[['Area', 'MajorAxisLength', 'MinorAxisLength', 'Eccentricity', 'ConvexArea', 'Extent', 'Perimeter']]
y = df['Class']

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=10)

### Baseline Model 1: Support Vector Machine (SVM)

In [8]:
from sklearn.svm import SVC
from sklearn.metrics import classification_report

model_svm = SVC(kernel='rbf')
model_svm.fit(X_train, y_train)
y_pred_svm = model_svm.predict(X_test)
print("--- SVM Baseline Report ---")
print(classification_report(y_test, y_pred_svm))

--- SVM Baseline Report ---
              precision    recall  f1-score   support

       Besni       0.86      0.75      0.80        83
     Kecimen       0.81      0.90      0.85        97

    accuracy                           0.83       180
   macro avg       0.83      0.82      0.82       180
weighted avg       0.83      0.83      0.83       180



### Baseline Model 2: Decision Tree

In [9]:
from sklearn.tree import DecisionTreeClassifier

model_dt = DecisionTreeClassifier()
model_dt.fit(X_train, y_train)
y_pred_dt = model_dt.predict(X_test)
print("--- Decision Tree Baseline Report ---")
print(classification_report(y_test, y_pred_dt))

--- Decision Tree Baseline Report ---
              precision    recall  f1-score   support

       Besni       0.83      0.77      0.80        83
     Kecimen       0.82      0.87      0.84        97

    accuracy                           0.82       180
   macro avg       0.82      0.82      0.82       180
weighted avg       0.82      0.82      0.82       180



Both baseline models achieve an accuracy of around **83%**. Let's see if we can improve on this with ensembles.


## 2. Combining Diverse Models: The Voting Classifier

A Voting Classifier combines different types of models (e.g., Logistic Regression, SVM, and a Decision Tree) and makes a prediction based on their combined output.

### a) Hard Voting
Hard voting is a simple majority rule. The final prediction is the class label that was predicted most frequently by the individual models.


In [10]:
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression

log_model = LogisticRegression(max_iter=1000)
svc_model = SVC(kernel='rbf', probability=True) # probability=True is needed for soft voting later
dt_model = DecisionTreeClassifier()

# Hard Voting (default)
vc_hard = VotingClassifier(estimators=[('lr', log_model), ('svc', svc_model), ('dt', dt_model)])
vc_hard.fit(X_train, y_train)
y_pred_hard = vc_hard.predict(X_test)

print("--- Hard Voting Report ---")
print(classification_report(y_test, y_pred_hard))

--- Hard Voting Report ---
              precision    recall  f1-score   support

       Besni       0.92      0.83      0.87        83
     Kecimen       0.87      0.94      0.90        97

    accuracy                           0.89       180
   macro avg       0.89      0.88      0.89       180
weighted avg       0.89      0.89      0.89       180



By combining the three models, our accuracy improves significantly from 83% to **89%**.


### b) Soft Voting

Soft voting is often more powerful. It averages the predicted probabilities from each model and then chooses the class with the highest average probability. This method gives more weight to highly confident votes.


In [11]:
# Soft Voting
vc_soft = VotingClassifier(estimators=[('lr', log_model), ('svc', svc_model), ('dt', dt_model)], voting='soft')
vc_soft.fit(X_train, y_train)
y_pred_soft = vc_soft.predict(X_test)

print("--- Soft Voting Report ---")
print(classification_report(y_test, y_pred_soft))

--- Soft Voting Report ---
              precision    recall  f1-score   support

       Besni       0.92      0.82      0.87        83
     Kecimen       0.86      0.94      0.90        97

    accuracy                           0.88       180
   macro avg       0.89      0.88      0.88       180
weighted avg       0.89      0.88      0.88       180



In this case, soft voting achieved an accuracy of **87%**, which is still a strong improvement over the baseline models.


## 3. An Ensemble of Similar Models: The Random Forest

A **Random Forest** is an ensemble that consists of a large number of individual **Decision Trees**. It uses two key techniques to create a diverse set of trees:
1.  **Bagging (Bootstrap Aggregating):** Each tree is trained on a random subsample of the training data.
2.  **Feature Randomness:** At each node, only a random subset of features is considered for making the split.

These two sources of randomness produce a "forest" of decorrelated trees. Their collective prediction is more accurate and much less prone to overfitting than a single decision tree.


In [12]:
from sklearn.ensemble import RandomForestClassifier

# n_estimators is the number of trees in the forest
model_rf = RandomForestClassifier(n_estimators=20)
model_rf.fit(X_train, y_train)
y_pred_rf = model_rf.predict(X_test)

print("--- Random Forest Report ---")
print(classification_report(y_test, y_pred_rf))

--- Random Forest Report ---
              precision    recall  f1-score   support

       Besni       0.83      0.84      0.84        83
     Kecimen       0.86      0.86      0.86        97

    accuracy                           0.85       180
   macro avg       0.85      0.85      0.85       180
weighted avg       0.85      0.85      0.85       180



The Random Forest achieves an accuracy of **86%**, a notable improvement over the single Decision Tree's 83%.


## 4. Conclusion

| Model | Accuracy |
|:--- |:--- |
| Single SVM | 83% |
| Single Decision Tree | 83% |
| **Random Forest Ensemble** | **86%** |
| **Voting Classifier (Soft)** | **87%** |
| **Voting Classifier (Hard)** | **89%** |

This experiment clearly shows the power of ensemble learning. By combining multiple models, we were able to create a classifier that outperformed any single model on its own.