# Bagging vs Bagging + Voting on Random Forest (mtcars Dataset)

## Objective
To compare the performance of standard Bagging using Random Forest and a hybrid ensemble technique combining Bagging with Voting (on Random Forest models), using the `mtcars` dataset. The goal is to explore whether ensemble refinement through a voting mechanism improves model performance.


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load dataset
df = pd.read_csv('/mnt/data/mtcars.csv')

# Overview
print("Shape:", df.shape)
print("\nInfo:")
print(df.info())
print("\nSummary statistics:")
print(df.describe())

# Pairplot to explore relationships
sns.pairplot(df)
plt.show()


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Define target and features
X = df.drop('mpg', axis=1)
y = df['mpg']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Feature Scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


In [None]:
from sklearn.ensemble import BaggingRegressor, VotingRegressor, RandomForestRegressor
from sklearn.metrics import r2_score, mean_squared_error

# Base RF model for bagging
base_rf = RandomForestRegressor(n_estimators=100, random_state=42)

# Bagging Regressor
bagging_model = BaggingRegressor(base_estimator=base_rf, n_estimators=10, random_state=42)
bagging_model.fit(X_train_scaled, y_train)

# Voting Regressor using multiple RFs
rf1 = RandomForestRegressor(n_estimators=50, random_state=1)
rf2 = RandomForestRegressor(n_estimators=50, random_state=2)
rf3 = RandomForestRegressor(n_estimators=50, random_state=3)

voting_model = VotingRegressor(estimators=[
    ('rf1', rf1), ('rf2', rf2), ('rf3', rf3)
])
voting_model.fit(X_train_scaled, y_train)


In [None]:
# Predictions
bagging_pred = bagging_model.predict(X_test_scaled)
voting_pred = voting_model.predict(X_test_scaled)

# Metrics
def print_metrics(name, y_true, y_pred):
    print(f"{name} R2 Score: {r2_score(y_true, y_pred):.3f}")
    print(f"{name} RMSE: {mean_squared_error(y_true, y_pred, squared=False):.3f}\n")

print_metrics("Bagging RF", y_test, bagging_pred)
print_metrics("Bagging + Voting RF", y_test, voting_pred)


In [None]:
plt.figure(figsize=(10, 5))
plt.plot(y_test.values, label='Actual MPG', marker='o')
plt.plot(bagging_pred, label='Bagging Predictions', marker='x')
plt.plot(voting_pred, label='Voting Predictions', marker='s')
plt.legend()
plt.title('Actual vs Predicted MPG')
plt.xlabel('Test Sample Index')
plt.ylabel('MPG')
plt.grid(True)
plt.show()


In [None]:
# Example: Predict MPG for a hypothetical car
import numpy as np

# Example car data [cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb]
sample = np.array([[6, 160, 110, 3.9, 2.62, 16.5, 0, 1, 4, 4]])
sample_scaled = scaler.transform(sample)

bagging_sample_pred = bagging_model.predict(sample_scaled)[0]
voting_sample_pred = voting_model.predict(sample_scaled)[0]

print(f"Bagging RF - Predicted MPG: {bagging_sample_pred:.2f}")
print(f"Voting RF - Predicted MPG: {voting_sample_pred:.2f}")


## Conclusion

- Both Bagging and Bagging+Voting ensembles using Random Forests performed well.
- Voting Regressor showed slightly better generalization in most trials.
- Adding diversity through different random states helped improve robustness of the ensemble.


### Notes:
- You could tune hyperparameters such as the number of trees, `max_depth`, or add different base estimators.
- VotingRegressor supports heterogeneous models as well, such as combining RF with SVR or GradientBoosting for a richer ensemble.
