# Bagging Exercise

In this exercise, you will explore the concept of Bagging (Bootstrap Aggregating) and implement it using a random forest model. Bagging is an ensemble technique mainly used for reducing the variance of a predictive model and preventing overfitting. The main idea behind bagging is to combine multiple learners in a way that the ensemble model performs better than an individual model.

## Dataset
We will use the Iris dataset for this exercise. The Iris dataset is a classic dataset from the field of machine learning, containing measurements for iris flowers of three different species. **Feel free to use another dataset!!**

## Task
Your task is to:
1. Load the dataset.
2. Preprocess the data (if necessary).
3. Implement Bagging models.
4. Evaluate the models performance.

Please fill in the following code blocks to complete the exercise.


In [23]:
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier



# Load the dataset


In [17]:
data = load_iris()
X = data.data
y = data.target

# Preprocess the data (if necessary)

In [9]:
#For the Iris dataset, it's already in a clean format.

# Split the Dataset

In [18]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


# Initialize and Train the Classifiers

## Random Forest
Initialize and train a Random Forest classifier.

In [19]:
random_forest_classifier = RandomForestClassifier(n_estimators=70, random_state=42)
random_forest_classifier.fit(X_train, y_train)


### Evaluate the model performance

In [20]:
predictions = random_forest_classifier.predict(X_test)

In [21]:
accuracy = accuracy_score(y_test, predictions)
print(f"random forest model accuracy: {accuracy *100:.2f}%" )


random forest model accuracy: 100.00%


## Bagging Meta-estimator
Initialize a K-Nearest Neighbors classifier and use it as the base estimator for the Bagging classifier.

In [24]:
base_estimator = KNeighborsClassifier()
bagging_classifier= BaggingClassifier(base_estimator, n_estimators=70, random_state =42)

In [25]:
bagging_classifier.fit(X_train, y_train)

### Evaluate the model performance

In [26]:
predictions = bagging_classifier.predict(X_test)

In [27]:
accuracy = accuracy_score(y_test, predictions)
print(f'bagging classifier model accuracy:{accuracy*100:.2f}% ')

bagging classifier model accuracy:100.00% 


## Pasting
Initialize a Decision Tree classifier and use it as the base estimator for a Bagging classifier with Pasting (without replacement).

In [28]:
from sklearn.tree import DecisionTreeClassifier

In [29]:
base_estimator=DecisionTreeClassifier()
pasting_classifier = BaggingClassifier(base_estimator, n_estimators=70, random_state=42)
pasting_classifier.fit(X_train, y_train)

### Evaluate the model performance

In [30]:
predictions= pasting_classifier.predict(X_test)

In [31]:
accuracy= accuracy_score(y_test, predictions)
print(f'passting classifier model accuracy: {accuracy*100:.2f}%')

passting classifier model accuracy: 100.00%


## Roughly Balanced Bagging (RBB)
Implement Roughly Balanced Bagging by manually creating balanced bootstrap samples and aggregating predictions from multiple Decision Tree classifiers.

In [34]:
import numpy as np

In [36]:
n_estimators = 100
ensemble_preds = np.zeros((n_estimators, len(X_test)))
ensemble_models = []

for i in range(n_estimators):
    pos_indices = np.where(y_train == 1)[0]
    neg_indices = np.where(y_train == 0)[0]

    chosen_pos_indices = np.random.choice(pos_indices, size=len(pos_indices), replace=True)
    chosen_neg_indices = np.random.choice(neg_indices, size=len(pos_indices), replace=True)

    balanced_sample_indices = np.concatenate([chosen_pos_indices, chosen_neg_indices])
    np.random.shuffle(balanced_sample_indices)

    X_train_balanced = X_train[balanced_sample_indices]
    y_train_balanced = y_train[balanced_sample_indices]

    tree_clf = DecisionTreeClassifier(random_state=i)
    tree_clf.fit(X_train_balanced, y_train_balanced)
    ensemble_models.append(tree_clf)

### Evaluate the model performance

In [37]:
ensemble_preds[i] = tree_clf.predict(X_test)
final_preds = np.round(np.mean(ensemble_preds, axis=0))
accuracy = accuracy_score(y_test, final_preds)
print(f'Roughly Balanced Bagging Model Accuracy: {accuracy:.2f}')


Roughly Balanced Bagging Model Accuracy: 0.42
