# Bagging Exercise

In this exercise, you will explore the concept of Bagging (Bootstrap Aggregating) and implement it using a random forest model. Bagging is an ensemble technique mainly used for reducing the variance of a predictive model and preventing overfitting. The main idea behind bagging is to combine multiple learners in a way that the ensemble model performs better than an individual model.

## Dataset
We will use the Iris dataset for this exercise. The Iris dataset is a classic dataset from the field of machine learning, containing measurements for iris flowers of three different species. **Feel free to use another dataset!!**

## Task
Your task is to:
1. Load the dataset.
2. Preprocess the data (if necessary).
3. Implement Bagging models.
4. Evaluate the models performance.

Please fill in the following code blocks to complete the exercise.


In [101]:
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load the dataset


In [104]:
df = pd.DataFrame(load_iris().data, columns=load_iris().feature_names)
df['target'] = load_iris().target

# Preprocess the data (if necessary)

In [120]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   sepal length (cm)  150 non-null    float64
 1   sepal width (cm)   150 non-null    float64
 2   petal length (cm)  150 non-null    float64
 3   petal width (cm)   150 non-null    float64
 4   target             150 non-null    int32  
dtypes: float64(4), int32(1)
memory usage: 5.4 KB


In [122]:
df['target'].value_counts()

target
0    50
1    50
2    50
Name: count, dtype: int64

# Split the Dataset

In [109]:
X = df.drop(columns = 'target')
y = df[['target']]

In [112]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, test_size = .2)

# Initialize and Train the Classifiers

## Random Forest
Initialize and train a Random Forest classifier.

In [128]:
rf = RandomForestClassifier()
bagging = BaggingClassifier(rf, n_estimators = 1, random_state = 7)
bagging.fit(X_train, y_train)

y_pred_rfbag = bagging.predict(X_test)



  y = column_or_1d(y, warn=True)


1.0

### Evaluate the model performance

In [None]:
accuracy_score(y_test, y_pred_rfbag)

## Bagging Meta-estimator
Initialize a K-Nearest Neighbors classifier and use it as the base estimator for the Bagging classifier.

In [130]:
kn = KNeighborsClassifier()
bagging_kn = BaggingClassifier(kn, n_estimators = 20, random_state = 42)

bagging_kn.fit(X_train, y_train)
y_pred_kn = bagging_kn.predict(X_test)


  y = column_or_1d(y, warn=True)


### Evaluate the model performance

In [132]:
accuracy_score(y_test, y_pred_kn)

1.0

## Pasting
Initialize a Decision Tree classifier and use it as the base estimator for a Bagging classifier with Pasting (without replacement).

In [134]:
dt = DecisionTreeClassifier()
bagging_dt = BaggingClassifier(dt, n_estimators = 20, random_state = 42)

bagging_dt.fit(X_train, y_train)
y_pred_dt = bagging_dt.predict(X_test)

  y = column_or_1d(y, warn=True)


### Evaluate the model performance

In [136]:
accuracy_score(y_test, y_pred_dt)

1.0

## Roughly Balanced Bagging (RBB)
Implement Roughly Balanced Bagging by manually creating balanced bootstrap samples and aggregating predictions from multiple Decision Tree classifiers.

In [146]:
import numpy as np
# Number of base estimators
n_estimators = 100

# Initialize arrays to store the ensemble predictions and models
ensemble_preds = np.zeros((n_estimators, len(X_test)))
ensemble_models = []

for i in range(n_estimators):
    # Create a bootstrap sample, ensuring it's roughly balanced
    pos_indices = np.where(y_train == 1)[0]
    neg_indices = np.where(y_train == 0)[0]

    chosen_pos_indices = np.random.choice(pos_indices, size=len(pos_indices), replace=True)
    chosen_neg_indices = np.random.choice(neg_indices, size=len(pos_indices), replace=True)

    balanced_sample_indices = np.concatenate([chosen_pos_indices, chosen_neg_indices])
    np.random.shuffle(balanced_sample_indices)

    X_train_balanced = X_train.iloc[balanced_sample_indices]
    y_train_balanced = y_train.iloc[balanced_sample_indices]

    # Train a decision tree classifier on the balanced bootstrap sample
    tree_clf = DecisionTreeClassifier(random_state=i)
    tree_clf.fit(X_train_balanced, y_train_balanced)
    ensemble_models.append(tree_clf)

    # Make predictions on the test set
    ensemble_preds[i] = tree_clf.predict(X_test)

# Majority voting across all estimators for the final prediction
y_pred_rbb = np.round(np.mean(ensemble_preds, axis=0))



### Evaluate the model performance

In [149]:
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred_rbb)
print(f'Roughly Balanced Bagging Model Accuracy: {accuracy:.2f}')

Roughly Balanced Bagging Model Accuracy: 0.63
