# Bagging Exercise

In this exercise, you will explore the concept of Bagging (Bootstrap Aggregating) and implement it using a random forest model. Bagging is an ensemble technique mainly used for reducing the variance of a predictive model and preventing overfitting. The main idea behind bagging is to combine multiple learners in a way that the ensemble model performs better than an individual model.

## Dataset
We will use the Iris dataset for this exercise. The Iris dataset is a classic dataset from the field of machine learning, containing measurements for iris flowers of three different species. **Feel free to use another dataset!!**

## Task
Your task is to:
1. Load the dataset.
2. Preprocess the data (if necessary).
3. Implement Bagging models.
4. Evaluate the models performance.

Please fill in the following code blocks to complete the exercise.


# Load the dataset


In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_breast_cancer
from sklearn.datasets import load_iris


In [3]:
iris = load_breast_cancer()
X = iris.data
y = iris.target

# Preprocess the data (if necessary)

# Split the Dataset

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize and Train the Classifiers

## Random Forest
Initialize and train a Random Forest classifier.

In [11]:
random_forestC = RandomForestClassifier(n_estimators=50, max_depth= 3, random_state=0)
random_forestC.fit(X_train,y_train)


### Evaluate the model performance

In [19]:
RF_pred = random_forestC.predict(X_test)
RF_accuracy = accuracy_score(RF_pred, y_test)
print(f'Random Forest accuracy: {RF_accuracy*100:.2f}%')

Random Forest accuracy: 96.49%


## Bagging Meta-estimator
Initialize a K-Nearest Neighbors classifier and use it as the base estimator for the Bagging classifier.

In [21]:
base_estimator = KNeighborsClassifier()
bagging_ME = BaggingClassifier(base_estimator, n_estimators= 50,random_state=0)
bagging_ME.fit(X_train,y_train)

### Evaluate the model performance

In [20]:
bagging_ME_pred = bagging_ME.predict(X_test)
bagging_ME_accuracy = accuracy_score(y_test,bagging_ME_pred)
print(f'Bagging Meta-Estimator accuracy:{bagging_ME_accuracy*100:.2f}%')

Bagging Meta-Estimator accuracy:95.91%


## Pasting
Initialize a Decision Tree classifier and use it as the base estimator for a Bagging classifier with Pasting (without replacement).

In [23]:
base_estimator_pasting = DecisionTreeClassifier(max_depth=3)
Pasting_classifier = BaggingClassifier(base_estimator_pasting, 
                                       bootstrap=False,
                                       random_state=0,
                                       max_samples=0.5)
Pasting_classifier.fit(X_train,y_train)

### Evaluate the model performance

In [None]:
pasting_pred = Pasting_classifier.predict(X_test)

print(accuracy_score)

Pasting Classifier Model Accuracy: 97.08%


## Roughly Balanced Bagging (RBB)
Implement Roughly Balanced Bagging by manually creating balanced bootstrap samples and aggregating predictions from multiple Decision Tree classifiers.

In [6]:
n_estimators = 100

# Initilaize array to store the ensemble predicitons and models
ensemble_preds = np.zeros((n_estimators, len((X_test))))
ensemble_models = []

for i in range(n_estimators):
    ## making two lists one saving the indices with - and one with + 
    pos_indices = np.where(y_train == 1)[0]
    neg_indices = np.where(y_train == 0)[0]

    mini_class = min(len(pos_indices),len(neg_indices))
    chosen_pos_indices = np.random.choice(pos_indices, size =mini_class, replace = True)
    chosen_neg_indices = np.random.choice(neg_indices, size= mini_class, replace= True)

    balanced_indices = np.concatenate([chosen_neg_indices, chosen_pos_indices])
    # we need to shuffle them 
    np.random.shuffle(balanced_indices)

    # now we will extracted them form the training dataset
    x_train_balanced = X_train.iloc[balanced_indices]
    y_train_balanced = y_train.iloc[balanced_indices]

    decision_tree = DecisionTreeClassifier(max_depth=3,random_state=i)
    decision_tree.fit(x_train_balanced,y_train_balanced)
    ensemble_models.append(decision_tree)

    ensemble_preds[i] = decision_tree.predict(X_test)
    

AttributeError: 'numpy.ndarray' object has no attribute 'iloc'

### Evaluate the model performance

In [None]:
final_preds= np.round(np.mean(ensemble_preds, axis=0))
print(f'Roughly Balanced Bagging Model Accuracy: {accuracy_score(y_test, final_preds):.2f}')