<a href="https://colab.research.google.com/github/Mariyyah-Alrasheed/Exercise_week3_T5/blob/main/Ex_Bagging_Exercise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Bagging Exercise

In this exercise, you will explore the concept of Bagging (Bootstrap Aggregating) and implement it using a random forest model. Bagging is an ensemble technique mainly used for reducing the variance of a predictive model and preventing overfitting. The main idea behind bagging is to combine multiple learners in a way that the ensemble model performs better than an individual model.

## Dataset
We will use the Iris dataset for this exercise. The Iris dataset is a classic dataset from the field of machine learning, containing measurements for iris flowers of three different species. **Feel free to use another dataset!!**

## Task
Your task is to:
1. Load the dataset.
2. Preprocess the data (if necessary).
3. Implement Bagging models.
4. Evaluate the models performance.

Please fill in the following code blocks to complete the exercise.


In [1]:
! pip install pymongo scikit-learn python-dotenv

Collecting pymongo
  Downloading pymongo-4.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (22 kB)
Collecting python-dotenv
  Downloading python_dotenv-1.0.1-py3-none-any.whl.metadata (23 kB)
Collecting dnspython<3.0.0,>=1.16.0 (from pymongo)
  Downloading dnspython-2.6.1-py3-none-any.whl.metadata (5.8 kB)
Downloading pymongo-4.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m29.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading python_dotenv-1.0.1-py3-none-any.whl (19 kB)
Downloading dnspython-2.6.1-py3-none-any.whl (307 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m307.7/307.7 kB[0m [31m14.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: python-dotenv, dnspython, pymongo
Successfully installed dnspython-2.6.1 pymongo-4.8.0 python-dotenv-1.0.1


In [2]:
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler

In [3]:
from sklearn.ensemble import BaggingClassifier

In [22]:
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier

# Load the dataset


In [4]:
iris = load_iris()

In [5]:
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['target'] = iris.target

# Preprocess the data (if necessary)

In [6]:
df.isnull().sum()

sepal length (cm)    0
sepal width (cm)     0
petal length (cm)    0
petal width (cm)     0
target               0
dtype: int64

In [7]:
df.duplicated().sum()

1

In [8]:
df

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,2
146,6.3,2.5,5.0,1.9,2
147,6.5,3.0,5.2,2.0,2
148,6.2,3.4,5.4,2.3,2


In [9]:
df.drop_duplicates(inplace=True)

In [10]:
# df_pre = pd.get_dummies(df)

In [11]:
scaler = StandardScaler()

In [12]:
X = df.drop(columns='target')
y = df.target

In [13]:
X = scaler.fit_transform(X)

In [14]:
X

array([[-0.8980334 ,  1.01240113, -1.33325507, -1.30862368],
       [-1.13956224, -0.1373532 , -1.33325507, -1.30862368],
       [-1.38109108,  0.32254853, -1.39001364, -1.30862368],
       [-1.5018555 ,  0.09259766, -1.2764965 , -1.30862368],
       [-1.01879782,  1.242352  , -1.33325507, -1.30862368],
       [-0.53574014,  1.9322046 , -1.16297935, -1.04548613],
       [-1.5018555 ,  0.78245027, -1.33325507, -1.17705491],
       [-1.01879782,  0.78245027, -1.2764965 , -1.30862368],
       [-1.74338434, -0.36730407, -1.33325507, -1.30862368],
       [-1.13956224,  0.09259766, -1.2764965 , -1.44019246],
       [-0.53574014,  1.47230287, -1.2764965 , -1.30862368],
       [-1.26032666,  0.78245027, -1.21973792, -1.30862368],
       [-1.26032666, -0.1373532 , -1.33325507, -1.44019246],
       [-1.86414876, -0.1373532 , -1.50353079, -1.44019246],
       [-0.05268246,  2.16215547, -1.44677222, -1.30862368],
       [-0.17344688,  3.08195894, -1.2764965 , -1.04548613],
       [-0.53574014,  1.

In [15]:
y.value_counts()

target
0    50
1    50
2    49
Name: count, dtype: int64

# Split the Dataset

In [16]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and Train the Classifiers

## Random Forest
Initialize and train a Random Forest classifier.

In [17]:
rf = RandomForestClassifier()
rf.fit(X_train,y_train)

### Evaluate the model performance

In [18]:
y_pred = rf.predict(X_test)

In [19]:
accuracy = accuracy_score(y_test,y_pred)

In [20]:
print(f'accuracy: {accuracy*100:.2f}%')

accuracy: 100.00%


## Bagging Meta-estimator
Initialize a K-Nearest Neighbors classifier and use it as the base estimator for the Bagging classifier.

In [23]:
# Initialize base classifier and Bagging Meta-estimator
base_estimator = KNeighborsClassifier()
bagging_classifier = BaggingClassifier(base_estimator, n_estimators=50, random_state=42)

# Train the classifier on the training data
bagging_classifier.fit(X_train, y_train)

# Make predictions on the test data
predictions = bagging_classifier.predict(X_test)


Bagging Classifier Model Accuracy: 100.00%


### Evaluate the model performance

In [24]:
accuracy = accuracy_score(y_test, predictions)
print(f'Bagging Classifier Model Accuracy: {accuracy * 100:.2f}%')


Bagging Classifier Model Accuracy: 100.00%


## Pasting
Initialize a Decision Tree classifier and use it as the base estimator for a Bagging classifier with Pasting (without replacement).

In [25]:
base_estimator = DecisionTreeClassifier()
pasting_classifier = BaggingClassifier(base_estimator, n_estimators=50, max_samples=0.7, bootstrap=False, random_state=42)

# Train the classifier on the training data
pasting_classifier.fit(X_train, y_train)

### Evaluate the model performance

In [26]:
predictions = pasting_classifier.predict(X_test)

# Evaluate the model's accuracy
accuracy = accuracy_score(y_test, predictions)
print(f'Pasting Classifier Model Accuracy: {accuracy * 100:.2f}%')


Pasting Classifier Model Accuracy: 100.00%


## Roughly Balanced Bagging (RBB)
Implement Roughly Balanced Bagging by manually creating balanced bootstrap samples and aggregating predictions from multiple Decision Tree classifiers.

In [29]:
import numpy as np

# Number of base estimators
n_estimators = 100

# Initialize arrays to store the ensemble predictions and models
ensemble_preds = np.zeros((n_estimators, len(X_test)))
ensemble_models = []

for i in range(n_estimators):
    # Create a bootstrap sample, ensuring it's roughly balanced
    pos_indices = np.where(y_train == 1)[0]
    neg_indices = np.where(y_train == 0)[0]

    chosen_pos_indices = np.random.choice(pos_indices, size=len(pos_indices), replace=True)
    chosen_neg_indices = np.random.choice(neg_indices, size=len(pos_indices), replace=True)

    balanced_sample_indices = np.concatenate([chosen_pos_indices, chosen_neg_indices])
    np.random.shuffle(balanced_sample_indices)

    X_train_balanced = X_train.iloc[balanced_sample_indices]
    y_train_balanced = y_train.iloc[balanced_sample_indices]

    # Train a decision tree classifier on the balanced bootstrap sample
    tree_clf = DecisionTreeClassifier(random_state=i)
    tree_clf.fit(X_train_balanced, y_train_balanced)
    ensemble_models.append(tree_clf)


### Evaluate the model performance

In [30]:
    # Make predictions on the test set
    ensemble_preds[i] = tree_clf.predict(X_test)

# Majority voting across all estimators for the final prediction
final_preds = np.round(np.mean(ensemble_preds, axis=0))