## Bagging_Ensemble ( Classifier )

A Bagging Ensemble Classifier is a machine learning technique that trains multiple models on different random subsets of the training data ( bootstrapping ) and combines their predictions ( majority voting for classification ) to improve accuracy and reduce variance.

### Importing Dataset

In [None]:
from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split

### Creating a Dataset

In [None]:
X,y = make_classification(n_samples=10000, n_features=10,n_informative=3)

In [None]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)

### Single Decision Tree

In [None]:
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train,y_train)
y_pred = dt.predict(X_test)

print("Decision Tree accuracy",accuracy_score(y_test,y_pred))

Decision Tree accuracy 0.888


### Bagging using Decision Tree

In [None]:
bag = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=500, # number of models/decision tree
    max_samples=0.25, # from our test dataset we are using 25 percent of data
    bootstrap=True,
    random_state=42
)

In [None]:
bag.fit(X_train,y_train)

In [None]:
y_pred = bag.predict(X_test)

In [None]:
accuracy_score(y_test,y_pred)

0.92

In [None]:
bag.estimators_samples_[0].shape

(2000,)

In [None]:
bag.estimators_features_[0].shape

(10,)

### Bagging using SVM

In [None]:
bag = BaggingClassifier(
    estimator=SVC(),
    n_estimators=500,
    max_samples=0.25,
    bootstrap=True,
    random_state=42
)

In [None]:
bag.fit(X_train,y_train)
y_pred = bag.predict(X_test)
print("Bagging using SVM",accuracy_score(y_test,y_pred))

Bagging using SVM 0.9035


### Pasting

Similar to bagging, but the key difference is that pasting does not use replacement when drawing subsets of the training data. It is based on row sampliing without replacement.

In [None]:
bag = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=500,
    max_samples=0.25,
    bootstrap=False,
    random_state=42,
    verbose = 1,
    n_jobs=-1
)

In [None]:
bag.fit(X_train,y_train)
y_pred = bag.predict(X_test)
print("Pasting classifier",accuracy_score(y_test,y_pred))

[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done   2 out of   2 | elapsed:   14.5s finished
[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.


Pasting classifier 0.9215


[Parallel(n_jobs=2)]: Done   2 out of   2 | elapsed:    0.4s finished


### Random Subspaces

 Instead of sampling instances (like in bagging and pasting), random subspaces sample features (columns) of the dataset, training each model on different feature subsets.It is based on column/feature sampling with or without replacement , typically without replacement.

In [None]:
bag = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=500,
    max_samples=1.0,
    bootstrap=False,
    max_features=0.5,
    bootstrap_features=True,
    random_state=42
)

In [None]:
bag.fit(X_train,y_train)
y_pred = bag.predict(X_test)
print("Random Subspaces classifier",accuracy_score(y_test,y_pred))

Random Subspaces classifier 0.9175


### Random Patches

Random Patches is a combination of Bagging (row sampling) and Random Subspaces (feature sampling). In this method, both data instances (rows) and features (columns) are randomly sampled to train different models.

In [None]:
bag = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=500,
    max_samples=0.25,
    bootstrap=True,
    max_features=0.5,
    bootstrap_features=True,
    random_state=42
)

In [None]:
bag.fit(X_train,y_train)
y_pred = bag.predict(X_test)
print("Random Patches classifier",accuracy_score(y_test,y_pred))

Random Patches classifier 0.913


### OOB Score - Out of Bag Score

In Bagging classifiers (such as Random Forest), each decision tree is trained on a random subset of the data with or wiithout replacement (Bootstrap Sampling).

*   On average, about 63% of the training samples are used in training each tree.

*   The remaining 37% of the samples (Out-of-Bag samples) are not used in training that specific tree.

*   These OOB samples act as a validation set, allowing the model to evaluate performance without needing a separate validation dataset.

By averaging the predictions on OOB samples across all trees, the OOB Score provides an estimate of the model's performance, similar to cross-validation.

In [None]:
bag = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=500,
    max_samples=0.25,
    bootstrap=True,
    oob_score=True,
    random_state=42
)

In [None]:
bag.fit(X_train, y_train)  # Fit the model

In [None]:
bag.oob_score_

0.924625

In [None]:
y_pred = bag.predict(X_test)
print("Accuracy",accuracy_score(y_test,y_pred))

Accuracy 0.92


### Bagging Tips


*   Bagging generally gives better results than Pasting
*   Good results come around the 25% to 50% row sampling mark
*   Random patches and subspaces should be used while dealing with high dimensional data
*   To find the correct hyperparameter values we can do GridSearchCV/RandomSearchCV

### Hyperparameter Tuning Algorithms

1. GridSearchCV (Exhaustive Search)


*   It performs an exhaustive search over all possible combinations of hyperparameter values.

*   It systematically goes through each combination and evaluates the model using cross-validation.
*   It is computationally expensive because it tries all combinations.
*   Best for small search spaces where exhaustive evaluation is feasible.


2. RandomizedSearchCV (Random Search)


*   It randomly selects a subset of hyperparameter combinations from the specified ranges.

*  It does not test all possible combinations, making it faster for large search spaces.
*   Works well when you have many hyperparameters or limited computational resources.
*   It does not guarantee finding the absolute best combination, but often finds a good one.

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
parameters = {
    'n_estimators': [50,100,500],
    'max_samples': [0.1,0.4,0.7,1.0],
    'bootstrap' : [True,False],
    'max_features' : [0.1,0.4,0.7,1.0]
    }

In [None]:
search = GridSearchCV(BaggingClassifier(), parameters, cv=5)

In [None]:
search.fit(X_train,y_train)

In [None]:
search.best_params_
search.best_score_