# Using Bagging for an imbalanced classification task

Here, we use the breast cancer data, a naturally imbalanced problem. We will use a Logistic Regression inside a Bagging Classifier. We will see how defining weights for the bootstraping in the Bagging can improve our classification task. The metric we will use here is the balanced accuracy, which has recently been added to scikit-learn. Balanced accuracy is basically an averaged recall score.

In [0]:
import numpy as np

from sklearn.ensemble import BaggingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import balanced_accuracy_score

In [0]:
breast_cancer = datasets.load_breast_cancer()
X = breast_cancer.data
y = breast_cancer.target

X_train, X_test, y_train, y_test = train_test_split(X,y, random_state = 2019)

In [21]:
print("Size of class 0: %i" % len(y_train[y_train == 0]))
print("Size of class 1: %i" % len(y_train[y_train == 1]))

Size of class 0: 164
Size of class 1: 262


### 1. Without weights

In [0]:
model = BaggingClassifier(LogisticRegression(solver = 'liblinear'), 
                          n_estimators=10, 
                          bootstrap = True, random_state = 2019)

In [23]:
model.fit(X_train,y_train)

BaggingClassifier(base_estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='liblinear',
          tol=0.0001, verbose=0, warm_start=False),
         bootstrap=True, bootstrap_features=False, max_features=1.0,
         max_samples=1.0, n_estimators=10, n_jobs=None, oob_score=False,
         random_state=2019, verbose=0, warm_start=False)

In [24]:
balanced_accuracy_score(y_test,model.predict(X_test))

0.9214912280701755

### 2. With weights

In [0]:
y0 = len(y_train[y_train == 0])
y1 = len(y_train[y_train == 1])

w0 = y1/y0
w1 = 1

sample_weights = np.zeros(len(y_train))
sample_weights[y_train == 0] = w0
sample_weights[y_train == 1] = w1

In [0]:
model = BaggingClassifier(LogisticRegression(solver = 'liblinear'), 
                      n_estimators=10, 
                      bootstrap = True, random_state = 2019)

In [27]:
model.fit(X_train,y_train,sample_weights)

BaggingClassifier(base_estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='liblinear',
          tol=0.0001, verbose=0, warm_start=False),
         bootstrap=True, bootstrap_features=False, max_features=1.0,
         max_samples=1.0, n_estimators=10, n_jobs=None, oob_score=False,
         random_state=2019, verbose=0, warm_start=False)

In [28]:
balanced_accuracy_score(y_test,model.predict(X_test))

0.9423245614035087

Conclusion: We have improved our balanced accuracy from **0.92** to **0.94**!