In this notebook, we will try applying Ensemble learning techniques (Bagging and Boosting) on the reddit data.

Both bagging and boosting involves training multiple classifiers of the same variety on random subsamples of the data. Bagging trains multiple classifiers in parallel (scales well) while boosting does so in sequential order. This allows boosting to assign weights to samples (i.e. assign a higher weight to samples that were misclassified by the previous learner, thereby iteratively improving).

In [None]:
import pandas as pd
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import cross_validate
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import confusion_matrix
from imblearn.over_sampling import RandomOverSampler
from imblearn.combine import SMOTETomek, SMOTEENN
from imblearn.over_sampling import SMOTE, ADASYN
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from imblearn.ensemble import BalancedBaggingClassifier
from imblearn.ensemble import BalancedRandomForestClassifier
from imblearn.ensemble import RUSBoostClassifier
from imblearn.ensemble import EasyEnsembleClassifier
from sklearn.ensemble import AdaBoostClassifier

In [None]:
train_data = pd.read_csv("../data/processed/train_data_baseline.csv")
test_data = pd.read_csv("../data/processed/test_data_baseline.csv")

In [None]:
class filter_add_attributes(BaseEstimator, TransformerMixin):
    '''Custom transformer based on Sklearn's classes.
    Takes in dataframe (train or test) and adds new features and returns
    a filtered version of the original train/test datasets.'''
    def fit(self, X, y=None):
        return self.fit_transform(X)
    def transform(self, X, y=None):
        return self.fit_transform(X)
    def fit_transform(self, X, y=None):
        '''Calculates and adds comment body length and account activity (based on frequency of comment author)
        as features. Returns a new dataframe with the added columns.'''
        data = X.copy()
        data["body_len"] = data.comment_body.apply(lambda x: len(x))
        data["acc_activity"] = data.author_ids.map(data.author_ids.value_counts())
        data["is_premium"] = data.is_premium.astype(int)
        return data.filter(items=["ups", "comment_karma", "link_karma", "is_premium", "comment_age_days", "acc_age_days", "body_len", "acc_activity"], axis=1)

In [None]:
pipeline = Pipeline([
        ('filter_add', filter_add_attributes()),
        ('scaler', StandardScaler()),
    ])

X_train = pipeline.fit_transform(train_data)
X_test = pipeline.transform(test_data)
y_train = train_data["gildings"].to_list()
y_test = test_data["gildings"].to_list()

In [None]:
assert len(X_train) == len(y_train)
assert len(X_test) == len(y_test)

In [None]:
def train_eval(clf, X, y):
    '''Takes in train and train datasets along a model to train and evaluate. Returns f1 and roc-auc scores.'''
    rskf = RepeatedStratifiedKFold(n_splits=10, n_repeats=2, random_state=42)
    scores = cross_validate(clf, X, y, cv=rskf, scoring=['f1', 'roc_auc'])
    return scores['test_f1'], scores['test_roc_auc']

Imbalanced learn also provides us with variations of regular ensemble models (BalancedBaggingClassifier for BaggingClassifier, BalancedRandomForestClassifier for RandomForstClassifier etc). First, we will run both versions with default hyperparameters to serve as baseline models.

Balanced bagging classifier uses RandomUnderSampler to undersample the training set before applying bagging classifier. We use sampling strategy 'auto', which results in resampling (under sampling) only the majority class (class 0 in our case). BalancedRF classifier performs the same operation before applying RF classifier.

Read more:https://imbalanced-learn.readthedocs.io/en/stable/ensemble.html

In [None]:
bagging = BaggingClassifier(base_estimator=DecisionTreeClassifier(), n_estimators=10, random_state=42)
balanced_bagging = BalancedBaggingClassifier(base_estimator=DecisionTreeClassifier(), n_estimators=10,
                                             sampling_strategy='auto',  
                                             replacement=False, random_state=42)
rf = RandomForestClassifier(n_estimators=100, random_state=42)
brf = BalancedRandomForestClassifier(n_estimators=100, random_state=42)

models = [bagging, balanced_bagging, rf, brf]
model_names = ["Bagging", "Balanced Bagging", "Random Forests", "Balanced Random Forests"]

for i in range(len(models)):
    f1, roc = train_eval(models[i], X_train, y_train)
    print(f"{model_names[i]}:")
    print(f"F1 score is: {np.mean(f1)}")
    print(f"ROC-AUC is: {np.mean(roc)}")
    print("\n")

Imbalanced Learn also offers two variations of AdaBoost. We will also run Sklearn's Adaboost for comparison.

RusBoost integrates undersampling into AdaBoost, while EasyEnsembleClassifer is a bag of balanced boosted (Adaboost) learners.

In [None]:
adaboost = AdaBoostClassifier(n_estimators=200, algorithm='SAMME.R', random_state=42)
rusboost = RUSBoostClassifier(n_estimators=200, algorithm='SAMME.R', random_state=42)
eec = EasyEnsembleClassifier(n_estimators=200, random_state=42)

models = [adaboost, rusboost, eec]
model_names = ["AdaBoost", "RusBoost", "Easy Ensemble Boost"]

for i in range(len(models)):
    f1, roc = train_eval(models[i], X_train, y_train)
    print(f"{model_names[i]}:")
    print(f"F1 score is: {np.mean(f1)}")
    print(f"ROC-AUC is: {np.mean(roc)}")
    print("\n")