<br> </br>
<font size = 8> <center> AI Strategy and Digital transformation </center> </font>
<font size = 6> <center>  <b> 6. Data rebalancing methods </b> </center>
<br>
<font size = 5> <center> Piotr Wójcik </center> </font>
<font size = 5> <center> University of Warsaw, Poland
<font size = 5> <center> pwojcik@wne.uw.edu.pl
<br> </br>
<font size = 5> <center>  January 2025 </center> </font>
</center> </font>

In [None]:
# change working directory
from google.colab import drive
drive.mount('/content/drive')

%cd '/content/drive/My Drive/szkolenia/2025-01_Bucharest'

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content/drive/My Drive/szkolenia/2025-01_Bucharest


In [None]:
# lets import all the needed packages

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pickle
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, roc_curve, roc_auc_score, recall_score, precision_score, f1_score, get_scorer, make_scorer # to define own metrics
from sklearn.model_selection import train_test_split, KFold, cross_val_score, RepeatedKFold, GridSearchCV, cross_validate
from imblearn.under_sampling import RandomUnderSampler, TomekLinks
from imblearn.over_sampling import RandomOverSampler, SMOTE, ADASYN
from imblearn.combine import SMOTETomek
from imblearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, MinMaxScaler
# importing freqtable() function defined by the lecturer in PW_functions.py
from PW_functions import freqtable

In [None]:
# lets load the imputed churn data stored before
with open('data/churn_prepared.pkl', 'rb') as f:  # 'rb' stands for read in a binary mode
    churn_train_encoded = pickle.load(f)
    churn_test_encoded = pickle.load(f)

churn_test_encoded.head()

Unnamed: 0,customer_id,customer_age,customer_number_of_dependents,customer_relationship_length,customer_available_credit_limit,total_products,period_inactive,contacts_in_last_year,credit_card_debt_balance,remaining_credit_limit,...,customer_salary_range_60-80K,customer_salary_range_80-120K,customer_salary_range_Unknown,customer_salary_range_below 40K,credit_card_classification_Blue,credit_card_classification_Gold,credit_card_classification_Platinum,credit_card_classification_Silver,account_status_closed,account_status_open
2089,122823,-1.179263,0.490392,-1.875667,-0.460621,-0.047547,-0.341229,-0.402278,-0.719363,-0.691007,...,0,0,0,1,1,0,0,0,1,0
8911,674482,0.824459,-0.276069,1.382108,-0.146121,-0.363695,-1.324011,-1.30632,0.159612,-0.143218,...,0,0,1,0,1,0,0,0,0,1
8411,529000,0.824459,0.490392,0.880912,-0.094396,0.268601,0.641554,0.501764,0.409696,-0.062615,...,1,0,0,0,1,0,0,0,0,1
7311,344732,-0.427867,0.490392,-0.372079,-0.429807,-0.363695,0.641554,-1.30632,0.801986,-0.766427,...,0,0,1,0,1,0,0,0,0,1
9211,957784,0.699226,-1.04253,1.13151,0.098896,-0.995992,2.607119,-1.30632,1.658894,0.210287,...,1,0,0,0,1,0,0,0,0,1


# Application of sample rebalancing methods

Lets check the effect of balancing the sample

**CAUTION!** rebalancing technique should be used **ONLY** on the **TRAINING DATA SET** !!!!!

Similarly like data transformations - it makes no sense in applying transformations first and dividing the transformed data in the training and testing dataset - this is **information leakage**.

It makes no sense to create instances based on the current minority class and then exclude an instance for validation, pretending we didn’t generate it using data that is still in the training set.

In [None]:
# lets remind the frequencies of the outcome variable in the training sample

freqtable(churn_train_encoded['account_status_closed'])

# 16% of clients churned

Unnamed: 0_level_0,count,percent
account_status_closed,Unnamed: 1_level_1,Unnamed: 2_level_1
0,5949,83.930587
1,1139,16.069413


In [None]:
# lets check how random down-sampling works

# Split the data into features (X) and target (y)
# IMPORTANT! remember that customer ID is not a sensible predictor!

churn_train_X = churn_train_encoded.drop(['account_status_closed', 'account_status_open', 'customer_id'], axis = 1)
churn_train_y = churn_train_encoded['account_status_closed']

# Create the undersampler
sampling_ = RandomUnderSampler(random_state = 123)

# Fit and resample
X_down, y_down = sampling_.fit_resample(churn_train_X, churn_train_y)

# Combine X_down and y_down back into a single DataFrame
churn_train_down = pd.concat([X_down, y_down], axis = 1)

# Check distribution of classes after downsampling
freqtable(churn_train_down['account_status_closed'])

Unnamed: 0_level_0,count,percent
account_status_closed,Unnamed: 1_level_1,Unnamed: 2_level_1
0,1139,50.0
1,1139,50.0


In [None]:
# lets check how random up-sampling works - same syntax as above

# Create the oversampler
sampling_2 = RandomOverSampler(random_state = 123)

# Fit and resample
X_up, y_up = sampling_2.fit_resample(churn_train_X, churn_train_y)

# Combine X_up and y_up back into a single DataFrame
churn_train_up = pd.concat([X_up, y_up], axis = 1)

# Check distribution of classes after downsampling
freqtable(churn_train_up['account_status_closed'])


Unnamed: 0_level_0,count,percent
account_status_closed,Unnamed: 1_level_1,Unnamed: 2_level_1
0,5949,50.0
1,5949,50.0


In [None]:
# one of the most common non-standard up-sampling methods is SMOTE

# Create the overrsampler
sampling_3 = SMOTE(random_state = 123) # optional argument: k_neighbors = 5 by default

# Fit and resample
X_sm, y_sm = sampling_3.fit_resample(churn_train_X, churn_train_y)

# Combine X_up and y_up back into a single DataFrame
churn_train_SMOTE = pd.concat([X_sm, y_sm], axis = 1)

# Check distribution of classes after downsampling
freqtable(churn_train_SMOTE['account_status_closed'])


Unnamed: 0_level_0,count,percent
account_status_closed,Unnamed: 1_level_1,Unnamed: 2_level_1
0,5949,50.0
1,5949,50.0


In [None]:
# Tomek Links remove examples that are “tomek links”—pairs of samples from different classes that are very close together in feature space.
# They’re generally considered noise or borderline points.

# Down-sampling is achieved by removing the majority-class samples in each tomek link pair, thus cleaning up the decision boundary.

# Create the undersampler
sampling_4 = TomekLinks()

# Fit and resample
X_tl, y_tl = sampling_4.fit_resample(churn_train_X, churn_train_y)

# Combine X_up and y_up back into a single DataFrame
churn_train_TL = pd.concat([X_tl, y_tl], axis = 1)

# Check distribution of classes after downsampling
freqtable(churn_train_TL['account_status_closed'])

Unnamed: 0_level_0,count,percent
account_status_closed,Unnamed: 1_level_1,Unnamed: 2_level_1
0,5780,83.538084
1,1139,16.461916


In [None]:
# In some cases, you want to oversample the minority class and clean up the majority class simultaneously

# Combination of SMOTE and TomekLinks (SMOTETomek) - applies SMOTE first, then uses Tomek Links to remove borderline majority samples

# Create the undersampler
sampling_5 = SMOTETomek(random_state = 123)

# Fit and resample
X_stl, y_stl = sampling_5.fit_resample(churn_train_X, churn_train_y)

# Combine X_up and y_up back into a single DataFrame
churn_train_STL = pd.concat([X_stl, y_stl], axis = 1)

# Check distribution of classes after downsampling
freqtable(churn_train_STL['account_status_closed'])


Unnamed: 0_level_0,count,percent
account_status_closed,Unnamed: 1_level_1,Unnamed: 2_level_1
0,5948,50.0
1,5948,50.0


adaptive synthetic (ADASYN) sampling approach for learning from imbalanced data sets. The essential idea of ADASYN is to use a weighted distribution for different minority class examples according to their level of difficulty in learning, where more synthetic data is generated for minority class examples that are harder to learn compared to those minority examples that are easier to learn. As a result, the ADASYN approach improves learning with respect to the data distributions in two ways: (1) reducing the bias introduced by the class imbalance, and (2) adaptively shifting the classification decision boundary toward the difficult examples.

https://ieeexplore.ieee.org/document/4633969

ROSE is NOT available in Python, just in R

In [None]:
# lets also consider ADASYN

# Create the undersampler
sampling_6 = ADASYN(random_state = 123)

# Fit and resample
X_ada, y_ada = sampling_6.fit_resample(churn_train_X, churn_train_y)

# Combine X_up and y_up back into a single DataFrame
churn_train_ADA = pd.concat([X_ada, y_ada], axis = 1)

# Check distribution of classes after downsampling
freqtable(churn_train_ADA['account_status_closed'])

Unnamed: 0_level_0,count,percent
account_status_closed,Unnamed: 1_level_1,Unnamed: 2_level_1
0,5949,50.334208
1,5870,49.665792


## Applications of various rebalanding techniques within cross-validation of logistic regression

In [None]:
# lets check how various rebalancing methods influence model performance on a NEW DATASET

# to avoid looking into the test data we need to apply cross-validation

# lets start with a benchmark - logistic regression with no rebalancing
# and assess its performance on a new data with the use of 5-fold cross-validation
# using a wide range of metrics (also level-specific)

# Defining custom metrics for individual levels
def recall_for_class_0(y_true, y_pred):
    return recall_score(y_true, y_pred, pos_label = 0)

def recall_for_class_1(y_true, y_pred):
    return recall_score(y_true, y_pred, pos_label = 1)

def precision_for_class_0(y_true, y_pred):
    return precision_score(y_true, y_pred, pos_label = 0)

def precision_for_class_1(y_true, y_pred):
    return precision_score(y_true, y_pred, pos_label = 1)

def f1_for_class_0(y_true, y_pred):
    return f1_score(y_true, y_pred, pos_label = 0)

def f1_for_class_1(y_true, y_pred):
    return f1_score(y_true, y_pred, pos_label = 1)

# we define a list of scoring metrics
scoring_full = {
    'accuracy': 'accuracy',
     # Add custom scorers for each class
    'recall_class_0': make_scorer(recall_for_class_0),
    'recall_class_1': make_scorer(recall_for_class_1),
    'balanced_accuracy': 'balanced_accuracy', # or alternatively 'recall_macro'
    'precision_class_0': make_scorer(precision_for_class_0),
    'precision_class_1': make_scorer(precision_for_class_1),
    'f1_class_0': make_scorer(f1_for_class_0),
    'f1_class_1': make_scorer(f1_for_class_1),
    'roc_auc': 'roc_auc'}

# then define 5-fold CV
cv5 = KFold(n_splits = 5, shuffle = True, random_state = 123)

# And apply it to or model - we need to use another function than before - cross_validate()
# as cross_val_score() only allows for a single metric

performance_benchmark = cross_validate(
    estimator = LogisticRegression(),
    X = churn_train_X,
    y = churn_train_y,
    cv = cv5,
    scoring = scoring_full,
    n_jobs = -1,
    return_train_score = False)  # set True if you also want train scores

# performance_benchmark is a dictionary; let's convert it to a DataFrame
performance_benchmark_df = pd.DataFrame(performance_benchmark)
print(performance_benchmark_df)

# Mean of each column
print("Average over all folds")
print(performance_benchmark_df.mean())

# not very good in prediting churning clients


   fit_time  score_time  test_accuracy  test_recall_class_0  \
0  0.056413    0.053269       0.898449             0.975256   
1  0.054233    0.045462       0.889281             0.966611   
2  0.044996    0.067604       0.890691             0.969038   
3  0.041449    0.062643       0.892025             0.960996   
4  0.056542    0.034445       0.890614             0.972010   

   test_recall_class_1  test_balanced_accuracy  test_precision_class_0  \
0             0.532520                0.753888                0.908585   
1             0.468182                0.717396                0.908235   
2             0.470852                0.719945                0.907524   
3             0.500000                0.730498                0.916139   
4             0.487395                0.729703                0.903785   

   test_precision_class_1  test_f1_class_0  test_f1_class_1  test_roc_auc  
0                0.818750         0.940741         0.645320      0.903188  
1                0.72028

In [None]:
# lets check how various rebalancing methods influence model performance on a NEW DATASET compared to our benchmark

# to avoid looking into the test data we need to apply cross-validation

# IMPORTANT! As rebalancing is applied to TRAIN data only, it has to be done INDEPENDENTLY in each step of of cross-validation!

# Therefore we need to put it into the pipeline

# lets try all the methods presented above applied to a logistic regression:
# - rebalancing methods: none, down, up, SMOTE, ADASYN, TomekLinks, SMOTETomek

# -------------------------------------------------------------------------
# Define the pipeline:
#    step1: sampler (to handle imbalance)
#    step2: logistic regression classifier

pipe = Pipeline([
    ('sampler', RandomUnderSampler()),   # placeholder step (will be replaced in param grid)
    ('model', LogisticRegression())
])

# -------------------------------------------------------------------------
# Define the parameter grid for GridSearchCV.
# We'll vary the sampler: undersampling, oversampling, SMOTE, ADASYN, TomekLinks, SMOTETomek

param_grid = {
    # Pass None to skip sampling. We replace 'sampler' step with 'passthrough'
    'sampler': [
        'passthrough',  # no rebalancing
        RandomUnderSampler(random_state = 123),
        RandomOverSampler(random_state = 123),
        SMOTE(random_state = 123),
        ADASYN(random_state = 123),
        TomekLinks(),
        SMOTETomek(random_state = 123)
    ]}

# -------------------------------------------------------------------------
# Set up GridSearchCV, which also allows for multiple metrics,
# but one has to be indicated in the refit= argument
# (based on which the best model is selected)

performance_rebalancing = GridSearchCV(
    estimator = pipe,
    param_grid = param_grid,
    scoring = scoring_full,
    refit = 'balanced_accuracy',
    cv = cv5, # CV5 - defined before
    n_jobs = -1,        # parallel
    verbose = 1         # show progress
   )

# -------------------------------------------------------------------------
# Fit on the data
performance_rebalancing.fit(churn_train_X, churn_train_y)


Fitting 5 folds for each of 7 candidates, totalling 35 fits


In [None]:
# best parameters found for the refit metric (balanced_accuracy)
print("Best Params:", performance_rebalancing.best_params_)
print("Best Score:", performance_rebalancing.best_score_)

# Collect and inspect results
performance_rebalancing_df = pd.DataFrame(performance_rebalancing.cv_results_)

cols_to_keep = [
    col for col in performance_rebalancing_df.columns
    if col.startswith("mean_test_")
]

# Concatenate 'params' with the columns to keep
subset_cols = ['params'] + cols_to_keep

# Sort by 'rank_test_balanced_accuracy'
result_df = performance_rebalancing_df[subset_cols].sort_values(by = 'mean_test_balanced_accuracy',
                                                                ascending = False)

result_df.head(7)

# each of the rebalancing methods improves balanced accuracy
# two best approaches: over and unnder-sampling allow to obtain almost equal recall for both classes

Best Params: {'sampler': RandomOverSampler(random_state=123)}
Best Score: 0.8032849381479481


Unnamed: 0,params,mean_test_accuracy,mean_test_recall_class_0,mean_test_recall_class_1,mean_test_balanced_accuracy,mean_test_precision_class_0,mean_test_precision_class_1,mean_test_f1_class_0,mean_test_f1_class_1,mean_test_roc_auc
2,{'sampler': RandomOverSampler(random_state=123)},0.798536,0.796167,0.810402,0.803285,0.956576,0.432641,0.868972,0.563695,0.882185
1,{'sampler': RandomUnderSampler(random_state=123)},0.795997,0.794072,0.805151,0.799612,0.95534,0.4283,0.867211,0.558886,0.880187
3,{'sampler': SMOTE(random_state=123)},0.883747,0.937877,0.601001,0.769439,0.924756,0.649689,0.931237,0.623748,0.881763
6,{'sampler': SMOTETomek(random_state=123)},0.883465,0.937708,0.600188,0.768948,0.92459,0.648769,0.931069,0.622884,0.881824
4,{'sampler': ADASYN(random_state=123)},0.881773,0.937517,0.590338,0.763928,0.922884,0.643776,0.930124,0.615543,0.880327
5,{'sampler': TomekLinks()},0.892071,0.964744,0.51211,0.738427,0.911823,0.734434,0.937521,0.603193,0.883855
0,{'sampler': 'passthrough'},0.892212,0.968782,0.49179,0.730286,0.908854,0.74996,0.937843,0.593692,0.88409


In [None]:
# lets check their performance on the test sample to see if assessment based on CV is reliable

# at the beginning we stored differently resampled training data as separate objects

# not lets put them into a list of tuples.
resampled_datasets = [
    ("benchmark", churn_train_X, churn_train_y), # no rebalancing
    ("Undersampled", X_down, y_down),
    ("Oversampled",   X_up,   y_up),
    ("SMOTE", X_sm, y_sm),
    ("ADASYN", X_ada, y_ada),
    ("Tomek Links", X_tl, y_tl),
    ("SMOTETomek", X_stl, y_stl)
]

# Split the test data into features (X) and target (y)
churn_test_X = churn_test_encoded.drop(['account_status_closed', 'account_status_open', 'customer_id'], axis = 1)
churn_test_y = churn_test_encoded['account_status_closed']

def evaluate_on_test(estimator, X_test, y_test, scoring_dict):
    y_pred = estimator.predict(X_test)
    # Probabilities (for roc_auc or if your metric needs proba)
    #    If your model doesn't have predict_proba, you can skip or handle errors.
    try:
        y_proba = estimator.predict_proba(X_test)[:, 1]
    except (AttributeError, IndexError):
        y_proba = None

    results = {}

    for metric_name, scorer in scoring_dict.items():
        # as scorer is a string, we can use sklearn's get_scorer() -> a _SCORER object
            sc = get_scorer(scorer)
            results[metric_name] = sc(estimator, X_test, y_test)
    return results

# We'll store predictions and evaluation metrics
results_list = []

for set_name, X_res, y_res in resampled_datasets:
    # Train a logistic regression on (X_res, y_res)
    model = LogisticRegression()
    model.fit(X_res, y_res)

    # Evaluate using the function above
    metrics_dict = evaluate_on_test(model, churn_test_X, churn_test_y, scoring_full)

    # Store in a list (or DataFrame)
    row = {'Dataset': set_name}
    row.update(metrics_dict)
    results_list.append(row)

results_df = pd.DataFrame(results_list).sort_values(by = 'balanced_accuracy',
                                                    ascending = False)

results_df.head(7)

# the ranking is almost the same - the values of BA were well predicted within CV

Unnamed: 0,Dataset,accuracy,recall_class_0,recall_class_1,balanced_accuracy,precision_class_0,precision_class_1,f1_class_0,f1_class_1,roc_auc
2,Oversampled,0.800921,0.793806,0.838115,0.815961,0.962452,0.437433,0.870032,0.574842,0.887342
1,Undersampled,0.792037,0.783614,0.836066,0.80984,0.96152,0.425,0.863499,0.563536,0.884123
6,SMOTETomek,0.886147,0.934927,0.631148,0.783038,0.929825,0.649789,0.932369,0.640333,0.886269
3,SMOTE,0.885818,0.934927,0.629098,0.782013,0.929462,0.649049,0.932187,0.638918,0.885975
4,ADASYN,0.884831,0.936103,0.616803,0.776453,0.927379,0.648707,0.931721,0.632353,0.885809
5,Tomek Links,0.894373,0.960016,0.55123,0.755623,0.917916,0.725067,0.938494,0.62631,0.889322
0,benchmark,0.894044,0.965112,0.522541,0.743826,0.913544,0.741279,0.93862,0.612981,0.889447


# Exercises 6

## Exercise 6.1

Extend the pipeline for the churn data to include regularization (ridge, lasso or elastic net), remember about scaling of the data! Compare z-score (StandardScaler) and range (MinMaxScaler) standardization.


In [None]:
# place for solution

# -------------------------------------------------------------------------
# Define the pipeline:
#    step1: sampler (to handle imbalance)
#    step2: logistic regression classifier

pipe = Pipeline([
    ('sampler', RandomUnderSampler()),   # placeholder step (will be replaced in param grid)
    ('scaler', StandardScaler()),        # placeholder
    ('model', LogisticRegression(solver = 'saga',  # saga supports l1, l2, elasticnet
                                 tol = 0.01, # decrease tolerance to make optimization faster
                                 max_iter = 10000,  # increase if data is large/complex
                                 random_state = 123))
])

cv = KFold(n_splits = 5, shuffle = True, random_state = 123)

# -------------------------------------------------------------------------
# Define multiple metrics
#    We'll collect: accuracy, precision, recall, f1, roc_auc
#    We'll let GridSearchCV 'refit' using 'f1' (for instance)

scoring = {
    'accuracy': 'accuracy',
    'precision': 'precision',
    'recall': 'recall',
    'f1': 'f1',
    'roc_auc': 'roc_auc'
}

# -------------------------------------------------------------------------
# Define the parameter grid for GridSearchCV.
#    We'll vary:
#       - Sampler: None, undersampling, oversampling, SMOTE, ADASYN, TomekLinks, SMOTETomek
#       - Scaler: None, StandardScaler, MinMaxScaler
#       - Logistic penality: none, l1, l2, elasticnet
#       - l1_ratio (only relevant if penalty = 'elasticnet')

param_grid = {
    # Pass None to skip sampling. We replace 'sampler' step with 'passthrough'
    'sampler': [
        'passthrough',                            # no rebalancing
        RandomUnderSampler(random_state = 123),
        RandomOverSampler(random_state = 123),
        SMOTE(random_state = 123),
        ADASYN(random_state  =123),
        TomekLinks(sampling_strategy = 'majority'),
        SMOTETomek(random_state = 123)
    ],
    'scaler': [
        'passthrough',                            # no scaling
        StandardScaler(),
        MinMaxScaler()
    ],
    'model__penalty': ['l1', 'l2', 'elasticnet'],
    'model__l1_ratio': [0, 0.5, 1], # only used if penalty='elasticnet'
    # We already set solver='saga' in the pipeline (supports l1 / elasticnet)
}

# -------------------------------------------------------------------------
# Define the cross-validation approach

cv = KFold(n_splits = 5, shuffle = True, random_state = 123)

# -------------------------------------------------------------------------
# Define multiple metrics
#    We'll collect: accuracy, precision, recall, f1, roc_auc
#    We'll let GridSearchCV 'refit' using 'f1' (for instance)

scoring = {
    'accuracy': 'accuracy',
    'precision': 'precision',
    'recall': 'recall',
    'f1': 'f1',
    'roc_auc': 'roc_auc'
}

# -------------------------------------------------------------------------
# Set up GridSearchCV

grid = GridSearchCV(
    estimator = pipe,
    param_grid = param_grid,
    scoring = scoring,
    refit = 'roc_auc',       # which metric to use for final model .predict()
    cv = cv,
    n_jobs = -1,        # parallel
    verbose = 1         # show progress
)

# -------------------------------------------------------------------------
# Fit on the data

grid.fit(churn_train_X, churn_train_y)

# -------------------------------------------------------------------------
# Collect and inspect results

results_df = pd.DataFrame(grid.cv_results_)

# best parameters found for the refit metric (F1)
print("Best Params (F1):", grid.best_params_)
print("Best Score (F1):", grid.best_score_)

# You can also check the rank and mean test scores for all metrics:
desired_cols = [
    'param_sampler', 'param_scaler', 'param_clf__penalty', 'param_clf__C',
    'param_clf__l1_ratio', 'mean_test_accuracy', 'mean_test_precision',
    'mean_test_recall', 'mean_test_f1', 'mean_test_roc_auc', 'rank_test_f1'
]
print(results_df[desired_cols].sort_values(by='rank_test_f1'))

# Now 'grid.best_estimator_' is the pipeline with the best combination of steps
best_pipeline = grid.best_estimator_

## Exercise 6.2

Taking into consideration one of the previously stored "best" models check if and how various data rebalancing methods influence its performance on a new dataset: use cross-validation first and then compare the results on a test dataset

In [None]:
# place for solution

In [None]:
# https://machinelearningmastery.com/cross-validation-for-imbalanced-classification/