# Handling imbalanced data

In this notebook we will look at various methods for handling an imbalanced classes problem and apply Conformal Prediction to calibrate class probabilities.

We will use Credit Card Fraud Detection dataset from Kaggle https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud

The datasets contains credit card transactions in September 2013 by cardholders in Europe. The transactions occurred on two days, with 492 fraudulent transactions out of 284,807 transactions. The dataset is highly imbalanced, with positive class (fraudulent transactions) accounting for 0.17% of all transactions.

The dataset contains numerical features that are the results of PCA transformation, the original features have been withheld due to confidentiality and privacy issues.

* Features V1, V2, ... V28 are the principal components obtained using PCA;
* The only original features are 'Time' and 'Amount'.
* Feature 'Time' contains the time (in seconds) for each transaction relative to the first transaction in the dataset.
* The feature 'Amount' is the transaction Amount.
* Label 'Class' is the dependant variable that needs to be predicted (fraudulent transactions labeled with 1).

In [None]:
!pip install dtype_diet
!pip install catboost
!pip install plotly

Collecting dtype_diet
  Downloading dtype_diet-0.0.2-py3-none-any.whl (7.6 kB)
Installing collected packages: dtype_diet
Successfully installed dtype_diet-0.0.2


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import time

import plotly.io as pio
pio.renderers.default = 'colab'

# Set the style for visualization
sns.set_style("whitegrid")

from tqdm import tqdm

from dtype_diet import report_on_dataframe, optimize_dtypes

import plotly.graph_objs as go
import plotly.figure_factory as ff
from plotly import tools
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
from plotly.subplots import make_subplots
init_notebook_mode(connected=True)

import gc
from datetime import datetime

from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold

from sklearn.metrics import accuracy_score, balanced_accuracy_score, precision_score, average_precision_score, cohen_kappa_score, recall_score, f1_score, roc_auc_score, log_loss, brier_score_loss, matthews_corrcoef
from sklearn.preprocessing import StandardScaler
from sklearn.calibration import CalibrationDisplay, calibration_curve
from matplotlib.gridspec import GridSpec


from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_predict

from imblearn.over_sampling import RandomOverSampler, SMOTE, ADASYN
from imblearn.under_sampling import RandomUnderSampler, NearMiss, TomekLinks, EditedNearestNeighbours

from catboost import CatBoostClassifier
from sklearn import svm
import lightgbm as lgb
from lightgbm import LGBMClassifier
from sklearn.svm import SVC
import xgboost as xgb

pd.set_option('display.max_columns', 100)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
data = pd.read_csv('/content/drive/MyDrive/creditcard/creditcard.csv')

In [None]:
print(f'Original df memory: {data.memory_usage(deep=True).sum()/1024/1024} MB')
proposed_df = report_on_dataframe(data, unit="MB")

data = optimize_dtypes(data, proposed_df)
print(f'Proposed df memory: {data.memory_usage(deep=True).sum()/1024/1024} MB')

In [None]:
data.head()

In [None]:
data.info()

In [None]:
# there is no missing data in the dataset
data.isnull().sum()

## EDA

Basic Examination of the Data:

The dataset contains 284,807 records and 31 columns.
The columns consist of:
* Time: Number of seconds elapsed between this transaction and the first transaction in the dataset.
* V1 to V28: These are the principal components obtained through PCA.
Amount: Transaction amount.
* Class: This is our target variable where 1 indicates a fraudulent transaction and 0 indicates a non-fraudulent transaction.

Summary Statistics Insights:

Time: Ranges from 0 to 172,792 seconds. This indicates that the data spans over roughly two days of transactions.
* Amount: The average transaction amount is about 88.35, with a standard deviation of 250.12.
* Transaction amounts range from 0 to 25,691.16.
* Class: The mean value is close to 0 (0.001727 to be exact), which indicates a highly imbalanced dataset, as expected.

In [None]:
data.describe()

Here are the visualizations from the Exploratory Data Analysis (EDA):

Distribution of Fraudulent vs Non-Fraudulent Transactions:

As expected, the dataset is highly imbalanced with a vast majority of transactions being non-fraudulent.
Distribution of Transaction Times:

The distribution seems bimodal, suggesting two peaks or high activity periods within the 2-day span of transactions. This could possibly correspond to daytime activities and nighttime.
Distribution of Transaction Amount:

Most of the transaction amounts are concentrated around the lower values, with very few high-value transactions.
Distribution of V1 for Fraudulent vs Non-Fraudulent Transactions:

This distribution plot for one of the PCA components (V1) shows that the feature distributions for fraudulent and non-fraudulent transactions have some differences, which could be useful for classification.

In [None]:
# Set up the figure and axes
fig, ax = plt.subplots(nrows=2, ncols=2, figsize=(18, 12))

# Distribution of the target variable (Class)
sns.countplot(data=data, x='Class', ax=ax[0, 0])
ax[0, 0].set_title('Distribution of Fraudulent vs Non-Fraudulent Transactions')
ax[0, 0].set_xticklabels(['Non-Fraudulent (0)', 'Fraudulent (1)'])

# Distribution of the Time column
sns.histplot(data['Time'], ax=ax[0, 1], bins=50)
ax[0, 1].set_title('Distribution of Transaction Times')

# Distribution of the Amount column
sns.histplot(data['Amount'], ax=ax[1, 0], bins=100)
ax[1, 0].set_title('Distribution of Transaction Amount')
ax[1, 0].set_xlim([0, 2000])  # Limiting for better visualization as there are few high value transactions

# Distribution of one of the PCA components (V1 as an example)
sns.kdeplot(data[data['Class'] == 0]['V1'], label='Non-Fraudulent', ax=ax[1, 1])
sns.kdeplot(data[data['Class'] == 1]['V1'], label='Fraudulent', ax=ax[1, 1])
ax[1, 1].set_title('Distribution of V1 for Fraudulent vs Non-Fraudulent Transactions')
ax[1, 1].legend()

plt.tight_layout()
plt.show()

Fraudulent transactions exhibit a more uniform distribution over time compared to valid transactions. They appear consistently distributed throughout the timeline, even during periods with low genuine transaction activity, which corresponds to nighttime in the European timezone.

In [None]:
# Set up the figure with the specified colors
plt.figure(figsize=(9, 6))

# Plot the density distribution of 'Time' for non-fraudulent transactions using the same blue color
sns.kdeplot(data[data['Class'] == 0]['Time'], color='blue', label='Non-Fraudulent', fill=True, alpha=0.5)

# Overlay the density distribution of 'Time' for fraudulent transactions using the same red color
sns.kdeplot(data[data['Class'] == 1]['Time'], color='red', label='Fraudulent', fill=True, alpha=0.5)

plt.title('Density Distribution of Transaction Time for Both Classes')
plt.legend()

plt.tight_layout()
plt.show()

In [None]:
# Convert 'Time' from seconds to hours, without limiting to a single day's range
data['Hour'] = (data['Time'] / 3600).astype(int)

# Grouping data by 'Hour' and 'Class' to get count and sum of transactions for each class
grouped_data = data.groupby(['Hour', 'Class']).agg(Number_of_Transactions=('Time', 'count'), Total_Amount=('Amount', 'sum')).reset_index()

grouped_data.head()

In [None]:
# plot normalized hourly distributions, such that, for each class, the values represent a percentage of the daily total for that class.
# Calculate daily totals for number of transactions and total amount for each class
daily_totals_by_class = grouped_data.groupby(['Class']).agg(Daily_Total_Transactions=('Number_of_Transactions', 'sum'),
                                                            Daily_Total_Amount=('Total_Amount', 'sum')).reset_index()

# Merge these daily totals with the original grouped data
normalized_data = pd.merge(grouped_data, daily_totals_by_class, on='Class', how='left')

# Calculate the percentage of transactions and amounts for each hour based on the daily totals for each class
normalized_data['Percentage_Transactions'] = (normalized_data['Number_of_Transactions'] / normalized_data['Daily_Total_Transactions']) * 100
normalized_data['Percentage_Amount'] = (normalized_data['Total_Amount'] / normalized_data['Daily_Total_Amount']) * 100

normalized_data.head()

In [None]:
# Visualize the corrected hourly percentages
fig, ax = plt.subplots(nrows=2, ncols=1, figsize=(12, 10))

# Hourly distribution of the percentage of transactions for both classes
sns.lineplot(data=normalized_data, x='Hour', y='Percentage_Transactions', hue='Class', ax=ax[0], palette='tab10')
ax[0].set_title('Hourly Distribution of Percentage of Transactions')
ax[0].set_ylabel('Percentage of Transactions (%)')
ax[0].legend(title='Class', labels=['Non-Fraudulent', 'Fraudulent'])

# Hourly distribution of the percentage of transaction amounts for both classes
sns.lineplot(data=normalized_data, x='Hour', y='Percentage_Amount', hue='Class', ax=ax[1], palette='tab10')
ax[1].set_title('Hourly Distribution of Percentage of Transaction Amounts')
ax[1].set_ylabel('Percentage of Transaction Amounts (%)')
ax[1].legend(title='Class', labels=['Non-Fraudulent', 'Fraudulent'])

plt.tight_layout()
plt.show()


In [None]:
# import plotly.io as pio
pio.renderers.default = 'colab'

In [None]:
# Create a Plotly line chart for hourly distribution of the percentage of transactions for both classes
fig1 = go.Figure()

# Add traces for non-fraudulent and fraudulent transactions
fig1.add_trace(go.Scatter(x=normalized_data[normalized_data['Class'] == 0]['Hour'],
                          y=normalized_data[normalized_data['Class'] == 0]['Percentage_Transactions'],
                          mode='lines',
                          name='Non-Fraudulent'))
fig1.add_trace(go.Scatter(x=normalized_data[normalized_data['Class'] == 1]['Hour'],
                          y=normalized_data[normalized_data['Class'] == 1]['Percentage_Transactions'],
                          mode='lines',
                          name='Fraudulent'))

# Update layout
fig1.update_layout(title='Hourly Distribution of Percentage of Transactions',
                   xaxis_title='Hour',
                   yaxis_title='Percentage of Transactions (%)')

# Create a Plotly line chart for hourly distribution of the percentage of transaction amounts for both classes
fig2 = go.Figure()

# Add traces for non-fraudulent and fraudulent transactions
fig2.add_trace(go.Scatter(x=normalized_data[normalized_data['Class'] == 0]['Hour'],
                          y=normalized_data[normalized_data['Class'] == 0]['Percentage_Amount'],
                          mode='lines',
                          name='Non-Fraudulent'))
fig2.add_trace(go.Scatter(x=normalized_data[normalized_data['Class'] == 1]['Hour'],
                          y=normalized_data[normalized_data['Class'] == 1]['Percentage_Amount'],
                          mode='lines',
                          name='Fraudulent'))

# Update layout
fig2.update_layout(title='Hourly Distribution of Percentage of Transaction Amounts',
                   xaxis_title='Hour',
                   yaxis_title='Percentage of Transaction Amounts (%)')

fig1.show()
fig2.show()

In [None]:
# Create a Plotly boxplot for transaction amounts for both classes
fig = go.Figure()

# Add boxplots for non-fraudulent and fraudulent transaction amounts
fig.add_trace(go.Box(y=data[data['Class'] == 0]['Amount'], name='Non-Fraudulent', marker_color='blue'))
fig.add_trace(go.Box(y=data[data['Class'] == 1]['Amount'], name='Fraudulent', marker_color='red'))

# Update layout
fig.update_layout(title='Boxplot of Transaction Amounts',
                  yaxis=dict(type='log', title='Transaction Amount ($)'),
                  xaxis_title='Class')

fig.show()

In [None]:
# Filter out fraudulent transactions
fraudulent_data = data[data['Class'] == 1]

# Create a histogram with binned time intervals
hist_data, bin_edges = np.histogram(fraudulent_data['Time'], bins=48)  # 48 bins for 48 hours

In [None]:

# Create a heatmap using Plotly's graph_objects
fig = go.Figure(data=go.Heatmap(z=[hist_data], x=bin_edges[:-1], colorscale='Viridis', showscale=True))

# Update layout and axis titles
fig.update_layout(title='Heatmap of Fraudulent Transactions Over Time',
                  xaxis_title='Time (in seconds)',
                  yaxis_title='Fraudulent Transactions',
                  yaxis_nticks=1)  # Only one y-tick as we have one row of data

# Rotate x-axis labels for better readability
fig.update_xaxes(tickangle=45)

fig.show()


In [None]:
# Calculate the correlation matrix
correlation_matrix = data.corr()

# Create a heatmap using Plotly
heatmap = go.Figure(data=go.Heatmap(z=correlation_matrix.values,
                                    x=correlation_matrix.columns,
                                    y=correlation_matrix.columns,
                                    colorscale='Viridis',
                                    zmin=-1, zmax=1))

# Update layout for better readability
heatmap.update_layout(title='Correlation Heatmap of Features and Target',
                      xaxis_tickangle=-45)

heatmap.show()


In [None]:
# Compute the correlation matrix again
correlation_matrix = data.corr()

# Display the correlation matrix using a red to green heatmap formatting on the dataframe display
cm_red_green = sns.diverging_palette(150, 10, as_cmap=True)
styled_correlation_red_green = correlation_matrix.style.background_gradient(cmap=cm_red_green)
styled_correlation_red_green

In [None]:
# Recompute the correlations of features with 'Class'
class_correlations = correlation_matrix["Class"].drop("Class")

# Create the figure
fig = go.Figure(data=[go.Bar(x=class_correlations.index,
                             y=class_correlations.values,
                             marker=dict(color=class_correlations.values,
                                         colorscale="RdYlGn",
                                         colorbar=dict(title="Correlation Coefficient")))])

# Update layout
fig.update_layout(title="Correlation of Features with 'Class'",
                  xaxis_title="Features",
                  yaxis_title="Correlation Coefficient",
                  xaxis_tickangle=-45)

fig.show()

## Modeling

In [None]:
#TRAIN/VALIDATION/TEST SPLIT
VALID_SIZE = 0.20 # simple validation using train_test_split
TEST_SIZE = 0.20 # test size using_train_test_split

#CROSS-VALIDATION
NUMBER_KFOLDS = 5 #number of KFolds for cross-validation

RANDOM_STATE = 42

In [None]:
target = 'Class'
features = ['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10','V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19',\
       'V20', 'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28','Amount']

In [None]:
# Split the data into train, validation, test, and calibration sets
train_calib_df, test_df = train_test_split(data, test_size=TEST_SIZE, random_state=RANDOM_STATE, shuffle=False)
train_df, calib_df = train_test_split(train_calib_df, test_size=TEST_SIZE, random_state=RANDOM_STATE, shuffle=False)
len(train_df), len(calib_df), len(test_df)  # Display the number of samples in each dataset


In [None]:
data.head()

In [None]:
data.drop('Hour',axis = 1, inplace = True)

In [None]:
train_df.head()

In [None]:
# Train the StandardScaler on the training set
scaler_time = StandardScaler().fit(train_df['Time'].values.reshape(-1, 1))
scaler_amount = StandardScaler().fit(train_df['Amount'].values.reshape(-1, 1))

# Transform the 'Time' and 'Amount' columns in train, validation, calibration, and test sets
train_df['Time'] = scaler_time.transform(train_df['Time'].values.reshape(-1, 1))
train_df['Amount'] = scaler_amount.transform(train_df['Amount'].values.reshape(-1, 1))

calib_df['Time'] = scaler_time.transform(calib_df['Time'].values.reshape(-1, 1))
calib_df['Amount'] = scaler_amount.transform(calib_df['Amount'].values.reshape(-1, 1))

test_df['Time'] = scaler_time.transform(test_df['Time'].values.reshape(-1, 1))
test_df['Amount'] = scaler_amount.transform(test_df['Amount'].values.reshape(-1, 1))

train_df[['Time', 'Amount']].head()  # Display the transformed 'Time' and 'Amount' columns for the training set as an example


In [None]:
# Check for overlapping rows between the training and test sets
overlapping_rows = train_df.merge(test_df, how='inner')
overlap_count = overlapping_rows.shape[0]

overlap = overlap_count > 0
overlap, overlap_count

In [None]:
train_df.head()

In [None]:
# Creating a data frame# Creating a dataframe to record the # Creating a dataframe to record performance of various models
models = ['Naive Classifier', 'Logistic Regression', 'Random Forest Classifier', 'AdaBoost Classifier', 'CatBoost Classifier', 'SVC', 'LGBM Classifier', 'XGBoost Classifier']

metrics = ['Accuracy', 'Precision', 'Recall', 'F1', 'ROC AUC', 'ECE', 'Log Loss', 'Brier Loss']

performance_base_models_df = pd.DataFrame(index=models, columns=metrics)

performance_calibrated_models_df = pd.DataFrame(index=models, columns=metrics)


In [None]:
# Create a DataFrame to record execution times
time_df = pd.DataFrame(index=models, columns=['Execution Time (s)'])

## Naive classifier

In [None]:
# Naive classifier: predict the majority class (Class 0) for all samples
naive_predictions = np.zeros(len(test_df))

# True labels for the test set
true_labels = test_df['Class'].values

# Compute metrics
accuracy = accuracy_score(true_labels, naive_predictions)
precision = precision_score(true_labels, naive_predictions,zero_division=0)
recall = recall_score(true_labels, naive_predictions,zero_division=0)
f1 = f1_score(true_labels, naive_predictions)
roc_auc = roc_auc_score(true_labels, naive_predictions)
logloss = log_loss(true_labels, naive_predictions)
brier_loss = brier_score_loss(true_labels, naive_predictions)

# Compute Expected Calibration Error (ECE)
fraction_of_positives, mean_predicted_value = calibration_curve(true_labels, naive_predictions, n_bins=10)
ece = np.sum(np.abs(fraction_of_positives - mean_predicted_value)) / len(mean_predicted_value)

# Populate the performance dataframe
performance_base_models_df.loc['Naive Classifier', :] = [accuracy, precision, recall, f1, roc_auc, ece, logloss, brier_loss]


In [None]:
performance_base_models_df

In [None]:
def evaluate_model_performance(model, model_name, true_labels=true_labels, performance_df=performance_base_models_df, verbose=False):
    """
    Evaluates model performance and updates the performance dataframe with metrics.

    Args:
    - predictions (array-like): Predicted values from the model.
    - model_name (str): Name of the model for which performance is being evaluated.
    - true_labels (array-like): Actual labels for comparison. Default is the true_labels of the test set.
    - performance_df (DataFrame): DataFrame to update with model performance metrics.

    Returns:
    - DataFrame with updated performance metrics for the given model.
    """

    start_time = time.time()  # Record start time

    # Train Logistic Regression
    model.fit(train_df.drop(columns='Class'), train_df['Class'])

    # Predict class score on the test set
    prob_pos = model.predict_proba(test_df.drop(columns='Class'))[:, 1]
    # Predict on the test set
    predictions = model.predict(test_df.drop(columns='Class'))

    # Compute metrics
    accuracy = accuracy_score(true_labels, predictions)
    precision = precision_score(true_labels, predictions, zero_division=0)
    recall = recall_score(true_labels, predictions)
    f1 = f1_score(true_labels, predictions)
    roc_auc = roc_auc_score(true_labels, prob_pos)
    logloss = log_loss(true_labels, prob_pos)
    brier_loss = brier_score_loss(true_labels, prob_pos)

    # Compute Expected Calibration Error (ECE)
    fraction_of_positives, mean_predicted_value = calibration_curve(true_labels, predictions, n_bins=10)
    ece = np.sum(np.abs(fraction_of_positives - mean_predicted_value)) / len(mean_predicted_value)

    # Populate the performance dataframe
    performance_df.loc[model_name, :] = [accuracy, precision, recall, f1, roc_auc, ece, logloss, brier_loss]

    # Plot calibration curve and histogram if verbose is True
    if verbose:
        fig = plt.figure(figsize=(10, 10))
        gs = GridSpec(2, 1)
        ax_calibration_curve = fig.add_subplot(gs[0, :])
        ax_histogram = fig.add_subplot(gs[1, :])

        # Plot calibration curve
        CalibrationDisplay.from_estimator(
            model,
            test_df.drop(columns='Class'),
            true_labels,
            n_bins=10,
            name=model_name,
            ax=ax_calibration_curve
        )
        ax_calibration_curve.set_title(f"Calibration plot ({model_name})")

        # Plot histogram
        ax_histogram.hist(prob_pos, range=(0, 1), bins=10, label=model_name)
        ax_histogram.set(title=model_name, xlabel="Mean predicted probability", ylabel="Count")

        plt.tight_layout()
        plt.show()

    end_time = time.time()  # Record end time
    execution_time = end_time - start_time  # Calculate execution time in seconds

    # Record execution time in the time DataFrame
    time_df.loc[model_name, 'Execution Time (s)'] = execution_time

    return performance_df,model

In [None]:
time_df

## Dummy Classifier

In [None]:
dummy_classifier_model = DummyClassifier(strategy='most_frequent', random_state=RANDOM_STATE , constant=None)

performance_base_models_df, _ = evaluate_model_performance(dummy_classifier_model,'Naive Classifier', verbose=True)


In [None]:
performance_base_models_df

In [None]:
time_df

## Logistic Regression

In [None]:
# Train Logistic Regression
logistic_regression_model = LogisticRegression(random_state=RANDOM_STATE, max_iter=1000)
performance_base_models_df,_ = evaluate_model_performance(logistic_regression_model,'Logistic Regression', verbose=True)

In [None]:
performance_base_models_df

In [None]:
time_df

### Random Forest

In [None]:
# Train Random Forest Classifier
rf_model = RandomForestClassifier(random_state=RANDOM_STATE, n_jobs=-1)
performance_base_models_df, rf_trained_model = evaluate_model_performance(rf_model,'Random Forest Classifier', verbose=True)

In [None]:
performance_base_models_df

In [None]:
time_df

In [None]:
pio.renderers.default = 'colab'
# Extract feature importances from the Random Forest model
feature_importances = rf_trained_model.feature_importances_

# Create a DataFrame for the feature importances
features_df = pd.DataFrame({
    'Feature': train_df.drop(columns='Class').columns,
    'Importance': feature_importances
})

# Sort the DataFrame by importance
features_df = features_df.sort_values(by='Importance', ascending=False)

# Plotting using Plotly
fig = go.Figure(data=[
    go.Bar(x=features_df['Feature'], y=features_df['Importance'], marker_color='rgba(55, 128, 191, 0.7)')
])

fig.update_layout(title='Feature Importances from Random Forest',
                  xaxis_title='Features',
                  yaxis_title='Importance',
                  xaxis_tickangle=-45)

fig.show()

###Â AdaBoost Classifier

In [None]:
# Train AdaBoost classifier
ada_model = AdaBoostClassifier(random_state=RANDOM_STATE)
performance_base_models_df, ada_trained_model = evaluate_model_performance(ada_model ,'AdaBoost Classifier', verbose=True)

In [None]:
performance_base_models_df

In [None]:
time_df

In [None]:
pio.renderers.default = 'colab'
# Extract feature importances from the Random Forest model
feature_importances = ada_trained_model.feature_importances_

# Create a DataFrame for the feature importances
features_df = pd.DataFrame({
    'Feature': train_df.drop(columns='Class').columns,
    'Importance': feature_importances
})

# Sort the DataFrame by importance
features_df = features_df.sort_values(by='Importance', ascending=False)

# Plotting using Plotly
fig = go.Figure(data=[
    go.Bar(x=features_df['Feature'], y=features_df['Importance'], marker_color='rgba(55, 128, 191, 0.7)')
])

fig.update_layout(title='Feature Importances from Ada Boost',
                  xaxis_title='Features',
                  yaxis_title='Importance',
                  xaxis_tickangle=-45)

fig.show()

### CatBoost Classifier

In [None]:
# Train CatBoost classifier
catboost_model  = CatBoostClassifier(
    task_type="CPU",       # You can change this to "GPU" if you have a GPU.
    thread_count=-1,       # Use all available CPU cores
    verbose=0,
    random_state=RANDOM_STATE
)

performance_base_models_df, catboost_trained_model = evaluate_model_performance(catboost_model,'CatBoost Classifier', verbose=True)

In [None]:
performance_base_models_df

In [None]:
time_df

In [None]:
pio.renderers.default = 'colab'

# Extract feature importances
feature_importances = catboost_trained_model.get_feature_importance()

# Sort feature importances
sorted_indices = feature_importances.argsort()[::-1]  # Sort in descending order

# Sort feature names based on importance order
sorted_features = train_df.drop(columns='Class').columns[sorted_indices]
sorted_importances = feature_importances[sorted_indices]

# Plot sorted feature importances using Plotly
fig = go.Figure(data=[
    go.Bar(x=sorted_features,
           y=sorted_importances,
           marker_color='indianred')
])

fig.update_layout(title='Feature Importances from CatBoost Classifier (Sorted)',
                 xaxis=dict(title='Features'),
                 yaxis=dict(title='Importance'),
                 xaxis_tickangle=-45)

fig.show()

### Support Vector Machines Classifier

In [None]:
# Train SVC model
svc_model = SVC(probability=True, random_state=RANDOM_STATE)
performance_base_models_df,_ = evaluate_model_performance(svc_model ,'SVC', verbose=True)

In [None]:
performance_base_models_df

In [None]:
time_df

### LGBM Classifier

In [None]:
lgb_model = lgb.LGBMClassifier(random_state=RANDOM_STATE, n_jobs=-1)
performance_base_models_df,lgb_trained_model = evaluate_model_performance(lgb_model,'LGBM Classifier', verbose=True)

In [None]:
performance_base_models_df

In [None]:
time_df

In [None]:
pio.renderers.default = 'colab'
# Extract feature importances from the Random Forest model
feature_importances = lgb_trained_model.feature_importances_
# Create a DataFrame for the feature importances
features_df = pd.DataFrame({
    'Feature': train_df.drop(columns='Class').columns,
    'Importance': feature_importances
})

# Sort the DataFrame by importance
features_df = features_df.sort_values(by='Importance', ascending=False)

# Plotting using Plotly
fig = go.Figure(data=[
    go.Bar(x=features_df['Feature'], y=features_df['Importance'], marker_color='rgba(55, 128, 191, 0.7)')
])

fig.update_layout(title='Feature Importances from Ada Boost',
                  xaxis_title='Features',
                  yaxis_title='Importance',
                  xaxis_tickangle=-45)

fig.show()

### XGBoost classifier

In [None]:
# # Train the XGBoost Classifier
xgb_model = xgb.XGBClassifier(n_jobs=-1)
performance_base_models_df,xgb_trained_model = evaluate_model_performance(xgb_model,'XGBoost Classifier', verbose=True)

In [None]:
pio.renderers.default = 'colab'
# Extract feature importances from the Random Forest model
feature_importances = xgb_trained_model.feature_importances_
# Create a DataFrame for the feature importances
features_df = pd.DataFrame({
    'Feature': train_df.drop(columns='Class').columns,
    'Importance': feature_importances
})

# Sort the DataFrame by importance
features_df = features_df.sort_values(by='Importance', ascending=False)

# Plotting using Plotly
fig = go.Figure(data=[
    go.Bar(x=features_df['Feature'], y=features_df['Importance'], marker_color='rgba(55, 128, 191, 0.7)')
])

fig.update_layout(title='Feature Importances from Ada Boost',
                  xaxis_title='Features',
                  yaxis_title='Importance',
                  xaxis_tickangle=-45)

fig.show()

In [None]:
performance_base_models_df

In [None]:
time_df

In [None]:
performance_base_models_df.to_csv('performance_base_models.csv')

In [None]:
performance_base_models_df.sort_values(['Log Loss',	'Brier Loss'], ascending = [True, True])

In [None]:
performance_base_models_df.sort_values(['ECE'], ascending = True)

In [None]:
performance_base_models_df.sort_values(['Log Loss', "Brier Loss"])

In [None]:
time_df.columns

In [None]:
time_df.sort_values('Execution Time (s)', ascending = True)

### Investigate resampling techniques

In [None]:
resampled_time_df = pd.DataFrame(index = time_df.index, columns = time_df.columns)
resampled_time_df

In [None]:
performance_resampling_methods_df = pd.DataFrame(index = metrics, columns = ['None', 'Weights', 'Threshold', 'Threshold + W', 'RandomOverSampler',\
                                    'SMOTE', 'ADASYN', 'RandomUnderSampler', 'NearMiss', 'TomekLinks', 'EditedNearestNeighbours'])

In [None]:
def evaluate_resampled_model_performance(model, model_name, sampler, true_labels=true_labels, performance_df=performance_resampling_methods_df, verbose=False):
    """
    Evaluates model performance and updates the performance dataframe with metrics.

    Args:
    - predictions (array-like): Predicted values from the model.
    - model_name (str): Name of the model for which performance is being evaluated.
    - true_labels (array-like): Actual labels for comparison. Default is the true_labels of the test set.
    - performance_df (DataFrame): DataFrame to update with model performance metrics.

    Returns:
    - DataFrame with updated performance metrics for the given model.
    """

    start_time = time.time()  # Record start time

    technique = sampler.__class__.__name__

    X_resampled, y_resampled = sampler.fit_resample(train_df.drop(columns='Class'), train_df['Class'])

    # Train Logistic Regression
    model.fit(X_resampled, y_resampled)

    # Predict class score on the test set
    prob_pos = model.predict_proba(test_df.drop(columns='Class'))[:, 1]
    # Predict on the test set
    predictions = model.predict(test_df.drop(columns='Class'))

    # Compute metrics
    accuracy = accuracy_score(true_labels, predictions)
    precision = precision_score(true_labels, predictions, zero_division=0)
    recall = recall_score(true_labels, predictions)
    f1 = f1_score(true_labels, predictions)
    roc_auc = roc_auc_score(true_labels, prob_pos)
    logloss = log_loss(true_labels, prob_pos)
    brier_loss = brier_score_loss(true_labels, prob_pos)

    # Compute Expected Calibration Error (ECE)
    fraction_of_positives, mean_predicted_value = calibration_curve(true_labels, prob_pos, n_bins=10)
    ece = np.sum(np.abs(fraction_of_positives - mean_predicted_value)) / len(mean_predicted_value)

    # Populate the performance dataframe
    performance_resampling_methods_df[technique] = [accuracy, precision, recall, f1, roc_auc, ece, logloss, brier_loss]

    # Plot calibration curve and histogram if verbose is True
    if verbose:
        fig = plt.figure(figsize=(10, 10))
        gs = GridSpec(2, 1)
        ax_calibration_curve = fig.add_subplot(gs[0, :])
        ax_histogram = fig.add_subplot(gs[1, :])

        # Plot calibration curve
        CalibrationDisplay.from_estimator(
            model,
            test_df.drop(columns='Class'),
            true_labels,
            n_bins=10,
            name=model_name,
            ax=ax_calibration_curve
        )
        ax_calibration_curve.set_title(f"Calibration plot ({model_name})")

        # Plot histogram
        ax_histogram.hist(prob_pos, range=(0, 1), bins=10, label=model_name)
        ax_histogram.set(title=model_name, xlabel="Mean predicted probability", ylabel="Count")

        plt.tight_layout()
        plt.show()

    end_time = time.time()  # Record end time
    execution_time = end_time - start_time  # Calculate execution time in seconds

    # Record execution time in the time DataFrame
    resampled_time_df.loc[model_name, 'Execution Time (s)'] = execution_time

    return performance_df,model

In [None]:
# Train Logistic Regression
logistic_regression_model = LogisticRegression(random_state=RANDOM_STATE, max_iter=1000)
performance_resampling_methods_df,_ = evaluate_resampled_model_performance(logistic_regression_model,'Logistic Regression', sampler = SMOTE(), verbose=True)

In [None]:
performance_resampling_methods_df

## Logistic regression with resampling methods

In [None]:
techniques = [RandomOverSampler(), SMOTE(), ADASYN(), RandomUnderSampler(), NearMiss(version=1), TomekLinks(), EditedNearestNeighbours()]

for sampler in tqdm(techniques):
    technique = sampler.__class__.__name__
    print(f'Technique: {technique}')
    model = LogisticRegression(random_state=RANDOM_STATE, max_iter=1000)
    performance_resampling_methods_df,_ = evaluate_resampled_model_performance(logistic_regression_model,'Logistic Regression', sampler = sampler, verbose=True)

In [None]:
resampled_time_df

In [None]:
performance_resampling_methods_df

In [None]:
lr_performance_resampling_methods_df = performance_resampling_methods_df.copy()
performance_resampling_methods_df[:] = np.nan

In [None]:
lr_performance_resampling_methods_df.to_csv('lr_performance_resampling_methods.csv')

In [None]:
for sampler in tqdm(techniques):
    technique = sampler.__class__.__name__

    print(f'Technique: {technique}')

    catboost_model  = CatBoostClassifier(
    task_type="CPU",       # You can change this to "GPU" if you have a GPU.
    thread_count=-1,       # Use all available CPU cores
    verbose=0,
    random_state=RANDOM_STATE
    )
    performance_resampling_methods_df,_ = evaluate_resampled_model_performance(catboost_model,'CatBoost Classifier', sampler = sampler, verbose=True)

## CatBoost with resampling methods

In [None]:
catboost_performance_resampling_methods_df = performance_resampling_methods_df.copy()
performance_resampling_methods_df[:] = np.nan

In [None]:
catboost_performance_resampling_methods_df

In [None]:
resampled_time_df

In [None]:
catboost_performance_resampling_methods_df.to_csv('catboost_performance_resampling_methods.csv')

## Calibration with Venn-ABERS

In [None]:
!rm -r '/content/VennABERS'

In [None]:
CLONE_URL = f"https://github.com/ptocca/VennABERS"
!git clone https://github.com/ptocca/VennABERS
get_ipython().system(f"git clone {CLONE_URL}")

import sys
sys.path.append("VennABERS")

In [None]:
%cd VennABERS

In [None]:
import VennABERS
??VennABERS.ScoresToMultiProbs

In [None]:
%pwd

In [None]:
%cd '../'

## Calibrate Logistic Regression

In [None]:
calibrated_performance_df = performance_resampling_methods_df.copy()
calibrated_performance_df[:] = np.nan

In [None]:
# ground truth calibration labels
y_cal = calib_df['Class']

In [None]:
def evaluate_resampled_calibrated_model_performance(model, model_name, sampler, true_labels=true_labels, performance_df=calibrated_performance_df, verbose=False):
    """
    Evaluates model performance and updates the performance dataframe with metrics.

    Args:
    - predictions (array-like): Predicted values from the model.
    - model_name (str): Name of the model for which performance is being evaluated.
    - true_labels (array-like): Actual labels for comparison. Default is the true_labels of the test set.
    - performance_df (DataFrame): DataFrame to update with model performance metrics.

    Returns:
    - DataFrame with updated performance metrics for the given model.
    """

    start_time = time.time()  # Record start time

    technique = sampler.__class__.__name__

    X_resampled, y_resampled = sampler.fit_resample(train_df.drop(columns='Class'), train_df['Class'])

    # Train Logistic Regression
    model.fit(X_resampled, y_resampled)

    # use trained machnine learning model to predict on the calibration set
    y_hat_cal_scores = model.predict_proba(calib_df.drop(columns='Class'))[:, 1]

    # Predict class score on the test set
    prob_pos = model.predict_proba(test_df.drop(columns='Class'))[:, 1]
    testScores =prob_pos

    # Predict on the test set
    predictions = model.predict(test_df.drop(columns='Class'))

    # calibrate using Venn-ABERS
    #calibrPts: a list of pairs (score,label) corresponding to the scores and labels of the calibration examples. The score is a float and the label is an integer meant to take values 0 or 1.
    calibrPts = zip(list(y_hat_cal_scores),list(y_cal))

    # Conformal Prediciton VennABERS calibration model learns calibration
    # on the calibration dataset by comparing scores output by underlying machine
    # learning model and comparing them with class labels on the calibration set
    p0,p1 = VennABERS.ScoresToMultiProbs(calibrPts,testScores)

    prob_pos_calibrated = p1/(1-p0+p1)
    calibrated_predictions = prob_pos_calibrated > 0.5

    # Compute metrics
    accuracy = accuracy_score(true_labels, calibrated_predictions)
    precision = precision_score(true_labels, calibrated_predictions, zero_division=0)
    recall = recall_score(true_labels, calibrated_predictions)
    f1 = f1_score(true_labels, calibrated_predictions)
    roc_auc = roc_auc_score(true_labels, prob_pos_calibrated)
    logloss = log_loss(true_labels, prob_pos_calibrated)
    brier_loss = brier_score_loss(true_labels, prob_pos_calibrated)

    # Compute Expected Calibration Error (ECE)
    fraction_of_positives, mean_predicted_value = calibration_curve(true_labels, prob_pos_calibrated, n_bins=10)
    ece = np.sum(np.abs(fraction_of_positives - mean_predicted_value)) / len(mean_predicted_value)

    # Populate the performance dataframe
    calibrated_performance_df[technique] = [accuracy, precision, recall, f1, roc_auc, ece, logloss, brier_loss]

    # Plot calibration curve and histogram if verbose is True
    if verbose:
        fig = plt.figure(figsize=(10, 10))
        gs = GridSpec(2, 1)
        ax_calibration_curve = fig.add_subplot(gs[0, :])
        ax_histogram = fig.add_subplot(gs[1, :])

        # Plot calibration curve
        CalibrationDisplay.from_estimator(
            model,
            test_df.drop(columns='Class'),
            true_labels,
            n_bins=10,
            name=model_name,
            ax=ax_calibration_curve
        )
        ax_calibration_curve.set_title(f"Calibration plot ({model_name})")

        # Plot histogram
        ax_histogram.hist(prob_pos, range=(0, 1), bins=10, label=model_name)
        ax_histogram.set(title=model_name, xlabel="Mean predicted probability", ylabel="Count")

        plt.tight_layout()
        plt.show()

    end_time = time.time()  # Record end time
    execution_time = end_time - start_time  # Calculate execution time in seconds

    # Record execution time in the time DataFrame
    resampled_time_df.loc[model_name, 'Execution Time (s)'] = execution_time

    return calibrated_performance_df,model

In [None]:
# Train Logistic Regression
logistic_regression_model = LogisticRegression(random_state=RANDOM_STATE, max_iter=1000)
calibrated_performance_df,_ = evaluate_resampled_calibrated_model_performance(logistic_regression_model,'Logistic Regression', sampler = SMOTE(), verbose=True)

In [None]:
calibrated_performance_df

In [None]:
techniques = [RandomOverSampler(), SMOTE(), ADASYN(), RandomUnderSampler(), NearMiss(version=1), TomekLinks(), EditedNearestNeighbours()]

for sampler in tqdm(techniques):
    technique = sampler.__class__.__name__
    print(f'Technique: {technique}')
    model = LogisticRegression(random_state=RANDOM_STATE, max_iter=1000)
    calibrated_performance_df,_ = evaluate_resampled_calibrated_model_performance(logistic_regression_model,'Logistic Regression', sampler = sampler, verbose=True)

In [None]:
lr_calibrated_resampling_performance_df = calibrated_performance_df.copy()
lr_calibrated_resampling_performance_df.to_csv('lr_calibrated_resampling_performance.csv')

## Calibrate CatBoost

In [None]:
calibrated_performance_df[:] = np.nan

In [None]:
for sampler in tqdm(techniques):
    technique = sampler.__class__.__name__

    print(f'Technique: {technique}')

    catboost_model  = CatBoostClassifier(
    task_type="CPU",       # You can change this to "GPU" if you have a GPU.
    thread_count=-1,       # Use all available CPU cores
    verbose=0,
    random_state=RANDOM_STATE
    )
    calibrated_performance_df,_ = evaluate_resampled_calibrated_model_performance(catboost_model,'CatBoost Classifier', sampler = sampler, verbose=True)

In [None]:
catboost_calibrated_resampling_performance_df = calibrated_performance_df.copy()
catboost_calibrated_resampling_performance_df.to_csv('catboost_calibrated_resampling_performance.csv')