<a href="https://colab.research.google.com/github/Blueblurr/mbalanced-Classification-Analysis-and-Oversampling-Techniques/blob/main/CCDF_MODEL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Section 1: Problem Nature

### Problem Context
The increasing adoption of credit cards has led to a parallel rise in credit card fraud. Detecting fraudulent transactions is a high-stakes classification problem where failure can result in financial loss, regulatory risk, and reputational damage.

### Data Constraints
Due to privacy regulations, banks and financial institutions typically operate on siloed datasets. This fragmentation reduces the data available per model, limiting the ability to learn generalisable fraud patterns.

In our dataset (sourced from Kaggle), anonymised features have been transformed via Principal Component Analysis (PCA), and raw transaction metadata is unavailable. This restricts interpretability and may complicate strategies like synthetic oversampling.

### Class Imbalance Challenge
Fraudulent transactions make up a tiny fraction of total data — often <1%. Standard classifiers trained on such data tend to predict the majority class (legitimate transactions) to maximise accuracy, failing to detect fraud.

#### Three Strategic Response Options:
1. **Data-Level Approach:** Use oversampling (e.g. SMOTE) or undersampling to balance classes.
2. **Model-Level Approach:** Adjust model’s cost function or class weights to penalise false negatives more heavily.
3. **Framework-Level Approach:** Reframe the task as an anomaly detection problem rather than standard classification.

### Project Objective
We will evaluate the effectiveness of these strategies by implementing and comparing classical ML models on a real-world fraud detection dataset. Key focus areas:

- Navigating limitations of PCA-transformed data
- Evaluating model performance using metrics suited for imbalanced binary classification (see Section 2)
- Optimising hyperparameters for each model via stratified cross-validation
- Comparing models using consistent performance benchmarks based on Metrics explained in Section 2.
- We will apply an Application plan for choosing which model is "best" via Section 4.


# Section 2: Metrics of Interest

| **Metric**               | **What It Measures**                                     | **When It's Useful**                                      |
| ------------------------ | -------------------------------------------------------- | --------------------------------------------------------- |
| **Precision**            | Out of predicted frauds, how many were correct?          | When false positives are costly                           |
| **Recall (Sensitivity)** | Out of actual frauds, how many were detected?            | When missing frauds is unacceptable                       |
| **F1-Score**             | Balance between Precision and Recall                     | When you need a single number on imbalance                |
| **ROC-AUC**              | Model’s ability to distinguish classes at all thresholds | General classifier quality (good baseline)                |
| **PR-AUC**               | Tradeoff of Precision vs Recall                          | More informative than ROC-AUC when positive class is rare |




From this we can formalise our metrics mathematically. Consider the following values:

Predicted Positive Given Actual Positive  =  True Positive	(TP)
Predicted Negative Given Actual Postive = False Negative (FN)
Predicted Positive Given Actual Negative = False Positive (FP)
Predicted Negative Given Actual Negative = True Negative (TN)
                     

Precision defined by the ratio of the predicted positives that were correct can be written as the following:

$Precision  = \frac{TP}{FP+TP}$

Recall defined by the ratio of frauds detected our of total frauds:

$Recall = \frac{TP}{TP+FN}$

The F1 score is the Harmonic Mean of the Precision and Recall. The harmonic mean for two values is $H(x_1,x_2) = \frac{2x_1x_2}{x_1+x_2}$

$F1 = \frac{2 * Recall * Precision}{Recall + Precision }$

ROC-AUC consists of two parts. First, the ROC Curve which plots the True Positive Rate (TPR = Recall) vs False Positive Rate (FPR):

Essentially, each data point has two features –– it's true value and it's predicted value. From the entire dataset we can determine the TP,FN,FP and TN rates. We can use these to determine precision and recall. We can use these measures to determine the F1 score. So far so good.

#### Ranking Classifiers with ROC-AUC

We can also use these values in our Confusion Matrix ([TP, FP, TN, FN]) to plot the TPR (our recall) $\frac{TP}{TP+FN}$ vs FPR(% of negatives wrongly predicted as positives) $\frac{FP}{FP+TN}$ values.

Now typically, we can assume our classification model outputs predictions as a probability between 0 and 1, and attatch a threshold value, by which values larger than, say x, are considered a predicted positive. Subdividing the 0-1 interval into 200 parts, we can have 200 different threshold values such that we have a corresponding 200 Confusion matricies that correlate to a single point on our ROC curve, from here we can plot 200 points, join the dots then calculate the area under the curve. The higher the value of the area, the higher the probability that our model ranks a randomly chosen positive instance higher than a randomly chosen negative instance, which is essentailly measureing our performance.




####
While there aren’t many disadvantages of th AUC approach , it’s important to note that under extreme imbalance of data AUC score may be affected. Furthermore, AUC treats all misclassifications equally. In many real-world scenarios, the costs and benefits associated with different types of errors can vary. AUC doesn’t take this into account and might not fully represent the performance in cases where one type of error is more critical than another.

PR-AUC

PR-AUC plots Precision vs. Recall at varying thresholds. Unlike ROC-AUC, it focuses solely on the performance over the positive class, making it more informative when the positive class (fraud) is rare. This makes PR-AUC especially valuable in fraud detection tasks where class imbalance skews ROC-based metrics.

Let's start representing this maths through code.


First we will Load and display our dataset we wish to classify.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

path = "/content/creditcard.csv" #google collab specific
df = pd.read_csv(path)
df.head()

After some data cleaning, we will define a baseline model to be the an un-specificfied (w.r.t hyper params) Random Forest Model. We will assign our data, train test split, import our model, define and fit it using training data and attain predictions on our test data.

In [None]:
# Standard Data Cleaning.
#We wish class labels to be 0 or 1.
#If there's any other value but this binary input, we will identify & delete the row.


# Identify and remove rows where 'Class' is not 0 or 1
initial_row_count = df.shape[0]
df = df[df['Class'].isin([0, 1])]
rows_removed = initial_row_count - df.shape[0]

if rows_removed > 0:
    print(f"Removed {rows_removed} rows where 'Class' was not 0 or 1.")
else:
    print("No rows found where 'Class' was not 0 or 1.")

In [None]:
# Assign Regressors and Class Labels.
X = df.drop('Class',axis = 1)
y = df['Class']
# Split the data into a training set and a test set
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=0)
# Import Model

In [None]:
from sklearn.ensemble import RandomForestClassifier
# Define and fit model
rf = RandomForestClassifier(random_state=0)
rf.fit(X_train,y_train)
#Attain predictions for y, as y_hat
y_pred = rf.predict(X_test)

Next, we will form our Confusion Matrix

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay,confusion_matrix
import matplotlib.pyplot as plt
cm = confusion_matrix(y_test,y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm,
                              display_labels=rf.classes_)
disp.plot()
plt.show()

Let's look at glance through our model's metrics to assess it's performance.

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred, digits=4))

It's clear from this output that the model struggles with classifying positive values  relative to negative ones, something we expected due to severe class imbalance. We'd therefore expect the PR-AUC to be lower than the ROC-AUC, as it is more sensitive to the positive class label predictive perfomance. We will visualise this below and in Section 2, explore methods of addressing this problem to improve our base model. From then on we will explore other models, ways to improve them, and via our model performance metrics conduct a comparative analysis. For now let's visualise the ROC and PR curves :

In [None]:
from sklearn import metrics
import matplotlib.pyplot as plt

# Get predicted probabilities for the test set
y_prob = rf.predict_proba(X_test)[:, 1]

# Calculate the ROC curve using y_test and y_pred
fpr, tpr, thresholds = metrics.roc_curve(y_test, y_prob)
roc_auc = metrics.auc(fpr, tpr)
# Calculate the PR curve using y_test and y_pred
precision, recall, _ = metrics.precision_recall_curve(y_test, y_prob)
pr_auc = metrics.auc(recall, precision)
#Display the PR curve
display1 = metrics.PrecisionRecallDisplay(precision=precision, recall=recall, average_precision=pr_auc)
display1.plot()
# Display the ROC curve
display = metrics.RocCurveDisplay(fpr=fpr, tpr=tpr, roc_auc=roc_auc)
display.plot()

plt.show()

While our metrics suggest strong fraud detection performance, these results are based on a single, highly imbalanced split with only 12 out of 400+ fraud cases. Metrics like ROC-AUC are overly optimistic due to class imbalance. To validate model robustness, we must evaluate performance using techniques like cross-validation with stratified sampling, focusing on the PR-AUC and F1-Score as our key metrics.



In [None]:
#LET'S DO THIS ALL AGAIN TO NAIL IT IN

In [None]:
#Import Pandas and Load our data, displaying it also.
import pandas as pd
path = "/content/creditcard.csv"
df = pd.read_csv(path)
df.head()

In [None]:
#Pre-processing stage

# Getting rid of rows lacking binary values in the Class column.
# Also removing rows with NaN values in the 'Class' column.

# Initial Row shape
row_shape = df.shape[0]
df = df[df['Class'].isin([0, 1])]
rows_removed = row_shape - df.shape[0]

if rows_removed > 0:
  print(f'Removed {rows_removed} many rows containing non binary.')
else:
  print(['No rows removed.'])

#Assign Regressors and Class Labels
X = df.drop('Class', axis =1)
y = df['Class']

from sklearn.model_selection import train_test_split
#Split data into traning and test data
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=0)

In [None]:
from sklearn.ensemble import RandomForestClassifier
#Define and fit the model
rf = RandomForestClassifier(random_state=0)
rf.fit(X_train,y_train)

In [None]:
#Get a predicted value
y_pred = rf.predict(X_test)

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay,confusion_matrix
import matplotlib.pyplot as plt
cm = confusion_matrix(y_test,y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm,
                              display_labels=rf.classes_)
disp.plot()
plt.show()

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred, digits=4))

In [None]:
y_prob = rf.predict_proba(X_test)[:, 1]
fpr, tpr, thresholds = metrics.roc_curve(y_test, y_prob)
roc_auc = metrics.auc(fpr, tpr)
precision, recall, _ = metrics.precision_recall_curve(y_test, y_prob)
pr_auc = metrics.auc(recall, precision)
display1 = metrics.PrecisionRecallDisplay(precision=precision, recall=recall, average_precision=pr_auc)
display1.plot()
display = metrics.RocCurveDisplay(fpr=fpr, tpr=tpr, roc_auc=roc_auc)
display.plot()
plt.show()



In [None]:
#Let's build on this with Cross validation

from sklearn.model_selection import StratifiedKFold, cross_val_score

#Import Pandas and Load our data, displaying it also.
import pandas as pd
path = "/content/creditcard.csv"
df = pd.read_csv(path)
#Pre-processing stage

# Getting rid of rows lacking binary values in the Class column.
# Also removing rows with NaN values in the 'Class' column.

# Initial Row shape
row_shape = df.shape[0]
df = df[df['Class'].isin([0, 1])]
rows_removed = row_shape - df.shape[0]

if rows_removed > 0:
  print(f'Removed {rows_removed} many rows containing non binary.')
else:
  print(['No rows removed.'])

#Assign Regressors and Class Labels
X = df.drop('Class', axis =1)
y = df['Class']

rf = RandomForestClassifier(n_jobs=-1, random_state=0, verbose=1)
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)

pr_auc_scores = cross_val_score(rf, X, y, cv=cv, scoring='average_precision')
f1_scores = cross_val_score(rf, X, y, cv=cv, scoring='f1')

print("Mean PR-AUC:", pr_auc_scores.mean())
print("Mean F1 Score:", f1_scores.mean())

#OUTPUT WAS:
#Mean PR-AUC: 0.8503585364185218
#Mean F1 Score: 0.8656535196070081

#Section 2: Resampling Techniques

A valid approach in addressing class imbalance is to oversample positive cases and/or undersample negative cases from our dataset.

##Synthetic Minority Oversampling Technique

SMOTE is an oversampling technique where the synthetic samples are generated for the minority class. This algorithm helps to overcome the overfitting problem posed by random oversampling. It focuses on the feature space to generate new instances with the help of interpolation between the positive instances that lie together.


Though this algorithm is quite useful, it has few drawbacks associated with it.

The synthetic instances generated are in the same direction i.e. connected by an artificial line its diagonal instances. This in turn complicates the decision surface generated by few classifier algorithms.
SMOTE tends to create a large no. of noisy data points in feature space

We'll illistrate this via code.




In [None]:
%pip install imbalanced-learn



###First Preprocessing: Load and Split Data

In [None]:
import pandas as pd

# Load the dataset
df = pd.read_csv("/content/creditcard.csv")
print("Original shape:", df.shape)
print("Original class distribution:\n", df['Class'].value_counts())

# Clean Class column: convert to float then to int (ensures uniformity of data type)
df['Class'] = pd.to_numeric(df['Class'], errors='coerce')  # handles strings, NaNs
df = df[df['Class'].isin([0, 1])]  # remove non-binary values in class
df['Class'] = df['Class'].astype(int)

# Drop any remaining NaNs
df = df.dropna(subset=['Class'])

print("After filtering shape:", df.shape)
print("Filtered class distribution:\n", df['Class'].value_counts())

# Split features and labels
X = df.drop('Class', axis=1).reset_index(drop=True)
y = df['Class'].reset_index(drop=True)

Original shape: (284807, 31)
Original class distribution:
 Class
0    284315
1       492
Name: count, dtype: int64
After filtering shape: (284807, 31)
Filtered class distribution:
 Class
0    284315
1       492
Name: count, dtype: int64


###Second Preprocessing: SMOTE to oversample Postive instance training data

In [None]:
####### OKAY THIS IS SMOTE ON THE DATASET

# #Import SMOTE
#
# from imblearn.over_sampling import SMOTE
# from collections import Counter
#
# counter = Counter(y_train)
# print('Before', counter)
# smt = SMOTE(X_train,y_train,random_state =1)
# X_train,y_train = smt.fit_resample(X_train,y_train)
# counter = Counter(y_train)
# print('After', counter)
#
# #...go on to run model on this data set

In [None]:
# THIS IS SMOTE AS PART OF THE PIPLINE

from imblearn.over_sampling import SMOTE
from imblearn import pipeline
from collections import Counter
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import average_precision_score, f1_score

smt = SMOTE(random_state=1)
pipeline = pipeline.Pipeline([
    ('SMOTE', smt),
    ('clf', RandomForestClassifier(n_jobs=-1, random_state=0, verbose=1))
])


cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)

pr_auc_scores = cross_val_score(pipeline, X, y, cv=cv, scoring='average_precision')
f1_scores = cross_val_score(pipeline, X, y, cv=cv, scoring='f1')

print("Mean PR-AUC:", pr_auc_scores.mean())
print("Mean F1 Score:", f1_scores.mean())

#This ended up giving the following:
#Mean PR-AUC: 0.8533183273188725
#Mean F1 Score: 0.8612774208614977

#ADASYN: Adaptive Synthetic Sampling Approach
ADASYN is a generalized form of the SMOTE algorithm. This algorithm also aims to oversample the minority class by generating synthetic instances for it. But the difference here is it considers the density distribution, ri which decides the no. of synthetic instances generated for samples which difficult to learn. Due to this, it helps in adaptively changing the decision boundaries based on the samples difficult to learn. This is the major difference compared to SMOTE.

In [None]:
from imblearn.over_sampling import ADASYN
from collections import Counter

counter_2 = Counter(y_train)
print('Before', counter_2)
adasyn = ADASYN(X_train,y_train, random_state=1)
X_train, y_train = adasyn.fit_resample(X_train, y_train)
print('After', counter_2)

#...go on to run model on this data set

In [None]:
# THIS IS SMOTE AS PART OF THE PIPLINE

from imblearn.over_sampling import ADASYN
from imblearn import pipeline
from collections import Counter
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import average_precision_score, f1_score

ada = ADASYN(random_state=1)
pipeline = pipeline.Pipeline([
    ('ADASYN', ada),
    ('clf', RandomForestClassifier(n_jobs=-1, random_state=0, verbose=1))])


cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)

pr_auc_scores = cross_val_score(pipeline, X, y, cv=cv, scoring='average_precision')
f1_scores = cross_val_score(pipeline, X, y, cv=cv, scoring='f1')

print("Mean PR-AUC:", pr_auc_scores.mean())
print("Mean F1 Score:", f1_scores.mean())

# Mean PR-AUC: 0.8508376667850875
# Mean F1 Score: 0.8573482376796676

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.


KeyboardInterrupt: 

#Hybridization: SMOTE + ENN
SMOTE + ENN is another hybrid technique where more no. of observations are removed from the sample space. Here, ENN is an undersampling technique where the nearest neighbors of each of the majority class is estimated. If the nearest neighbors misclassify that particular instance of the majority class, then that instance gets deleted.

Integrating this technique with oversampled data done by SMOTE helps in doing extensive data cleaning. Here on misclassification by NN’s samples from both the classes are removed. This results in a more clear and concise class separation.

In [None]:
from imblearn.combine import SMOTEENN
from collections import Counter

counter_3 = Counter(y_train)
print('Before', counter_3)
smote_enn = SMOTEENN(X_train,y_train, random_state=1)
X_train, y_train = smote_enn.fit_resample(X_train, y_train)
print('After', counter_3)

#...go on to run model on this data set

In [None]:
from imblearn.combine import SMOTEENN
from imblearn import pipeline
from collections import Counter
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import average_precision_score, f1_score

smt_en = SMOTEENN(random_state=1)
pipeline = pipeline.Pipeline([
    ('smt_en', smt_en),
    ('clf', RandomForestClassifier(n_jobs=-1, random_state=0, verbose=1))])


cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)
pr_auc_scores = cross_val_score(pipeline, X, y, cv=cv, scoring='average_precision')
f1_scores = cross_val_score(pipeline, X, y, cv=cv, scoring='f1')

print("Mean PR-AUC:", pr_auc_scores.mean())
print("Mean F1 Score:", f1_scores.mean())

#Mean PR-AUC: 0.8514091222163511
#Mean F1 Score: 0.8490642656735388

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   43.8s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:  1.7min finished
[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    0.0s
[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed:    0.1s finished
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   40.7s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:  1.6min finished
[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    0.0s
[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed:    0.1s finished
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   42.1s
[Parall

Mean PR-AUC: 0.8514091222163511
Mean F1 Score: 0.8490642656735388


[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:  1.6min finished
[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    0.0s
[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed:    0.1s finished


Oversampling-based methods, including SMOTE and SMOTE-ENN, did not materially improve model performance over the baseline Random Forest model.

This implies the model generalises just fine without synthetic data generation.

We will now conisder some directions forward:

-Add class_weight='balanced' to RandomForestClassifier (weighting cost function to prioritise reducing false negatives over false positives)

-Try XGBoost / LightGBM with the same CV structure (alternative models)

-Afterwards, conduct threshold tuning (optional if performance is already strong)






In [None]:
#WEIGHT CLASS BALANCED

#Let's build on this with Cross validation

from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_jobs=-1, random_state=0, verbose=1, class_weight='balanced')
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)

pr_auc_scores = cross_val_score(rf, X, y, cv=cv, scoring='average_precision')
f1_scores = cross_val_score(rf, X, y, cv=cv, scoring='f1')

print("Mean PR-AUC:", pr_auc_scores.mean())
print("Mean F1 Score:", f1_scores.mean())

#OUTPUT WAS:
#Mean PR-AUC: 0.8507911190464504
#Mean F1 Score: 0.8507171820397957

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   13.0s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:   31.6s finished
[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    0.0s
[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed:    0.1s finished
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   13.0s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:   30.6s finished
[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    0.0s
[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed:    0.1s finished
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   13.0s
[Parall

Mean PR-AUC: 0.8507911190464504
Mean F1 Score: 0.8507171820397957


[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:   32.1s finished
[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    0.0s
[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed:    0.1s finished


### Evaluation of Class-Weighted Random Forest

To address class imbalance without altering the data distribution, we trained a Random Forest classifier using the `class_weight='balanced'` parameter. This forces the model to penalize misclassifications of the minority class more heavily during training.

**Cross-validated Results:**
- PR-AUC: 0.8508
- F1 Score: 0.8507

**Conclusion:**  
Class weighting did not improve model performance relative to the baseline model. In fact, it slightly reduced F1 score. This confirms our earlier conclusion — that the baseline model already handles the imbalance effectively, and additional reweighting or resampling adds no measurable benefit.


---
##ALTERNATIVE MODELS

XGBoost builds an ensemble of decision trees, where each new tree attempts to correct the errors made by the previous ones. It uses a technique called gradient boosting, which involves optimising a loss function (like mean squared error or log loss) using gradient descent.

The model has moderately high performance and extremely fast runtimes.






In [None]:
!pip install xgboost optuna

Collecting optuna
  Downloading optuna-4.4.0-py3-none-any.whl.metadata (17 kB)
Collecting alembic>=1.5.0 (from optuna)
  Downloading alembic-1.16.4-py3-none-any.whl.metadata (7.3 kB)
Collecting colorlog (from optuna)
  Downloading colorlog-6.9.0-py3-none-any.whl.metadata (10 kB)
Downloading optuna-4.4.0-py3-none-any.whl (395 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m395.9/395.9 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading alembic-1.16.4-py3-none-any.whl (247 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m247.0/247.0 kB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading colorlog-6.9.0-py3-none-any.whl (11 kB)
Installing collected packages: colorlog, alembic, optuna
Successfully installed alembic-1.16.4 colorlog-6.9.0 optuna-4.4.0


We'll illustrate this approach via code:

In [None]:
from xgboost import XGBClassifier
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.ensemble import RandomForestClassifier

xgb_classifier = XGBClassifier(objective='binary:logistic', random_state=0)
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)

pr_auc_scores = cross_val_score(xgb_classifier, X, y, cv=cv, scoring='average_precision')
f1_scores = cross_val_score(xgb_classifier, X, y, cv=cv, scoring='f1')

print("Mean PR-AUC:", pr_auc_scores.mean())
print("Mean F1 Score:", f1_scores.mean())

#Mean PR-AUC: 0.7904463093843405
#Mean F1 Score: 0.8212182283730669


Mean PR-AUC: 0.7904463093843405
Mean F1 Score: 0.8212182283730669


XGBoost's speed is a major advantage, completing training in under two minutes. To optimize its performance, we apply Bayesian hyperparameter optimization using the Tree-structured Parzen Estimator (TPE) method via the Optuna library. This approach efficiently explores the hyperparameter space without exhaustively retraining the model at every iteration, unlike traditional grid search. Importantly, this method is CPU-efficient, reducing the risk of runtime disconnection.

In [None]:
import pandas as pd
import numpy as np
from collections import Counter
from sklearn.model_selection import StratifiedKFold, cross_val_score
from xgboost import XGBClassifier
import optuna

# Load dataset
path = "/content/creditcard.csv"  # adjust if local
df = pd.read_csv(path)

# Clean the target column
df = df[df['Class'].isin([0, 1])]
X = df.drop('Class', axis=1)
y = df['Class']

# Objective function for Optuna: defines hyperparameter search space
def objective(trial):
    params = {
        'max_depth': trial.suggest_int('max_depth', 3, 10),
        'learning_rate': trial.suggest_float('learning_rate', 1e-3, 0.3, log=True),
        'subsample': trial.suggest_float('subsample', 0.5, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0),
        'scale_pos_weight': Counter(y)[0] / Counter(y)[1],
        'n_estimators': trial.suggest_int('n_estimators', 100, 500),
        'min_child_weight': trial.suggest_int('min_child_weight', 1, 10),
        'gamma': trial.suggest_float('gamma', 0, 5),
        'tree_method': 'hist',  # Fast training
        'use_label_encoder': False
    }

    model = XGBClassifier(**params, objective='binary:logistic', random_state=0, n_jobs=-1)

    cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)
    scores = cross_val_score(model, X, y, cv=cv, scoring='average_precision', n_jobs=-1)

    return scores.mean()
    # returns cross-validated PR-AUC score for XGBoost classifier

In [None]:
# Searches for model that optimises PR- AUC via TPE approach (maximising expected improvement)
study = optuna.create_study(direction='maximize', sampler=optuna.samplers.TPESampler())
study.optimize(objective, n_trials=50)



[I 2025-08-05 14:47:09,272] A new study created in memory with name: no-name-30a7ab83-fcff-4c4a-9e51-626d04df0e64
[I 2025-08-05 14:47:56,213] Trial 0 finished with value: 0.7408389591792222 and parameters: {'max_depth': 4, 'learning_rate': 0.01230072864513482, 'subsample': 0.919571847160622, 'colsample_bytree': 0.5612943109554946, 'n_estimators': 320, 'min_child_weight': 4, 'gamma': 1.7468914763585826}. Best is trial 0 with value: 0.7408389591792222.
[I 2025-08-05 14:48:28,973] Trial 1 finished with value: 0.744453278890908 and parameters: {'max_depth': 6, 'learning_rate': 0.0034480565241761115, 'subsample': 0.9519230227400952, 'colsample_bytree': 0.5761271774033399, 'n_estimators': 184, 'min_child_weight': 9, 'gamma': 1.6821284584839185}. Best is trial 1 with value: 0.744453278890908.
[I 2025-08-05 14:49:06,702] Trial 2 finished with value: 0.8499250689923574 and parameters: {'max_depth': 10, 'learning_rate': 0.07480970531729769, 'subsample': 0.9526982139792209, 'colsample_bytree': 0.

In [None]:
print("Best PR-AUC Score:", study.best_value)
print("Best Hyperparameters:", study.best_params)

# Evaluate the best model on PR-AUC and F1
best_model = XGBClassifier(**study.best_params, objective='binary:logistic',
                           use_label_encoder=False, random_state=0, n_jobs=-1)
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)

pr_auc = cross_val_score(best_model, X, y, cv=cv, scoring='average_precision', n_jobs=-1)
f1 = cross_val_score(best_model, X, y, cv=cv, scoring='f1', n_jobs=-1)

print("Cross-Validated PR-AUC:", pr_auc.mean())
print("Cross-Validated F1 Score:", f1.mean())


Best PR-AUC Score: 0.8651803985314931
Best Hyperparameters: {'max_depth': 8, 'learning_rate': 0.08011927645640976, 'subsample': 0.6444313604893382, 'colsample_bytree': 0.812649449268795, 'n_estimators': 372, 'min_child_weight': 4, 'gamma': 1.6326169748600887}
Cross-Validated PR-AUC: 0.8618736239214029
Cross-Validated F1 Score: 0.8648313654581627


##🧾 Final Summary and Conclusion
In this notebook, we tackled the challenge of credit card fraud detection, a high-stakes and highly imbalanced classification problem. We:

**Explored baseline performance using a Random Forest classifier**, identifying the limitations of standard accuracy in imbalanced settings.

**Applied resampling techniques (SMOTE, ADASYN, SMOTE-ENN) to balance the dataset**, but observed no material performance gains over the baseline.

**Evaluated class weighting**, which similarly failed to outperform the default model.

**Introduced XGBoost**, a powerful gradient boosting algorithm, and used **Bayesian optimization (Optuna)** to efficiently tune hyperparameters.

Improved performance further with **cross-validation, using PR-AUC and F1-score** as our key metrics.

🔍 Key Takeaway:
Despite experimenting with several imbalance mitigation techniques, **the baseline Random Forest and tuned XGBoost** models handled the dataset robustly without requiring synthetic sampling. This reinforces the value of well-calibrated ensemble methods, especially when combined with rigorous cross-validation and appropriate evaluation metrics.

