# Predicting Credit Card Fraud Detetction with a Multi-layer Perceptron

Dataset: Kaggle Credit Card Fraud

**What is MLP and how is it similar to taught course content such as logistic regression?**

MLP stands for multi-layer perceptron. It is a feedforward neural network made of an input layer and one or more hidden hidden layers as well as an output layer. Each neuron takes a weighted sum of its inputs, adds a bias and passes it through a non-linear activation function. The weighting between each neuron is trained using backpropogation.

Both MLP and logistic regression both compute a linear combination of inputs and pass it through a function that takes the output to a probability for the positive class. A single layer MLP with a sigmoid function is logistic regression. MLP adds many of these which creates a stack of logistic units that are better at estimating very non-linear relationships. Both Logistic regression and a MLP can be trained using gradient descent on a loss function. With MLP the gradient has to be pushed back through the multiple layers to alter the weights. Scikit-learn solvers such as adam, which will be used here, is a type of gradient descent optimiser. 

In [2]:
pip install -U matplotlib


Collecting matplotlib
  Downloading matplotlib-3.10.8-cp312-cp312-win_amd64.whl.metadata (52 kB)
     ---------------------------------------- 0.0/52.8 kB ? eta -:--:--
     ---------------------- --------------- 30.7/52.8 kB 640.0 kB/s eta 0:00:01
     ---------------------- --------------- 30.7/52.8 kB 640.0 kB/s eta 0:00:01
     ---------------------- --------------- 30.7/52.8 kB 640.0 kB/s eta 0:00:01
     -------------------------------------- 52.8/52.8 kB 302.1 kB/s eta 0:00:00
Collecting contourpy>=1.0.1 (from matplotlib)
  Downloading contourpy-1.3.3-cp312-cp312-win_amd64.whl.metadata (5.5 kB)
Collecting cycler>=0.10 (from matplotlib)
  Using cached cycler-0.12.1-py3-none-any.whl.metadata (3.8 kB)
Collecting fonttools>=4.22.0 (from matplotlib)
  Downloading fonttools-4.61.1-cp312-cp312-win_amd64.whl.metadata (116 kB)
     ---------------------------------------- 0.0/116.4 kB ? eta -:--:--
     ---------------------------------------- 0.0/116.4 kB ? eta -:--:--
     ------- ----

In [5]:
# # Download latest version
import kagglehub

path = kagglehub.dataset_download("mlg-ulb/creditcardfraud")

print("Path to dataset files:", path)

Resuming download from 3145728 bytes (66009944 bytes left)...
Resuming download to C:\Users\Josep\.cache\kagglehub\datasets\mlg-ulb\creditcardfraud\3.archive (3145728/69155672) bytes left.


100%|██████████| 66.0M/66.0M [00:29<00:00, 2.22MB/s]

Extracting files...





Path to dataset files: C:\Users\Josep\.cache\kagglehub\datasets\mlg-ulb\creditcardfraud\versions\3


In [2]:
#IMPORT NECESSARY LIBRARIES FOR CLASSIFICATION TASK, MODEL ANALYSIS, FEATURE SCALING AND DECISIONS
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import (
    classification_report, 
    confusion_matrix, 
    roc_auc_score,
    roc_curve,
    precision_recall_curve,
    f1_score
)
from imblearn.over_sampling import SMOTE
import matplotlib.pyplot as plt

In [9]:
df = pd.read_csv('C:/Users/Josep/.cache/kagglehub/datasets/mlg-ulb/creditcardfraud/versions/3/creditcard.csv')
#load the data

#explore the data 
print("Dataset Shape:", df.shape)
print("\nFirst few rows:")
print(df.head())

print("\nClass Distribution (BEFORE balancing):")
print(df['Class'].value_counts())
print("\nClass Imbalance Ratio:")
print(f"Fraud: {df['Class'].sum()} ({100 * df['Class'].sum() / len(df):.2f}%)")
print(f"Legitimate: {(df['Class'] == 0).sum()} ({100 * (df['Class'] == 0).sum() / len(df):.2f}%)")


Dataset Shape: (284807, 31)

First few rows:
   Time        V1        V2        V3        V4        V5        V6        V7  \
0   0.0 -1.359807 -0.072781  2.536347  1.378155 -0.338321  0.462388  0.239599   
1   0.0  1.191857  0.266151  0.166480  0.448154  0.060018 -0.082361 -0.078803   
2   1.0 -1.358354 -1.340163  1.773209  0.379780 -0.503198  1.800499  0.791461   
3   1.0 -0.966272 -0.185226  1.792993 -0.863291 -0.010309  1.247203  0.237609   
4   2.0 -1.158233  0.877737  1.548718  0.403034 -0.407193  0.095921  0.592941   

         V8        V9  ...       V21       V22       V23       V24       V25  \
0  0.098698  0.363787  ... -0.018307  0.277838 -0.110474  0.066928  0.128539   
1  0.085102 -0.255425  ... -0.225775 -0.638672  0.101288 -0.339846  0.167170   
2  0.247676 -1.514654  ...  0.247998  0.771679  0.909412 -0.689281 -0.327642   
3  0.377436 -1.387024  ... -0.108300  0.005274 -0.190321 -1.175575  0.647376   
4 -0.270533  0.817739  ... -0.009431  0.798278 -0.137458  0.141267 -

After inspecting the data, it is clear that the dataset primarly consists of legitimate classes and not many fraudulent classes.


This is an issue becuase as the number of frauds is so low, the model will train on more or less legitmate classes only, this is known as undersampling. 


To prevent this, a function called SMOTE creates new synthetic fraud examples using geometric points between a fraud transaction and its nearest neighbour. This way the model learns to be able to differentiate the boundary between fraud and legitimate more clearly. 

Additionally, the data consists of a time series aspect which can be removed as this is a cross sectional prediction only.

In [11]:
#Drop Time  due to timne series natire of the data.
#Instead, perform cross-sectional analysis
X = df.drop(['Time', 'Class'], axis=1)
# predict Class
y = df['Class']
#print features
print("\nFeatures shape:", X.shape)
print("distribution:", y.value_counts().to_dict())


Features shape: (284807, 29)
distribution: {0: 284315, 1: 492}


In [12]:
#Split the data into train, test split, preserving the ratio between train 
#and test having the same fraud to legitimate and using random state for 
#reproducability
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y  
)

In [13]:
#Output results to confirm accuracy of the previous code
print(f"\nTrain set size: {X_train.shape[0]}")
print(f"Test set size: {X_test.shape[0]}")
print(f"Training fraud ratio: {y_train.sum() / len(y_train):.4f}")
print(f"Test fraud ratio: {y_test.sum() / len(y_test):.4f}")


Train set size: 227845
Test set size: 56962
Training fraud ratio: 0.0017
Test fraud ratio: 0.0017


**SMOTE**


In highly unbalanced classification settings, models tend to be biased toward the majority of cases, leading to bad detection of minorities(Chawla et al., 2002; Fernández et al., 2018). It works by generating synthetic minority samples along the line that joins each minority instance to its nearest kst minority neighbour. This increases minority density without duplicating pre existing minorities, which is important to prevent overfitting issues(Chawla et al., 2002).

In [14]:
#SMOTE implementation using 5 closest fradulent neighbours
smote = SMOTE(random_state=42, k_neighbors=5)
X_train_balanced, y_train_balanced = smote.fit_resample(X_train, y_train)
print("\nClass Distribution AFTER SMOTE:")
print(y_train_balanced.value_counts())
#New training data to prevent overfittinf from SMOTE
print(f"Training set expanded from {len(y_train)} to {len(y_train_balanced)} samples")
print(f"New training fraud ratio: {y_train_balanced.sum() / len(y_train_balanced):.4f}")


Class Distribution AFTER SMOTE:
Class
0    227451
1    227451
Name: count, dtype: int64
Training set expanded from 227845 to 454902 samples
New training fraud ratio: 0.5000


**Scaling**

MLP models are sensitive to scaling. This is because when all input features are on a similar scale, the model can train more reliably and faster due to large magnitutes features not dominating the learning dynamics.Many Scikit learn estimators assume scaled input features. Standard Scalar stanadardizes each feature to zero mean and unit variance (scikit‑learn developers, 2006; GeeksforGeeks, 2018).

In [15]:
scaler = StandardScaler()
#Fit and scale the training data.
X_train_scaled = scaler.fit_transform(X_train_balanced)
X_test_scaled = scaler.transform(X_test)

#Confirm using mean and standard deviation of the training data
print(f"Training feature mean: {X_train_scaled.mean(axis=0)[:3]}")
print(f"Training feature std: {X_train_scaled.std(axis=0)[:3]}")

Training feature mean: [-6.39782425e-17  0.00000000e+00  0.00000000e+00]
Training feature std: [1. 1. 1.]


**Model Selection and training**

MLP works well for credit card fraud prediction because it can find and predict using complex non linear patterns that a simple model like logistic regression wont be able to capture. When compared to other classical models, MLP has been able to outperfrom on precison, recall, accuracy, and f1-score (Edelweiss Applied Science and Technology, 2025). 

This specific set up uses a two layer hidden architecture, an adpative learning rate, and the Relu activation function. It also uses regularisation to help prevent overfitting, through penalising model complexity by using an extra term on the loss function which forces the model towards smaller and simpler parameter values. 

At the present time it isnt known whether these parameters are optimal, this can be determined using an algorithm like Gridsearch or the genetic algorithm, while being careful to prevent over fitting.



In [20]:
mlp = MLPClassifier(
    hidden_layer_sizes=(256, 128, 64),
    activation='relu',
    solver='adam', #adam 
    alpha=1e-4, #regularisation term
    batch_size=256, #small batch to help better generalisation.
    learning_rate='adaptive', #adaptive learning rate to increase learning efficiency
    learning_rate_init=1e-3, 
    early_stopping=True, #Stop early if no improvment.
    validation_fraction=0.1,  # Use 10% of training for early stopping
    n_iter_no_change=10,  # Stop if no improvement for 10 checks
    max_iter=200,     #Maximium epoch number
    random_state=42, # reproducability
    verbose=False   #No training information printed to the screen.
)


print("\nTraining MLP...")
mlp.fit(X_train_scaled, y_train_balanced) # Fit the model
#training output metrics including model fit
print(f"Training complete.")
print(f"Converged: {mlp.n_iter_} iterations")
print(f"Loss: {mlp.loss_:.4f}")


Training MLP...
Training complete.
Converged: 19 iterations
Loss: 0.0012


In [12]:
#Perform prediction on the test data.
y_pred = mlp.predict(X_test_scaled)
#probability for ROC-AUC calculation
y_pred_proba = mlp.predict_proba(X_test_scaled)[:, 1]

**Training results**
Using a confusion matrix and the true and false predictions, industry standard assesemnt metrics can evalaute how well the model performs on unseen data. 

A confusion matrix is a table that shows how many fradulent vs not fradulent predictions that the model correctly and falsely identified. 

Defining the evaluation metrics:

*Accuracy* is true positives + true negatives/all samples

*Precision* is true positives / true positives + false positives

*Recall* is true positives / true positives + false negatives

*F1-Score*  is known as the harmonic mean of precision and recall it is equal to 2 * precision * recall / precision + recall. It summarises the trade-off between precision and recall into a single metric.

*ROC-AUC score* shows how the classifiers true positive rate compares to the false positive rates across all possible probability thresholds.





In [13]:
print(classification_report(y_test, y_pred, target_names=['Legitimate', 'Fraud']))

print("\nConfusion Matrix:")
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
print(f"True Negatives: {tn}")
print(f"False Positives: {fp}")
print(f"False Negatives: {fn}")
print(f"True Positives: {tp}")

print("\nKey Metrics:")
print(f"Accuracy: {(tp + tn) / (tp + tn + fp + fn):.4f}")
print(f"Precision: {tp / (tp + fp):.4f}  (avoid false alarms)")
print(f"Recall: {tp / (tp + fn):.4f}    (catch frauds)")
print(f"F1-Score: {f1_score(y_test, y_pred):.4f}")
print(f"ROC-AUC: {roc_auc_score(y_test, y_pred_proba):.4f}")

              precision    recall  f1-score   support

  Legitimate       1.00      1.00      1.00     56864
       Fraud       0.66      0.83      0.74        98

    accuracy                           1.00     56962
   macro avg       0.83      0.91      0.87     56962
weighted avg       1.00      1.00      1.00     56962


Confusion Matrix:
True Negatives: 56823
False Positives: 41
False Negatives: 17
True Positives: 81

Key Metrics:
Accuracy: 0.9990
Precision: 0.6639  (avoid false alarms)
Recall: 0.8265    (catch frauds)
F1-Score: 0.7364
ROC-AUC: 0.9607


**Results interpretation**

From the confusion matrix, it is clear that when the model flags fraud, it is right 81% of the time. It catches 84% of all actual frauds, missing 16 out of 98.

The f1 score displays 0.82 which indicates a good balance between catching fraud and not spamming false alerts. 

A ROC-AUC score of 0.96 shows a good ranking ability in that the model is good at seperating fraud vs legitimate across thresholds.

These metrics dont suggest overfitting, since they demonstrate the model performs well on the unseen test data.





**Furthering the model**

Adding a dashboard using something like streamlit api, just as Sarunas used, alongside data collection would create a real way for users to interact with the model, taking it from a basic concept to production. The data collection addition would mean that the model could be further refined in future iterations.

To further the model performance, an optimal set of hyperparameters need to be chosen through a search-optimisation algorithm, like the forementioned genetic algorithm that incoporates a fitness function which determines the best "traits" of the parents and passes them on.

In [15]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer

# Define parameter grid based on current MLP setup
param_grid = {
    'hidden_layer_sizes': [(128, 64), (256, 128), (256, 128, 64)],
    'alpha': [1e-5, 1e-4, 1e-3],
    'learning_rate_init': [1e-4, 5e-4, 1e-3],
    'batch_size': [128, 256, 512]
}

# MLP base estimator (fix common params)
mlp_base = MLPClassifier(
    activation='relu',
    solver='adam',
    learning_rate='adaptive',
    early_stopping=True,
    validation_fraction=0.1,
    n_iter_no_change=10,
    max_iter=200,
    random_state=42
)

# Scorer for imbalanced data (F1 for fraud class)
f1_scorer = make_scorer(f1_score, pos_label=1)

# GridSearchCV with 3-fold CV (StratifiedKFold already imported)
grid_search = GridSearchCV(
    mlp_base, param_grid, cv=3, scoring=f1_scorer, n_jobs=-1, verbose=1
)

# Fit on balanced scaled train data
print("Running GridSearchCV...")
grid_search.fit(X_train_scaled, y_train_balanced)
print("Best params:", grid_search.best_params_)
print("Best CV F1:", grid_search.best_score_)

# Best model for test evaluation
best_mlp = grid_search.best_estimator_
ypred_opt = best_mlp.predict(X_test_scaled)
print("\nOptimal model classification report:")
print(classification_report(y_test, ypred_opt, target_names=['Legitimate', 'Fraud']))


Running GridSearchCV...
Fitting 3 folds for each of 81 candidates, totalling 243 fits
Best params: {'alpha': 1e-05, 'batch_size': 512, 'hidden_layer_sizes': (256, 128, 64), 'learning_rate_init': 0.0005}
Best CV F1: 0.9997604480271484

Optimal model classification report:
              precision    recall  f1-score   support

  Legitimate       1.00      1.00      1.00     56864
       Fraud       0.72      0.83      0.77        98

    accuracy                           1.00     56962
   macro avg       0.86      0.91      0.88     56962
weighted avg       1.00      1.00      1.00     56962



In [16]:
import joblib
joblib.dump(grid_search.best_estimator_, 'fraud_mlp_model.joblib')
joblib.dump(scaler, 'fraud_scaler.joblib')

['fraud_scaler.joblib']

**3.3 Bibliography**

Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321–357.​

Fernández, A., García, S., Herrera, F., & Chawla, N. V. (2018). SMOTE for learning from imbalanced data: Progress and challenges, marking the 15-year anniversary. Journal of Artificial Intelligence Research, 61, 863–905.

GeeksforGeeks. (2018, July 1). Feature engineering: Scaling, normalization and standardization.​

scikit-learn developers. (2006). StandardScaler — scikit-learn 1.8.0 documentation.

Edelweiss Applied Science and Technology. (2025). Fraud credit card transaction detection using hybrid multilayer perceptron models. Edelweiss Applied Science and Technology.