# Beginning by generating synthetic data using Python and employing techniques like rotation, scaling, and flipping for data augmentation to create a dataset with increased variability.

# Step 1: Data Acquisition and Initial Analysis

Dataset used is Credit Card Fraud Detection. I shortened it from 284,808 transactions to 45,198.

In [1]:
import pandas as pd

# Load the dataset
data = pd.read_csv("/content/creditcard_shortened.csv")

# Display the first few rows of the dataset and its summary statistics
data_head = data.head()
data_description = data.describe()

data_head, data_description


(   Time        V1        V2        V3        V4        V5        V6        V7  \
 0   0.0 -1.359807 -0.072781  2.536347  1.378155 -0.338321  0.462388  0.239599   
 1   0.0  1.191857  0.266151  0.166480  0.448154  0.060018 -0.082361 -0.078803   
 2   1.0 -1.358354 -1.340163  1.773209  0.379780 -0.503198  1.800499  0.791461   
 3   1.0 -0.966272 -0.185226  1.792993 -0.863291 -0.010309  1.247203  0.237609   
 4   2.0 -1.158233  0.877737  1.548718  0.403034 -0.407193  0.095921  0.592941   
 
          V8        V9  ...       V21       V22       V23       V24       V25  \
 0  0.098698  0.363787  ... -0.018307  0.277838 -0.110474  0.066928  0.128539   
 1  0.085102 -0.255425  ... -0.225775 -0.638672  0.101288 -0.339846  0.167170   
 2  0.247676 -1.514654  ...  0.247998  0.771679  0.909412 -0.689281 -0.327642   
 3  0.377436 -1.387024  ... -0.108300  0.005274 -0.190321 -1.175575  0.647376   
 4 -0.270533  0.817739  ... -0.009431  0.798278 -0.137458  0.141267 -0.206010   
 
         V26      

From the summary statistics, we see that here are 45,198 transactions in this shortened dataset.
The mean value of the Class column is approximately 0.0031, indicating a small fraction of fraudulent transactions in the dataset (as expected, since fraud is typically rare).
The Amount feature has a wide range, with values from 0 to 7879.42.

Given that the dataset is tabular techniques like rotation, scaling, and flipping are not applicable. So we will use a Generative Adversarial Network (GAN) to generate synthetic data for the minority class (fraudulent transactions).

Following steps are performed:

1. Separate the dataset into the majority class (non-fraudulent) and minority class (fraudulent).
2. Implement a GAN to generate synthetic samples for the minority class.

In [2]:
# Separate the data into majority and minority classes
majority_data = data[data['Class'] == 0]
minority_data = data[data['Class'] == 1]

# Display the number of samples in each class
num_majority = len(majority_data)
num_minority = len(minority_data)

num_majority, num_minority


(45057, 141)

The dataset contains:

45,057 non-fraudulent transactions (majority class)

141 fraudulent transactions (minority class)

Now synthetic data will be generated using GAN.

A Generative Adversarial Network (GAN) comprises two neural networks:

1. Generator: Tries to generate synthetic data samples.

2. Discriminator: Tries to distinguish between real and synthetic samples.

I will train a GAN on the minority class to generate synthetic fraudulent transactions.

In [9]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.layers import Dense, LeakyReLU, BatchNormalization, Reshape, Flatten
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.optimizers import Adam

# Define GAN components
def build_generator(input_dim, output_dim):
    model = Sequential()
    model.add(Dense(128, input_dim=input_dim))
    model.add(LeakyReLU(alpha=0.2))
    model.add(BatchNormalization(momentum=0.8))
    model.add(Dense(256))
    model.add(LeakyReLU(alpha=0.2))
    model.add(Dense(output_dim, activation='tanh'))
    return model

def build_discriminator(input_dim):
    model = Sequential()
    model.add(Dense(256, input_dim=input_dim))
    model.add(LeakyReLU(alpha=0.2))
    model.add(Dense(128))
    model.add(LeakyReLU(alpha=0.2))
    model.add(Dense(1, activation='sigmoid'))
    model.compile(loss='binary_crossentropy', optimizer=Adam(0.0002, 0.5), metrics=['accuracy'])
    return model

def build_gan(generator, discriminator):
    z = tf.keras.Input(shape=(generator.input_shape[1],))
    img = generator(z)
    discriminator.trainable = False
    validity = discriminator(img)
    model = Model(z, validity)
    model.compile(loss='binary_crossentropy', optimizer=Adam(0.0002, 0.5))
    return model

# Create GAN
generator = build_generator(input_dim=30, output_dim=30)
discriminator = build_discriminator(input_dim=30)
gan = build_gan(generator, discriminator)

# Training parameters
epochs = 3000
batch_size = 128
half_batch = batch_size // 2

real_labels = np.ones((half_batch, 1))
real_labels_full = np.ones((batch_size, 1))
fake_labels = np.zeros((half_batch, 1))

# Lists to store loss values
d_losses = []
g_losses = []

# Training GAN
for epoch in range(epochs):

    # Train Discriminator
    idx = np.random.randint(0, num_minority, half_batch)
    real_samples = minority_data.drop(columns=['Class']).values[idx]

    noise = np.random.normal(0, 1, (half_batch, 30))
    fake_samples = generator.predict(noise)

    d_loss_real = discriminator.train_on_batch(real_samples, real_labels)
    d_loss_fake = discriminator.train_on_batch(fake_samples, fake_labels)
    d_loss = 0.5 * np.add(d_loss_real, d_loss_fake)

    # Train Generator
    noise = np.random.normal(0, 1, (batch_size, 30))
    g_loss = gan.train_on_batch(noise, real_labels_full)

    d_losses.append(d_loss[0])
    g_losses.append(g_loss)

    if epoch % 500 == 0:
        print(f"Epoch {epoch}/{epochs} | Discriminator Loss: {d_loss[0]} | Generator Loss: {g_loss}")

# Generate synthetic samples
num_synthetic_samples = num_majority - num_minority
noise = np.random.normal(0, 1, (num_synthetic_samples, 30))
synthetic_samples = generator.predict(noise)

synthetic_samples.shape


Epoch 0/3000 | Discriminator Loss: 0.330177366733551 | Generator Loss: 0.7728040218353271
Epoch 500/3000 | Discriminator Loss: 0.06007399696682114 | Generator Loss: 2.1892385482788086
Epoch 1000/3000 | Discriminator Loss: 0.013565849513883165 | Generator Loss: 3.625011444091797
Epoch 1500/3000 | Discriminator Loss: 0.0043829187325172825 | Generator Loss: 4.730654716491699
Epoch 2000/3000 | Discriminator Loss: 0.0018794965567394684 | Generator Loss: 5.598355770111084
Epoch 2500/3000 | Discriminator Loss: 0.0008221004488859762 | Generator Loss: 6.416855812072754


(44916, 30)

In [11]:
# Generate synthetic samples
num_synthetic_samples = num_majority - num_minority
noise = np.random.normal(0, 1, (num_synthetic_samples, 30))
synthetic_samples = generator.predict(noise)

# Convert synthetic samples to a DataFrame
synthetic_df = pd.DataFrame(synthetic_samples, columns=minority_data.columns[:-1])
synthetic_df['Class'] = 1

print(synthetic_df.shape)
print(synthetic_df.isna().sum())

(44916, 31)
Time      0
V1        0
V2        0
V3        0
V4        0
V5        0
V6        0
V7        0
V8        0
V9        0
V10       0
V11       0
V12       0
V13       0
V14       0
V15       0
V16       0
V17       0
V18       0
V19       0
V20       0
V21       0
V22       0
V23       0
V24       0
V25       0
V26       0
V27       0
V28       0
Amount    0
Class     0
dtype: int64


# Next, integrate this newly generated synthetic data with the existing dataset, combining them through concatenation to create a hybrid dataset and train the models.

In [19]:
# Combine synthetic data with original data
combined_data = pd.concat([data, synthetic_df], axis=0)
print(combined_data['Class'].isna().sum())

# Dropping nans
combined_data.dropna(subset=['Class'], inplace=True)
print(combined_data['Class'].isna().sum())

combined_data.shape

23719
0


(90114, 31)

In [20]:
# Split the combined data into training and validation sets
from sklearn.model_selection import train_test_split
X = combined_data.drop(columns=['Class'])
y = combined_data['Class']
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Train a classifier (e.g., a Random Forest classifier)
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

clf = RandomForestClassifier(n_estimators=100)
clf.fit(X_train, y_train)

# Validate the classifier
y_pred = clf.predict(X_val)
print(classification_report(y_val, y_pred))

              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00      9012
         1.0       1.00      1.00      1.00      9011

    accuracy                           1.00     18023
   macro avg       1.00      1.00      1.00     18023
weighted avg       1.00      1.00      1.00     18023



Now I will implement k-fold cross-validation to get a more robust understanding of the model's performance. Cross-validation involves partitioning the dataset into multiple subsets (folds). The model is trained on k-1 of these folds and validated on the remaining fold. This process is repeated k times, each time using a different fold for validation. The average performance across these k iterations gives a more robust measure of the model's performance. I will use the StratifiedKFold class from scikit-learn to ensure each fold has an approximately equal distribution of both classes.

After cross-validation, I will train the model on the entire combined dataset (original + synthetic data) and test it on a portion of the original dataset which I will use as a test set.

In [24]:
# Cross-Validation

from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score

# Define the cross-validation procedure
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Lists to store performance metrics
precision_list = []
recall_list = []
f1_list = []
accuracy_list = []

# Split combined data into features and target variable
X_combined = combined_data.drop(columns=['Class'])
y_combined = combined_data['Class']

# Perform cross-validation
for train_idx, val_idx in cv.split(X_combined, y_combined):
    X_train_fold, X_val_fold = X_combined.iloc[train_idx], X_combined.iloc[val_idx]
    y_train_fold, y_val_fold = y_combined.iloc[train_idx], y_combined.iloc[val_idx]

    # Train the classifier
    clf = RandomForestClassifier(n_estimators=100)
    clf.fit(X_train_fold, y_train_fold)

    # Predict on the validation fold
    y_pred_fold = clf.predict(X_val_fold)

    # Calculate performance metrics
    precision_list.append(precision_score(y_val_fold, y_pred_fold))
    recall_list.append(recall_score(y_val_fold, y_pred_fold))
    f1_list.append(f1_score(y_val_fold, y_pred_fold))
    accuracy_list.append(accuracy_score(y_val_fold, y_pred_fold))

# Calculate average performance metrics
avg_precision = np.mean(precision_list)
avg_recall = np.mean(recall_list)
avg_f1 = np.mean(f1_list)
avg_accuracy = np.mean(accuracy_list)

print("Average Precision:", avg_precision)
print("Average Recall:", avg_recall)
print("Average F1 Score:", avg_f1)
print("Average Accuracy:", avg_accuracy)


Average Precision: 0.9998224071999651
Average Recall: 0.9995117226895787
Average F1 Score: 0.9996670181328102
Average Accuracy: 0.9996670889703303


# Assess the models' performance using trustworthiness metrics like accuracy, precision, recall, and F1 score.

In [25]:
from sklearn.model_selection import train_test_split

# Drop rows with NaN values from the original dataset
cleaned_data = data.dropna()

# Split the cleaned data to create the pseudo-test set
X_cleaned = cleaned_data.drop(columns=['Class'])
y_cleaned = cleaned_data['Class']

X_train_cleaned, X_test_pseudo, y_train_cleaned, y_test_pseudo = train_test_split(
    X_cleaned, y_cleaned, test_size=0.2, random_state=42, stratify=y_cleaned)

# Predict on the pseudo-test set
y_pred_pseudo = clf.predict(X_test_pseudo)

# Calculate performance metrics for the pseudo-test set
pseudo_precision = precision_score(y_test_pseudo, y_pred_pseudo)
pseudo_recall = recall_score(y_test_pseudo, y_pred_pseudo)
pseudo_f1 = f1_score(y_test_pseudo, y_pred_pseudo)
pseudo_accuracy = accuracy_score(y_test_pseudo, y_pred_pseudo)

pseudo_precision, pseudo_recall, pseudo_f1, pseudo_accuracy

(0.9642857142857143,
 0.9642857142857143,
 0.9642857142857143,
 0.9997787610619469)

# An important step is enabling a comparative analysis to gain insights into how synthetic data impacts the trustworthiness of AI models.

In [26]:
# Train a classifier on the original (non-augmented) training data
clf_original = RandomForestClassifier(n_estimators=100)
clf_original.fit(X_train_cleaned, y_train_cleaned)


# Predict on the pseudo-test set
y_pred_original = clf_original.predict(X_test_pseudo)


# Calculate performance metrics for the model trained only on original data
original_precision = precision_score(y_test_pseudo, y_pred_original)
original_recall = recall_score(y_test_pseudo, y_pred_original)
original_f1 = f1_score(y_test_pseudo, y_pred_original)
original_accuracy = accuracy_score(y_test_pseudo, y_pred_original)

original_precision, original_recall, original_f1, original_accuracy

(0.92, 0.8214285714285714, 0.8679245283018867, 0.9992256637168142)

# 1.   Precision (Class 1):
92% - Out of all the transactions the model predicted as fraudulent, 88.00% were actually fraudulent.
# 2.   Recall (Class 1):
82.1% - Out of all the actual fraudulent transactions in the test set, the model correctly identified 82.1% of them.
# 3.   F1 Score (Class 1):
86.7% - This is the harmonic mean of precision and recall, providing a balance between the two metrics.
# 4.   Accuracy:
99.92% - Out of all transactions, the model correctly predicted 99.92% of them.

# **Comparative Analysis:**
**Model trained on the original dataset (without synthetic data):**

Precision: 92.00%

Recall: 82.14%

F1 Score: 86.79%

Accuracy: 99.92%


**Model trained on the combined dataset (original + synthetic data):**

Precision: 96.43%

Recall: 96.43%

F1 Score: 96.43%

Accuracy: 99.98%


From the comparison, we can see that the model trained with synthetic data significantly outperforms the one trained without it especially in terms of precision and recall.

The use of synthetic data has led to better identification of fraudulent transactions. This indicates its utility in enhancing the trustworthiness and performance of the model in imbalanced scenarios like credit card fraud detection.