# Fraud Detection with GAN and Random Forest

---

## Objective
This notebook addresses class imbalance in fraud detection using:
1. **GANs** to generate synthetic minority class samples.
2. **Random Forest Classifier** to evaluate performance with GAN-augmented data.
3. **Evaluation Metrics** (precision, recall, F1-score) to assess model improvements.

---

## Dataset
- **Source**: 'creditcard.csv'
- **Features**: 'Time', 'Amount', 'V1' to 'V28' (PCA-transformed features).
- **Target ('Class')**:
  - '0': Non-fraudulent
  - '1': Fraudulent

---

## Workflow

1. **Generate Synthetic Data**:
   - Train a GAN to create realistic synthetic samples for the minority class.
   - Combine real and synthetic data to form a balanced dataset.

2. **Baseline Evaluation**:
   - Use Random Oversampling and train a Random Forest classifier.
   - Evaluate performance on the original test set.

3. **Reload Balanced Dataset**:
   - Verify class balance and prepare the data for modeling.


4. **GAN-Augmented Evaluation**:
   - Retrain the Random Forest with the GAN-balanced dataset.
   - Compare metrics against the baseline.

5. **Performance Insights**:
   - Assess the impact of GANs in improving fraud detection.


In [9]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import tensorflow as tf
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.models import Model, Sequential
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    classification_report,
    confusion_matrix,
    precision_score,
    recall_score,
    f1_score
)
from imblearn.over_sampling import RandomOverSampler
from utils import GANDataBalancer

In [10]:
Df=pd.read_csv('creditcard.csv')
Df

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.166480,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.167170,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.379780,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.108300,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.50,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.206010,0.502292,0.219422,0.215153,69.99,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
284802,172786.0,-11.881118,10.071785,-9.834783,-2.066656,-5.364473,-2.606837,-4.918215,7.305334,1.914428,...,0.213454,0.111864,1.014480,-0.509348,1.436807,0.250034,0.943651,0.823731,0.77,0
284803,172787.0,-0.732789,-0.055080,2.035030,-0.738589,0.868229,1.058415,0.024330,0.294869,0.584800,...,0.214205,0.924384,0.012463,-1.016226,-0.606624,-0.395255,0.068472,-0.053527,24.79,0
284804,172788.0,1.919565,-0.301254,-3.249640,-0.557828,2.630515,3.031260,-0.296827,0.708417,0.432454,...,0.232045,0.578229,-0.037501,0.640134,0.265745,-0.087371,0.004455,-0.026561,67.88,0
284805,172788.0,-0.240440,0.530483,0.702510,0.689799,-0.377961,0.623708,-0.686180,0.679145,0.392087,...,0.265245,0.800049,-0.163298,0.123205,-0.569159,0.546668,0.108821,0.104533,10.00,0


In [3]:
Df.describe()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
count,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,...,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0
mean,94813.859575,1.168375e-15,3.416908e-16,-1.379537e-15,2.074095e-15,9.604066e-16,1.487313e-15,-5.556467e-16,1.213481e-16,-2.406331e-15,...,1.654067e-16,-3.568593e-16,2.578648e-16,4.473266e-15,5.340915e-16,1.683437e-15,-3.660091e-16,-1.22739e-16,88.349619,0.001727
std,47488.145955,1.958696,1.651309,1.516255,1.415869,1.380247,1.332271,1.237094,1.194353,1.098632,...,0.734524,0.7257016,0.6244603,0.6056471,0.5212781,0.482227,0.4036325,0.3300833,250.120109,0.041527
min,0.0,-56.40751,-72.71573,-48.32559,-5.683171,-113.7433,-26.16051,-43.55724,-73.21672,-13.43407,...,-34.83038,-10.93314,-44.80774,-2.836627,-10.2954,-2.604551,-22.56568,-15.43008,0.0,0.0
25%,54201.5,-0.9203734,-0.5985499,-0.8903648,-0.8486401,-0.6915971,-0.7682956,-0.5540759,-0.2086297,-0.6430976,...,-0.2283949,-0.5423504,-0.1618463,-0.3545861,-0.3171451,-0.3269839,-0.07083953,-0.05295979,5.6,0.0
50%,84692.0,0.0181088,0.06548556,0.1798463,-0.01984653,-0.05433583,-0.2741871,0.04010308,0.02235804,-0.05142873,...,-0.02945017,0.006781943,-0.01119293,0.04097606,0.0165935,-0.05213911,0.001342146,0.01124383,22.0,0.0
75%,139320.5,1.315642,0.8037239,1.027196,0.7433413,0.6119264,0.3985649,0.5704361,0.3273459,0.597139,...,0.1863772,0.5285536,0.1476421,0.4395266,0.3507156,0.2409522,0.09104512,0.07827995,77.165,0.0
max,172792.0,2.45493,22.05773,9.382558,16.87534,34.80167,73.30163,120.5895,20.00721,15.59499,...,27.20284,10.50309,22.52841,4.584549,7.519589,3.517346,31.6122,33.84781,25691.16,1.0


In [4]:
Df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   Time    284807 non-null  float64
 1   V1      284807 non-null  float64
 2   V2      284807 non-null  float64
 3   V3      284807 non-null  float64
 4   V4      284807 non-null  float64
 5   V5      284807 non-null  float64
 6   V6      284807 non-null  float64
 7   V7      284807 non-null  float64
 8   V8      284807 non-null  float64
 9   V9      284807 non-null  float64
 10  V10     284807 non-null  float64
 11  V11     284807 non-null  float64
 12  V12     284807 non-null  float64
 13  V13     284807 non-null  float64
 14  V14     284807 non-null  float64
 15  V15     284807 non-null  float64
 16  V16     284807 non-null  float64
 17  V17     284807 non-null  float64
 18  V18     284807 non-null  float64
 19  V19     284807 non-null  float64
 20  V20     284807 non-null  float64
 21  V21     28

In [6]:
#missing_Vallues
Df.isna().sum()

Time      0
V1        0
V2        0
V3        0
V4        0
V5        0
V6        0
V7        0
V8        0
V9        0
V10       0
V11       0
V12       0
V13       0
V14       0
V15       0
V16       0
V17       0
V18       0
V19       0
V20       0
V21       0
V22       0
V23       0
V24       0
V25       0
V26       0
V27       0
V28       0
Amount    0
Class     0
dtype: int64

In [7]:
# Detect outliers using IQR
outlier_info = {}

for column in Df.columns:
    if column != 'Class':  # Skip the target column (Class)
        Q1 = Df[column].quantile(0.25)
        Q3 = Df[column].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR

        # Identify outliers
        outliers = Df[(Df[column] < lower_bound) | (Df[column] > upper_bound)]
        outlier_info[column] = {
            "outlier_count": outliers.shape[0],
            "lower_bound": lower_bound,
            "upper_bound": upper_bound,
        }

# Convert outlier summary to DataFrame
outlier_summary = pd.DataFrame(outlier_info).T
outlier_summary.columns = ['Outlier Count', 'Lower Bound', 'Upper Bound']

print(outlier_summary)

        Outlier Count   Lower Bound    Upper Bound
Time              0.0 -73477.000000  266999.000000
V1             7062.0     -4.274396       4.669664
V2            13526.0     -2.701961       2.907135
V3             3363.0     -3.766705       3.903536
V4            11148.0     -3.236612       3.131313
V5            12295.0     -2.646882       2.567212
V6            22965.0     -2.518586       2.148856
V7             8948.0     -2.240844       2.257204
V8            24134.0     -1.012593       1.131309
V9             8283.0     -2.503452       2.457494
V10            9496.0     -2.019449       1.937947
V11             780.0     -3.015626       2.992725
V12           15348.0     -1.941286       2.153952
V13            3368.0     -2.615106       2.629071
V14           14149.0     -1.803660       1.871236
V15            2894.0     -2.430442       2.496378
V16            8184.0     -1.955036       2.010296
V17            7420.0     -1.808883       1.724810
V18            7533.0     -1.99

In [11]:
target = "Class"
100*Df[target].value_counts()/Df.shape[0]

0    99.827251
1     0.172749
Name: Class, dtype: float64

In [12]:
X = Df.drop(columns=[target])
y = Df[target]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Baseline Evaluation with Random Oversampling

Before implementing GANs to augment the dataset, it is essential to establish a **baseline performance** using a simple and widely-used resampling technique: **Random Oversampling**. 

This step involves:
1. Using **Random Oversampling** to balance the training dataset by duplicating samples from the minority class.
2. Training a **Random Forest Classifier** on the oversampled dataset to predict fraud cases.
3. Evaluating the model's performance on the original, unmodified test set using:
   - **Confusion Matrix**: To analyze prediction outcomes for fraud and non-fraud cases.
   - **Precision, Recall, and F1-Score**: To measure the classifier's effectiveness in detecting fraudulent transactions.

The results obtained here will serve as a benchmark for assessing the **effectiveness of GAN-generated synthetic samples** in improving fraud detection. By comparing the baseline metrics with those achieved using GAN-augmented data, we can quantify the added value of GANs.

In [5]:
# Baseline Evaluation with Random Oversampling
oversampler = RandomOverSampler(sampling_strategy=0.05, random_state=42)  
X_train_oversampled, y_train_oversampled = oversampler.fit_resample(X_train, y_train)

model = RandomForestClassifier(random_state=42)
model.fit(X_train_oversampled, y_train_oversampled)

y_pred = model.predict(X_test)

conf_matrix = confusion_matrix(y_test, y_pred)
classification_summary = classification_report(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print("Confusion Matrix:")
print(conf_matrix)
print("\nClassification Summary:")
print(classification_summary)
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1:.2f}")

Confusion Matrix:
[[56860     4]
 [   18    80]]

Classification Summary:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56864
           1       0.95      0.82      0.88        98

    accuracy                           1.00     56962
   macro avg       0.98      0.91      0.94     56962
weighted avg       1.00      1.00      1.00     56962

Precision: 0.95
Recall: 0.82
F1 Score: 0.88


In [13]:
# Baseline Evaluation with Random Oversampling
oversampler = GANDataBalancer(sampling_strategy=0.05, random_state=42, latent_dim=100)  
X_train_oversampled, y_train_oversampled = oversampler.fit_resample(X=X_train, y=y_train)

model = RandomForestClassifier(random_state=42)
model.fit(X_train_oversampled, y_train_oversampled)

y_pred = model.predict(X_test)

conf_matrix = confusion_matrix(y_test, y_pred)
classification_summary = classification_report(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print("Confusion Matrix:")
print(conf_matrix)
print("\nClassification Summary:")
print(classification_summary)
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1:.2f}")


[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step 


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


Epoch 0/1000 | Discriminator Loss: [array(3522.4604, dtype=float32), array(0.4765625, dtype=float32)] | Generator Loss: [array(3522.4604, dtype=float32), array(3522.4604, dtype=float32), array(0.4765625, dtype=float32)]
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 816us/step
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 880us/step
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 847us/step
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 859us/step
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 996us/step
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0