<a href="https://colab.research.google.com/github/Maedeabm/Fraud-Detection-Using-Neural-Network/blob/main/Fraud_Detection_LSTM_improved_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

To improve the model, especially with the focus on increasing the recall for the fraudulent class, we can take the following steps:

  Handling Class Imbalance: Using techniques like Synthetic Minority Over-sampling Technique (SMOTE) to oversample the minority class (fraudulent transactions).
  
  Model Tuning: Exploring other architectures or adjusting the LSTM model, possibly introducing more layers or nodes.
  
  Adjusting Decision Threshold: Instead of the default 0.5 threshold, you can adjust it to maximize recall while keeping precision at an acceptable level.
  
  Feature Engineering: Consider more features or transform existing ones to enhance the model's capability.

Let's start with the first step: SMOTE. Note that the following steps are code snippets, and it's assumed that you have the necessary libraries and preprocessed data:

In [None]:
!pip install tensorflow
!pip install imbalanced-learn

import pandas as pd
import numpy as np
import tensorflow as tf
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout
from tensorflow.keras.callbacks import EarlyStopping

[31mERROR: Operation cancelled by user[0m[31m
[0m

In [None]:
from google.colab import files

uploaded = files.upload()

# Assuming the dataset is named "paysim.csv"
data = pd.read_csv('paysim.csv')

Saving paysim.csv to paysim.csv



Data Preprocessing:

This will be a basic preprocessing to get started:


In [None]:
# Dropping columns that may not be required for this basic model
data = data.drop(['nameOrig', 'nameDest', 'isFlaggedFraud'], axis=1)

# Convert categorical columns to numerical values
data = pd.get_dummies(data, columns=['type'], drop_first=True)

# Normalize the features
scaler = MinMaxScaler()
data[['amount', 'oldbalanceOrg', 'newbalanceOrig', 'oldbalanceDest', 'newbalanceDest']] = scaler.fit_transform(data[['amount', 'oldbalanceOrg', 'newbalanceOrig', 'oldbalanceDest', 'newbalanceDest']])

# Splitting data into features and target variable
X = data.drop('isFraud', axis=1).values
y = data['isFraud'].values

# Split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Reshape input to be 3D for LSTM [samples, timesteps, features]
X_train = X_train.reshape((X_train.shape[0], 1, X_train.shape[1]))
X_test = X_test.reshape((X_test.shape[0], 1, X_test.shape[1]))


Now, apply SMOTE:

In [None]:
from imblearn.over_sampling import SMOTE

# Before applying SMOTE, reshape the data back to 2D
X_train = X_train.reshape(X_train.shape[0], X_train.shape[2])

smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

# After SMOTE, reshape the data back to 3D for LSTM
X_resampled = X_resampled.reshape((X_resampled.shape[0], 1, X_resampled.shape[1]))


Model Tuning:

This involves re-defining the model, potentially adding more layers or nodes:

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout
from tensorflow.keras.callbacks import EarlyStopping

model = Sequential()
model.add(LSTM(100, return_sequences=True, input_shape=(X_resampled.shape[1], X_resampled.shape[2])))
model.add(Dropout(0.2))
model.add(LSTM(50))
model.add(Dropout(0.2))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])


3. Model Training with Callback:

We'll use early stopping for better convergence:

In [None]:
es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=10)

history = model.fit(X_resampled, y_resampled, epochs=100, batch_size=64, validation_split=0.2, verbose=2, callbacks=[es])


Epoch 1/100
127088/127088 - 704s - loss: 0.2282 - accuracy: 0.8994 - val_loss: 0.3146 - val_accuracy: 0.8666 - 704s/epoch - 6ms/step
Epoch 2/100
127088/127088 - 696s - loss: 0.1598 - accuracy: 0.9339 - val_loss: 0.1955 - val_accuracy: 0.9051 - 696s/epoch - 5ms/step
Epoch 3/100
127088/127088 - 697s - loss: 0.1286 - accuracy: 0.9474 - val_loss: 0.2617 - val_accuracy: 0.8779 - 697s/epoch - 5ms/step
Epoch 4/100
127088/127088 - 698s - loss: 0.1120 - accuracy: 0.9547 - val_loss: 0.0904 - val_accuracy: 0.9636 - 698s/epoch - 5ms/step
Epoch 5/100
127088/127088 - 713s - loss: 0.1022 - accuracy: 0.9586 - val_loss: 0.4526 - val_accuracy: 0.8258 - 713s/epoch - 6ms/step
Epoch 6/100
127088/127088 - 693s - loss: 0.0956 - accuracy: 0.9616 - val_loss: 0.0870 - val_accuracy: 0.9671 - 693s/epoch - 5ms/step
Epoch 7/100
127088/127088 - 700s - loss: 0.0903 - accuracy: 0.9637 - val_loss: 0.2290 - val_accuracy: 0.8923 - 700s/epoch - 6ms/step
Epoch 8/100
127088/127088 - 691s - loss: 0.0870 - accuracy: 0.9651 - 

4. Adjusting Decision Threshold:

After training, when you make predictions, instead of taking the default 0.5 threshold, find an optimal threshold that maximizes recall:

In [None]:
from sklearn.metrics import precision_recall_curve

y_prob = model.predict(X_test)
precision, recall, thresholds = precision_recall_curve(y_test, y_prob)

# Calculate F1 score for different thresholds
f1_scores = 2*(precision*recall)/(precision+recall)

# Get threshold for the best F1 score
best_threshold = thresholds[np.argmax(f1_scores)]

y_pred = (y_prob > best_threshold).astype(int).flatten()




Now, you can evaluate the model using this optimized threshold:

In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           0       1.00      1.00      1.00   1270904
           1       0.83      0.71      0.77      1620

    accuracy                           1.00   1272524
   macro avg       0.92      0.85      0.88   1272524
weighted avg       1.00      1.00      1.00   1272524



These steps should help in increasing the recall. You can iterate on these steps, fine-tune hyperparameters, or explore more features to continuously enhance the model.