# Real-Time Fraud Detection in Financial Transactions

This project demonstrates the use of machine learning to detect fraudulent financial transactions. The model was trained on synthetic data, simulating real-world transaction patterns and fraud behavior. 

## Key Components:
- **Synthetic Data Generation**: Created transaction data with attributes like transaction amount, user ID, and fraud labels.
- **Feature Engineering**: Extracted key features like transaction time and transaction amount to improve model predictions.
- **Model Training**: Random Forest classifier was used to detect fraudulent transactions.
- **Real-Time Simulation**: The model was tested in a simulated real-time environment for predicting fraud on incoming transactions.

This project can be expanded by incorporating real-world datasets and improving the model’s generalization for use in live systems.


In [3]:
# Importing necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.preprocessing import StandardScaler
from faker import Faker
from sklearn.datasets import make_classification

In [14]:
# Initialize Faker for generating realistic data
fake = Faker()

# Function to generate synthetic financial transaction data
def generate_synthetic_data(num_records=1000, fraud_percentage=0.05):
    # Generate random transaction data
    X, y = make_classification(n_samples=num_records, n_features=5, n_informative=3, n_redundant=1, 
                               n_classes=2, weights=[1 - fraud_percentage, fraud_percentage], flip_y=0.01, random_state=42)
    
    # Create DataFrame
    df = pd.DataFrame(X, columns=[f'feature_{i+1}' for i in range(5)])
    
    # Adding some realistic columns for financial transactions
    df['transaction_id'] = [f'TX{100000 + i}' for i in range(num_records)]
    df['user_id'] = [fake.uuid4() for _ in range(num_records)]
    df['transaction_amount'] = np.random.lognormal(mean=3, sigma=1, size=num_records).round(2)  # Lognormal for skewed amounts
    df['transaction_time'] = [fake.date_time_this_year().strftime('%Y-%m-%d %H:%M:%S') for _ in range(num_records)]
    df['location'] = [fake.city() for _ in range(num_records)]
    df['merchant_id'] = [fake.uuid4() for _ in range(num_records)]
    df['label'] = y  # Fraud label (0 or 1)
    
    # Introduce some fraud patterns:
    # - Large transaction amounts (fraudulent behavior)
    large_transaction_mask = (df['label'] == 1) & (df['transaction_amount'] < 500)
    df.loc[large_transaction_mask, 'transaction_amount'] = np.random.uniform(1000, 10000, size=large_transaction_mask.sum())
    
    # - Unusual transaction times (fraudulent behavior)
    late_night_mask = (df['label'] == 1) & (df['transaction_time'].apply(lambda x: int(x.split(' ')[1].split(':')[0]) in [2, 3, 4, 5]))
    df.loc[late_night_mask, 'transaction_time'] = [fake.date_time_this_year().replace(hour=np.random.choice([2, 3, 4, 5])).strftime('%Y-%m-%d %H:%M:%S') for _ in range(late_night_mask.sum())]
    
    return df

# Generate synthetic data with 1000 records and 5% fraud
synthetic_data = generate_synthetic_data(num_records=1000, fraud_percentage=0.05)

# Save the dataset as a CSV file
synthetic_data.to_csv('synthetic_fraud_data.csv', index=False)

# Display the first few rows of the synthetic data
print(synthetic_data.head())


   feature_1  feature_2  feature_3  feature_4  feature_5 transaction_id  \
0  -0.038769  -0.649239  -0.224746  -1.346275   0.126879       TX100000   
1   1.005284  -1.373239   1.157346   0.126493   1.422799       TX100001   
2  -0.742455  -0.573257   1.688442  -2.588237   0.762562       TX100002   
3   2.440938  -2.556425  -0.930664   0.111514  -1.133170       TX100003   
4  -0.941758   0.367913  -0.549360  -2.029919  -1.503957       TX100004   

                                user_id  transaction_amount  \
0  90e78a02-a361-49fa-a062-68fea3a46031                5.80   
1  0f3c6246-a8f6-462d-931e-028b2db60849               23.78   
2  c905fbc7-f6d0-44d8-90fe-38d3cd9fbe4e               11.81   
3  761aba91-acad-4311-acc9-7b75988d64ae                8.31   
4  84d583ad-f77f-41d1-8926-1fa737f32c25               30.99   

      transaction_time              location  \
0  2025-01-22 04:32:28            East Eddie   
1  2025-01-07 11:05:23  East Kathleenborough   
2  2025-01-26 05:32:41    

In [15]:
# Preprocessing and feature engineering
def preprocess_data(df):
    # Convert transaction_time to datetime and extract hour and day of the week
    df['transaction_time'] = pd.to_datetime(df['transaction_time'])
    df['hour'] = df['transaction_time'].dt.hour
    df['day_of_week'] = df['transaction_time'].dt.dayofweek
    
    # Drop non-numeric columns such as user_id, merchant_id, transaction_id, transaction_time, and location
    df.drop(columns=['transaction_id', 'merchant_id', 'user_id', 'transaction_time', 'location'], inplace=True)
    
    # Normalize the transaction_amount
    scaler = StandardScaler()
    df['transaction_amount'] = scaler.fit_transform(df[['transaction_amount']])
    
    return df

# Apply preprocessing
synthetic_data = preprocess_data(synthetic_data)

# Split data into features and labels
X = synthetic_data.drop(columns=['label'])
y = synthetic_data['label']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Check the shape of the data after splitting
print(f'Training data shape: {X_train.shape}')
print(f'Test data shape: {X_test.shape}')


Training data shape: (800, 8)
Test data shape: (200, 8)


In [16]:
# Train a Random Forest Classifier
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Predict on the test set
y_pred = rf_model.predict(X_test)

# Evaluation
print("Classification Report:")
print(classification_report(y_test, y_pred))

print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))


Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       190
           1       1.00      1.00      1.00        10

    accuracy                           1.00       200
   macro avg       1.00      1.00      1.00       200
weighted avg       1.00      1.00      1.00       200

Confusion Matrix:
[[190   0]
 [  0  10]]
