# Exploratory Data Analysis (EDA) on Synthetic Fraud Detection Data

This notebook is used for performing exploratory data analysis on the synthetic data generated for the fraud detection system. The analysis will help in understanding the data distributions, relationships, and potential features for model training.

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set visualization style
sns.set(style='whitegrid')

In [None]:
# Load the synthetic data
customers = pd.read_csv('../data/synthetic/customers.csv')
devices = pd.read_csv('../data/synthetic/devices.csv')
transactions = pd.read_csv('../data/synthetic/transactions.csv')

# Display the first few rows of each dataset
print('Customers Data:')
display(customers.head())

print('Devices Data:')
display(devices.head())

print('Transactions Data:')
display(transactions.head())

In [None]:
# Summary statistics of the datasets
print('Customers Summary:')
display(customers.describe())

print('Devices Summary:')
display(devices.describe())

print('Transactions Summary:')
display(transactions.describe())

In [None]:
# Visualize the distribution of transaction amounts
plt.figure(figsize=(10, 6))
sns.histplot(transactions['amount'], bins=50, kde=True)
plt.title('Distribution of Transaction Amounts')
plt.xlabel('Amount')
plt.ylabel('Frequency')
plt.show()

In [None]:
# Analyze the relationship between customer risk level and fraud labels
plt.figure(figsize=(10, 6))
sns.countplot(data=transactions, x='customer_risk_level', hue='fraud_label_id')
plt.title('Customer Risk Level vs Fraud Labels')
plt.xlabel('Customer Risk Level')
plt.ylabel('Count')
plt.legend(title='Fraud Label', loc='upper right', labels=['Legit', 'Suspicious', 'Fraud'])
plt.show()

In [None]:
# Correlation heatmap of transaction features
plt.figure(figsize=(12, 8))
correlation_matrix = transactions.corr()
sns.heatmap(correlation_matrix, annot=True, fmt='.2f', cmap='coolwarm', square=True)
plt.title('Correlation Heatmap of Transaction Features')
plt.show()

## Conclusion

This notebook provides a preliminary analysis of the synthetic data generated for the fraud detection system. Further analysis and feature engineering will be necessary to prepare the data for model training.