### analyzing a synthetic financial dataset generated by the PaySim mobile money simulator. This dataset simulates transactions based on real financial logs from a mobile money service operating in an African country.

The dataset contains 6 million transactions that took place over the span of 30 days, with each row representing a single transaction. The main objective is to create a supervised Machine Learning model to detect fraudulent activities based on historical transaction patterns.

The dataset includes the following important features:

**step**: maps a unit of time in the real world. In this case, 1 step is 1 hour of time. Total steps 744 (30 days simulation), for example step 24 is equivalent to one day.

**type**: types of transaction such as CASH-IN, CASH-OUT, DEBIT, PAYMENT, and TRANSFER.

**amount**: the amount of the transaction in local currency.

**nameOrig**: ID of the sender (originator).

**oldbalanceOrg & newbalanceOrig**: sender's balance before and after a transaction.

**nameDest**: ID of the Recipient.

**oldbalanceDest & newbalanceDest**: recipient's balance before and after a transaction.

**isFraud**: isFraud = 1 indicates a fraudulent transaction and 0 if otherwise.

**isFlaggedFraud**: Indicates whether the system flagged the transaction as potentially fraudulent (1 if flagged while 0 if otherwise).


i. Note that there is no information for initial and new balance of customers that start with M (Merchants).

ii. The fraudulent behavior of the agents aims to profit by taking control or customers accounts and try to empty the funds by transferring to another account and then cashing out of the system.

iii. The business model aims to control massive transfers from one account to another and flags illegal attempts. An illegal attempt in this dataset is an attempt to transfer more than 200.000 in a single transaction.

### Import libraries

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import (accuracy_score, confusion_matrix, classification_report,
                           roc_auc_score, roc_curve, precision_score, recall_score,
                           f1_score, precision_recall_curve, ConfusionMatrixDisplay)
from imblearn.over_sampling import SMOTE
from sklearn.utils import resample
import networkx as nx


In [None]:
# Load Dataset
forensic_df = pd.read_csv('PaySim.csv')
forensic_df.head()

In [None]:
# Getting quick info from the datasets:
forensic_df.info()


In [None]:
forensic_df.describe()

In [None]:
#Checking for missing values
forensic_df.isnull().sum()

In [None]:
# Checking the value in step column
forensic_df['step'].value_counts()

In [None]:
# Count and display occurrences of LEGITIMATE and FRAUD transactions (0=normal, 1=fraud)
forensic_df['isFraud'].value_counts()

In [None]:
# Show percentage distribution of LEGITIMATE and FRAUD transactions
(forensic_df["isFraud"].value_counts(normalize=True)* 100).round(4)

##### There are 6,354,407 (99,8709%) legitimate transactions and 8,213 (0,1291%) fraud transactions. The dataset is heavily imbalanced.

In [None]:
# Checking the value in Transaction type column
forensic_df['type'].value_counts()

In [None]:
# Checking the value in Transaction amount column
forensic_df['amount'].value_counts()

In [None]:
# Sample data 100,000 non-fraud transactions (randomly sampled), keeping all fraud transactions (approximately 8,213 in the dataset)
df_non_fraud = resample(forensic_df[forensic_df['isFraud'] == 0], n_samples=100000, random_state=42)
df_fraud = forensic_df[forensic_df['isFraud'] == 1]
forensic_df = pd.concat([df_non_fraud, df_fraud])
print(f"Sampled dataset shape: {forensic_df.shape}")

In [None]:
# Feature Engineering
# Creates step_week by dividing step (hours) by 168 to group transactions weekly
forensic_df['step_week'] = forensic_df['step'] // 168

#Computes amountZ (z-scored transaction amount) for normalization.
forensic_df['amountZ'] = (forensic_df['amount'] - forensic_df['amount'].mean()) / forensic_df['amount'].std()

# Calculates balance_change_orig and balance_change_dest to capture account balance changes, enhancing fraud detection
forensic_df['balance_change_orig'] = forensic_df['newbalanceOrig'] - forensic_df['oldbalanceOrg']
forensic_df['balance_change_dest'] = forensic_df['newbalanceDest'] - forensic_df['oldbalanceDest']

#### What is the Z-Scored Amount (amountZ)?
The z-scored amount, referred to as amountZ in your code, is a standardized version of the amount column in the PaySim dataset. It transforms transaction amounts into a scale that measures how far each amount deviates from the mean in terms of standard deviations. 

The z-score indicates how unusual a transaction amount is compared to typical transactions. For example, a high amountZ (e.g., +3) means the amount is 3 standard deviations above the mean, potentially indicating a suspicious transaction (e.g., a large transfer associated with fraud).


#### Why create weekly groups (168 hours = 1 week)?

Grouping transactions by week and transaction type (e.g., TRANSFER, CASH_OUT) ensures sufficient sample sizes for reliable Benford’s Law calculations.

### Initializing Bendford's law for amount

Why use Bendford’s law: Benford’s Law is used to detect anomalies in the amount column of the dataset, which is critical for identifying potential fraud

In [None]:
# Benford's Law functions
def leading_digit(x):
    x = abs(x)
    while x >= 10:
        x /= 10
    return int(x)

def benford_dev(amounts):
    # Skip groups with fewer than 10 transactions to avoid unreliable deviations
    if len(amounts) < 10:
        return np.nan
    leads = amounts[amounts > 0].map(leading_digit)
    freq = leads.value_counts(normalize=True)
    expected = {d: np.log10(1 + 1/d) for d in range(1, 10)}
    return sum((freq.get(d, 0) - expected[d])**2 for d in expected)

In [None]:
# Apply Benford's Law by week and type
forensic_df['benford_dev'] = forensic_df.groupby(['step_week', 'type'])['amount'].transform(benford_dev)

In [None]:
# Handle missing values
forensic_df['benford_dev'] = forensic_df['benford_dev'].fillna(forensic_df['benford_dev'].mean())

In [None]:
# Locate and identify assets in cases of fraud, embezzlement, or money laundering
# Creates a transaction network to trace funds between accounts (nameOrig to nameDest) ysing Network analysis

fraud_df = forensic_df[forensic_df['isFraud'] == 1]
high_value_fraud = fraud_df[fraud_df['amount'] > fraud_df['amount'].quantile(0.95)][['nameOrig', 'nameDest', 'amount']]
print(f"Processing {len(high_value_fraud)} high-value fraudulent transactions for fund tracing.")
G = nx.DiGraph()
for row in high_value_fraud.itertuples(index=False):
    G.add_edge(row.nameOrig, row.nameDest, amount=row.amount)
print(f"Graph created with {G.number_of_nodes()} nodes and {G.number_of_edges()} edges.")
plt.figure(figsize=(10, 8))
pos = nx.spring_layout(G, k=0.5, iterations=20)  # Reduced iterations for speed
nx.draw(G, pos, node_size=30, node_color='skyblue', edge_color='gray', arrows=True, arrowsize=10)
plt.title('High-Value Fraudulent Fund Transfer Network', fontsize=14)
plt.close()  # Close plot to free memory
print(f"Generated fund_tracing_network.png with ~{G.number_of_edges()} high-value transfers.")

### Data Cleaning & Pre-processing

In [None]:
# Drop irrelevant columns to reduce noise
forensic_df = forensic_df.drop(['nameOrig', 'nameDest', 'oldbalanceOrg', 'newbalanceOrig', 'oldbalanceDest', 'newbalanceDest'], axis=1)

In [None]:
forensic_df.info()

In [None]:
forensic_df.head()

### Visualisation

In [None]:
# Transaction Type Distribution
plt.figure(figsize=(10, 5))
sns.countplot(x="type", data=forensic_df, palette="magma", hue="type", order=forensic_df["type"].value_counts().index)
plt.xticks(rotation=45, fontsize=12)
plt.xlabel("Transaction Type", fontsize=12)
plt.ylabel("Count", fontsize=12)
plt.title("Distribution of Transaction Types", fontsize=14)
plt.show()

In [None]:
# Fraud vs Non-Fraud Distribution
plt.figure(figsize=(6, 4))
sns.countplot(x="isFraud", data=forensic_df, hue="isFraud", palette="magma")
plt.xlabel("Fraudulent Transaction", fontsize=12)
plt.ylabel("Count", fontsize=12)
plt.title("Fraud vs Non-Fraud Distribution", fontsize=14)
plt.xticks([0, 1], ["Non-Fraud", "Fraud"])
plt.show()

In [None]:
# Fraudulent Transaction Amounts
plt.figure(figsize=(7, 5))
sns.histplot(forensic_df[forensic_df["isFraud"] == 1]["amount"], bins=15, kde=True, color="red", alpha=0.7)
plt.xlabel("Transaction Amount", fontsize=12)
plt.ylabel("Frequency", fontsize=12)
plt.title("Distribution of Fraudulent Transaction Amounts", fontsize=14)
plt.show()

In [None]:
# Benford's Law Visualization
leads = forensic_df[forensic_df['amount'] > 0]['amount'].map(leading_digit)
freq = leads.value_counts(normalize=True).reindex(range(1, 10), fill_value=0)
expected = pd.Series({d: np.log10(1 + 1/d) for d in range(1, 10)})
plt.figure(figsize=(8, 6))
plt.bar(range(1, 10), freq, alpha=0.6, color='skyblue', label='Observed')
plt.bar(range(1, 10), expected, alpha=0.4, color='salmon', label='Expected (Benford)')
plt.xlabel('Leading Digit', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.title("Benford's Law: Observed vs. Expected Digit Frequencies", fontsize=14)
plt.legend(fontsize=10)
plt.grid(True, alpha=0.3)
plt.xticks(range(1, 10))
plt.show()

In [None]:
# Weekly Benford Deviation
avg = forensic_df.groupby('step_week')['benford_dev'].mean()
plt.figure(figsize=(8, 4))
avg.plot(kind='line', color='darkblue')
plt.ylabel('Deviation', fontsize=12)
plt.xlabel('Week', fontsize=12)
plt.title("Weekly Benford Deviation", fontsize=14)
plt.grid(True, alpha=0.3)
plt.savefig('weekly_benford.png', dpi=300, bbox_inches='tight')
plt.show()

In [None]:
# Histograms and Boxplots
for col in ['step', 'amount', 'step_week', 'amountZ', 'benford_dev']:
    plt.figure(figsize=(7, 5))
    sns.histplot(forensic_df[col], kde=True, bins=30, color='teal')
    plt.title(f'Distribution of {col}', fontsize=14)
    plt.xlabel(col, fontsize=12)
    plt.ylabel('Frequency', fontsize=12)
    plt.show()

In [None]:
for col in ['step', 'amount', 'step_week', 'amountZ', 'benford_dev']:
    plt.figure(figsize=(7, 5))
    sns.boxplot(x='isFraud', y=col, data=forensic_df, palette='magma')
    plt.title(f'{col} Distribution by isFraud', fontsize=14)
    plt.xlabel('isFraud', fontsize=12)
    plt.ylabel(col, fontsize=12)
    plt.xticks([0, 1], ['Non-Fraud', 'Fraud'])
    plt.show()

In [None]:
# Correlation Heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(forensic_df[['amount', 'isFraud', 'isFlaggedFraud', 'step', 'step_week', 'amountZ', 'benford_dev']].corr(), annot=True, cmap='coolwarm', fmt='.4f')
plt.title('Correlation Heatmap of Features', fontsize=14)
plt.savefig('correlation_heatmap.png', dpi=300, bbox_inches='tight')
plt.show()

### Handling Outliers

In [None]:
# Function to cap outliers using IQR
def cap_outliers_iqr(df, column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    df[column] = df[column].apply(lambda x: upper_bound if x > upper_bound else lower_bound if x < lower_bound else x)

for col in ['step', 'amount', 'step_week', 'amountZ']:
    cap_outliers_iqr(forensic_df, col)

In [None]:
# Verify outliers
for col in ['step', 'amount', 'step_week', 'amountZ']:
    plt.figure(figsize=(5, 3))
    sns.boxplot(y=forensic_df[col], color='teal')
    plt.title(f'Boxplot of {col} (After Capping Outliers)', fontsize=14, pad=20)
    plt.savefig(f'box_{col}_capped.png', dpi=300, bbox_inches='tight')
    plt.show()

In [None]:
# Encode categorical variable 'type'
forensic_df = pd.get_dummies(forensic_df, columns=['type'], drop_first=True)

In [None]:
# Define features and target
features = ['amountZ', 'benford_dev', 'step', 'step_week', 'balance_change_orig', 'balance_change_dest'] + [col for col in forensic_df.columns if col.startswith('type_')]
X = forensic_df[features]
y = forensic_df['isFraud']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Apply SMOTE
smote = SMOTE(sampling_strategy=0.5, random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)
print("Before SMOTE:", y_train.value_counts())
print("\nAfter SMOTE:", y_train_smote.value_counts())

In [None]:
# Scale features
scaler = StandardScaler()
X_train_smote = scaler.fit_transform(X_train_smote)
X_test = scaler.transform(X_test)

### Model Training and Evaluation

In [None]:
# Train and evaluate models
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000, class_weight='balanced', random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, class_weight='balanced', random_state=42),
    'XGBoost': XGBClassifier(max_depth=5, n_estimators=100, scale_pos_weight=10, eval_metric='logloss', random_state=42)
}

models_trained = {}
results = {'Model': [], 'Accuracy': [], 'Precision': [], 'Recall': [], 'F1 Score': [], 'AUC': []}
for name, model in models.items():
    model.fit(X_train_smote, y_train_smote)
    models_trained[name] = model
    y_pred_proba = model.predict_proba(X_test)[:, 1]
    auc = roc_auc_score(y_test, y_pred_proba)
    
    # Default threshold
    y_pred = model.predict(X_test)
    results['Model'].append(name)
    results['Accuracy'].append(accuracy_score(y_test, y_pred))
    results['Precision'].append(precision_score(y_test, y_pred, zero_division=0))
    results['Recall'].append(recall_score(y_test, y_pred))
    results['F1 Score'].append(f1_score(y_test, y_pred))
    results['AUC'].append(auc)
    
    print(f"\n{name}")
    print(f"AUC: {auc:.4f}")
    print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
    print(f"Precision: {precision_score(y_test, y_pred, zero_division=0):.4f}")
    print(f"Recall: {recall_score(y_test, y_pred):.4f}")
    print(f"F1 Score: {f1_score(y_test, y_pred):.4f}")
    
    # Optimized threshold
    precisions, recalls, thresholds = precision_recall_curve(y_test, y_pred_proba)
    f1_scores = 2 * (precisions * recalls) / (precisions + recalls + 1e-10)
    optimal_idx = np.argmax(f1_scores)
    optimal_threshold = thresholds[optimal_idx]
    y_pred_optimal = (y_pred_proba >= optimal_threshold).astype(int)
    print(f"Optimal Threshold: {optimal_threshold:.4f}")
    print(f"Optimized Precision: {precision_score(y_test, y_pred_optimal):.4f}")
    print(f"Optimized Recall: {recall_score(y_test, y_pred_optimal):.4f}")
    print(f"Optimized F1 Score: {f1_score(y_test, y_pred_optimal):.4f}")

In [None]:
# Model Comparison Bar Chart
comparison_df = pd.DataFrame(results)
print("\nModel Comparison:\n", comparison_df)
plt.figure(figsize=(10, 6))
metrics = ['Accuracy', 'Precision', 'Recall', 'F1 Score']
x = np.arange(len(metrics))
width = 0.25
plt.bar(x - width, comparison_df[comparison_df['Model'] == 'Logistic Regression'][metrics].values[0], width, label='Logistic Regression', color='skyblue')
plt.bar(x, comparison_df[comparison_df['Model'] == 'Random Forest'][metrics].values[0], width, label='Random Forest', color='salmon')
plt.bar(x + width, comparison_df[comparison_df['Model'] == 'XGBoost'][metrics].values[0], width, label='XGBoost', color='darkgreen')
plt.xlabel('Metrics', fontsize=12)
plt.ylabel('Score', fontsize=12)
plt.title('Model Comparison: Logistic Regression, Random Forest, XGBoost', fontsize=14, pad=20)
plt.xticks(x, metrics)
plt.legend(fontsize=10)
plt.grid(True, alpha=0.3)
plt.savefig('model_comparison.png', dpi=300, bbox_inches='tight')
plt.show()

In [None]:
# ROC Curve for XGBoost
plt.figure(figsize=(8, 6))
fpr, tpr, _ = roc_curve(y_test, models_trained['XGBoost'].predict_proba(X_test)[:, 1])
auc = roc_auc_score(y_test, models_trained['XGBoost'].predict_proba(X_test)[:, 1])
plt.plot(fpr, tpr, color='darkblue', lw=2, label=f'XGBoost (AUC = {auc:.2f})')
plt.plot([0, 1], [0, 1], 'k--', lw=1)
plt.xlabel('False Positive Rate', fontsize=12)
plt.ylabel('True Positive Rate', fontsize=12)
plt.title('ROC Curve for XGBoost Model', fontsize=14, pad=20)
plt.legend(fontsize=10)
plt.grid(True, alpha=0.3)
plt.savefig('roc_curve.png', dpi=300, bbox_inches='tight')
plt.show()

In [None]:
# Precision-Recall Curve for XGBoost
y_pred_proba = models_trained['XGBoost'].predict_proba(X_test)[:, 1]
precisions, recalls, thresholds = precision_recall_curve(y_test, y_pred_proba)
f1_scores = 2 * (precisions * recalls) / (precisions + recalls + 1e-10)
optimal_idx = np.argmax(f1_scores)
optimal_threshold = thresholds[optimal_idx]
plt.figure(figsize=(8, 6))
plt.plot(recalls, precisions, color='darkgreen', lw=2, label='Precision-Recall Curve')
plt.scatter(recalls[optimal_idx], precisions[optimal_idx], color='red', s=100, label=f'Optimal Threshold ({optimal_threshold:.2f})')
plt.xlabel('Recall', fontsize=12)
plt.ylabel('Precision', fontsize=12)
plt.title('Precision-Recall Curve for XGBoost', fontsize=14, pad=20)
plt.legend(fontsize=10)
plt.grid(True, alpha=0.3)
plt.savefig('precision_recall.png', dpi=300, bbox_inches='tight')
plt.show()

In [None]:
# Confusion Matrix for XGBoost with Optimized Threshold
y_pred_optimal = (y_pred_proba >= optimal_threshold).astype(int)
cm = confusion_matrix(y_test, y_pred_optimal)
plt.figure(figsize=(6, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=False, annot_kws={'fontsize': 12})
plt.xlabel('Predicted', fontsize=12)
plt.ylabel('Actual', fontsize=12)
plt.title('Confusion Matrix (XGBoost, Optimized Threshold)', fontsize=14, pad=20)
plt.xticks([0.5, 1.5], ['Non-Fraud', 'Fraud'])
plt.yticks([0.5, 1.5], ['Non-Fraud', 'Fraud'])
plt.savefig('confusion_matrix.png', dpi=300, bbox_inches='tight')
plt.show()

In [None]:
# Flag suspicious transactions
#threshold = forensic_df['benford_dev'].quantile(0.95)
#suspicious = forensic_df[forensic_df['benford_dev'] > threshold]
#print(f"Flagged {len(suspicious)} suspicious transactions for review.")

In [None]:
# Flag suspicious transactions
threshold = forensic_df['benford_dev'].quantile(0.95)
suspicious = forensic_df[forensic_df['benford_dev'] > threshold].copy()

In [None]:
# Reconstruct 'type' from dummy columns
type_columns = [col for col in forensic_df.columns if col.startswith('type_')]
if type_columns:
    suspicious['type'] = suspicious[type_columns].idxmax(axis=1).str.replace('type_', '')
else:
    suspicious['type'] = 'Unknown'  # Fallback if no type columns

In [None]:
# Display flagged transactions
print(f"\nFlagged {len(suspicious)} suspicious transactions for review:")
display_columns = ['step', 'type', 'amount', 'benford_dev', 'isFraud', 'balance_change_orig', 'balance_change_dest']
print(suspicious[display_columns].head(10))  # Display first 10 for brevity

In [None]:
# Save to CSV
suspicious[display_columns].to_csv('suspicious_transactions.csv', index=False)
print("Saved flagged transactions to 'suspicious_transactions.csv'")

In [None]:
# Visualize flagged transactions
plt.figure(figsize=(10, 6))
sns.scatterplot(data=suspicious, x='amount', y='benford_dev', hue='isFraud', style='type', size='isFraud', 
                palette={0: 'skyblue', 1: 'red'}, sizes=(50, 200), alpha=0.7)
plt.xlabel('Transaction Amount', fontsize=12)
plt.ylabel('Benford’s Law Deviation', fontsize=12)
plt.title('Flagged Suspicious Transactions (Benford’s Law)', fontsize=14)
plt.legend(title='Fraud Status & Type', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.grid(True, alpha=0.3)
plt.show()