# 01: Corridor Analysis

This notebook explores the fundamental differences in transaction patterns across payment corridors, demonstrating why a one-size-fits-all fraud detection approach fails.

## Objectives
1. Generate synthetic transaction data mimicking real corridor patterns
2. Visualise how "normal" varies dramatically by corridor
3. Quantify the false positive problem with global thresholds

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')

# Set style
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('husl')

print('Libraries loaded successfully')

## 1. Generate Synthetic Corridor Data

We'll create synthetic data that mirrors real-world corridor patterns observed in cross-border payments. Each corridor has distinct:
- Transaction amount distributions
- Velocity patterns (transactions per sender per day)
- Temporal patterns (time of day, day of week)

In [None]:
# Corridor configuration based on observed patterns
CORRIDOR_PROFILES = {
    'GBP_NGN': {  # UK to Nigeria
        'name': 'UK → Nigeria',
        'amount_mean': 450,
        'amount_std': 380,
        'amount_min': 50,
        'amount_max': 5000,
        'velocity_mean': 1.8,  # transactions per sender per day
        'velocity_std': 1.2,
        'peak_hours': [9, 10, 11, 17, 18, 19, 20],  # UTC
        'peak_days': [0, 4, 5],  # Monday, Friday, Saturday
        'fraud_rate': 0.012,  # 1.2%
        'typical_use': 'Family support, school fees, medical bills'
    },
    'GBP_PLN': {  # UK to Poland
        'name': 'UK → Poland',
        'amount_mean': 220,
        'amount_std': 150,
        'amount_min': 30,
        'amount_max': 1500,
        'velocity_mean': 0.9,
        'velocity_std': 0.4,
        'peak_hours': [12, 13, 14, 18, 19],
        'peak_days': [4, 5],  # Friday, Saturday (end of work week)
        'fraud_rate': 0.003,  # 0.3%
        'typical_use': 'Worker remittances, monthly family support'
    },
    'GBP_INR': {  # UK to India
        'name': 'UK → India',
        'amount_mean': 680,
        'amount_std': 520,
        'amount_min': 100,
        'amount_max': 8000,
        'velocity_mean': 1.2,
        'velocity_std': 0.8,
        'peak_hours': [8, 9, 10, 14, 15, 16],
        'peak_days': [0, 1, 2, 3, 4],  # Weekdays
        'fraud_rate': 0.006,  # 0.6%
        'typical_use': 'Property payments, family events, education'
    },
    'GBP_PHP': {  # UK to Philippines
        'name': 'UK → Philippines',
        'amount_mean': 320,
        'amount_std': 200,
        'amount_min': 40,
        'amount_max': 2000,
        'velocity_mean': 2.1,
        'velocity_std': 1.5,
        'peak_hours': [6, 7, 8, 21, 22, 23],  # Early morning/late night UK = daytime PH
        'peak_days': [0, 4, 5, 6],  # Monday, Fri-Sun
        'fraud_rate': 0.008,  # 0.8%
        'typical_use': 'Domestic worker remittances, family support'
    }
}

print(f'Configured {len(CORRIDOR_PROFILES)} corridors for analysis')

In [None]:
def generate_corridor_transactions(corridor_id, profile, n_transactions=10000, seed=42):
    """
    Generate synthetic transactions for a specific corridor.
    """
    np.random.seed(seed)
    
    # Generate amounts (log-normal distribution, common in financial data)
    amounts = np.random.lognormal(
        mean=np.log(profile['amount_mean']),
        sigma=0.8,
        size=n_transactions
    )
    amounts = np.clip(amounts, profile['amount_min'], profile['amount_max'])
    
    # Generate timestamps over 90 days
    base_date = datetime(2024, 1, 1)
    timestamps = []
    for _ in range(n_transactions):
        day_offset = np.random.randint(0, 90)
        
        # Weight hours towards peak hours
        if np.random.random() < 0.7:  # 70% during peak
            hour = np.random.choice(profile['peak_hours'])
        else:
            hour = np.random.randint(0, 24)
        
        minute = np.random.randint(0, 60)
        ts = base_date + timedelta(days=day_offset, hours=hour, minutes=minute)
        timestamps.append(ts)
    
    # Generate sender IDs (some senders make multiple transactions)
    n_unique_senders = int(n_transactions / profile['velocity_mean'] / 30)  # ~monthly active
    sender_ids = np.random.choice(
        [f'sender_{i:05d}' for i in range(n_unique_senders)],
        size=n_transactions,
        replace=True
    )
    
    # Generate fraud labels
    is_fraud = np.random.random(n_transactions) < profile['fraud_rate']
    
    # Fraud transactions tend to be higher value
    fraud_indices = np.where(is_fraud)[0]
    amounts[fraud_indices] *= np.random.uniform(1.5, 3.0, size=len(fraud_indices))
    amounts = np.clip(amounts, profile['amount_min'], profile['amount_max'])
    
    df = pd.DataFrame({
        'corridor': corridor_id,
        'corridor_name': profile['name'],
        'sender_id': sender_ids,
        'amount': amounts.round(2),
        'timestamp': timestamps,
        'is_fraud': is_fraud
    })
    
    return df

# Generate data for all corridors
all_transactions = []
for corridor_id, profile in CORRIDOR_PROFILES.items():
    df = generate_corridor_transactions(corridor_id, profile, n_transactions=10000)
    all_transactions.append(df)
    print(f'{profile["name"]}: {len(df):,} transactions, {df["is_fraud"].sum()} fraud cases ({df["is_fraud"].mean()*100:.2f}%)')

transactions = pd.concat(all_transactions, ignore_index=True)
print(f'\nTotal: {len(transactions):,} transactions')

## 2. Visualise Corridor Differences

Let's see how dramatically "normal" varies by corridor.

In [None]:
# Amount distributions by corridor
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.flatten()

for idx, (corridor_id, profile) in enumerate(CORRIDOR_PROFILES.items()):
    corridor_data = transactions[transactions['corridor'] == corridor_id]
    
    ax = axes[idx]
    ax.hist(corridor_data['amount'], bins=50, edgecolor='white', alpha=0.7)
    
    # Add percentile lines
    median = corridor_data['amount'].median()
    p95 = corridor_data['amount'].quantile(0.95)
    
    ax.axvline(median, color='green', linestyle='--', linewidth=2, label=f'Median: £{median:.0f}')
    ax.axvline(p95, color='red', linestyle='--', linewidth=2, label=f'95th %ile: £{p95:.0f}')
    
    ax.set_title(f'{profile["name"]}\n{profile["typical_use"]}', fontsize=11)
    ax.set_xlabel('Transaction Amount (£)')
    ax.set_ylabel('Frequency')
    ax.legend(fontsize=9)

plt.suptitle('Transaction Amount Distributions by Corridor', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig('corridor_amount_distributions.png', dpi=150, bbox_inches='tight')
plt.show()

In [None]:
# Summary statistics table
summary_stats = transactions.groupby('corridor_name').agg({
    'amount': ['median', lambda x: x.quantile(0.95), 'std'],
    'is_fraud': ['sum', 'mean']
}).round(2)

summary_stats.columns = ['Median Amount (£)', '95th Percentile (£)', 'Std Dev', 'Fraud Count', 'Fraud Rate']
summary_stats['Fraud Rate'] = (summary_stats['Fraud Rate'] * 100).round(2).astype(str) + '%'

print('\n=== Corridor Summary Statistics ===')
print(summary_stats.to_string())

## 3. The False Positive Problem with Global Thresholds

Let's demonstrate what happens when we apply a single "flag transactions above £X" rule globally.

In [None]:
def evaluate_global_threshold(transactions, threshold):
    """
    Evaluate a simple global threshold rule.
    Flag all transactions above threshold as suspicious.
    """
    flagged = transactions['amount'] > threshold
    
    results = []
    for corridor_name in transactions['corridor_name'].unique():
        corridor_data = transactions[transactions['corridor_name'] == corridor_name]
        corridor_flagged = flagged[transactions['corridor_name'] == corridor_name]
        
        # True positives: flagged AND fraud
        tp = (corridor_flagged & corridor_data['is_fraud']).sum()
        # False positives: flagged AND NOT fraud
        fp = (corridor_flagged & ~corridor_data['is_fraud']).sum()
        # False negatives: NOT flagged AND fraud
        fn = (~corridor_flagged & corridor_data['is_fraud']).sum()
        # True negatives: NOT flagged AND NOT fraud
        tn = (~corridor_flagged & ~corridor_data['is_fraud']).sum()
        
        recall = tp / (tp + fn) if (tp + fn) > 0 else 0
        precision = tp / (tp + fp) if (tp + fp) > 0 else 0
        fpr = fp / (fp + tn) if (fp + tn) > 0 else 0
        
        results.append({
            'Corridor': corridor_name,
            'Flagged': corridor_flagged.sum(),
            'True Positives': tp,
            'False Positives': fp,
            'Recall': f'{recall:.1%}',
            'Precision': f'{precision:.1%}',
            'False Positive Rate': f'{fpr:.1%}'
        })
    
    return pd.DataFrame(results)

# Test with global threshold of £1000 (seems reasonable, right?)
print('=== Global Threshold: £1,000 ===')
print('Rule: Flag all transactions above £1,000 for review\n')
results_1000 = evaluate_global_threshold(transactions, 1000)
print(results_1000.to_string(index=False))

In [None]:
# Test with lower threshold
print('\n=== Global Threshold: £500 ===')
print('Rule: Flag all transactions above £500 for review\n')
results_500 = evaluate_global_threshold(transactions, 500)
print(results_500.to_string(index=False))

### Key Insight

Notice the problem:

- **£1,000 threshold**: Good precision for UK→Poland (few false positives), but terrible recall for UK→India (misses most fraud because their legitimate transactions are often above £1,000)

- **£500 threshold**: Better recall across corridors, but creates massive false positive rates for UK→Nigeria and UK→India

**There is no single threshold that works well for all corridors.**

This is why we need corridor-specific calibration.

In [None]:
# Corridor-specific thresholds based on percentiles
def evaluate_corridor_specific_thresholds(transactions, percentile=95):
    """
    Use corridor-specific thresholds based on percentile.
    """
    results = []
    
    for corridor_name in transactions['corridor_name'].unique():
        corridor_data = transactions[transactions['corridor_name'] == corridor_name].copy()
        
        # Calculate corridor-specific threshold
        threshold = corridor_data['amount'].quantile(percentile / 100)
        
        corridor_flagged = corridor_data['amount'] > threshold
        
        tp = (corridor_flagged & corridor_data['is_fraud']).sum()
        fp = (corridor_flagged & ~corridor_data['is_fraud']).sum()
        fn = (~corridor_flagged & corridor_data['is_fraud']).sum()
        tn = (~corridor_flagged & ~corridor_data['is_fraud']).sum()
        
        recall = tp / (tp + fn) if (tp + fn) > 0 else 0
        precision = tp / (tp + fp) if (tp + fp) > 0 else 0
        fpr = fp / (fp + tn) if (fp + tn) > 0 else 0
        
        results.append({
            'Corridor': corridor_name,
            'Threshold': f'£{threshold:.0f}',
            'Flagged': corridor_flagged.sum(),
            'True Positives': tp,
            'False Positives': fp,
            'Recall': f'{recall:.1%}',
            'Precision': f'{precision:.1%}',
            'False Positive Rate': f'{fpr:.1%}'
        })
    
    return pd.DataFrame(results)

print('=== Corridor-Specific Thresholds (95th Percentile) ===')
print('Rule: Flag transactions above corridor-specific 95th percentile\n')
results_specific = evaluate_corridor_specific_thresholds(transactions, percentile=95)
print(results_specific.to_string(index=False))

### Improvement

Corridor-specific thresholds achieve **more balanced performance** across corridors. False positive rates are now consistent rather than wildly varying.

This is just the beginning—the full system goes beyond simple thresholds to use dynamic signal weighting across multiple features.

In [None]:
# Save the synthetic data for use in subsequent notebooks
transactions.to_csv('synthetic_transactions.csv', index=False)
print(f'Saved {len(transactions):,} transactions to synthetic_transactions.csv')

## Summary

This analysis demonstrates the fundamental problem:

1. **Corridors have vastly different baselines** — median transaction in UK→India is 3x that of UK→Poland

2. **Global thresholds create impossible tradeoffs** — optimising for one corridor harms another

3. **Corridor-specific calibration improves balance** — but we need more sophisticated approaches for production

Next notebook: **Feature Engineering** — building corridor-normalised features for the full model.