## Data Simulation

-  Monthly retraining trigger - GitHub Actions will retrain monthly
- Threshold-based drift retraining - Performance drop ≥ 5% will trigger retraining
- One year simulation - 12 months of data with realistic patterns
- Controlled drift testing - Months 6, 9, 11 will test drift detection

**Step 1: Setting up the environment**

In [3]:
import pandas as pd
import numpy as np
import os
from pathlib import Path
import json
from datetime import datetime
import warnings
import kagglehub
warnings.filterwarnings('ignore')

In [4]:
# Set random seed for reproducibility
np.random.seed(42)

# Create directory structure
monthly_dir = Path('data/monthly')
monthly_dir.mkdir(parents=True, exist_ok=True)

print("Libraries imported successfully")
print(f"Monthly data directory: {monthly_dir}")

Libraries imported successfully
Monthly data directory: data\monthly


**Step 2. Load dataset**

In [5]:
# Download dataset
path = kagglehub.dataset_download("prajwaldongre/loan-application-and-transaction-fraud-detection")

print("Path to dataset files:", path)

Path to dataset files: C:\Users\lvand\.cache\kagglehub\datasets\prajwaldongre\loan-application-and-transaction-fraud-detection\versions\1


In [6]:
# Load datasets
loans_df = pd.read_csv(os.path.join(path, 'loan_applications.csv'))
transactions_df = pd.read_csv(os.path.join(path, 'transactions.csv'))

print(f"\nLoan Applications: {loans_df.shape}")
print(f"Transactions: {transactions_df.shape}")
print(f"Original fraud rate (loans): {loans_df['fraud_flag'].mean():.3f}")
print(f"Original fraud rate (transactions): {transactions_df['fraud_flag'].mean():.3f}")

# Display first few rows
print("\nLoan Applications columns:")
print(loans_df.columns.tolist())
print("\nTransactions columns:")
print(transactions_df.columns.tolist())


Loan Applications: (50000, 21)
Transactions: (50000, 16)
Original fraud rate (loans): 0.021
Original fraud rate (transactions): 0.010

Loan Applications columns:
['application_id', 'customer_id', 'application_date', 'loan_type', 'loan_amount_requested', 'loan_tenure_months', 'interest_rate_offered', 'purpose_of_loan', 'employment_status', 'monthly_income', 'cibil_score', 'existing_emis_monthly', 'debt_to_income_ratio', 'property_ownership_status', 'residential_address', 'applicant_age', 'gender', 'number_of_dependents', 'loan_status', 'fraud_flag', 'fraud_type']

Transactions columns:
['transaction_id', 'customer_id', 'transaction_date', 'transaction_type', 'transaction_amount', 'merchant_category', 'merchant_name', 'transaction_location', 'account_balance_after_transaction', 'is_international_transaction', 'device_used', 'ip_address', 'transaction_status', 'transaction_source_destination', 'transaction_notes', 'fraud_flag']


**Step 3: Create baseline monthly splits**

In [8]:
# Create month assignments for loans (chronological split)
loans_df_sorted = loans_df.copy().reset_index(drop=True)
n_loans = len(loans_df_sorted)
base_size = n_loans // 12
remainder = n_loans % 12

# Assign month numbers
month_assignments = []
start_idx = 0

for month in range(1, 13):
    # Add one extra record to first 'remainder' months
    month_size = base_size + (1 if month <= remainder else 0)
    month_assignments.extend([month] * month_size)

    end_idx = start_idx + month_size
    print(f"Month {month:02d}: records {start_idx} to {end_idx-1} ({month_size} loans)")
    start_idx = end_idx

loans_df_sorted['month'] = month_assignments

# Assign transactions to same months as their corresponding loans
transactions_with_month = transactions_df.merge(
    loans_df_sorted[['customer_id', 'month']],
    on='customer_id',
    how='left'
)

print(f"\nTransaction month assignment success rate: {(~transactions_with_month['month'].isna()).mean():.3f}")

Month 01: records 0 to 4166 (4167 loans)
Month 02: records 4167 to 8333 (4167 loans)
Month 03: records 8334 to 12500 (4167 loans)
Month 04: records 12501 to 16667 (4167 loans)
Month 05: records 16668 to 20834 (4167 loans)
Month 06: records 20835 to 25001 (4167 loans)
Month 07: records 25002 to 29168 (4167 loans)
Month 08: records 29169 to 33335 (4167 loans)
Month 09: records 33336 to 37501 (4166 loans)
Month 10: records 37502 to 41667 (4166 loans)
Month 11: records 41668 to 45833 (4166 loans)
Month 12: records 45834 to 49999 (4166 loans)

Transaction month assignment success rate: 0.967


**Step 4: Save baseline monthly data**

In [9]:
monthly_stats = []

for month in range(1, 13):
    month_str = f"{month:02d}"
    month_path = monthly_dir / month_str
    month_path.mkdir(exist_ok=True)

    # Filter data for this month
    month_loans = loans_df_sorted[loans_df_sorted['month'] == month].drop('month', axis=1)
    month_transactions = transactions_with_month[transactions_with_month['month'] == month].drop('month', axis=1)

    # Save to CSV
    month_loans.to_csv(month_path / 'loan_applications.csv', index=False)
    month_transactions.to_csv(month_path / 'transactions.csv', index=False)

    # Calculate stats
    fraud_rate = month_loans['fraud_flag'].mean()
    trans_fraud_rate = month_transactions['fraud_flag'].mean()

    stats = {
        'month': month_str,
        'loan_count': len(month_loans),
        'transaction_count': len(month_transactions),
        'loan_fraud_rate': fraud_rate,
        'transaction_fraud_rate': trans_fraud_rate,
        'drift_applied': 'none'
    }
    monthly_stats.append(stats)

    print(f"Month {month_str}: {len(month_loans)} loans ({fraud_rate:.3f} fraud), {len(month_transactions)} transactions")

print("\nBaseline monthly splits saved successfully")

Month 01: 4167 loans (0.018 fraud), 10640 transactions
Month 02: 4167 loans (0.020 fraud), 10315 transactions
Month 03: 4167 loans (0.022 fraud), 10336 transactions
Month 04: 4167 loans (0.018 fraud), 10490 transactions
Month 05: 4167 loans (0.020 fraud), 10521 transactions
Month 06: 4167 loans (0.022 fraud), 10330 transactions
Month 07: 4167 loans (0.021 fraud), 10460 transactions
Month 08: 4167 loans (0.020 fraud), 10437 transactions
Month 09: 4166 loans (0.024 fraud), 10547 transactions
Month 10: 4166 loans (0.020 fraud), 10361 transactions
Month 11: 4166 loans (0.019 fraud), 10498 transactions
Month 12: 4166 loans (0.022 fraud), 10346 transactions

Baseline monthly splits saved successfully


**Step 5: Define drift simulation functions**

In [10]:
def apply_feature_drift(df, feature_name, shift_factor=0.3):
    """Apply distribution shift to a numerical feature"""
    df_drift = df.copy()
    if feature_name in df_drift.columns:
        mean_val = df_drift[feature_name].mean()
        std_val = df_drift[feature_name].std()

        # Add systematic shift
        noise = np.random.normal(0, std_val * shift_factor, len(df_drift))
        df_drift[feature_name] = df_drift[feature_name] + noise

        # Keep values in reasonable bounds
        df_drift[feature_name] = np.clip(df_drift[feature_name],
                                       df_drift[feature_name].min(),
                                       df_drift[feature_name].max())
    return df_drift

def apply_fraud_rate_drift(df, target_fraud_rate=0.04):
    """Increase fraud rate by converting some non-fraud to fraud"""
    df_drift = df.copy()
    current_fraud_rate = df_drift['fraud_flag'].mean()

    if target_fraud_rate > current_fraud_rate:
        # Calculate how many records to flip
        n_to_flip = int((target_fraud_rate - current_fraud_rate) * len(df_drift))

        # Select random non-fraud records to flip
        non_fraud_indices = df_drift[df_drift['fraud_flag'] == 0].index
        if len(non_fraud_indices) >= n_to_flip:
            flip_indices = np.random.choice(non_fraud_indices, n_to_flip, replace=False)
            df_drift.loc[flip_indices, 'fraud_flag'] = 1

    return df_drift

def apply_transaction_drift(df, amount_multiplier=1.5):
    """Modify transaction amounts to simulate spending pattern changes"""
    df_drift = df.copy()
    if 'transaction_amount' in df_drift.columns:
        # Apply multiplier with some randomness
        multipliers = np.random.normal(amount_multiplier, 0.2, len(df_drift))
        multipliers = np.clip(multipliers, 0.5, 3.0)  # Keep reasonable bounds
        df_drift['transaction_amount'] = df_drift['transaction_amount'] * multipliers
    return df_drift

print("Drift simulation functions defined")


Drift simulation functions defined


**Step 6: Apply controlled drift to months**

Different types of drifts are applied to months 6, 9 and 11. These scenarios should trigger retraining.
In month 6, a feature distribution drift is tested. The debt_to_income_ratio is shifted, which should prompt retraining.
In month 9, the fraud rate is increased to test performance-based retraining.
In month 11, the model's ability to detect behavioral changes in transaction data is tested.

In [11]:
# Month 6: Feature distribution drift (economic conditions change)
month_6_path = monthly_dir / '06'
loans_6 = pd.read_csv(month_6_path / 'loan_applications.csv')
transactions_6 = pd.read_csv(month_6_path / 'transactions.csv')

# Apply debt-to-income ratio shift (economic stress)
loans_6_drift = apply_feature_drift(loans_6, 'debt_to_income_ratio', shift_factor=0.4)

# Save drifted data
loans_6_drift.to_csv(month_6_path / 'loan_applications.csv', index=False)

# Update stats
monthly_stats[5]['drift_applied'] = 'feature_drift_debt_to_income'
monthly_stats[5]['loan_fraud_rate'] = loans_6_drift['fraud_flag'].mean()

print(f"Month 06 drift applied: debt_to_income_ratio distribution shift")

Month 06 drift applied: debt_to_income_ratio distribution shift


In [12]:
# Month 9: Fraud rate increase
month_9_path = monthly_dir / '09'
loans_9 = pd.read_csv(month_9_path / 'loan_applications.csv')
transactions_9 = pd.read_csv(month_9_path / 'transactions.csv')

# Increase fraud rate significantly
loans_9_drift = apply_fraud_rate_drift(loans_9, target_fraud_rate=0.045)
transactions_9_drift = apply_fraud_rate_drift(transactions_9, target_fraud_rate=0.025)

# Save drifted data
loans_9_drift.to_csv(month_9_path / 'loan_applications.csv', index=False)
transactions_9_drift.to_csv(month_9_path / 'transactions.csv', index=False)

# Update stats
monthly_stats[8]['drift_applied'] = 'fraud_rate_increase'
monthly_stats[8]['loan_fraud_rate'] = loans_9_drift['fraud_flag'].mean()
monthly_stats[8]['transaction_fraud_rate'] = transactions_9_drift['fraud_flag'].mean()

print(f"Month 09 drift applied: fraud rate increased to {loans_9_drift['fraud_flag'].mean():.3f}")

Month 09 drift applied: fraud rate increased to 0.045


In [13]:
# Month 11: Transaction behavior change
month_11_path = monthly_dir / '11'
loans_11 = pd.read_csv(month_11_path / 'loan_applications.csv')
transactions_11 = pd.read_csv(month_11_path / 'transactions.csv')

# Apply transaction amount drift
transactions_11_drift = apply_transaction_drift(transactions_11, amount_multiplier=1.8)

# Also apply mild feature drift to applicant age (demographic shift)
loans_11_drift = apply_feature_drift(loans_11, 'applicant_age', shift_factor=0.3)

# Save drifted data
loans_11_drift.to_csv(month_11_path / 'loan_applications.csv', index=False)
transactions_11_drift.to_csv(month_11_path / 'transactions.csv', index=False)

# Update stats
monthly_stats[10]['drift_applied'] = 'transaction_behavior_and_demographics'
monthly_stats[10]['loan_fraud_rate'] = loans_11_drift['fraud_flag'].mean()
monthly_stats[10]['transaction_fraud_rate'] = transactions_11_drift['fraud_flag'].mean()

print(f"Month 11 drift applied: transaction amounts increased, demographic shift")

Month 11 drift applied: transaction amounts increased, demographic shift


**Step 7: Create dift summary**

In [15]:
monthly_stats_df = pd.DataFrame(monthly_stats)

print("DRIFT SIMULATION SUMMARY")
print("=" * 50)
print(f"Total months created: {len(monthly_stats_df)}")
print(f"Months with drift: {len(monthly_stats_df[monthly_stats_df['drift_applied'] != 'none'])}")
print()

print("MONTHLY STATISTICS:")
print(monthly_stats_df[['month', 'loan_count', 'loan_fraud_rate', 'drift_applied']].to_string(index=False))

print("\nDRIFT DETAILS:")
drift_months = monthly_stats_df[monthly_stats_df['drift_applied'] != 'none']
for _, row in drift_months.iterrows():
    print(f"Month {row['month']}: {row['drift_applied']} (fraud rate: {row['loan_fraud_rate']:.3f})")

# Save summary to JSON
summary_data = {
    'simulation_date': datetime.now().isoformat(),
    'total_months': 12,
    'baseline_fraud_rate': loans_df['fraud_flag'].mean(),
    'drift_months': {
        '06': 'feature_drift_debt_to_income',
        '09': 'fraud_rate_increase',
        '11': 'transaction_behavior_and_demographics'
    },
    'monthly_stats': monthly_stats
}

with open(monthly_dir / 'simulation_summary.json', 'w') as f:
    json.dump(summary_data, f, indent=2)

print(f"\nSimulation summary saved to: {monthly_dir / 'simulation_summary.json'}")

DRIFT SIMULATION SUMMARY
Total months created: 12
Months with drift: 3

MONTHLY STATISTICS:
month  loan_count  loan_fraud_rate                         drift_applied
   01        4167         0.017999                                  none
   02        4167         0.019918                                  none
   03        4167         0.021838                                  none
   04        4167         0.018479                                  none
   05        4167         0.020158                                  none
   06        4167         0.022318          feature_drift_debt_to_income
   07        4167         0.020878                                  none
   08        4167         0.019678                                  none
   09        4166         0.044887                   fraud_rate_increase
   10        4166         0.019923                                  none
   11        4166         0.019443 transaction_behavior_and_demographics
   12        4166         0.0218

**Step 8: Validate drift simulation results**

In [16]:
print("DRIFT VALIDATION")
print("=" * 30)

# Check each drift month
validation_results = []

# Month 06: Feature drift validation
loans_06_orig = loans_df_sorted[loans_df_sorted['month'] == 6]
loans_06_drift = pd.read_csv(monthly_dir / '06' / 'loan_applications.csv')

debt_ratio_change = (loans_06_drift['debt_to_income_ratio'].mean() -
                    loans_06_orig['debt_to_income_ratio'].mean())

print(f"Month 06 - Debt-to-income ratio change: {debt_ratio_change:.3f}")
validation_results.append(('Month_06_feature_drift', abs(debt_ratio_change) > 0.01))

# Month 09: Fraud rate validation
loans_09_drift = pd.read_csv(monthly_dir / '09' / 'loan_applications.csv')
fraud_rate_09 = loans_09_drift['fraud_flag'].mean()

print(f"Month 09 - New fraud rate: {fraud_rate_09:.3f} (target: >0.04)")
validation_results.append(('Month_09_fraud_increase', fraud_rate_09 > 0.04))

# Month 11: Transaction drift validation
trans_11_orig = transactions_with_month[transactions_with_month['month'] == 11]
trans_11_drift = pd.read_csv(monthly_dir / '11' / 'transactions.csv')

amount_change = (trans_11_drift['transaction_amount'].mean() /
                trans_11_orig['transaction_amount'].mean())

print(f"Month 11 - Transaction amount multiplier: {amount_change:.2f}")
validation_results.append(('Month_11_transaction_drift', amount_change > 1.3))

# Overall validation
all_passed = all([result[1] for result in validation_results])
print(f"\nDrift simulation validation: {'PASSED' if all_passed else 'FAILED'}")

if all_passed:
    print("SUCCESS: All drift simulations applied correctly")
    print("MLOps system will have realistic drift to detect and respond to!")
else:
    print("WARNING: Some drift simulations may need adjustment")

DRIFT VALIDATION
Month 06 - Debt-to-income ratio change: 0.065
Month 09 - New fraud rate: 0.045 (target: >0.04)
Month 11 - Transaction amount multiplier: 1.80

Drift simulation validation: PASSED
SUCCESS: All drift simulations applied correctly
MLOps system will have realistic drift to detect and respond to!


**Step 9. Verification**

In [17]:
# Check directory structure
months_created = []
for month_num in range(1, 13):
    month_str = f"{month_num:02d}"
    month_path = monthly_dir / month_str

    loan_file = month_path / 'loan_applications.csv'
    trans_file = month_path / 'transactions.csv'

    if loan_file.exists() and trans_file.exists():
        loan_count = len(pd.read_csv(loan_file))
        trans_count = len(pd.read_csv(trans_file))
        months_created.append(f"Month {month_str}: {loan_count} loans, {trans_count} transactions")

print("Directory structure verification:")
for month_info in months_created:
    print(f"  {month_info}")

print(f"\nTotal files created: {len(months_created) * 2}")
print(f"Summary file: {(monthly_dir / 'simulation_summary.json').exists()}")

Directory structure verification:
  Month 01: 4167 loans, 10640 transactions
  Month 02: 4167 loans, 10315 transactions
  Month 03: 4167 loans, 10336 transactions
  Month 04: 4167 loans, 10490 transactions
  Month 05: 4167 loans, 10521 transactions
  Month 06: 4167 loans, 10330 transactions
  Month 07: 4167 loans, 10460 transactions
  Month 08: 4167 loans, 10437 transactions
  Month 09: 4166 loans, 10547 transactions
  Month 10: 4166 loans, 10361 transactions
  Month 11: 4166 loans, 10498 transactions
  Month 12: 4166 loans, 10346 transactions

Total files created: 24
Summary file: True
