# 🇮🇳 Indian Banking Fraud Detection System

## Features of Indian Banking Dataset:
- **Payment Methods**: UPI, RTGS, NEFT, IMPS, Net Banking
- **Indian Merchants**: Kirana stores, Petrol pumps, Mobile recharge, etc.
- **Regional Patterns**: Transactions across Indian states and cities
- **Festival Patterns**: Higher transaction volumes during festivals
- **Banking Hours**: Traditional banking hours (10 AM - 4 PM)
- **RBI Compliance**: Indian banking regulations and patterns

## Fraud Patterns Detected:
- International transactions (higher risk)
- Night-time transactions (2.5x risk)
- Festival season fraud attempts (2x risk)
- High-value transactions outside banking hours
- Suspicious payment method combinations

In [1]:
# Import libraries for Indian banking analysis
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier
from imblearn.over_sampling import SMOTE
from sklearn.metrics import classification_report, roc_auc_score, confusion_matrix
import joblib
import warnings
warnings.filterwarnings('ignore')

# Set Indian locale for currency formatting
plt.style.use('seaborn-v0_8')
sns.set_palette('viridis')

In [2]:
# Load Indian banking dataset
print('🇮🇳 Loading Indian Banking Transaction Dataset...')
df = pd.read_csv('../data/indian_banking_transactions.csv')

print(f'Dataset shape: {df.shape}')
print(f'Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB')

# Display basic information
print('\n📊 Dataset Overview:')
df.info()

print('\n💰 Amount Statistics (in ₹):')
print(df['Amount'].describe())

print('\n🚨 Fraud Distribution:')
fraud_counts = df['Class'].value_counts()
print(f'Legitimate: {fraud_counts[0]} ({fraud_counts[0]/len(df)*100:.2f}%)')
print(f'Fraudulent: {fraud_counts[1]} ({fraud_counts[1]/len(df)*100:.2f}%)')

🇮🇳 Loading Indian Banking Transaction Dataset...
Dataset shape: (100000, 41)
Memory usage: 55.35 MB

📊 Dataset Overview:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 41 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   Transaction_ID      100000 non-null  object 
 1   Timestamp           100000 non-null  object 
 2   Amount              100000 non-null  float64
 3   Payment_Method      100000 non-null  object 
 4   Merchant_Category   100000 non-null  object 
 5   Location            100000 non-null  object 
 6   Hour                100000 non-null  int64  
 7   Day_of_Week         100000 non-null  int64  
 8   Is_Festival_Season  100000 non-null  bool   
 9   Is_Banking_Hours    100000 non-null  bool   
 10  Is_Weekend          100000 non-null  bool   
 11  Month               100000 non-null  int64  
 12  Class               100000 non-null  int64  
 13  V1            

In [3]:
# Indian Banking EDA
print('🏦 Indian Banking Pattern Analysis')

# Payment method analysis
print('\n📱 Fraud Rate by Payment Method:')
payment_fraud = df.groupby('Payment_Method')['Class'].agg(['count', 'sum', 'mean']).round(4)
payment_fraud.columns = ['Total_Transactions', 'Fraud_Cases', 'Fraud_Rate']
payment_fraud = payment_fraud.sort_values('Fraud_Rate', ascending=False)
print(payment_fraud)

# Merchant category analysis
print('\n🏪 Fraud Rate by Merchant Category:')
merchant_fraud = df.groupby('Merchant_Category')['Class'].agg(['count', 'sum', 'mean']).round(4)
merchant_fraud.columns = ['Total_Transactions', 'Fraud_Cases', 'Fraud_Rate']
merchant_fraud = merchant_fraud.sort_values('Fraud_Rate', ascending=False)
print(merchant_fraud.head(10))

# Location analysis
print('\n🌍 Fraud Rate by Location:')
location_fraud = df.groupby('Location')['Class'].agg(['count', 'sum', 'mean']).round(4)
location_fraud.columns = ['Total_Transactions', 'Fraud_Cases', 'Fraud_Rate']
location_fraud = location_fraud.sort_values('Fraud_Rate', ascending=False)
print(location_fraud.head(10))

# Time-based analysis
print('\n⏰ Fraud Rate by Hour:')
hourly_fraud = df.groupby('Hour')['Class'].agg(['count', 'sum', 'mean']).round(4)
hourly_fraud.columns = ['Total_Transactions', 'Fraud_Cases', 'Fraud_Rate']
print(hourly_fraud)

# Festival season analysis
print('\n🎉 Festival Season Impact:')
festival_fraud = df.groupby('Is_Festival_Season')['Class'].agg(['count', 'sum', 'mean']).round(4)
festival_fraud.columns = ['Total_Transactions', 'Fraud_Cases', 'Fraud_Rate']
festival_fraud.index = ['Normal Period', 'Festival Season']
print(festival_fraud)

🏦 Indian Banking Pattern Analysis

📱 Fraud Rate by Payment Method:
                    Total_Transactions  Fraud_Cases  Fraud_Rate
Payment_Method                                                 
International_Card                 421          421      1.0000
UPI                              11161         1676      0.1502
Net_Banking                      10841         1243      0.1147
Mobile_Banking                   10237          775      0.0757
Credit_Card                       9616            0      0.0000
ATM_Withdrawal                    9625            0      0.0000
Cash_Deposit                      9661            0      0.0000
Debit_Card                        9737            0      0.0000
IMPS                              9502            0      0.0000
NEFT                              9552            0      0.0000
RTGS                              9647            0      0.0000

🏪 Fraud Rate by Merchant Category:
                   Total_Transactions  Fraud_Cases  Fraud_Rate
Me

In [4]:
# Feature Engineering for Indian Banking
print('🛠️ Feature Engineering for Indian Banking Patterns...')

# Create a copy for modeling
df_model = df.copy()

# Encode categorical variables
le_payment = LabelEncoder()
le_merchant = LabelEncoder()
le_location = LabelEncoder()

df_model['Payment_Method_Encoded'] = le_payment.fit_transform(df_model['Payment_Method'])
df_model['Merchant_Category_Encoded'] = le_merchant.fit_transform(df_model['Merchant_Category'])
df_model['Location_Encoded'] = le_location.fit_transform(df_model['Location'])

# Create risk indicators
df_model['Is_High_Risk_Payment'] = df_model['Payment_Method'].isin(['UPI', 'Net_Banking', 'International_Card']).astype(int)
df_model['Is_International'] = df_model['Location'].str.contains('International').astype(int)
df_model['Is_High_Value'] = (df_model['Amount'] > df_model['Amount'].quantile(0.95)).astype(int)
df_model['Is_Night_Transaction'] = ((df_model['Hour'] < 6) | (df_model['Hour'] > 22)).astype(int)
df_model['Amount_Log'] = np.log1p(df_model['Amount'])

# Convert timestamp to unix timestamp for modeling
df_model['Timestamp'] = pd.to_datetime(df_model['Timestamp'])
df_model['Time_Numeric'] = df_model['Timestamp'].astype(int) // 10**9

print('✅ Feature engineering completed!')
print(f'Total features available: {len(df_model.columns)}')

🛠️ Feature Engineering for Indian Banking Patterns...
✅ Feature engineering completed!
Total features available: 50


In [5]:
# Prepare features for modeling
print('🎯 Preparing Features for Indian Banking Model...')

# Select features for modeling
feature_columns = [col for col in df_model.columns if col.startswith('V')] + [
    'Amount', 'Amount_Log', 'Time_Numeric', 'Hour', 'Day_of_Week', 'Month',
    'Payment_Method_Encoded', 'Merchant_Category_Encoded', 'Location_Encoded',
    'Is_Banking_Hours', 'Is_Weekend', 'Is_Festival_Season',
    'Is_High_Risk_Payment', 'Is_International', 'Is_High_Value', 'Is_Night_Transaction'
]

X = df_model[feature_columns]
y = df_model['Class']

print(f'Feature matrix shape: {X.shape}')
print(f'Features used: {len(feature_columns)}')
print(f'Target distribution - Fraud: {y.sum()} ({y.mean():.3%})')

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f'Training set: {X_train_scaled.shape}')
print(f'Test set: {X_test_scaled.shape}')

🎯 Preparing Features for Indian Banking Model...
Feature matrix shape: (100000, 44)
Features used: 44
Target distribution - Fraud: 4115 (4.115%)
Training set: (80000, 44)
Test set: (20000, 44)


In [6]:
# Handle class imbalance with SMOTE
print('⚖️ Handling Class Imbalance...')

print(f'Before SMOTE - Fraud cases: {y_train.sum()} ({y_train.mean():.3%})')

smote = SMOTE(random_state=42, k_neighbors=5)
X_train_balanced, y_train_balanced = smote.fit_resample(X_train_scaled, y_train)

print(f'After SMOTE - Total samples: {len(X_train_balanced)}')
print(f'After SMOTE - Fraud cases: {y_train_balanced.sum()} ({y_train_balanced.mean():.3%})')

print('✅ Dataset is now balanced for training!')

⚖️ Handling Class Imbalance...
Before SMOTE - Fraud cases: 3292 (4.115%)
After SMOTE - Total samples: 153416
After SMOTE - Fraud cases: 76708 (50.000%)
✅ Dataset is now balanced for training!


In [7]:
# Train Indian Banking Fraud Detection Models
print('🤖 Training Indian Banking Fraud Detection Models...')

# Define models optimized for Indian banking patterns
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42, class_weight='balanced'),
    'Random Forest': RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42, class_weight='balanced'),
    'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, max_depth=6, random_state=42),
    'XGBoost': XGBClassifier(n_estimators=100, max_depth=6, eval_metric='logloss', random_state=42)
}

results = {}

for name, model in models.items():
    print(f'\n🔥 Training {name}...')
    
    # Train model
    model.fit(X_train_balanced, y_train_balanced)
    
    # Make predictions
    y_pred = model.predict(X_test_scaled)
    y_pred_proba = model.predict_proba(X_test_scaled)[:, 1]
    
    # Calculate metrics
    auc_score = roc_auc_score(y_test, y_pred_proba)
    
    print(f'🎯 {name} Results:')
    print(f'   AUC Score: {auc_score:.4f}')
    print(f'   Classification Report:')
    print(classification_report(y_test, y_pred, target_names=['Legitimate', 'Fraud']))
    
    # Store results
    results[name] = {
        'model': model,
        'auc': auc_score,
        'predictions': y_pred,
        'probabilities': y_pred_proba
    }

# Select best model
best_model_name = max(results.keys(), key=lambda k: results[k]['auc'])
best_model = results[best_model_name]['model']
best_auc = results[best_model_name]['auc']

print(f'\n🏆 Best Model: {best_model_name}')
print(f'🎯 Best AUC Score: {best_auc:.4f}')

🤖 Training Indian Banking Fraud Detection Models...

🔥 Training Logistic Regression...
🎯 Logistic Regression Results:
   AUC Score: 0.9923
   Classification Report:
              precision    recall  f1-score   support

  Legitimate       1.00      0.97      0.98     19177
       Fraud       0.58      0.94      0.71       823

    accuracy                           0.97     20000
   macro avg       0.79      0.95      0.85     20000
weighted avg       0.98      0.97      0.97     20000


🔥 Training Random Forest...
🎯 Random Forest Results:
   AUC Score: 0.9973
   Classification Report:
              precision    recall  f1-score   support

  Legitimate       1.00      0.98      0.99     19177
       Fraud       0.73      0.96      0.83       823

    accuracy                           0.98     20000
   macro avg       0.86      0.97      0.91     20000
weighted avg       0.99      0.98      0.98     20000


🔥 Training Gradient Boosting...
🎯 Gradient Boosting Results:
   AUC Score: 0.99

In [8]:
# Feature Importance Analysis
print('📊 Feature Importance Analysis for Indian Banking...')

if hasattr(best_model, 'feature_importances_'):
    # Get feature importances
    importances = best_model.feature_importances_
    feature_importance_df = pd.DataFrame({
        'feature': feature_columns,
        'importance': importances
    }).sort_values('importance', ascending=False)
    
    print('🔝 Top 15 Most Important Features:')
    print(feature_importance_df.head(15))
    
    # Indian banking specific insights
    indian_features = feature_importance_df[
        feature_importance_df['feature'].isin([
            'Is_International', 'Is_High_Risk_Payment', 'Payment_Method_Encoded',
            'Is_Night_Transaction', 'Is_Festival_Season', 'Amount_Log'
        ])
    ]
    print('\n🇮🇳 Indian Banking Specific Feature Importance:')
    print(indian_features)
else:
    print('Feature importance not available for this model type.')

📊 Feature Importance Analysis for Indian Banking...
🔝 Top 15 Most Important Features:
                   feature  importance
40    Is_High_Risk_Payment    0.693226
41        Is_International    0.099312
34  Payment_Method_Encoded    0.046287
29              Amount_Log    0.034455
28                  Amount    0.018958
0                       V1    0.010752
3                       V4    0.010268
6                       V7    0.007859
8                       V9    0.007527
2                       V3    0.007349
7                       V8    0.007035
1                       V2    0.006653
9                      V10    0.006445
5                       V6    0.006411
4                       V5    0.005966

🇮🇳 Indian Banking Specific Feature Importance:
                   feature  importance
40    Is_High_Risk_Payment    0.693226
41        Is_International    0.099312
34  Payment_Method_Encoded    0.046287
29              Amount_Log    0.034455
39      Is_Festival_Season    0.005096
43    Is

In [9]:
# Save Indian Banking Model and Components
print('💾 Saving Indian Banking Fraud Detection Model...')

# Save best model
joblib.dump(best_model, '../backend/app/model/indian_banking_model.pkl')
joblib.dump(scaler, '../backend/app/model/indian_banking_scaler.pkl')

# Save encoders for categorical variables
joblib.dump(le_payment, '../backend/app/model/payment_encoder.pkl')
joblib.dump(le_merchant, '../backend/app/model/merchant_encoder.pkl')
joblib.dump(le_location, '../backend/app/model/location_encoder.pkl')

# Save feature columns list
joblib.dump(feature_columns, '../backend/app/model/feature_columns.pkl')

print(f'✅ Saved {best_model_name} model with {best_auc:.4f} AUC score')
print('✅ Saved scaler and encoders')
print('✅ Saved feature column names')

print('\n🎉 Indian Banking Fraud Detection System Ready!')
print('\n📋 Model Performance Summary:')
print(f'   • Model Type: {best_model_name}')
print(f'   • AUC Score: {best_auc:.4f}')
print(f'   • Training Samples: {len(X_train_balanced):,}')
print(f'   • Test Samples: {len(X_test):,}')
print(f'   • Features Used: {len(feature_columns)}')
print(f'   • Fraud Detection Rate: {(results[best_model_name]["predictions"] == 1).sum()}/{len(y_test)}')

print('\n🚀 Ready for deployment in Indian banking environment!')

💾 Saving Indian Banking Fraud Detection Model...
✅ Saved XGBoost model with 0.9996 AUC score
✅ Saved scaler and encoders
✅ Saved feature column names

🎉 Indian Banking Fraud Detection System Ready!

📋 Model Performance Summary:
   • Model Type: XGBoost
   • AUC Score: 0.9996
   • Training Samples: 153,416
   • Test Samples: 20,000
   • Features Used: 44
   • Fraud Detection Rate: 817/20000

🚀 Ready for deployment in Indian banking environment!
