# 🔍 Raksha Behavioral Anomaly Detection Model

## Autoencoder-Based Fraud Detection for Mobile Banking

This notebook trains a deep learning autoencoder model to detect anomalous behavioral patterns in mobile banking sessions. The model analyzes 30 behavioral biometric features to identify potentially fraudulent activity.

### Features Analyzed:
- **Touch Patterns**: Tap duration, swipe velocity, touch pressure, intervals
- **Motion Sensors**: Accelerometer/gyroscope variance, device orientation  
- **Device Context**: Battery, brightness, screen time, app usage
- **Location/Network**: GPS coordinates, WiFi signatures
- **Temporal Patterns**: Time of day, day of week encoding

### Model Architecture:
- **Type**: Autoencoder Neural Network
- **Input**: 30 normalized features (18 continuous + 12 binary)
- **Hidden Layers**: 32 → 16 → 32 neurons (ReLU activation)
- **Output**: Reconstruction of input features
- **Anomaly Detection**: Uses reconstruction error threshold

## 1. Install Required Libraries

First, let's install and import all necessary libraries for training the model.

In [None]:
# Install required packages
!pip install tensorflow scikit-learn pandas numpy joblib matplotlib seaborn

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.callbacks import EarlyStopping
import joblib
import warnings
warnings.filterwarnings('ignore')

print("✅ All libraries imported successfully!")
print(f"TensorFlow version: {tf.__version__}")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")

## 2. Load and Explore Training Data

Upload your `behavioral_training_data_6000_correlated.csv` file and explore its structure.

In [None]:
# Upload your CSV file
from google.colab import files
print("📁 Please upload your behavioral_training_data_6000_correlated.csv file:")
uploaded = files.upload()

# Load the dataset
file_name = list(uploaded.keys())[0]
df = pd.read_csv(file_name)

print(f"✅ Dataset loaded successfully!")
print(f"📊 Dataset shape: {df.shape}")
print(f"📋 Columns: {len(df.columns)}")
print(f"💾 Memory usage: {df.memory_usage().sum() / 1024**2:.2f} MB")

# Display first few rows
print("\n🔍 First 5 rows:")
df.head()

In [None]:
# Explore dataset structure
print("📊 Dataset Information:")
print(f"Total samples: {len(df)}")
print(f"Total features: {len(df.columns)}")
print(f"Data types:\n{df.dtypes.value_counts()}")

# Check for missing values
print("\n🔍 Missing Values:")
missing_values = df.isnull().sum()
missing_percent = (missing_values / len(df)) * 100
missing_df = pd.DataFrame({
    'Missing_Count': missing_values,
    'Missing_Percentage': missing_percent
}).sort_values('Missing_Count', ascending=False)

print(missing_df[missing_df['Missing_Count'] > 0])

# Basic statistics
print("\n📈 Dataset Statistics:")
df.describe()

## 3. Data Preprocessing and Feature Engineering

Define feature categories and prepare data for training.

In [None]:
# Define feature categories - Updated to match Flutter app columns (30 total features)
continuous_features = [
    "tap_duration", "swipe_velocity", "touch_pressure", "tap_interval_avg",
    "accel_variance", "gyro_variance", "battery_level", "brightness_level",
    "screen_on_time", "time_of_day_sin", "time_of_day_cos",
    "wifi_id_hash", "gps_latitude", "gps_longitude",
    "device_orientation", "touch_area", "touch_event_count", "app_usage_time"
]

binary_features = [
    "accel_variance_missing", "gyro_variance_missing", "charging_state",
    "wifi_info_missing", "gps_location_missing",
    "day_of_week_mon", "day_of_week_tue", "day_of_week_wed",
    "day_of_week_thu", "day_of_week_fri", "day_of_week_sat", "day_of_week_sun"
]

print(f"📊 Feature Categories:")
print(f"Continuous features: {len(continuous_features)}")
print(f"Binary features: {len(binary_features)}")
print(f"Total features: {len(continuous_features) + len(binary_features)}")

# Verify all features exist in dataset
missing_features = []
for feature in continuous_features + binary_features:
    if feature not in df.columns:
        missing_features.append(feature)

if missing_features:
    print(f"⚠️ Missing features in dataset: {missing_features}")
else:
    print("✅ All features found in dataset!")

In [None]:
# Handle missing values
print("🔧 Preprocessing data...")
df_processed = df.copy()

# Fill missing values with 0 (appropriate for behavioral data)
df_processed[continuous_features] = df_processed[continuous_features].fillna(0)
df_processed[binary_features] = df_processed[binary_features].fillna(0)

# Verify no missing values remain
print(f"Missing values after preprocessing: {df_processed[continuous_features + binary_features].isnull().sum().sum()}")

# Standardize continuous features
print("📏 Scaling continuous features...")
scaler = StandardScaler()
scaled_continuous = scaler.fit_transform(df_processed[continuous_features])

# Combine scaled continuous and binary features
X = np.hstack([scaled_continuous, df_processed[binary_features].values])

print(f"✅ Final feature matrix shape: {X.shape}")
print(f"Feature matrix statistics:")
print(f"  Min: {X.min():.4f}")
print(f"  Max: {X.max():.4f}")
print(f"  Mean: {X.mean():.4f}")
print(f"  Std: {X.std():.4f}")

# Split into training and testing sets
X_train, X_test = train_test_split(X, test_size=0.2, random_state=42)
print(f"\n📊 Data split:")
print(f"Training set: {X_train.shape}")
print(f"Testing set: {X_test.shape}")

## 4. Build Autoencoder Model Architecture

Create the TensorFlow autoencoder model optimized for behavioral anomaly detection.

In [None]:
# Build autoencoder architecture
print("🏗️ Building autoencoder model...")

input_dim = X.shape[1]
print(f"Input dimension: {input_dim}")

# Define model architecture
input_layer = Input(shape=(input_dim,), name='input')
encoded = Dense(32, activation='relu', name='encoder_1')(input_layer)
encoded = Dense(16, activation='relu', name='encoder_2')(encoded)
decoded = Dense(32, activation='relu', name='decoder_1')(encoded)
output_layer = Dense(input_dim, activation='linear', name='output')(decoded)

# Create model
autoencoder = Model(inputs=input_layer, outputs=output_layer, name='behavioral_autoencoder')

# Compile model
autoencoder.compile(
    optimizer='adam',
    loss='mse',
    metrics=['mae']
)

print("✅ Model compiled successfully!")
print("\n🏗️ Model Architecture:")
autoencoder.summary()

# Visualize model architecture
from tensorflow.keras.utils import plot_model
plot_model(autoencoder, to_file='autoencoder_architecture.png', show_shapes=True, show_layer_names=True)
print("\n📊 Model architecture saved as 'autoencoder_architecture.png'")

## 5. Train the Model

Train the autoencoder on the behavioral data with early stopping.

In [None]:
# Set up training parameters
print("🚀 Starting model training...")

# Define callbacks
early_stopping = EarlyStopping(
    monitor='val_loss',
    patience=10,
    restore_best_weights=True,
    verbose=1
)

# Train the model
history = autoencoder.fit(
    X_train, X_train,
    epochs=100,
    batch_size=64,
    validation_split=0.2,
    callbacks=[early_stopping],
    verbose=1
)

print(f"✅ Training completed!")
print(f"Final epoch: {len(history.history['loss'])}")
print(f"Best training loss: {min(history.history['loss']):.6f}")
print(f"Best validation loss: {min(history.history['val_loss']):.6f}")

In [None]:
# Visualize training history
plt.figure(figsize=(15, 5))

# Plot training loss
plt.subplot(1, 2, 1)
plt.plot(history.history['loss'], label='Training Loss', color='blue')
plt.plot(history.history['val_loss'], label='Validation Loss', color='red')
plt.title('Model Loss During Training')
plt.xlabel('Epoch')
plt.ylabel('Loss (MSE)')
plt.legend()
plt.grid(True)

# Plot training MAE
plt.subplot(1, 2, 2)
plt.plot(history.history['mae'], label='Training MAE', color='blue')
plt.plot(history.history['val_mae'], label='Validation MAE', color='red')
plt.title('Model MAE During Training')
plt.xlabel('Epoch')
plt.ylabel('Mean Absolute Error')
plt.legend()
plt.grid(True)

plt.tight_layout()
plt.show()

# Display final metrics
final_train_loss = history.history['loss'][-1]
final_val_loss = history.history['val_loss'][-1]
print(f"\n📊 Final Training Metrics:")
print(f"Training Loss: {final_train_loss:.6f}")
print(f"Validation Loss: {final_val_loss:.6f}")
print(f"Overfitting Check: {final_val_loss/final_train_loss:.2f}x training loss")

## 6. Evaluate Model Performance

Analyze reconstruction errors and determine anomaly detection threshold.

In [None]:
# Evaluate model on test set
print("📊 Evaluating model performance...")

# Get predictions
X_test_pred = autoencoder.predict(X_test, verbose=0)

# Calculate reconstruction errors
reconstruction_errors = np.mean(np.square(X_test - X_test_pred), axis=1)

# Calculate statistics
print(f"📈 Reconstruction Error Statistics:")
print(f"  Mean: {reconstruction_errors.mean():.6f}")
print(f"  Std: {reconstruction_errors.std():.6f}")
print(f"  Min: {reconstruction_errors.min():.6f}")
print(f"  Max: {reconstruction_errors.max():.6f}")

# Calculate percentiles for threshold setting
percentiles = [90, 95, 99]
thresholds = {}
for p in percentiles:
    thresholds[p] = np.percentile(reconstruction_errors, p)
    print(f"  {p}th percentile: {thresholds[p]:.6f}")

# Use 95th percentile as anomaly threshold
threshold = thresholds[95]
print(f"\n🎯 Anomaly Detection Threshold (95th percentile): {threshold:.6f}")

# Create visualization of reconstruction errors
plt.figure(figsize=(15, 5))

# Histogram of reconstruction errors
plt.subplot(1, 2, 1)
plt.hist(reconstruction_errors, bins=50, alpha=0.7, color='skyblue', edgecolor='black')
plt.axvline(threshold, color='red', linestyle='--', linewidth=2, label=f'Threshold (95th percentile)')
plt.xlabel('Reconstruction Error')
plt.ylabel('Frequency')
plt.title('Distribution of Reconstruction Errors')
plt.legend()
plt.grid(True, alpha=0.3)

# Box plot of reconstruction errors
plt.subplot(1, 2, 2)
plt.boxplot(reconstruction_errors, patch_artist=True, boxprops=dict(facecolor='lightblue'))
plt.ylabel('Reconstruction Error')
plt.title('Box Plot of Reconstruction Errors')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 7. Save Trained Model and Scaler

Save the trained model and scaler for use in production.

In [None]:
# Save the trained model
print("💾 Saving trained model and scaler...")

# Save the autoencoder model
model_filename = 'autoencoder_model.h5'
autoencoder.save(model_filename)
print(f"✅ Model saved as: {model_filename}")

# Save the scaler
scaler_filename = 'scaler.pkl'
joblib.dump(scaler, scaler_filename)
print(f"✅ Scaler saved as: {scaler_filename}")

# Save the threshold
threshold_filename = 'threshold.pkl'
joblib.dump(threshold, threshold_filename)
print(f"✅ Threshold saved as: {threshold_filename}")

# Save feature lists for reference
feature_info = {
    'continuous_features': continuous_features,
    'binary_features': binary_features,
    'total_features': len(continuous_features) + len(binary_features)
}
joblib.dump(feature_info, 'feature_info.pkl')
print(f"✅ Feature information saved as: feature_info.pkl")

# Download files to local machine
print("\n📥 Downloading files to your local machine...")
from google.colab import files

files.download(model_filename)
files.download(scaler_filename)
files.download(threshold_filename)
files.download('feature_info.pkl')

print("\n🎉 All files saved and downloaded successfully!")
print("\nFiles ready for integration:")
print(f"  • {model_filename} - Trained autoencoder model")
print(f"  • {scaler_filename} - Feature scaler")
print(f"  • {threshold_filename} - Anomaly threshold")
print(f"  • feature_info.pkl - Feature definitions")

## 8. Test Anomaly Detection Function

Test the trained model with sample behavioral patterns.

In [None]:
# Define anomaly detection function
def score_session(session_dict, model, scaler, threshold):
    """
    Score a single behavioral session for anomaly detection.
    
    Args:
        session_dict: Dictionary containing all 30 behavioral features
        model: Trained autoencoder model
        scaler: Fitted StandardScaler
        threshold: Anomaly detection threshold
    
    Returns:
        tuple: (reconstruction_error, is_anomaly, risk_score_percent)
    """
    # Extract features in correct order
    cont_vals = [session_dict[feat] for feat in continuous_features]
    bin_vals = [session_dict[feat] for feat in binary_features]
    
    # Scale continuous features
    scaled_cont = scaler.transform([cont_vals])
    
    # Combine all features
    full_input = np.hstack([scaled_cont, [bin_vals]])
    
    # Get reconstruction
    reconstructed = model.predict(full_input, verbose=0)
    
    # Calculate reconstruction error
    error = np.mean(np.square(full_input - reconstructed), axis=1)[0]
    
    # Determine if anomaly
    is_anomaly = 1 if error > threshold else 0
    
    # Calculate risk score percentage
    risk_score_percent = min(100, (error / threshold) * 100)
    
    return error, is_anomaly, risk_score_percent

print("✅ Anomaly detection function defined!")

In [None]:
# Test with sample behavioral patterns
print("🧪 Testing anomaly detection with sample patterns...")

# Anomalous behavioral pattern (suspicious)
anomalous_example = {
    'tap_duration': 0.09,                # Too fast – bot-like
    'swipe_velocity': 0.2,               # Unusually slow swipe
    'touch_pressure': 0.1,               # Very light touch
    'tap_interval_avg': 0.05,            # Tapping too fast
    'accel_variance': 0.95,              # Erratic movement
    'gyro_variance': 0.95,               # High rotation — suspicious
    'battery_level': 0.05,               # Very low battery
    'brightness_level': 0.1,             # Screen barely visible
    'screen_on_time': 0.95,              # Long screen on time – suspicious idle
    'time_of_day_sin': -1.0,             # Edge of day (e.g., midnight)
    'time_of_day_cos': 0.0,
    'wifi_id_hash': 0.0,                 # Unrecognized network
    'gps_latitude': 0.0,
    'gps_longitude': 0.0,                # Unknown location
    'device_orientation': 0.95,          # Unusual device orientation
    'touch_area': 0.05,                  # Very small touch area - suspicious
    'touch_event_count': 0.95,           # Too many touch events
    'app_usage_time': 0.95,              # Suspiciously long app usage
    'accel_variance_missing': 1,
    'gyro_variance_missing': 1,
    'charging_state': 0,                 # Not charging on low battery
    'wifi_info_missing': 1,
    'gps_location_missing': 1,
    'day_of_week_mon': 0,
    'day_of_week_tue': 0,
    'day_of_week_wed': 0,
    'day_of_week_thu': 0,
    'day_of_week_fri': 0,
    'day_of_week_sat': 0,
    'day_of_week_sun': 1                 # Late night Sunday login (anomaly)
}

# Normal behavioral pattern
normal_example = {
    'tap_duration': 0.2,                 # Normal tap duration
    'swipe_velocity': 0.75,              # Normal swipe speed
    'touch_pressure': 0.7,               # Normal touch pressure
    'tap_interval_avg': 0.3,             # Normal tap intervals
    'accel_variance': 0.3,               # Normal movement
    'gyro_variance': 0.25,               # Normal rotation
    'battery_level': 0.6,                # Normal battery level
    'brightness_level': 0.7,             # Normal brightness
    'screen_on_time': 0.4,               # Normal screen time
    'time_of_day_sin': 0.5,              # Normal time (afternoon)
    'time_of_day_cos': 0.87,
    'wifi_id_hash': 0.8,                 # Familiar network
    'gps_latitude': 0.4,
    'gps_longitude': 0.6,                # Normal location
    'device_orientation': 0.5,           # Normal device orientation
    'touch_area': 0.6,                   # Normal touch area
    'touch_event_count': 0.4,            # Normal touch events
    'app_usage_time': 0.3,               # Normal app usage time
    'accel_variance_missing': 0,
    'gyro_variance_missing': 0,
    'charging_state': 1,                 # Device is charging
    'wifi_info_missing': 0,
    'gps_location_missing': 0,
    'day_of_week_mon': 0,
    'day_of_week_tue': 0,
    'day_of_week_wed': 1,                # Wednesday (normal business day)
    'day_of_week_thu': 0,
    'day_of_week_fri': 0,
    'day_of_week_sat': 0,
    'day_of_week_sun': 0
}

# Test both examples
print("\\n🔍 Testing Anomalous Pattern:")
error1, anomaly1, risk1 = score_session(anomalous_example, autoencoder, scaler, threshold)
print(f"  Reconstruction Error: {error1:.6f}")
print(f"  Risk Score: {risk1:.1f}%")
print(f"  Classification: {'⚠️ ANOMALY DETECTED' if anomaly1 else '✅ Normal'}")

print("\\n🔍 Testing Normal Pattern:")
error2, anomaly2, risk2 = score_session(normal_example, autoencoder, scaler, threshold)
print(f"  Reconstruction Error: {error2:.6f}")
print(f"  Risk Score: {risk2:.1f}%")
print(f"  Classification: {'⚠️ ANOMALY DETECTED' if anomaly2 else '✅ Normal'}")

print(f"\\n📊 Model Performance Summary:")
print(f"  Anomaly Threshold: {threshold:.6f}")
print(f"  Anomalous pattern correctly flagged: {anomaly1 == 1}")
print(f"  Normal pattern correctly flagged: {anomaly2 == 0}")
print(f"  Model discrimination: {error1/error2:.1f}x higher error for anomaly")

## 🎉 Model Training Complete!

### Summary
Your behavioral anomaly detection model has been successfully trained and tested. The model can now identify suspicious behavioral patterns in mobile banking sessions.

### Files Generated:
- `autoencoder_model.h5` - Trained TensorFlow model
- `scaler.pkl` - Feature preprocessing scaler
- `threshold.pkl` - Anomaly detection threshold
- `feature_info.pkl` - Feature definitions

### Next Steps:
1. **Download** all generated files to your local machine
2. **Integrate** the model into your Flutter app's cloud ML service
3. **Test** with real behavioral data from your mobile app
4. **Monitor** model performance and retrain as needed

### Model Performance:
- **Input Features**: 30 behavioral biometric features
- **Architecture**: 32→16→32 autoencoder with ReLU activation
- **Threshold**: 95th percentile of reconstruction errors
- **Detection**: Based on reconstruction error analysis

The model is ready for production use in the Raksha mobile banking fraud detection system!