# TrustShieldAI: Real-Time Fraud Detection Project

This notebook contains the implementation of a realistic financial transaction simulation and an unsupervised fraud detection system using Isolation Forest. The goal of this project is to generate a synthetic transaction dataset that captures both normal and fraudulent behaviors, and then detect anomalies automatically.

## Project Overview
- Simulate realistic transactions for users, POS agents, and mobile money agents across all Nigerian states.
- Incorporate geolocation, IP address, device fingerprints, and temporal features to enhance realism.
- Model multiple fraud scenarios, including:
  - Burst/velocity attacks
  - Cross-state POS-card fraud
  - Same-state POS-card fraud at distant locations
- Perform exploratory data analysis to understand feature distributions and trends.
- Preprocess and encode features for unsupervised anomaly detection.
- Train and evaluate an Isolation Forest model to flag potentially fraudulent transactions.

## Team Details
**Team Name:** TrustShieldAI  
**Team Members:**  
- Adaeze Ifeanyi – Team Lead  
- Abdusshakur Abdurrahman – Team Member  

This project demonstrates a combination of data simulation, feature engineering, exploratory analysis, and advanced anomaly detection techniques for real-time fraud detection systems.


### Fraud Dataset Generation and Transaction Simulation

Simulate realistic financial transactions with both normal and fraudulent activity:
- Define users, POS and mobile money agents, transaction volume, and geographic locations
- Assign transaction channels and agent mappings for realistic workflow
- Generate IP addresses by state and create unique device fingerprints
- Simulate nearby transaction locations and classify time-of-day
- Create standard transactions capturing distance from home, agent info, IP, and timestamps

Fraud scenarios simulated:
- Velocity/Burst Fraud → multiple rapid transactions in short intervals
- Cross-State POS-Card Fraud → transaction occurs in a different state from user’s home
- Same-State POS-Card Fraud → transaction occurs within the same state but at a distant agent

What We did:
- Configured the simulation environment with reproducible random seeds
- Generated realistic Nigerian state coordinates and mapped transaction channels
- Implemented helper functions for IP, device ID, nearby location, and time-of-day features
- Created functions to simulate standard transactions and various fraud patterns
- Built a dataset suitable for training and evaluating fraud detection models


In [None]:
#importing required libraries
import random
import numpy as np
import pandas as pd
from datetime import datetime, timedelta
from geopy.distance import geodesic
import math
import csv
import uuid
import hashlib
from collections import defaultdict
import matplotlib.pyplot as plt

In [None]:
random.seed(42)
np.random.seed(42)

# Config
n_users = 8000
n_pos_agents = 200  # POS agents
n_momo_agents = 100  # Mobile money agents
n_transactions = 50000
print("Building a real time fraud detector dataset....")

Building a real time fraud detector dataset....


In [None]:
states = ["Lagos","Ogun","Oyo","Osun","Ondo","Ekiti","Anambra","Enugu","Imo","Abia",
          "Ebonyi","Rivers","Delta","Edo","Cross River","Akwa Ibom","Bayelsa",
          "Abuja","Niger","Kwara","Kogi","Benue","Plateau","Nasarawa","Kano",
          "Kaduna","Katsina","Sokoto","Zamfara","Kebbi","Jigawa","Borno",
          "Adamawa","Yobe","Bauchi","Gombe","Taraba"]

# Complete realistic coordinates for all 36 Nigerian states
state_coords = {
    # South West
    "Lagos": (6.5244, 3.3792),
    "Oyo": (7.8429, 3.9470),
    "Ogun": (7.1608, 3.3476),
    "Osun": (7.5629, 4.5200),
    "Ondo": (7.2500, 5.2061),
    "Ekiti": (7.7190, 5.3111),

    # South East
    "Anambra": (6.2209, 6.9982),
    "Enugu": (6.5244, 7.5106),
    "Imo": (5.4925, 7.0349),
    "Abia": (5.4527, 7.5248),
    "Ebonyi": (6.2649, 8.0137),

    # South South
    "Rivers": (4.8156, 6.9994),
    "Delta": (5.8921, 5.6805),
    "Edo": (6.3350, 5.6037),
    "Cross River": (5.9631, 8.5450),
    "Akwa Ibom": (5.0077, 7.8536),
    "Bayelsa": (4.7719, 6.0699),

    # North Central
    "Abuja": (9.0579, 7.4951),
    "Niger": (10.4806, 6.5432),
    "Kwara": (8.9670, 4.5993),
    "Kogi": (7.7323, 6.7411),
    "Benue": (7.1906, 8.7501),
    "Plateau": (9.2182, 9.5179),
    "Nasarawa": (8.5378, 8.5167),

    # North West
    "Kano": (12.0022, 8.5920),
    "Kaduna": (10.5105, 7.4165),
    "Katsina": (12.9908, 7.6006),
    "Sokoto": (13.0059, 5.2476),
    "Zamfara": (12.1666, 6.6666),
    "Kebbi": (12.4500, 4.2000),
    "Jigawa": (12.2300, 9.5500),

    # North East
    "Borno": (11.8846, 13.1571),
    "Adamawa": (9.3265, 12.3984),
    "Yobe": (12.2939, 11.9668),
    "Bauchi": (10.3158, 9.8442),
    "Gombe": (10.2897, 11.1689),
    "Taraba": (8.8921, 11.3733)
}

# Define channels and transaction types
channels = ["POS", "MoMo"]

# Transaction types available for each channel
transaction_types = {
    "POS": ["Card", "Transfer"],
    "MoMo": ["Transfer"]
}

# Agent allowed transaction types
agent= {
    "POS_AGENT": ["Card", "Transfer"],
    "MOMO_AGENT": ["Transfer"]
}

print("States, states coordinates, channels, and transaction types initialized....")


States, states coordinates, channels, and transaction types initialized....


In [None]:
# Enhanced IP generation by state/region
def generate_ip_by_state(state):
    """Generate realistic IP ranges for Nigerian states based on major ISPs"""
    # Nigerian ISP IP ranges by region
    state_ip_ranges = {
        # South West
        "Lagos": [(41, 203), (197, 210), (102, 88)],
        "Oyo": [(41, 204), (154, 115), (102, 89)],
        "Ogun": [(41, 205), (197, 211), (105, 112)],
        "Osun": [(41, 206), (154, 116), (102, 90)],
        "Ondo": [(41, 207), (197, 212), (105, 113)],
        "Ekiti": [(41, 208), (154, 117), (102, 91)],

        # South East
        "Anambra": [(41, 209), (197, 213), (102, 92)],
        "Enugu": [(41, 210), (154, 118), (105, 114)],
        "Imo": [(41, 211), (197, 214), (102, 93)],
        "Abia": [(41, 212), (154, 119), (105, 115)],
        "Ebonyi": [(41, 213), (197, 215), (102, 94)],

        # South South
        "Rivers": [(41, 214), (154, 120), (105, 116)],
        "Delta": [(41, 215), (197, 216), (102, 95)],
        "Edo": [(41, 216), (154, 121), (105, 117)],
        "Cross River": [(41, 217), (197, 217), (102, 96)],
        "Akwa Ibom": [(41, 218), (154, 122), (105, 118)],
        "Bayelsa": [(41, 219), (197, 218), (102, 97)],

        # North Central
        "Abuja": [(41, 220), (154, 123), (102, 98)],
        "Niger": [(41, 221), (197, 219), (105, 119)],
        "Kwara": [(41, 222), (154, 124), (102, 99)],
        "Kogi": [(41, 223), (197, 220), (105, 120)],
        "Benue": [(41, 224), (154, 125), (102, 100)],
        "Plateau": [(41, 225), (197, 221), (105, 121)],
        "Nasarawa": [(41, 226), (154, 126), (102, 101)],

        # North West
        "Kano": [(154, 113), (41, 227), (105, 122)],
        "Kaduna": [(154, 114), (41, 228), (197, 222)],
        "Katsina": [(154, 127), (41, 229), (105, 123)],
        "Sokoto": [(154, 128), (197, 223), (102, 102)],
        "Zamfara": [(154, 129), (41, 230), (105, 124)],
        "Kebbi": [(154, 130), (197, 224), (102, 103)],
        "Jigawa": [(154, 131), (41, 231), (105, 125)],

        # North East
        "Borno": [(154, 132), (41, 232), (197, 225)],
        "Adamawa": [(154, 133), (197, 226), (105, 126)],
        "Yobe": [(154, 134), (41, 233), (102, 104)],
        "Bauchi": [(154, 135), (197, 227), (105, 127)],
        "Gombe": [(154, 136), (41, 234), (102, 105)],
        "Taraba": [(154, 137), (197, 228), (105, 128)]
    }

    if state in state_ip_ranges:
        range_choice = random.choice(state_ip_ranges[state])
    else:
        # Fallback for any missing states
        range_choice = (random.choice([41, 154, 197, 102, 105]), random.randint(200, 240))
    return f"{range_choice[0]}.{range_choice[1]}.{random.randint(0,255)}.{random.randint(1,254)}"

print("IP generation by state and region....")

IP generation by state and region....


In [None]:
def generate_device_id():
    return hashlib.md5(str(uuid.uuid4()).encode()).hexdigest()[:16]
print("Generate_device_id(): creating a random unique device fingerprint...")

def generate_nearby_location(lat, lon, max_distance_km):
    angle = random.uniform(0, 2*math.pi)
    distance = random.uniform(0, max_distance_km)
    lat_offset = (distance * math.cos(angle)) / 111
    lon_offset = (distance * math.sin(angle)) / (111 * math.cos(math.radians(lat)))
    return lat + lat_offset, lon + lon_offset
print("Generate_nearby_location(): generating a realistic nearby latitude/longitude...")

def get_time_of_day_category(hour):
    if 6 <= hour <= 18:
        return "day"
    elif 19 <= hour <= 22:
        return "evening"
    else:
        return "night"
print("Get_time_of_day_category(): classifying hour into day, evening, or night...")

Generate_device_id(): creating a random unique device fingerprint...
Generate_nearby_location(): generating a realistic nearby latitude/longitude...
Get_time_of_day_category(): classifying hour into day, evening, or night...


In [None]:
def select_agent(channel_main, transaction_type, all_agents, target_state=None):
    """
    Select an appropriate agent based on:
    - channel_main (POS or MoMo)
    - transaction_type (Card or Transfer)
    """
    # Filter by agent type (POS_AGENT or MOMO_AGENT)
    if channels == "POS":
        available_agents = [aid for aid, agent in all_agents.items()
                            if agent['type'] == 'POS_AGENT' and transaction_type in agent['transaction_types']]
    elif channels == "MoMo":
        available_agents = [aid for aid, agent in all_agents.items()
                            if agent['type'] == 'MOMO_AGENT' and transaction_type in agent['transaction_types']]
    else:
        return None

print("Selecting appropriate agent....")


Selecting appropriate agent....


In [None]:
def create_transaction(txn_id, user_id, user, amount, channel, lat, lon,
                      transaction_state, ip_address, agent_id, all_agents, timestamp):
    """Create a transaction record with all required fields"""
    home_location = (user['lat'], user['lon'])
    txn_location = (lat, lon)
    distance_from_home = geodesic(home_location, txn_location).kilometers

    return {
        "transaction_id": txn_id,
        "user_id": user_id,
        "amount": round(amount, 2),
        "channel": channel,
        "lat": round(lat, 6),
        "lon": round(lon, 6),
        "user_home_state": user['state'],
        "transaction_state": transaction_state,
        "distance_from_home_km": round(distance_from_home, 2),
        "ip_address": ip_address,
        "user_home_ip": user['home_ip'],
        "agent_id": agent_id,
        "agent_type": all_agents[agent_id]['type'] if agent_id else None,
        "timestamp": timestamp.isoformat(),
        "time_of_day": get_time_of_day_category(timestamp.hour)
    }
print("Creating transaction....")

Creating transaction....


In [None]:
def generate_burst_fraud(txn_id, user_id, user, channel, base_amount, timestamp, all_agents):
    """Generate velocity/burst attack fraud pattern"""
    transactions = []
    burst_size = random.randint(4, 12)
    burst_interval = random.randint(1, 5)

    # Select initial agent
    agent_id = select_agent(channels,transaction_type, all_agents)
    if agent_id:
        all_agents[agent_id]['suspicious_activity'] = True

    for b in range(burst_size):
        burst_time = timestamp + timedelta(minutes=b * burst_interval)
        amt = base_amount * random.uniform(0.8, 1.3)

        # 30% chance of cross-state fraud within burst
        if random.random() < 0.3:
            fraud_state = random.choice([s for s in states if s != user['state']])
            fraud_agent_id = select_agent(channels,transaction_type, all_agents, fraud_state)

            if fraud_agent_id:
                lat, lon = all_agents[fraud_agent_id]['lat'], all_agents[fraud_agent_id]['lon']
                current_agent_id = fraud_agent_id

            else:
                lat, lon = generate_nearby_location(*state_coords[fraud_state], 10)
                current_agent_id = agent_id

            transaction_state = fraud_state
            ip_address = generate_ip_by_state(fraud_state)
        else:
            lat, lon = generate_nearby_location(user['lat'], user['lon'], 3)
            transaction_state = user['state']
            current_agent_id = agent_id
            ip_address = user['home_ip'] if random.random() < 0.7 else generate_ip_by_state(user['state'])

        transaction = create_transaction(
            f"TXN{txn_id:08d}_{b}", user_id, user, amt, channel,
            lat, lon, transaction_state, ip_address, current_agent_id, all_agents, burst_time
        )
        transactions.append(transaction)

    return transactions
print("Generating velocity/burst transactions....")

Generating velocity/burst transactions....


In [None]:
def generate_cross_state_fraud(txn_id, user_id, user, base_amount, timestamp, all_agents):
    """
    Generate cross-state POS-Card fraud:
    - User stays in their home state
    - Transaction is recorded through an agent in another state
    """
    user_state = user['state']

    # Pick a fraud state different from the user's
    fraud_state = random.choice([s for s in states if s != user_state])

    # Agent belongs to fraud state
    fraud_agent_id = select_agent("POS", "Card", all_agents, fraud_state)

    if fraud_agent_id:
        fraud_lat, fraud_lon = all_agents[fraud_agent_id]['lat'], all_agents[fraud_agent_id]['lon']
    else:
        fraud_lat, fraud_lon = generate_nearby_location(*state_coords[fraud_state], 15)
        fraud_agent_id = None

    # Amount unusually high for fraud
    amount = base_amount * random.uniform(1.5, 3.0)

    # Fraudulent IP matches fraud state
    ip_address = generate_ip_by_state(fraud_state)

    # Transaction recorded in fraud state, but tied to user’s home state
    return create_transaction(
        f"TXN{txn_id:08d}", user_id, user, amount, "Card",
        fraud_lat, fraud_lon, fraud_state, ip_address, fraud_agent_id, all_agents, timestamp
    )
print("Generating cross state fraud....")

Generating cross state fraud....


In [None]:
def generate_same_state_pos_fraud(txn_id, user_id, user, base_amount, timestamp, all_agents):
    """
    Generate POS-Card fraud in the SAME state:
    - User is at home
    - Transaction happens in the same state but at a distant agent location
    """
    user_state = user['state']

    # Select an agent in the same state, but not the user's home location
    fraud_agent = random.choice([{"agent_id": agent_id, **agent_info}
        for agent_id, agent_info in all_agents.items()
        if agent_info['state'] == user_state
    ])

    fraud_agent_id = fraud_agent['agent_id']
    fraud_lat, fraud_lon = fraud_agent['lat'], fraud_agent['lon']

    # Fraud usually involves higher than normal amounts
    amount = base_amount * random.uniform(1.5, 3.0)

    # IP still resolves to same state
    ip_address = generate_ip_by_state(user_state)

    # Transaction logged as same-state, but far from user’s true home
    return create_transaction(
        f"TXN{txn_id:08d}", user_id, user, amount, "Card",
        fraud_lat, fraud_lon, user_state, ip_address, fraud_agent_id, all_agents, timestamp
    )
print("Generating same state fraud....")

Generating same state fraud....


In [None]:
def generate_normal_cross_state(txn_id, user_id, user, base_amount, timestamp, all_agents):
    """
    Generate a normal cross-state transaction (not fraud, just unusual but valid).
    Only for POS Transfer and Mobile Money.
    """
    # Limit channel to only POS Transfer or Mobile Money
    channel = random.choice(["POS Transfer", "Mobile Money"])

    # Pick an agent in a different state than the user
    user_state = user['state']
    agent = random.choice([{"agent_id": agent_id, **agent_info}
        for agent_id, agent_info in all_agents.items()
        if agent_info['state'] != user_state
    ])


    # Cross-state but valid
    transaction_state = agent['state']
    lat, lon = agent['lat'], agent['lon']
    ip_address = f"192.168.{random.randint(0,255)}.{random.randint(0,255)}"
    agent_id = agent['agent_id']

    # Build the transaction dictionary (no fraud flag)
    transaction = create_transaction(
        f"TXN{txn_id:08d}", user_id, user, base_amount, channel,
        lat, lon, transaction_state, ip_address, agent_id, all_agents, timestamp
    )

    return transaction
print("Generating normal cross state transaction....")

Generating normal cross state transaction....


In [None]:
def generate_round_amount_fraud(txn_id, user_id, user, channel, timestamp, all_agents):
    """Generate round number amount fraud at unusual times"""
    round_amounts = [50000, 100000, 150000, 200000, 250000, 500000, 1000000]
    amount = random.choice(round_amounts)

    fraud_hour = random.choice([2, 3, 4, 23, 1])
    fraud_time = timestamp.replace(hour=fraud_hour)

    txn_lat = user['lat'] + random.uniform(-0.1, 0.1)
    txn_lon = user['lon'] + random.uniform(-0.1, 0.1)

    agent_id = select_agent(channel, all_agents)

    return create_transaction(
        f"TXN{txn_id:08d}", user_id, user, amount, channel,
        txn_lat, txn_lon, user['state'], user['home_ip'], agent_id, all_agents, fraud_time
    )
print("Generating round amount at odd hours....")

Generating round amount at odd hours....


In [None]:
def generate_time_fraud(txn_id, user_id, user, channel, base_amount, timestamp, all_agents):
    """Generate unusual time pattern fraud"""
    unusual_hour = random.choice([1, 2, 3, 4, 5])
    unusual_time = timestamp.replace(hour=unusual_hour)
    amount = base_amount * random.uniform(1.2, 2.5)

    txn_lat, txn_lon = generate_nearby_location(user['lat'], user['lon'], 5)
    agent_id = select_agent(channel, all_agents)

    return create_transaction(
        f"TXN{txn_id:08d}", user_id, user, amount, channel,
        txn_lat, txn_lon, user['state'], user['home_ip'], agent_id, all_agents, unusual_time
    )

def generate_normal_transaction(txn_id, user_id, user, channel, base_amount, timestamp, all_agents):
    """Generate normal transaction"""
    txn_lat, txn_lon = generate_nearby_location(user['lat'], user['lon'], 5)
    agent_id = select_agent(channel, all_agents)

    return create_transaction(
        f"TXN{txn_id:08d}", user_id, user, base_amount, channel,
        txn_lat, txn_lon, user['state'], user['home_ip'], agent_id, all_agents, timestamp
    )
print("Generating unusual time pattern and normal transactions....")

Generating unusual time pattern and normal transactions....


In [None]:
# Generate Users with behavioral patterns
def generate_users():
    users = {}
    for i in range(n_users):
        state = random.choice(states)
        lat, lon = generate_nearby_location(*state_coords[state], 30)

        peak_hours = random.choice(["morning", "afternoon", "evening"])
        risk_profile = random.choice(["low", "medium", "high"])
        base_amount = 2000 if risk_profile == "low" else 50000 if risk_profile == "medium" else 10000

        users[f"USR{i:06d}"] = {
            "state": state,
            "lat": lat,
            "lon": lon,
            "avg_amount": np.random.lognormal(np.log(base_amount), 0.5),
            "device_id": generate_device_id(),
            "preferred_channel": random.choice(channels),
            "peak_hours": peak_hours,
            "risk_profile": risk_profile,
            "transaction_count": 0,
            "home_ip": generate_ip_by_state(state)
        }
    return users
print("Generating users....")

Generating users....


In [None]:
def generate_agents():
    """Generate POS and Mobile Money agents"""
    all_agents = {}

    # Generate POS agents
    for i in range(n_pos_agents):
        agent_state = random.choice(states)
        lat, lon = generate_nearby_location(*state_coords[agent_state], 20)

        all_agents[f"POS{i:05d}"] = {
            "type": "POS_AGENT",
            "state": agent_state,
            "lat": lat,
            "lon": lon,
            "channels": ["Card", "Transfer"],
        }
            # Generate Mobile Money agents
    for i in range(n_momo_agents):
        agent_state = random.choice(states)
        lat, lon = generate_nearby_location(*state_coords[agent_state], 20)

        all_agents[f"MOMO{i:05d}"] = {
            "type": "MOMO_AGENT",
            "state": agent_state,
            "lat": lat,
            "lon": lon,
            "channels": ["MobileMoney"],
        }

    return all_agents
print("Generating agents....")

Generating agents....


In [None]:
# Main execution
users = generate_users()
all_agents = generate_agents()

# Enhanced transaction generation using helper functions
transactions = []
user_last_transaction = defaultdict(lambda: datetime.now() - timedelta(days=30))

for txn_id in range(1, n_transactions+1):
    user_id = random.choice(list(users.keys()))
    user = users[user_id]

    # Generate timestamp
    base_time = datetime.now() - timedelta(days=random.randint(0,90))
    hour = random.randint(8, 18) if random.random() < 0.7 else (
        random.randint(19, 23) if random.random() < 0.8 else random.randint(0, 7)
    )
    timestamp = base_time.replace(hour=hour, minute=random.randint(0,59))

    # Basic setup
    channel = user['preferred_channel'] if random.random() < 0.7 else random.choice(channels)
    base_amount = max(100, np.random.lognormal(np.log(user['avg_amount']), 0.6))

    # Generate different fraud patterns using helper functions
    if random.random() < 0.06:  # Burst fraud
        burst_transactions = generate_burst_fraud(txn_id, user_id, user, channel, base_amount, timestamp, all_agents)
        transactions.extend(burst_transactions)
        user['transaction_count'] += len(burst_transactions)
        user_last_transaction[user_id] = timestamp
        continue

    elif random.random() < 0.04:  # Cross-state fraud
        fraud_transaction = generate_cross_state_fraud(txn_id, user_id, user, base_amount, timestamp, all_agents)
        transactions.append(fraud_transaction)
        continue

    elif random.random() < 0.02:  # Same-state POS fraud (user at home, agent elsewhere in same state)
        fraud_transaction = generate_same_state_pos_fraud(txn_id, user_id, user, base_amount, timestamp, all_agents)
        transactions.append(fraud_transaction)
        continue

    elif random.random() < 0.03:  # Rare normal cross-state
        normal_cross = generate_normal_cross_state(txn_id, user_id, user, base_amount, timestamp, all_agents)
        transactions.append(normal_cross)
        continue

    elif random.random() < 0.03:  # Round amount fraud
        fraud_transaction = generate_round_amount_fraud(txn_id, user_id, user, channel, timestamp, all_agents)
        transactions.append(fraud_transaction)
        continue

    elif random.random() < 0.02:  # Time fraud
        fraud_transaction = generate_time_fraud(txn_id, user_id, user, channel, base_amount, timestamp, all_agents)
        transactions.append(fraud_transaction)
        continue

    else:  # Normal transaction
        normal_transaction = generate_normal_transaction(txn_id, user_id, user, channel, base_amount, timestamp, all_agents)
        transactions.append(normal_transaction)

    user['transaction_count'] += 1
    user_last_transaction[user_id] = timestamp
print("Combining all functions and exercuting....")

TypeError: select_agent() missing 1 required positional argument: 'all_agents'

In [None]:
# Save enhanced dataset
generated_transaction_data = pd.DataFrame(transactions)
generated_transaction_data = generated_transaction_data.sort_values('timestamp')

# Add derived features
generated_transaction_data['hour'] = pd.to_datetime(generated_transaction_data['timestamp']).dt.hour
generated_transaction_data['day_of_week'] = pd.to_datetime(generated_transaction_data['timestamp']).dt.dayofweek


# Save to CSV
file_timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
generated_transaction_data.to_csv(f"transactions_data{file_timestamp}.csv", index=False)
print(f"Saved as: transactions_data{file_timestamp}.csv")

In [None]:
!wget https://raw.githubusercontent.com/Adaezeh/TrustShieldAI/main/transactions_data.csv -O transactions_data.csv

### Exploratory Data Analysis: Visualizing Transaction Features


Use histograms, count plots, and aggregation to explore numeric and categorical transaction features:
- Analyze numeric columns: amount, lat, lon, distance_from_home_km, hour, day_of_week
- Visualize distributions to identify patterns, ranges, and potential outliers
- Examine categorical columns: channel, user_home_state, transaction_state, agent_type, time_of_day
- Generate count plots to understand category frequencies and dataset composition

What We did:
- Plotted histograms for numeric features to observe distribution and detect anomalies
- Created count plots for categorical features to reveal popular channels, state-level activity, and agent types
- Identified temporal trends in transactions based on time-of-day and day-of-week
- Gained insights into the overall dataset composition, informing feature engineering and fraud detection modeling



In [None]:
transaction_data =pd.read_csv("transactions_data.csv")

In [None]:
transaction_data.head()

In [None]:
print(transaction_data.shape)

In [None]:
print(transaction_data.info())

In [None]:
print(transaction_data.isnull().sum())

In [None]:
transaction_data.columns

In [None]:
print(transaction_data.isnull().sum())

In [None]:
numeric_cols = ['amount', 'lat', 'lon', 'distance_from_home_km', 'hour', 'day_of_week']

transaction_data[numeric_cols].hist(figsize=(12,10), bins=30)
plt.suptitle("Distribution of Numeric Features", fontsize=16)
plt.show()


In [None]:
import seaborn as sns

In [None]:
categorical_cols = ['channel', 'user_home_state', 'transaction_state', 'agent_type', 'time_of_day']

for col in categorical_cols:
    plt.figure(figsize=(8,4))
    sns.countplot(x=col, data=transaction_data, order=transaction_data[col].value_counts().index)
    plt.xticks(rotation=45)
    plt.title(f"Distribution of {col}")
    plt.show()


### Fraud Detection with Isolation Forest

Use preprocessing, feature engineering, and unsupervised learning to detect anomalous transactions:
- Transform numeric features (e.g., log of amount, distance from home, percentiles)
- Encode categorical variables (channel, state, agent type, time-of-day)
- Engineer datetime features (hour and day-of-week as sine/cosine)
- Flag night/weekend transactions and very distant transactions

What We did:
- Preprocessed the transaction dataset, creating numeric, categorical, and derived features
- Scaled features using StandardScaler for model stability
- Split the dataset into training and testing sets
- Trained an Isolation Forest model to detect anomalies without a fixed contamination rate
- Predicted fraud for both train and test sets, generating a confidence score
- Merged predictions with original and derived features for flagged transactions
- Examined feature statistics to understand distribution of values
- Saved the trained model for reuse on new transaction data
- Created a helper function to apply the model for real-time detection on new transactions



In [None]:
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')
from sklearn.metrics import classification_report, confusion_matrix
import seaborn as sns
import joblib

In [None]:
def preprocess_features(df, label_encoders=None, fit_encoders=True):
    df_processed = df.copy()

    # Datetime features
    if 'timestamp' in df_processed.columns:
        df_processed['hour_sin'] = np.sin(2 * np.pi * df_processed['hour'] / 24)
        df_processed['hour_cos'] = np.cos(2 * np.pi * df_processed['hour'] / 24)
        df_processed['day_sin'] = np.sin(2 * np.pi * df_processed['day_of_week'] / 7)
        df_processed['day_cos'] = np.cos(2 * np.pi * df_processed['day_of_week'] / 7)

    # Additional features
    df_processed['amount_log'] = np.log1p(df_processed['amount'])
    df_processed['is_same_state'] = (df_processed['user_home_state'] == df_processed['transaction_state']).astype(int)
    df_processed['is_home_ip'] = (df_processed['ip_address'] == df_processed['user_home_ip']).astype(int)

    # Categorical encoding
    categorical_cols = ['channel','user_home_state','transaction_state','agent_type','time_of_day']
    if label_encoders is None:
        label_encoders = {}

    for col in categorical_cols:
        if col in df_processed.columns:
            if fit_encoders:
                label_encoders[col] = LabelEncoder()
                df_processed[f'{col}_encoded'] = label_encoders[col].fit_transform(df_processed[col].astype(str))
            else:
                known_categories = set(label_encoders[col].classes_)
                df_processed[col] = df_processed[col].astype(str)
                df_processed[col] = df_processed[col].apply(lambda x: x if x in known_categories else 'unknown')
                if 'unknown' not in known_categories:
                    label_encoders[col].classes_ = np.append(label_encoders[col].classes_, 'unknown')
                df_processed[f'{col}_encoded'] = label_encoders[col].transform(df_processed[col])

    # Feature columns
    feature_cols = [
        'amount','amount_log','lat','lon','distance_from_home_km',
        'hour','day_of_week','hour_sin','hour_cos','day_sin','day_cos',
        'is_same_state','is_home_ip'
    ]
    for col in categorical_cols:
        if f'{col}_encoded' in df_processed.columns:
            feature_cols.append(f'{col}_encoded')
    return df_processed[feature_cols], label_encoders

In [None]:
def train_isolation_forest(X_train, contamination=0.05, max_samples=30000):
    iso_forest = IsolationForest(
        contamination=contamination,
        random_state=42,
        n_estimators=100,
        max_samples=min(max_samples, len(X_train))
    )
    iso_forest.fit(X_train)
    return iso_forest

In [None]:
def predict_if_fraud(X_scaled, iso_forest):
    if_pred = iso_forest.predict(X_scaled)
    if_anomalies = if_pred.astype(int)
    results = pd.DataFrame({
        'is_fraud': if_anomalies,
        'isolation_forest_flagged': if_anomalies
    }, index=X_scaled.index)
    return results

In [None]:
# Preprocess
X, label_encoders = preprocess_features(transaction_data)

X.head(10)


In [None]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

X_scaled


In [None]:
# Train Isolation Forest
iso_forest = train_isolation_forest(X_scaled)

# Predict fraud
results_if = predict_if_fraud(pd.DataFrame(X_scaled, columns=X.columns), iso_forest)
results_if.head()

In [None]:
# Add original (unscaled) features back
results_with_features = pd.concat([X.reset_index(drop=True), results_if.reset_index(drop=True)], axis=1)

results_with_features.head()


In [None]:
# new
print("Loading your transaction data...")
print(f"Dataset loaded: {transaction_data.shape[0]} transactions, {transaction_data.shape[1]} features")

class EnhancedIsolationForestDetector:
    """
    Unsupervised Isolation Forest detector for fraud/outlier detection.
    Automatically determines anomalies without a fixed contamination rate.
    """

    def __init__(self, max_samples=30000):
        self.max_samples = max_samples
        self.iso_forest = None
        self.scaler = StandardScaler()
        self.label_encoders = {}
        self.feature_names = []
        self.feature_stats = {}

    def preprocess_features(self, df, fit_encoders=True):
        """Preprocess dataset: numeric transformations, datetime features, categorical encoding"""
        df_processed = df.copy()

        # Datetime features (your dataset already has hour and day_of_week)
        if 'hour' in df_processed.columns:
            df_processed['hour_sin'] = np.sin(2 * np.pi * df_processed['hour'] / 24)
            df_processed['hour_cos'] = np.cos(2 * np.pi * df_processed['hour'] / 24)
        if 'day_of_week' in df_processed.columns:
            df_processed['day_sin'] = np.sin(2 * np.pi * df_processed['day_of_week'] / 7)
            df_processed['day_cos'] = np.cos(2 * np.pi * df_processed['day_of_week'] / 7)

        # Numeric features
        df_processed['amount_log'] = np.log1p(df_processed['amount'])
        df_processed['is_same_state'] = (df_processed['user_home_state'] == df_processed['transaction_state']).astype(int)
        df_processed['is_home_ip'] = (df_processed['ip_address'] == df_processed['user_home_ip']).astype(int)
        df_processed['amount_percentile'] = df_processed['amount'].rank(pct=True)
        df_processed['is_night_transaction'] = ((df_processed['hour'] >= 22) | (df_processed['hour'] <= 5)).astype(int)
        df_processed['is_weekend'] = (df_processed['day_of_week'] >= 5).astype(int)

        if 'distance_from_home_km' in df_processed.columns:
            df_processed['very_far_from_home'] = (df_processed['distance_from_home_km'] > 500).astype(int)
            df_processed['distance_log'] = np.log1p(df_processed['distance_from_home_km'])

        # Categorical encoding
        categorical_cols = ['channel', 'user_home_state', 'transaction_state', 'agent_type', 'time_of_day']
        for col in categorical_cols:
            if col in df_processed.columns:
                if fit_encoders:
                    self.label_encoders[col] = LabelEncoder()
                    df_processed[f'{col}_encoded'] = self.label_encoders[col].fit_transform(df_processed[col].astype(str))
                else:
                    if col in self.label_encoders:
                        known_categories = set(self.label_encoders[col].classes_)
                        df_processed[col] = df_processed[col].astype(str).apply(lambda x: x if x in known_categories else 'unknown')
                        if 'unknown' not in known_categories:
                            self.label_encoders[col].classes_ = np.append(self.label_encoders[col].classes_, 'unknown')
                        df_processed[f'{col}_encoded'] = self.label_encoders[col].transform(df_processed[col])

        # Combine all feature columns
        feature_cols = [
            'amount','amount_log','amount_percentile',
            'lat','lon','distance_from_home_km',
            'hour','day_of_week','hour_sin','hour_cos','day_sin','day_cos',
            'is_same_state','is_home_ip','is_night_transaction','is_weekend'
        ]
        if 'very_far_from_home' in df_processed.columns:
            feature_cols.extend(['very_far_from_home', 'distance_log'])
        for col in categorical_cols:
            if f'{col}_encoded' in df_processed.columns:
                feature_cols.append(f'{col}_encoded')

        available_features = [col for col in feature_cols if col in df_processed.columns]
        self.feature_names = available_features

        return df_processed[available_features]

    def train_isolation_forest(self, X_train):
        """Train Isolation Forest without fixed contamination"""
        print(f"Training Isolation Forest with {X_train.shape[0]} samples and {X_train.shape[1]} features...")
        self.iso_forest = IsolationForest(
            contamination='auto',  # automatically determines anomaly threshold
            random_state=42,
            n_estimators=100,
            max_samples=min(self.max_samples, len(X_train))
        )
        self.iso_forest.fit(X_train)

        feature_df = pd.DataFrame(X_train, columns=self.feature_names)
        self.feature_stats = {
            'means': feature_df.mean().to_dict(),
            'stds': feature_df.std().to_dict(),
            'mins': feature_df.min().to_dict(),
            'maxs': feature_df.max().to_dict()
        }

        print("Training complete!")
        return self.iso_forest

    def predict_fraud(self, X_scaled, return_scores=False):
        """Predict anomalies (fraud)"""
        if self.iso_forest is None:
            raise ValueError("Model not trained yet. Call train_isolation_forest first.")

        if_pred = self.iso_forest.predict(X_scaled)
        if_scores = self.iso_forest.decision_function(X_scaled)
        if_anomalies = (if_pred == -1).astype(int)

        results = pd.DataFrame({
            'is_fraud': if_anomalies,
            'isolation_forest_flagged': if_anomalies,
            'confidence_score': self.normalize_scores(if_scores)
        })
        if return_scores:
            results['raw_anomaly_score'] = if_scores

        return results

    def normalize_scores(self, scores):
        min_score = scores.min()
        max_score = scores.max()
        if min_score == max_score:
            return np.zeros_like(scores)
        normalized = (max_score - scores) / (max_score - min_score)
        return np.clip(normalized, 0, 1)

    def save_model(self, filepath_prefix='isolation_forest_model'):
        """Save model and preprocessing objects"""
        model_data = {
            'iso_forest': self.iso_forest,
            'scaler': self.scaler,
            'label_encoders': self.label_encoders,
            'feature_names': self.feature_names,
            'feature_stats': self.feature_stats,
            'max_samples': self.max_samples
        }
        joblib.dump(model_data, f'{filepath_prefix}.pkl')
        print(f"✅ Model saved as {filepath_prefix}.pkl")

    def load_model(self, filepath):
        """Load previously trained model"""
        model_data = joblib.load(filepath)
        self.iso_forest = model_data['iso_forest']
        self.scaler = model_data['scaler']
        self.label_encoders = model_data['label_encoders']
        self.feature_names = model_data['feature_names']
        self.feature_stats = model_data['feature_stats']
        self.max_samples = model_data['max_samples']
        print(f"✅ Model loaded from {filepath}")



In [None]:
#  USAGE WORKFLOW

# Step 1: Initialize the detector
detector = EnhancedIsolationForestDetector(max_samples=30000)

# Step 2: Preprocess the transaction_data (add our CSV here)
print("\nPreprocessing features...")
X_features = detector.preprocess_features(transaction_data, fit_encoders=True)
print(f"Features created: {list(X_features.columns)}")

# Step 3: Scale the features
print("\nScaling features...")
X_scaled = detector.scaler.fit_transform(X_features)

# Step 4: Split data for training/testing (optional - you can train on all data)
X_train, X_test = train_test_split(X_scaled, test_size=0.2, random_state=42)
train_indices, test_indices = train_test_split(range(len(X_scaled)), test_size=0.2, random_state=42)

print(f"Training set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")

# Step 5: Train the model
detector.train_isolation_forest(X_train)

# Step 6: Make predictions
print("\nMaking fraud predictions...")
train_results = detector.predict_fraud(X_train, return_scores=True)
test_results = detector.predict_fraud(X_test, return_scores=True)

# Step 7: Analyze results
print(f"\n TRAINING SET RESULTS ")
print(f"Total transactions: {len(train_results)}")
print(f"Flagged as fraud: {train_results['is_fraud'].sum()}")
print(f"Fraud rate: {train_results['is_fraud'].mean():.2%}")

print(f"\n TEST SET RESULTS ")
print(f"Total transactions: {len(test_results)}")
print(f"Flagged as fraud: {test_results['is_fraud'].sum()}")
print(f"Fraud rate: {test_results['is_fraud'].mean():.2%}")

# Step 8: Examine flagged transactions (merged original + derived features)

# Get indices of flagged anomalies in the test set
flagged_indices = [idx for idx, val in zip(test_indices, test_results['is_fraud'].values) if val == 1]

# Extract original transaction data for flagged transactions
flagged_transactions = transaction_data.loc[flagged_indices].copy()

# Add derived features from processed DataFrame
derived_features = ['is_same_state', 'is_home_ip', 'is_night_transaction',
                    'is_weekend', 'very_far_from_home', 'distance_log']
# Only include features that exist (some may not exist depending on your data)
available_derived = [f for f in derived_features if f in X_features.columns]
flagged_transactions = pd.concat([flagged_transactions, X_features.loc[flagged_indices, available_derived]], axis=1)

# Print a sample of flagged transactions
print(f"\nSAMPLE FLAGGED TRANSACTIONS")
print(flagged_transactions[['transaction_id', 'user_id', 'amount', 'channel',
                            'distance_from_home_km', 'is_same_state', 'time_of_day'] + available_derived].head())


# Step 9: Feature importance analysis (manual)
print(f"\n FEATURE STATISTICS ")
for feature, mean_val in detector.feature_stats['means'].items():
    print(f"{feature}: mean={mean_val:.3f}, std={detector.feature_stats['stds'][feature]:.3f}")

# Step 10: Save the model for future use
detector.save_model('fraud_detector_model')





In [None]:
transaction_data.head()

In [None]:
transaction_data.columns

In [None]:
def detect_fraud_on_new_data(new_transactions_csv):
    """
    Example of how to use the trained model on new transaction data
    """
    # Load new data
    new_data = pd.read_csv("content/Book1.csv")

    # Load the trained model
    new_detector = EnhancedIsolationForestDetector()
    new_detector.load_model('fraud_detector_model.pkl')

    # Preprocess (fit_encoders=False since we use existing encoders)
    X_new = new_detector.preprocess_features(new_data, fit_encoders=False)

    # Scale using existing scaler
    X_new_scaled = new_detector.scaler.transform(X_new)

    # Predict
    fraud_results = new_detector.predict_fraud(X_new_scaled, return_scores=True)

    # Add results back to original data
    new_data_with_predictions = new_data.copy()
    new_data_with_predictions['is_fraud'] = fraud_results['is_fraud']
    new_data_with_predictions['fraud_confidence'] = fraud_results['confidence_score']

    return new_data_with_predictions

print(f"\n MODEL TRAINING COMPLETE ")
print("Your transaction_data.csv has been processed and the model is trained!")
print("You can now use detect_fraud_on_new_data() function for real-time detection.")


=== MODEL TRAINING COMPLETE ===
Your transaction_data.csv has been processed and the model is trained!
You can now use detect_fraud_on_new_data() function for real-time detection.


## Model Deployment on Streamlit  

To make our fraud detection system accessible and interactive, we deployed the trained model on **Streamlit**.  
The application allows users to upload new transaction data and receive real-time fraud detection insights.  

🔗 **Access the Streamlit App here:** [Streamlit Deployment Link](https://trustshieldai-nuozwqtlqvkgl8xpjmhq5k.streamlit.app/)  
