# Fraud Detection using Anomaly Detection Models

## Business Problem
Detect fraudulent credit card transactions in a high-volume payment gateway using anomaly detection techniques.

## Constraints
- Fraud rate ~2%
- Prediction latency <100ms
- False Positive cost = 10x False Negative
- Explainability required

## Dataset
Simulated credit card transactions (customers, merchants, geo, time)

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from pathlib import Path


In [2]:
DATA_PATH = Path("C:/Users/prate/Downloads/fraudTest.csv/fraudTest.csv")
df = pd.read_csv(DATA_PATH)

df.head()

Unnamed: 0.1,Unnamed: 0,trans_date_trans_time,cc_num,merchant,category,amt,first,last,gender,street,...,lat,long,city_pop,job,dob,trans_num,unix_time,merch_lat,merch_long,is_fraud
0,0,2020-06-21 12:14:25,2291163933867244,fraud_Kirlin and Sons,personal_care,2.86,Jeff,Elliott,M,351 Darlene Green,...,33.9659,-80.9355,333497,Mechanical engineer,1968-03-19,2da90c7d74bd46a0caf3777415b3ebd3,1371816865,33.986391,-81.200714,0
1,1,2020-06-21 12:14:33,3573030041201292,fraud_Sporer-Keebler,personal_care,29.84,Joanne,Williams,F,3638 Marsh Union,...,40.3207,-110.436,302,"Sales professional, IT",1990-01-17,324cc204407e99f51b0d6ca0055005e7,1371816873,39.450498,-109.960431,0
2,2,2020-06-21 12:14:53,3598215285024754,"fraud_Swaniawski, Nitzsche and Welch",health_fitness,41.28,Ashley,Lopez,F,9333 Valentine Point,...,40.6729,-73.5365,34496,"Librarian, public",1970-10-21,c81755dbbbea9d5c77f094348a7579be,1371816893,40.49581,-74.196111,0
3,3,2020-06-21 12:15:15,3591919803438423,fraud_Haley Group,misc_pos,60.05,Brian,Williams,M,32941 Krystal Mill Apt. 552,...,28.5697,-80.8191,54767,Set designer,1987-07-25,2159175b9efe66dc301f149d3d5abf8c,1371816915,28.812398,-80.883061,0
4,4,2020-06-21 12:15:17,3526826139003047,fraud_Johnston-Casper,travel,3.19,Nathan,Massey,M,5783 Evan Roads Apt. 465,...,44.2529,-85.017,1126,Furniture designer,1955-07-06,57ff021bd3f328f8738bb535c302a31b,1371816917,44.959148,-85.884734,0


In [3]:
df.shape, df.columns

((555719, 23),
 Index(['Unnamed: 0', 'trans_date_trans_time', 'cc_num', 'merchant', 'category',
        'amt', 'first', 'last', 'gender', 'street', 'city', 'state', 'zip',
        'lat', 'long', 'city_pop', 'job', 'dob', 'trans_num', 'unix_time',
        'merch_lat', 'merch_long', 'is_fraud'],
       dtype='object'))

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 555719 entries, 0 to 555718
Data columns (total 23 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   Unnamed: 0             555719 non-null  int64  
 1   trans_date_trans_time  555719 non-null  object 
 2   cc_num                 555719 non-null  int64  
 3   merchant               555719 non-null  object 
 4   category               555719 non-null  object 
 5   amt                    555719 non-null  float64
 6   first                  555719 non-null  object 
 7   last                   555719 non-null  object 
 8   gender                 555719 non-null  object 
 9   street                 555719 non-null  object 
 10  city                   555719 non-null  object 
 11  state                  555719 non-null  object 
 12  zip                    555719 non-null  int64  
 13  lat                    555719 non-null  float64
 14  long                   555719 non-nu

## Initial Observations
- Dataset contains transaction-level records with customer and merchant details
- Fraud label is highly imbalanced
- Includes temporal, geographic, and behavioral features
- Suitable for anomaly detection methods

In [5]:
df = df.drop(columns=["Unnamed: 0"])

In [6]:
ID_COLUMNS = [
    "trans_num",
    "cc_num",
    "merchant"
]

TARGET = "is_fraud"

POTENTIAL_FEATURES = [col for col in df.columns if col not in ID_COLUMNS + [TARGET]]

In [7]:
df[TARGET].value_counts(normalize=True)

0    0.99614
1    0.00386
Name: is_fraud, dtype: float64

### The following columns are excluded from modeling:
- Names (first, last): Personally identifiable, no predictive behavior
- Street, city, state, zip: High-cardinality identifiers, low generalization
- Job: High cardinality, weak signal for transaction-level fraud
- trans_num, cc_num: Identifiers, risk of memorization

0.3% of data points are fraud in the given data set which is acceptable

## Timestamp Handling Decision

The columns `trans_date_trans_time` and `dob` were found to be masked (####) and contained no usable temporal information. These columns were dropped.

The `unix_time` column was retained as it provides valid transaction time information and enables derivation of temporal features such as hour, day of week, and transaction velocity.


In [8]:
df = df.drop(columns=["trans_date_trans_time", "dob"])

In [9]:
df["transaction_datetime"] = pd.to_datetime(df["unix_time"], unit="s", utc=True)

In [10]:
df["hour"] = df["transaction_datetime"].dt.hour
df["day_of_week"] = df["transaction_datetime"].dt.dayofweek
df["month"] = df["transaction_datetime"].dt.month

In [11]:
df = df.drop(columns=["unix_time", "transaction_datetime"])

In [12]:
# dist from home
from math import radians, cos, sin, asin, sqrt

def haversine(lat1, lon1, lat2, lon2):
    lat1, lon1, lat2, lon2 = map(radians, [lat1, lon1, lat2, lon2])
    dlat = lat2 - lat1
    dlon = lon2 - lon1
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a))
    return 6371 * c  # Earth radius in km

df["distance_from_home"] = df.apply(
    lambda x: haversine(x["lat"], x["long"], x["merch_lat"], x["merch_long"]),
    axis=1
)

In [13]:
df = df.drop(columns=["lat", "long", "merch_lat", "merch_long"])

### Feature Engineering Summary

Temporal features (hour, day_of_week, month) were derived from unix_time to capture spending patterns.

Geospatial distance was computed between customer home and merchant location using the Haversine formula. Raw coordinates were dropped to reduce noise and improve model stability.

In [14]:
DROP_COLUMNS = [
    "first", "last", "street", "city", "zip", "job",
    "merchant", "trans_num", "cc_num"
]

df_model = df.drop(columns=DROP_COLUMNS)

In [15]:
y = df_model["is_fraud"]
X = df_model.drop(columns=["is_fraud"])

In [16]:
categorical_cols = ["category", "gender", "state"]
numerical_cols = [col for col in X.columns if col not in categorical_cols]

In [17]:
X_encoded = pd.get_dummies(
    X,
    columns=categorical_cols,
    drop_first=True
)

In [18]:
X_encoded.shape
X_encoded.head()

Unnamed: 0,amt,city_pop,hour,day_of_week,month,distance_from_home,category_food_dining,category_gas_transport,category_grocery_net,category_grocery_pos,...,state_SD,state_TN,state_TX,state_UT,state_VA,state_VT,state_WA,state_WI,state_WV,state_WY
0,2.86,333497,12,4,6,24.561462,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,29.84,302,12,4,6,104.925092,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
2,41.28,34496,12,4,6,59.080078,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,60.05,54767,12,4,6,27.698567,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,3.19,1126,12,4,6,104.335106,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [19]:
X_encoded.dtypes.value_counts()

uint8      63
int64       4
float64     2
dtype: int64

In [20]:
from sklearn.model_selection import train_test_split

# Separate normal transactions
X_normal = X_encoded[y == 0]

# Keep fraud only for validation/testing
X_train, X_val = train_test_split(
    X_normal,
    test_size=0.2,
    random_state=42
)

# Validation labels (for evaluation only)
y_val = y.loc[X_val.index]

In [21]:
# Train ONLY on normal (non-fraud) transactions
X_train = X_encoded[y == 0].sample(frac=0.8, random_state=42)

In [22]:
# Validation set contains remaining normal + all fraud
X_val = X_encoded.drop(X_train.index)
y_val = y.loc[X_val.index]

In [23]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)

In [24]:
import joblib

joblib.dump(scaler, "scaler.joblib")

['scaler.joblib']

In [25]:
from sklearn.ensemble import IsolationForest
from sklearn.metrics import (
    precision_score,
    recall_score,
    f1_score,
    roc_auc_score
)

In [None]:
iso_forest = IsolationForest(
    n_estimators=200,
    contamination=0.004,  # ~0.4% expected fraud
    max_samples="auto",
    random_state=42,
    n_jobs=-1
)

iso_forest.fit(X_train_scaled)

In [None]:
# Higher score = more normal, lower = more anomalous
anomaly_scores = iso_forest.decision_function(X_val_scaled)

# Convert to anomaly score (higher = more anomalous)
anomaly_scores = -anomaly_scores

In [None]:
threshold = np.percentile(anomaly_scores, 99.6)
y_pred = (anomaly_scores >= threshold).astype(int)

In [None]:
print("y_val unique classes:", np.unique(y_val))
print("y_val shape:", y_val.shape)
print("anomaly_scores shape:", anomaly_scores.shape)

In [None]:
y_val.value_counts()

In [None]:
y_val_array = y_val.values

In [None]:
precision = precision_score(y_val, y_pred)
recall = recall_score(y_val, y_pred)
f1 = f1_score(y_val, y_pred)
roc_auc = roc_auc_score(y_val_array, anomaly_scores)
#roc_auc = roc_auc_score(y_val, anomaly_scores)

precision, recall, f1, roc_auc

## Isolation Forest Results

Isolation Forest was trained on non-fraud transactions to learn normal behavior.
Anomaly scores were thresholded conservatively to maximize recall, given the higher cost of false negatives.

This model serves as a strong baseline due to:
- Fast inference
- Robust performance on tabular data
- High interpretability of anomaly drivers


## Isolation Forest Performance Analysis

Isolation Forest achieved a ROC-AUC of approx 0.73, indicating reasonable separation between fraud and normal transactions.
However, recall was low (~6%), suggesting that the model primarily detects only extreme anomalies.

Given the higher cost of false negatives in fraud detection, this model alone is insufficient and motivates the use of additional anomaly detection techniques.

In [None]:
from sklearn.svm import OneClassSVM

In [None]:
ocsvm = OneClassSVM(
    kernel="rbf",
    nu=0.004,      # expected fraud proportion (~0.4%)
    gamma="scale" # safe default for RBF
)

ocsvm.fit(X_train_scaled)

In [None]:
# Negative values → anomalies
ocsvm_scores = -ocsvm.decision_function(X_val_scaled)

In [None]:
threshold = np.percentile(ocsvm_scores, 99.6)
y_pred_ocsvm = (ocsvm_scores >= threshold).astype(int)

In [None]:
precision_oc = precision_score(y_val_array, y_pred_ocsvm)
recall_oc = recall_score(y_val_array, y_pred_ocsvm)
f1_oc = f1_score(y_val_array, y_pred_ocsvm)
roc_auc_oc = roc_auc_score(y_val_array, ocsvm_scores)

precision_oc, recall_oc, f1_oc, roc_auc_oc