# Real-World Anomaly Detection for FinTech Transactions

## 1. Problem Statement
## 2. Dataset Loading and Exploration
## 3. Feature Engineering
## 4. Experiment 1: One-Class SVM (Failed)
## 5. Experiment 2: Autoencoder (Partially Successful)
## 6. Experiment 3: Isolation Forest (Final Model)
## 7. Model Comparison and Selection
## 8. Final Observations and Production Considerations


1. Problem Statement - As digital payments grow, fraudsters are finding smarter ways to exploit transaction systems, making fraud harder to detect. The challenge is to identify suspicious transactions hidden within large volumes of mostly legitimate data, without causing unnecessary disruption to genuine users. A reliable anomaly detection solution is needed that can spot unusual behavior in real time, explain its decisions clearly, and operate efficiently in a real-world financial environment.

2. dataset loading 

In [5]:

import pandas as pd

df = pd.read_csv("fraudTest.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,trans_date_trans_time,cc_num,merchant,category,amt,first,last,gender,street,...,lat,long,city_pop,job,dob,trans_num,unix_time,merch_lat,merch_long,is_fraud
0,0,2020-06-21 12:14:25,2291163933867244,fraud_Kirlin and Sons,personal_care,2.86,Jeff,Elliott,M,351 Darlene Green,...,33.9659,-80.9355,333497,Mechanical engineer,1968-03-19,2da90c7d74bd46a0caf3777415b3ebd3,1371816865,33.986391,-81.200714,0
1,1,2020-06-21 12:14:33,3573030041201292,fraud_Sporer-Keebler,personal_care,29.84,Joanne,Williams,F,3638 Marsh Union,...,40.3207,-110.436,302,"Sales professional, IT",1990-01-17,324cc204407e99f51b0d6ca0055005e7,1371816873,39.450498,-109.960431,0
2,2,2020-06-21 12:14:53,3598215285024754,"fraud_Swaniawski, Nitzsche and Welch",health_fitness,41.28,Ashley,Lopez,F,9333 Valentine Point,...,40.6729,-73.5365,34496,"Librarian, public",1970-10-21,c81755dbbbea9d5c77f094348a7579be,1371816893,40.49581,-74.196111,0
3,3,2020-06-21 12:15:15,3591919803438423,fraud_Haley Group,misc_pos,60.05,Brian,Williams,M,32941 Krystal Mill Apt. 552,...,28.5697,-80.8191,54767,Set designer,1987-07-25,2159175b9efe66dc301f149d3d5abf8c,1371816915,28.812398,-80.883061,0
4,4,2020-06-21 12:15:17,3526826139003047,fraud_Johnston-Casper,travel,3.19,Nathan,Massey,M,5783 Evan Roads Apt. 465,...,44.2529,-85.017,1126,Furniture designer,1955-07-06,57ff021bd3f328f8738bb535c302a31b,1371816917,44.959148,-85.884734,0


In [6]:
df.shape
df.columns
df["is_fraud"].value_counts()


is_fraud
0    553574
1      2145
Name: count, dtype: int64

Observation:
- Dataset is highly imbalanced
- Fraud rate is less than 1%, which reflects real-world fintech scenarios
- This justifies the use of anomaly detection models


In [6]:
import pandas as pd

df = pd.read_csv("fraudTest.csv")

print(df.columns)

Index(['Unnamed: 0', 'trans_date_trans_time', 'cc_num', 'merchant', 'category',
       'amt', 'first', 'last', 'gender', 'street', 'city', 'state', 'zip',
       'lat', 'long', 'city_pop', 'job', 'dob', 'trans_num', 'unix_time',
       'merch_lat', 'merch_long', 'is_fraud'],
      dtype='object')


data exploration and cleaning 

In [18]:
df = df.rename(columns={
    "trans_num": "transaction_id",
    "cc_num": "card_number",
    "trans_date_trans_time": "timestamp",
    "amt": "amount",
    "merchant": "merchant_id",
    "category": "merchant_category",
    "merch_lat": "merchant_lat",
    "merch_long": "merchant_long"
})


In [19]:
df["customer_id"] = ["CUST_" + str(i % 100000).zfill(5) for i in range(len(df))]

In [20]:
import hashlib

df["card_number"] = df["card_number"].astype(str).apply(
    lambda x: "CARD_" + hashlib.sha256(x.encode()).hexdigest()[:10]
)

In [21]:
df["timestamp"] = pd.to_datetime(df["timestamp"])

df["hour"] = df["timestamp"].dt.hour
df["day_of_week"] = df["timestamp"].dt.weekday
df["month"] = df["timestamp"].dt.month

In [22]:
import numpy as np
df["customer_lat"] = np.random.uniform(8.0, 37.0, len(df))
df["customer_long"] = np.random.uniform(68.0, 97.0, len(df))

In [23]:
from math import radians, sin, cos, sqrt, atan2

def haversine(lat1, lon1, lat2, lon2):
    R = 6371
    lat1, lon1, lat2, lon2 = map(radians, [lat1, lon1, lat2, lon2])
    dlat = lat2 - lat1
    dlon = lon2 - lon1
    a = sin(dlat/2)**2 + cos(lat1)*cos(lat2)*sin(dlon/2)**2
    return 2 * R * atan2(sqrt(a), sqrt(1-a))

df["distance_from_home"] = df.apply(
    lambda r: haversine(
        r["customer_lat"], r["customer_long"],
        r["merchant_lat"], r["merchant_long"]
    ),
    axis=1
)


In [24]:
conditions = [
    (df["is_fraud"] == 1) & (df["distance_from_home"] > 100),
    (df["is_fraud"] == 1) & (df["amount"] > 100000),
    (df["is_fraud"] == 1)
]

choices = [
    "account_takeover",
    "card_cloning",
    "merchant_collusion"
]

df["fraud_type"] = np.select(conditions, choices, default="none")

In [25]:
df = df[[
    "transaction_id",
    "customer_id",
    "card_number",
    "timestamp",
    "amount",
    "merchant_id",
    "merchant_category",
    "merchant_lat",
    "merchant_long",
    "is_fraud",
    "fraud_type",
    "hour",
    "day_of_week",
    "month",
    "distance_from_home"
]]


In [26]:
print(df.columns)
print(df.head())

Index(['transaction_id', 'customer_id', 'card_number', 'timestamp', 'amount',
       'merchant_id', 'merchant_category', 'merchant_lat', 'merchant_long',
       'is_fraud', 'fraud_type', 'hour', 'day_of_week', 'month',
       'distance_from_home'],
      dtype='object')
                     transaction_id customer_id      card_number  \
0  2da90c7d74bd46a0caf3777415b3ebd3  CUST_00000  CARD_76d525af51   
1  324cc204407e99f51b0d6ca0055005e7  CUST_00001  CARD_b75ab80b26   
2  c81755dbbbea9d5c77f094348a7579be  CUST_00002  CARD_b6e169e160   
3  2159175b9efe66dc301f149d3d5abf8c  CUST_00003  CARD_d2dbee2d66   
4  57ff021bd3f328f8738bb535c302a31b  CUST_00004  CARD_c761a63af7   

            timestamp  amount                           merchant_id  \
0 2020-06-21 12:14:25    2.86                 fraud_Kirlin and Sons   
1 2020-06-21 12:14:33   29.84                  fraud_Sporer-Keebler   
2 2020-06-21 12:14:53   41.28  fraud_Swaniawski, Nitzsche and Welch   
3 2020-06-21 12:15:15   60.05       

In [30]:
print(df.shape)

(555719, 15)


### Experiment 1: One-Class SVM

Goal:
To detect fraudulent transactions by learning the boundary of normal behavior.

In [32]:
df

Unnamed: 0,transaction_id,customer_id,card_number,timestamp,amount,merchant_id,merchant_category,merchant_lat,merchant_long,is_fraud,fraud_type,hour,day_of_week,month,distance_from_home
0,2da90c7d74bd46a0caf3777415b3ebd3,CUST_00000,CARD_76d525af51,2020-06-21 12:14:25,2.86,fraud_Kirlin and Sons,personal_care,33.986391,-81.200714,0,none,12,6,6,14027.220109
1,324cc204407e99f51b0d6ca0055005e7,CUST_00001,CARD_b75ab80b26,2020-06-21 12:14:33,29.84,fraud_Sporer-Keebler,personal_care,39.450498,-109.960431,0,none,12,6,6,12388.041489
2,c81755dbbbea9d5c77f094348a7579be,CUST_00002,CARD_b6e169e160,2020-06-21 12:14:53,41.28,"fraud_Swaniawski, Nitzsche and Welch",health_fitness,40.495810,-74.196111,0,none,12,6,6,11742.754669
3,2159175b9efe66dc301f149d3d5abf8c,CUST_00003,CARD_d2dbee2d66,2020-06-21 12:15:15,60.05,fraud_Haley Group,misc_pos,28.812398,-80.883061,0,none,12,6,6,14599.988137
4,57ff021bd3f328f8738bb535c302a31b,CUST_00004,CARD_c761a63af7,2020-06-21 12:15:17,3.19,fraud_Johnston-Casper,travel,44.959148,-85.884734,0,none,12,6,6,11758.355634
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
555714,9b1f753c79894c9f4b71f04581835ada,CUST_55714,CARD_836c3aa1c2,2020-12-31 23:59:07,43.77,fraud_Reilly and Sons,health_fitness,39.946837,-91.333331,0,none,23,3,12,11596.984912
555715,2090647dac2c89a1d86c514c427f5b91,CUST_55715,CARD_041235f46f,2020-12-31 23:59:09,111.84,fraud_Hoppe-Parisian,kids_pets,29.661049,-96.186633,0,none,23,3,12,12647.640762
555716,6c5b7c8add471975aa0fec023b2e8408,CUST_55716,CARD_f48bd1e6de,2020-12-31 23:59:15,86.88,fraud_Rau-Robel,kids_pets,46.658340,-119.715054,0,none,23,3,12,11582.268185
555717,14392d723bb7737606b2700ac791b7aa,CUST_55717,CARD_b758f78bd6,2020-12-31 23:59:24,7.99,fraud_Breitenberg LLC,travel,44.470525,-117.080888,0,none,23,3,12,10908.016346


In [34]:
df_clean.head()


Unnamed: 0,transaction_id,customer_id,card_number,timestamp,amount,merchant_id,merchant_category,merchant_lat,merchant_long,is_fraud,fraud_type,hour,day_of_week,month,distance_from_home
0,2da90c7d74bd46a0caf3777415b3ebd3,CUST_00000,CARD_76d525af51,2020-06-21 12:14:25,2.86,fraud_Kirlin and Sons,personal_care,33.986391,-81.200714,0,none,12,6,6,14027.220109
1,324cc204407e99f51b0d6ca0055005e7,CUST_00001,CARD_b75ab80b26,2020-06-21 12:14:33,29.84,fraud_Sporer-Keebler,personal_care,39.450498,-109.960431,0,none,12,6,6,12388.041489
2,c81755dbbbea9d5c77f094348a7579be,CUST_00002,CARD_b6e169e160,2020-06-21 12:14:53,41.28,"fraud_Swaniawski, Nitzsche and Welch",health_fitness,40.49581,-74.196111,0,none,12,6,6,11742.754669
3,2159175b9efe66dc301f149d3d5abf8c,CUST_00003,CARD_d2dbee2d66,2020-06-21 12:15:15,60.05,fraud_Haley Group,misc_pos,28.812398,-80.883061,0,none,12,6,6,14599.988137
4,57ff021bd3f328f8738bb535c302a31b,CUST_00004,CARD_c761a63af7,2020-06-21 12:15:17,3.19,fraud_Johnston-Casper,travel,44.959148,-85.884734,0,none,12,6,6,11758.355634


In [2]:
import numpy as np
import pandas as pd

from sklearn.svm import OneClassSVM
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, roc_auc_score


In [3]:
df_clean = pd.read_csv("Desktop/fraud-anomaly-detection/transactions_cleaned.csv")

In [12]:
features = [
    "amount",
    "hour",
    "day_of_week",
    "month",
    "distance_from_home"
]

X = df_clean[features]
y = df_clean["is_fraud"]  


In [13]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)


In [None]:

X_train_small = X_train[:10000]
svm_model = OneClassSVM(
    kernel="rbf",
    gamma="scale",
    nu=0.05
)

svm_model.fit(X_train_small)


In [None]:
svm_preds = svm_model.predict(X_scaled)

# Convert SVM output to fraud labels
# -1 → anomaly (fraud), 1 → normal
svm_preds = np.where(svm_preds == -1, 1, 0)


current dataset have 555719 rows ,To achieve a fraud rate above 1%, I downsampled legitimate transactions while preserving all fraud cases, maintaining realistic fraud behavior without synthetic duplication.

In [4]:
from sklearn.model_selection import train_test_split

df_reduced, _ = train_test_split(df_clean ,
    train_size=150000,
    stratify=df_clean["is_fraud"],
    random_state=42
)

In [5]:
df_reduced["is_fraud"].value_counts(normalize=True) * 100

is_fraud
0    99.614
1     0.386
Name: proportion, dtype: float64

In [6]:
df_reduced["is_fraud"].value_counts()

is_fraud
0    149421
1       579
Name: count, dtype: int64

In [7]:
fraud_count = df_reduced[df_reduced["is_fraud"] == 1].shape[0]
required_total = int(fraud_count / 0.01)
required_legit = required_total - fraud_count

fraud_count, required_legit


(579, 57321)

In [8]:
df_fraud = df_reduced[df_reduced["is_fraud"] == 1]
df_legit = df_reduced[df_reduced["is_fraud"] == 0].sample(
    n=required_legit,
    random_state=42
)

df_1percent = pd.concat([df_fraud, df_legit]).sample(frac=1, random_state=42)

In [9]:
df_1percent["is_fraud"].value_counts(normalize=True) * 100

is_fraud
0    99.0
1     1.0
Name: proportion, dtype: float64