# Real-World Anomaly Detection for FinTech Transactions

## 1. Problem Statement
## 2. Dataset Loading and Exploration
## 3. Feature Engineering
## 4. Experiment 1: One-Class SVM (Failed)
## 5. Experiment 2: Autoencoder (Partially Successful)
## 6. Experiment 3: Isolation Forest (Final Model)
## 7. Model Comparison and Selection
## 8. Final Observations and Production Considerations


1. Problem Statement - As digital payments grow, fraudsters are finding smarter ways to exploit transaction systems, making fraud harder to detect. The challenge is to identify suspicious transactions hidden within large volumes of mostly legitimate data, without causing unnecessary disruption to genuine users. A reliable anomaly detection solution is needed that can spot unusual behavior in real time, explain its decisions clearly, and operate efficiently in a real-world financial environment.

2. dataset loading 

In [8]:

import pandas as pd

df = pd.read_csv("fraudTest.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,trans_date_trans_time,cc_num,merchant,category,amt,first,last,gender,street,...,lat,long,city_pop,job,dob,trans_num,unix_time,merch_lat,merch_long,is_fraud
0,0,2020-06-21 12:14:25,2291163933867244,fraud_Kirlin and Sons,personal_care,2.86,Jeff,Elliott,M,351 Darlene Green,...,33.9659,-80.9355,333497,Mechanical engineer,1968-03-19,2da90c7d74bd46a0caf3777415b3ebd3,1371817000.0,33.986391,-81.200714,0.0
1,1,2020-06-21 12:14:33,3573030041201292,fraud_Sporer-Keebler,personal_care,29.84,Joanne,Williams,F,3638 Marsh Union,...,40.3207,-110.436,302,"Sales professional, IT",1990-01-17,324cc204407e99f51b0d6ca0055005e7,1371817000.0,39.450498,-109.960431,0.0
2,2,2020-06-21 12:14:53,3598215285024754,"fraud_Swaniawski, Nitzsche and Welch",health_fitness,41.28,Ashley,Lopez,F,9333 Valentine Point,...,40.6729,-73.5365,34496,"Librarian, public",1970-10-21,c81755dbbbea9d5c77f094348a7579be,1371817000.0,40.49581,-74.196111,0.0
3,3,2020-06-21 12:15:15,3591919803438423,fraud_Haley Group,misc_pos,60.05,Brian,Williams,M,32941 Krystal Mill Apt. 552,...,28.5697,-80.8191,54767,Set designer,1987-07-25,2159175b9efe66dc301f149d3d5abf8c,1371817000.0,28.812398,-80.883061,0.0
4,4,2020-06-21 12:15:17,3526826139003047,fraud_Johnston-Casper,travel,3.19,Nathan,Massey,M,5783 Evan Roads Apt. 465,...,44.2529,-85.017,1126,Furniture designer,1955-07-06,57ff021bd3f328f8738bb535c302a31b,1371817000.0,44.959148,-85.884734,0.0


In [9]:
df.shape
df.columns
df["is_fraud"].value_counts()


is_fraud
0.0    548189
1.0      2145
Name: count, dtype: int64

Observation:
- Dataset is highly imbalanced
- Fraud rate is less than 1%, which reflects real-world fintech scenarios
- This justifies the use of anomaly detection models


In [10]:
import pandas as pd

df = pd.read_csv("fraudTest.csv")

print(df.columns)

Index(['Unnamed: 0', 'trans_date_trans_time', 'cc_num', 'merchant', 'category',
       'amt', 'first', 'last', 'gender', 'street', 'city', 'state', 'zip',
       'lat', 'long', 'city_pop', 'job', 'dob', 'trans_num', 'unix_time',
       'merch_lat', 'merch_long', 'is_fraud'],
      dtype='object')


data exploration and cleaning 

In [11]:
df = df.rename(columns={
    "trans_num": "transaction_id",
    "cc_num": "card_number",
    "trans_date_trans_time": "timestamp",
    "amt": "amount",
    "merchant": "merchant_id",
    "category": "merchant_category",
    "merch_lat": "merchant_lat",
    "merch_long": "merchant_long"
})


In [12]:
df["customer_id"] = ["CUST_" + str(i % 100000).zfill(5) for i in range(len(df))]

In [13]:
import hashlib

df["card_number"] = df["card_number"].astype(str).apply(
    lambda x: "CARD_" + hashlib.sha256(x.encode()).hexdigest()[:10]
)

In [14]:
df["timestamp"] = pd.to_datetime(df["timestamp"])

df["hour"] = df["timestamp"].dt.hour
df["day_of_week"] = df["timestamp"].dt.weekday
df["month"] = df["timestamp"].dt.month

In [15]:
import numpy as np
df["customer_lat"] = np.random.uniform(8.0, 37.0, len(df))
df["customer_long"] = np.random.uniform(68.0, 97.0, len(df))

In [16]:
from math import radians, sin, cos, sqrt, atan2

def haversine(lat1, lon1, lat2, lon2):
    R = 6371
    lat1, lon1, lat2, lon2 = map(radians, [lat1, lon1, lat2, lon2])
    dlat = lat2 - lat1
    dlon = lon2 - lon1
    a = sin(dlat/2)**2 + cos(lat1)*cos(lat2)*sin(dlon/2)**2
    return 2 * R * atan2(sqrt(a), sqrt(1-a))

df["distance_from_home"] = df.apply(
    lambda r: haversine(
        r["customer_lat"], r["customer_long"],
        r["merchant_lat"], r["merchant_long"]
    ),
    axis=1
)


In [17]:
conditions = [
    (df["is_fraud"] == 1) & (df["distance_from_home"] > 100),
    (df["is_fraud"] == 1) & (df["amount"] > 100000),
    (df["is_fraud"] == 1)
]

choices = [
    "account_takeover",
    "card_cloning",
    "merchant_collusion"
]

df["fraud_type"] = np.select(conditions, choices, default="none")

In [18]:
df = df[[
    "transaction_id",
    "customer_id",
    "card_number",
    "timestamp",
    "amount",
    "merchant_id",
    "merchant_category",
    "merchant_lat",
    "merchant_long",
    "is_fraud",
    "fraud_type",
    "hour",
    "day_of_week",
    "month",
    "distance_from_home"
]]


In [19]:
print(df.columns)
print(df.head())

Index(['transaction_id', 'customer_id', 'card_number', 'timestamp', 'amount',
       'merchant_id', 'merchant_category', 'merchant_lat', 'merchant_long',
       'is_fraud', 'fraud_type', 'hour', 'day_of_week', 'month',
       'distance_from_home'],
      dtype='object')
                     transaction_id customer_id      card_number  \
0  2da90c7d74bd46a0caf3777415b3ebd3  CUST_00000  CARD_cc6c29f3a6   
1  324cc204407e99f51b0d6ca0055005e7  CUST_00001  CARD_b47e8dca60   
2  c81755dbbbea9d5c77f094348a7579be  CUST_00002  CARD_9abc4462b4   
3  2159175b9efe66dc301f149d3d5abf8c  CUST_00003  CARD_66e686f7ff   
4  57ff021bd3f328f8738bb535c302a31b  CUST_00004  CARD_551125038a   

            timestamp  amount                           merchant_id  \
0 2020-06-21 12:14:25    2.86                 fraud_Kirlin and Sons   
1 2020-06-21 12:14:33   29.84                  fraud_Sporer-Keebler   
2 2020-06-21 12:14:53   41.28  fraud_Swaniawski, Nitzsche and Welch   
3 2020-06-21 12:15:15   60.05       

In [20]:
print(df.shape)

(555719, 15)


### Experiment 1: One-Class SVM

Goal:
To detect fraudulent transactions by learning the boundary of normal behavior.

In [22]:
df

Unnamed: 0,transaction_id,customer_id,card_number,timestamp,amount,merchant_id,merchant_category,merchant_lat,merchant_long,is_fraud,fraud_type,hour,day_of_week,month,distance_from_home
0,2da90c7d74bd46a0caf3777415b3ebd3,CUST_00000,CARD_cc6c29f3a6,2020-06-21 12:14:25,2.86,fraud_Kirlin and Sons,personal_care,33.986391,-81.200714,0,none,12,6,6,14193.311789
1,324cc204407e99f51b0d6ca0055005e7,CUST_00001,CARD_b47e8dca60,2020-06-21 12:14:33,29.84,fraud_Sporer-Keebler,personal_care,39.450498,-109.960431,0,none,12,6,6,12656.221561
2,c81755dbbbea9d5c77f094348a7579be,CUST_00002,CARD_9abc4462b4,2020-06-21 12:14:53,41.28,"fraud_Swaniawski, Nitzsche and Welch",health_fitness,40.495810,-74.196111,0,none,12,6,6,13020.152251
3,2159175b9efe66dc301f149d3d5abf8c,CUST_00003,CARD_66e686f7ff,2020-06-21 12:15:15,60.05,fraud_Haley Group,misc_pos,28.812398,-80.883061,0,none,12,6,6,14026.730331
4,57ff021bd3f328f8738bb535c302a31b,CUST_00004,CARD_551125038a,2020-06-21 12:15:17,3.19,fraud_Johnston-Casper,travel,44.959148,-85.884734,0,none,12,6,6,12660.105221
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
555714,9b1f753c79894c9f4b71f04581835ada,CUST_55714,CARD_8f9bb48f4a,2020-12-31 23:59:07,43.77,fraud_Reilly and Sons,health_fitness,39.946837,-91.333331,0,none,23,3,12,14467.586629
555715,2090647dac2c89a1d86c514c427f5b91,CUST_55715,CARD_b9107fe659,2020-12-31 23:59:09,111.84,fraud_Hoppe-Parisian,kids_pets,29.661049,-96.186633,0,none,23,3,12,13722.867034
555716,6c5b7c8add471975aa0fec023b2e8408,CUST_55716,CARD_581333f442,2020-12-31 23:59:15,86.88,fraud_Rau-Robel,kids_pets,46.658340,-119.715054,0,none,23,3,12,11690.309978
555717,14392d723bb7737606b2700ac791b7aa,CUST_55717,CARD_ebfc7f1ffd,2020-12-31 23:59:24,7.99,fraud_Breitenberg LLC,travel,44.470525,-117.080888,0,none,23,3,12,11282.886522


In [23]:
df.head()


Unnamed: 0,transaction_id,customer_id,card_number,timestamp,amount,merchant_id,merchant_category,merchant_lat,merchant_long,is_fraud,fraud_type,hour,day_of_week,month,distance_from_home
0,2da90c7d74bd46a0caf3777415b3ebd3,CUST_00000,CARD_cc6c29f3a6,2020-06-21 12:14:25,2.86,fraud_Kirlin and Sons,personal_care,33.986391,-81.200714,0,none,12,6,6,14193.311789
1,324cc204407e99f51b0d6ca0055005e7,CUST_00001,CARD_b47e8dca60,2020-06-21 12:14:33,29.84,fraud_Sporer-Keebler,personal_care,39.450498,-109.960431,0,none,12,6,6,12656.221561
2,c81755dbbbea9d5c77f094348a7579be,CUST_00002,CARD_9abc4462b4,2020-06-21 12:14:53,41.28,"fraud_Swaniawski, Nitzsche and Welch",health_fitness,40.49581,-74.196111,0,none,12,6,6,13020.152251
3,2159175b9efe66dc301f149d3d5abf8c,CUST_00003,CARD_66e686f7ff,2020-06-21 12:15:15,60.05,fraud_Haley Group,misc_pos,28.812398,-80.883061,0,none,12,6,6,14026.730331
4,57ff021bd3f328f8738bb535c302a31b,CUST_00004,CARD_551125038a,2020-06-21 12:15:17,3.19,fraud_Johnston-Casper,travel,44.959148,-85.884734,0,none,12,6,6,12660.105221


In [24]:
import numpy as np
import pandas as pd

from sklearn.svm import OneClassSVM
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, roc_auc_score


In [12]:
features = [
    "amount",
    "hour",
    "day_of_week",
    "month",
    "distance_from_home"
]

X = df[features]
y = df["is_fraud"]  


In [13]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)


In [None]:

X_train_small = X_train[:10000]
svm_model = OneClassSVM(
    kernel="rbf",
    gamma="scale",
    nu=0.05
)

svm_model.fit(X_train_small)


In [None]:
svm_preds = svm_model.predict(X_scaled)

# Convert SVM output to fraud labels
# -1 → anomaly (fraud), 1 → normal
svm_preds = np.where(svm_preds == -1, 1, 0)


Current dataset have 555719 rows ,To achieve a fraud rate above 1%-2%, I downsampled legitimate transactions while preserving all fraud cases, maintaining realistic fraud behavior without synthetic duplication.

In [31]:
from sklearn.model_selection import train_test_split

df_reduced, _ = train_test_split(df ,
    train_size=150000,
    stratify=df["is_fraud"],
    random_state=42
)

In [32]:
df_reduced["is_fraud"].value_counts(normalize=True) * 100

is_fraud
0    99.614
1     0.386
Name: proportion, dtype: float64

In [33]:
df_reduced["is_fraud"].value_counts()

is_fraud
0    149421
1       579
Name: count, dtype: int64

In [34]:
fraud_count = df_reduced[df_reduced["is_fraud"] == 1].shape[0]
required_total = int(fraud_count / 0.01)
required_legit = required_total - fraud_count

fraud_count, required_legit


(579, 57321)

In [35]:
df_fraud = df_reduced[df_reduced["is_fraud"] == 1]
df_legit = df_reduced[df_reduced["is_fraud"] == 0].sample(
    n=required_legit,
    random_state=42
)

df_1percent = pd.concat([df_fraud, df_legit]).sample(frac=1, random_state=42)

In [9]:
df_1percent["is_fraud"].value_counts(normalize=True) * 100

is_fraud
0    99.0
1     1.0
Name: proportion, dtype: float64

In [36]:
df = df_1percent

One class SVM Model after downsampled 

In [38]:
import numpy as np
import pandas as pd

from sklearn.svm import OneClassSVM
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, roc_auc_score


In [39]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In [40]:
X_train = X_scaled[y == 0]

In [43]:
X_train_small = X_train[:100000]

In [44]:
svm_model = OneClassSVM(
    kernel="rbf",
    gamma="scale",
    nu=0.05
)

svm_model.fit(X_train_small)

0,1,2
,"kernel  kernel: {'linear', 'poly', 'rbf', 'sigmoid', 'precomputed'} or callable, default='rbf' Specifies the kernel type to be used in the algorithm. If none is given, 'rbf' will be used. If a callable is given it is used to precompute the kernel matrix.",'rbf'
,"degree  degree: int, default=3 Degree of the polynomial kernel function ('poly'). Must be non-negative. Ignored by all other kernels.",3
,"gamma  gamma: {'scale', 'auto'} or float, default='scale' Kernel coefficient for 'rbf', 'poly' and 'sigmoid'. - if ``gamma='scale'`` (default) is passed then it uses  1 / (n_features * X.var()) as value of gamma, - if 'auto', uses 1 / n_features - if float, must be non-negative. .. versionchanged:: 0.22  The default value of ``gamma`` changed from 'auto' to 'scale'.",'scale'
,"coef0  coef0: float, default=0.0 Independent term in kernel function. It is only significant in 'poly' and 'sigmoid'.",0.0
,"tol  tol: float, default=1e-3 Tolerance for stopping criterion.",0.001
,"nu  nu: float, default=0.5 An upper bound on the fraction of training errors and a lower bound of the fraction of support vectors. Should be in the interval (0, 1]. By default 0.5 will be taken.",0.05
,"shrinking  shrinking: bool, default=True Whether to use the shrinking heuristic. See the :ref:`User Guide `.",True
,"cache_size  cache_size: float, default=200 Specify the size of the kernel cache (in MB).",200
,"verbose  verbose: bool, default=False Enable verbose output. Note that this setting takes advantage of a per-process runtime setting in libsvm that, if enabled, may not work properly in a multithreaded context.",False
,"max_iter  max_iter: int, default=-1 Hard limit on iterations within solver, or -1 for no limit.",-1


In [45]:
svm_raw_preds = svm_model.predict(X_scaled)

# Convert SVM output to fraud labels
# -1 = anomaly → fraud (1)
#  1 = normal  → legit (0)
svm_preds = np.where(svm_raw_preds == -1, 1, 0)


In [46]:
print(classification_report(y, svm_preds))

              precision    recall  f1-score   support

           0       1.00      0.95      0.97     57321
           1       0.11      0.60      0.18       579

    accuracy                           0.95     57900
   macro avg       0.55      0.78      0.58     57900
weighted avg       0.99      0.95      0.96     57900



In [47]:
svm_scores = -svm_model.decision_function(X_scaled)
print("AUC-ROC:", roc_auc_score(y, svm_scores))


AUC-ROC: 0.8554976535951416


One-class SVM accuracy is 0.95 and AUC-ROC: 0.85

In [50]:
with open(".gitignore", "w") as f:
    f.write("""# Ignore everything
*

# Allow Jupyter notebooks
!*.ipynb

# Allow gitignore itself
!.gitignore
""")


In [51]:
import os
os.listdir(".")


['.git',
 '.gitignore',
 '.ipynb_checkpoints',
 'cleaned_dataset.csv',
 'detection.ipynb',
 'fraudTest.csv',
 'untitled.txt']

In [52]:
import numpy as np
import pandas as pd

from sklearn.ensemble import IsolationForest
from sklearn.metrics import classification_report, roc_auc_score


In [53]:
iso_model = IsolationForest(
    n_estimators=200,
    contamination=0.01,   # approx expected fraud ratio
    random_state=42,
    n_jobs=-1
)

iso_model.fit(X)

0,1,2
,"n_estimators  n_estimators: int, default=100 The number of base estimators in the ensemble.",200
,"max_samples  max_samples: ""auto"", int or float, default=""auto"" The number of samples to draw from X to train each base estimator. - If int, then draw `max_samples` samples. - If float, then draw `max_samples * X.shape[0]` samples. - If ""auto"", then `max_samples=min(256, n_samples)`. If max_samples is larger than the number of samples provided, all samples will be used for all trees (no sampling).",'auto'
,"contamination  contamination: 'auto' or float, default='auto' The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the scores of the samples. - If 'auto', the threshold is determined as in the  original paper. - If float, the contamination should be in the range (0, 0.5]. .. versionchanged:: 0.22  The default value of ``contamination`` changed from 0.1  to ``'auto'``.",0.01
,"max_features  max_features: int or float, default=1.0 The number of features to draw from X to train each base estimator. - If int, then draw `max_features` features. - If float, then draw `max(1, int(max_features * n_features_in_))` features. Note: using a float number less than 1.0 or integer less than number of features will enable feature subsampling and leads to a longer runtime.",1.0
,"bootstrap  bootstrap: bool, default=False If True, individual trees are fit on random subsets of the training data sampled with replacement. If False, sampling without replacement is performed.",False
,"n_jobs  n_jobs: int, default=None The number of jobs to run in parallel for :meth:`fit`. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary ` for more details.",-1
,"random_state  random_state: int, RandomState instance or None, default=None Controls the pseudo-randomness of the selection of the feature and split values for each branching step and each tree in the forest. Pass an int for reproducible results across multiple function calls. See :term:`Glossary `.",42
,"verbose  verbose: int, default=0 Controls the verbosity of the tree building process.",0
,"warm_start  warm_start: bool, default=False When set to ``True``, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new forest. See :term:`the Glossary `. .. versionadded:: 0.21",False


In [54]:
iso_raw_preds = iso_model.predict(X)

# Convert: -1 → fraud (1), 1 → legit (0)
iso_preds = np.where(iso_raw_preds == -1, 1, 0)

In [55]:
print(classification_report(y, iso_preds))

              precision    recall  f1-score   support

           0       0.99      0.99      0.99     57321
           1       0.45      0.45      0.45       579

    accuracy                           0.99     57900
   macro avg       0.72      0.72      0.72     57900
weighted avg       0.99      0.99      0.99     57900



In [56]:
iso_scores = -iso_model.decision_function(X)
print("AUC-ROC:", roc_auc_score(y, iso_scores))


AUC-ROC: 0.8934913068267879
