# SecurePay — Intelligent Transaction Anomaly Detection System  
## Notebook 02 — Data Preparation for Anomaly Detection

---

## Introduction

Before applying anomaly detection algorithms, the dataset must be prepared in a format suitable for modelling. Raw transaction data often contains features with different ranges and scales, which can negatively affect distance-based anomaly detection methods.

This notebook focuses on selecting relevant behavioral features and transforming them into a model-ready format. Feature scaling is applied to ensure that no single feature dominates the anomaly detection process. No anomaly detection model is trained in this notebook.


#Stage 1


In [1]:
import pandas as pd
from sklearn.preprocessing import RobustScaler

df = pd.read_csv("securepay_txn_stream.csv")
df.head()


Unnamed: 0,txn_id,txn_hour,txn_amount,amount_deviation,txn_velocity,behavior_score,payment_channel,risk_flag
0,TXN00001,17,222.39,-0.3,0.75,0.33,CreditCard,0
1,TXN00002,21,991.0,-0.06,0.46,0.32,UPI,0
2,TXN00003,10,566.06,-0.27,0.59,0.17,CreditCard,0
3,TXN00004,8,320.76,-0.34,0.56,0.08,UPI,0
4,TXN00005,16,1047.68,0.27,0.13,0.29,CreditCard,0


#Stage 2

In [2]:
features = [
    'txn_hour',
    'txn_amount',
    'amount_deviation',
    'txn_velocity',
    'behavior_score'
]

X = df[features]
X.head()


Unnamed: 0,txn_hour,txn_amount,amount_deviation,txn_velocity,behavior_score
0,17,222.39,-0.3,0.75,0.33
1,21,991.0,-0.06,0.46,0.32
2,10,566.06,-0.27,0.59,0.17
3,8,320.76,-0.34,0.56,0.08
4,16,1047.68,0.27,0.13,0.29


#Stage 3

In [3]:
scaler = RobustScaler()
X_scaled = scaler.fit_transform(X)

X_scaled[:5]


array([[ 0.25      , -0.95681902, -0.58823529,  0.85714286,  0.86666667],
       [ 0.75      ,  0.59571675, -0.11764706,  0.02857143,  0.8       ],
       [-0.625     , -0.26263085, -0.52941176,  0.4       , -0.2       ],
       [-0.875     , -0.75811884, -0.66666667,  0.31428571, -0.8       ],
       [ 0.125     ,  0.71020618,  0.52941176, -0.91428571,  0.6       ]])

#Stage 4

In [4]:
X_scaled = pd.DataFrame(X_scaled, columns=features)
X_scaled.head()


Unnamed: 0,txn_hour,txn_amount,amount_deviation,txn_velocity,behavior_score
0,0.25,-0.956819,-0.588235,0.857143,0.866667
1,0.75,0.595717,-0.117647,0.028571,0.8
2,-0.625,-0.262631,-0.529412,0.4,-0.2
3,-0.875,-0.758119,-0.666667,0.314286,-0.8
4,0.125,0.710206,0.529412,-0.914286,0.6


#Stage 5

In [5]:
X_scaled.shape


(10000, 5)

## Observations

The dataset was successfully prepared by selecting only the relevant behavioral features required for anomaly detection. Non-essential columns such as transaction identifiers and the risk label were excluded from the modelling input to ensure that only meaningful behavioral indicators contribute to the anomaly detection process.

Robust scaling was applied to normalize the feature ranges and reduce the influence of extreme values. This ensures that no single feature dominates the anomaly detection algorithm and that all behavioral attributes contribute proportionally during analysis. After scaling, the feature values are balanced and suitable for distance-based anomaly detection methods.

## Conclusion

The dataset is now fully transformed into a clean and model-ready format. With the behavioral features properly conditioned and normalized, the data is suitable for applying anomaly detection algorithms. In the next notebook, the Isolation Forest algorithm will be used to identify globally anomalous transactions based on deviations from normal behavioral patterns.
