<a href="https://colab.research.google.com/github/RR-someOne/Detecting-Anomolies-in-Financial-Transactions-Securities/blob/main/Anomoly_Detection_Finance_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# IMPORTANT: RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES,
# THEN FEEL FREE TO DELETE THIS CELL.
# NOTE: THIS NOTEBOOK ENVIRONMENT DIFFERS FROM KAGGLE'S PYTHON
# ENVIRONMENT SO THERE MAY BE MISSING LIBRARIES USED BY YOUR
# NOTEBOOK.
import kagglehub
borismarjanovic_price_volume_data_for_all_us_stocks_etfs_path = kagglehub.dataset_download('borismarjanovic/price-volume-data-for-all-us-stocks-etfs')

print('Data source import complete.')


# Anomaly Detection in Finance

Anomaly detection is a critical analytical approach in finance that focuses on identifying unusual, rare, or suspicious patterns in financial data that deviate from normal behavior. These anomalies often act as early warning signals for risk, fraud, or abnormal market activity, enabling financial institutions to take timely corrective or preventive actions.

Modern financial systems generate massive volumes of data, including transaction records, market prices, order flows, and customer activity logs. Manual monitoring of such data is impractical, making machine learning–based anomaly detection an essential tool for automated surveillance and risk management. By learning patterns of normal behavior, anomaly detection models can efficiently surface irregularities that require further investigation.

## Key Use Cases
## Fraud Detection

Identify abnormal transactions such as:

Unauthorized payments

Identity theft

Unusual spending patterns

## Market Manipulation Detection

Detect suspicious trading behaviors including:

Spoofing

Pump-and-dump schemes

Insider trading signals

Abnormal price or volume movements

## Risk Monitoring

Monitor and flag:

Extreme portfolio losses

Liquidity stress events

System or trading failures

Unusual counterparty behavior

## Common Algorithms Used

Isolation Forest – Efficient for large-scale, high-dimensional financial data

One-Class SVM – Learns normal behavior and detects deviations

Autoencoders – Neural networks that identify anomalies via reconstruction error

Local Outlier Factor (LOF) – Detects anomalies based on local data density

## Why Anomaly Detection Matters in Finance

Most financial anomalies are rare and unlabeled

Early detection helps reduce financial losses and regulatory risk

Suitable for real-time monitoring and surveillance systems

## Key Challenges

Highly imbalanced datasets

Concept drift, where normal behavior evolves over time

Requirement for low false-positive rates in regulated environments

## Import Data

This Python 3 environment comes with many helpful analytics libraries installed
It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
For example, here's several helpful packages to load


In [None]:

%%capture captured
print("Hidden output")

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session


## Identifying Market Manipulation Patterns
### Spoofing

Definition:
Placing large buy/sell orders with no intention of executing them, to create a false impression of supply or demand and manipulate prices.

Indicators:

High cancellation rate: Orders are placed and canceled frequently before execution.

Order imbalance: A sudden skew in buy vs sell orders that doesn’t reflect actual trading interest.

Example:
A trader places a large sell order to push the price down, then buys at the lower price and cancels the large sell order.

## ML Workflow: Market Manipulation Detection Using Isolation Forest

## Problem Definition

Goal: Detect suspicious trading patterns (e.g., spoofing) in ETFs and stocks using historical transaction data.

### Indicators to consider:

High cancellation rate

Order imbalance (buy vs sell orders)

Abnormal volume spikes

Extreme price movements

Type of ML: Unsupervised anomaly detection

Algorithm: Isolation Forest

## Data Collection

Source: .txt files with trade/order data. Check the format of the financial data and features (ETFs/Stocks)



In [None]:
import os
import random

PRICE_VOLUME_DATA_FOR_ALL_US_STOCKS_ETFS_PATH = '/kaggle/input/price-volume-data-for-all-us-stocks-etfs/ETFs'

def get_sample_text_files_ETFs(directory_path, sample_size=5):
    """
    Walks through a directory, finds all .txt files, and returns a random sample.

    Args:
        directory_path (str): The path to the directory to search.
        sample_size (int): The number of sample text files to return.

    Returns:
        list: A list of paths to sample text files.
    """
    text_files = []
    for dirpath, _, filenames in os.walk(directory_path):
        for filename in filenames:
            if filename.endswith('.txt'):
                text_files.append(os.path.join(dirpath, filename))
                #print(text_files)

    if len(text_files) <= sample_size:
        return text_files
    else:
        return random.sample(text_files, sample_size)

# Get a sample of text files from the downloaded Kaggle dataset path
sample_files = get_sample_text_files_ETFs(PRICE_VOLUME_DATA_FOR_ALL_US_STOCKS_ETFS_PATH, sample_size=10)

print("Sample of text files from the dataset directory:")
for f in sample_files:
    print(f)

Sample of text files from the dataset directory:
/kaggle/input/price-volume-data-for-all-us-stocks-etfs/ETFs/psk.us.txt
/kaggle/input/price-volume-data-for-all-us-stocks-etfs/ETFs/ryf.us.txt
/kaggle/input/price-volume-data-for-all-us-stocks-etfs/ETFs/hap.us.txt
/kaggle/input/price-volume-data-for-all-us-stocks-etfs/ETFs/dfj.us.txt
/kaggle/input/price-volume-data-for-all-us-stocks-etfs/ETFs/unl.us.txt
/kaggle/input/price-volume-data-for-all-us-stocks-etfs/ETFs/vlue.us.txt
/kaggle/input/price-volume-data-for-all-us-stocks-etfs/ETFs/lemb.us.txt
/kaggle/input/price-volume-data-for-all-us-stocks-etfs/ETFs/inp.us.txt
/kaggle/input/price-volume-data-for-all-us-stocks-etfs/ETFs/pgj.us.txt
/kaggle/input/price-volume-data-for-all-us-stocks-etfs/ETFs/eqlt.us.txt


In [None]:
import os
import random

PRICE_VOLUME_DATA_FOR_ALL_US_STOCKS_ETFS_PATH_STOCKS = '/kaggle/input/price-volume-data-for-all-us-stocks-etfs/Stocks'

def get_sample_text_files_Stocks(directory_path, sample_size=5):
    """
    Walks through a directory, finds all .txt files, and returns a random sample.

    Args:
        directory_path (str): The path to the directory to search.
        sample_size (int): The number of sample text files to return.

    Returns:
        list: A list of paths to sample text files.
    """
    text_files = []
    for dirpath, _, filenames in os.walk(directory_path):
        for filename in filenames:
            if filename.endswith('.txt'):
                text_files.append(os.path.join(dirpath, filename))
                #print(text_files)

    if len(text_files) <= sample_size:
        return text_files
    else:
        return random.sample(text_files, sample_size)

# Get a sample of text files from the downloaded Kaggle dataset path
sample_files = get_sample_text_files_Stocks(PRICE_VOLUME_DATA_FOR_ALL_US_STOCKS_ETFS_PATH_STOCKS, sample_size=10)

print("Sample of text files from the dataset directory:")
for f in sample_files:
    print(f)

Sample of text files from the dataset directory:
/kaggle/input/price-volume-data-for-all-us-stocks-etfs/Stocks/rlgt.us.txt
/kaggle/input/price-volume-data-for-all-us-stocks-etfs/Stocks/cht.us.txt
/kaggle/input/price-volume-data-for-all-us-stocks-etfs/Stocks/klic.us.txt
/kaggle/input/price-volume-data-for-all-us-stocks-etfs/Stocks/sni.us.txt
/kaggle/input/price-volume-data-for-all-us-stocks-etfs/Stocks/fcfs.us.txt
/kaggle/input/price-volume-data-for-all-us-stocks-etfs/Stocks/slgn.us.txt
/kaggle/input/price-volume-data-for-all-us-stocks-etfs/Stocks/cowz.us.txt
/kaggle/input/price-volume-data-for-all-us-stocks-etfs/Stocks/muj.us.txt
/kaggle/input/price-volume-data-for-all-us-stocks-etfs/Stocks/agfsw.us.txt
/kaggle/input/price-volume-data-for-all-us-stocks-etfs/Stocks/csii.us.txt


In [None]:
import pandas as pd

print("------------- ETF's Data Format -------------------")

file_path = '/kaggle/input/price-volume-data-for-all-us-stocks-etfs/ETFs/slyv.us.txt'

try:
    slyv_df = pd.read_csv(file_path, sep='\t')
    print(f"Data from {file_path} loaded successfully. Displaying first 5 rows:")
    display(slyv_df.head())
except FileNotFoundError:
    print(f"Error: The file {file_path} was not found.")
except Exception as e:
    print(f"An error occurred while loading the file: {e}")

print("------------- Stocks Data Format -------------------")

file_path = '/kaggle/input/price-volume-data-for-all-us-stocks-etfs/Stocks/usrt.us.txt'

try:
    slyv_df = pd.read_csv(file_path, sep='\t')
    print(f"Data from {file_path} loaded successfully. Displaying first 5 rows:")
    display(slyv_df.head())
except FileNotFoundError:
    print(f"Error: The file {file_path} was not found.")
except Exception as e:
    print(f"An error occurred while loading the file: {e}")



------------- ETF's Data Format -------------------
Data from /kaggle/input/price-volume-data-for-all-us-stocks-etfs/ETFs/slyv.us.txt loaded successfully. Displaying first 5 rows:


Unnamed: 0,"Date,Open,High,Low,Close,Volume,OpenInt"
0,"2005-02-25,44.396,44.875,44.396,44.875,2993,0"
1,"2005-02-28,45.051,45.124,44.819,44.851,49043,0"
2,"2005-03-01,45.051,45.293,44.972,45.293,2247,0"
3,"2005-03-02,45.348,45.348,45.348,45.348,371,0"
4,"2005-03-03,45.285,45.314,45.043,45.314,13855,0"


------------- Stocks Data Format -------------------
Data from /kaggle/input/price-volume-data-for-all-us-stocks-etfs/Stocks/usrt.us.txt loaded successfully. Displaying first 5 rows:


Unnamed: 0,"Date,Open,High,Low,Close,Volume,OpenInt"
0,"2007-05-04,38.705,38.757,38.481,38.496,5289,0"
1,"2007-05-07,38.566,38.566,38.532,38.532,4257,0"
2,"2007-05-08,38.424,38.424,38.278,38.295,6454,0"
3,"2007-05-09,38.748,38.748,38.613,38.613,5161,0"
4,"2007-05-10,38.566,38.633,38.2,38.2,12897,0"


In [None]:
print("--------- Features List -----------")
print("Features List: Date, Open, High, Low, Close, Volume, OpenInt")

--------- Features List -----------
Features List: Date, Open, High, Low, Close, Volume, OpenInt
