Financial Regulation Article Clustering

Contributors

🎯 Project Overview

This project automates the analysis of financial regulation documents by:

Parsing provided JSON articles
Filtering relevant articles from unrelated content
Categorizing articles by risk type (credit, liquidity, market, operational, compliance)
Clustering similar articles with TF-IDF and LLM
Comparing clustered articles with LLM for contradictions or overlap

🚀 Installation

Prerequisites

Python 3.10+
Ollama with the gemma3:1b & gemma3:4b model installed
Conda (recommended) or pip

Setup with Conda

# Create environment from env.yml
conda env create -f env.yml
conda activate bureaucracy-buster

# Install additional dependencies
pip install scikit-learn

Install Ollama Model

# Pull the required models
ollama pull gemma3:1b
ollama pull gemma3:4b

Pipeline Overview

The processing pipeline consists of 5 main stages:

Parse → Filter → Split by Risk → Cluster → Compare

Usage

Run the Complete Pipeline

The easiest way to run the entire pipeline is using run_all.py:

cd src
python run_all.py

This will execute all 5 stages sequentially for all sources (EBA, FIVA_MOK) and all risk categories.

Individual Pipeline Steps

You can also run each step individually. Edit the source/category variables in the file as needed, then run:

1. Parse Documents (`data_parse.py`)

Parses regulatory documents from the data/gold/ folder into structured JSON.

# Edit SOURCE variable in data_parse.py as needed (EBA or FIVA_MOK)
python src/data_parse.py

Input: Documents in data/gold/{source}/
Output: data/intermediate/parsed/all_{source}_parsed.json

2. Filter Relevant Articles (`data_filter.py`)

Uses an LLM to classify articles as relevant or unrelated to credit-giving organizations.

# Edit SOURCE variable in data_filter.py as needed
python src/data_filter.py

Input: Parsed documents from Step 1
Output:

data/intermediate/filtered/credit_related_{source}.json
data/intermediate/filtered/unrelated_{source}.json

3. Categorize by Risk Type (`data_categorize.py`)

Uses an LLM to classify articles into 5 risk categories.

# Edit SOURCE variable in data_categorize.py as needed
python src/data_categorize.py

Input: Filtered articles from Step 2
Output: data/intermediate/categorized/{category}_{source}.json for each category:

credit_risk_{source}.json
liquidity_risk_{source}.json
market_risk_{source}.json
operational_risk_{source}.json
compliance_risk_{source}.json

4. Cluster Similar Articles (`data_cluster.py`)

Clusters articles across sources using TF-IDF pre-filtering + LLM similarity scoring.

# Edit category variable in data_cluster.py as needed
python src/data_cluster.py

Parameters (configurable in file):

tfidf_threshold: TF-IDF pre-filter threshold (default: 0.2)
llm_threshold: LLM similarity threshold for clustering (default: 0.84)
early_exit: Stop comparing if very high similarity (default: 0.91)

Input: Categorized articles from Step 3
Output: data/intermediate/clustered/{category}.json

5. Compare Clustered Articles (`data_compare.py`)

Compares articles within clusters to identify overlaps and contradictions.

# Edit category variable in data_compare.py as needed
python src/data_compare.py

Input: Clustered articles from Step 4
Output: results/{category}.json containing:

overlap: Articles with same regulatory requirements
contradiction: Articles with conflicting requirements
bloat: Generic similarities without concrete overlap

How Clustering Works

TF-IDF Pre-filtering: Computes text similarity using TF-IDF vectorization to reduce LLM calls
Same-document filtering: Skips articles from the same document to focus on cross-regulation overlap
LLM Similarity Scoring: Uses Gemma 3:1b via Ollama to assess semantic similarity of requirements
Cluster Assignment: Groups articles with similarity above threshold or creates new reference clusters
Progressive Reference Building: Unmatched articles become references for future comparisons

Poor clustering results: Adjust thresholds based on your needs:

Lower llm_threshold for broader clusters
Raise tfidf_threshold to reduce LLM calls

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
ARCHIVE_categorized_cleaned_data		ARCHIVE_categorized_cleaned_data
ARCHIVE_clustered_data		ARCHIVE_clustered_data
Results		Results
data/intermediate/sanity_checks		data/intermediate/sanity_checks
src		src
.gitignore		.gitignore
README.md		README.md
env.yml		env.yml
visualize.ipynb		visualize.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Financial Regulation Article Clustering

Contributors

🎯 Project Overview

📋 Table of Contents

🚀 Installation

Prerequisites

Setup with Conda

Install Ollama Model

Pipeline Overview

Usage

Run the Complete Pipeline

Individual Pipeline Steps

1. Parse Documents (`data_parse.py`)

2. Filter Relevant Articles (`data_filter.py`)

3. Categorize by Risk Type (`data_categorize.py`)

4. Cluster Similar Articles (`data_cluster.py`)

5. Compare Clustered Articles (`data_compare.py`)

How Clustering Works

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Financial Regulation Article Clustering

Contributors

🎯 Project Overview

📋 Table of Contents

🚀 Installation

Prerequisites

Setup with Conda

Install Ollama Model

Pipeline Overview

Usage

Run the Complete Pipeline

Individual Pipeline Steps

1. Parse Documents (data_parse.py)

2. Filter Relevant Articles (data_filter.py)

3. Categorize by Risk Type (data_categorize.py)

4. Cluster Similar Articles (data_cluster.py)

5. Compare Clustered Articles (data_compare.py)

How Clustering Works

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. Parse Documents (`data_parse.py`)

2. Filter Relevant Articles (`data_filter.py`)

3. Categorize by Risk Type (`data_categorize.py`)

4. Cluster Similar Articles (`data_cluster.py`)

5. Compare Clustered Articles (`data_compare.py`)

Packages