- osma.ovaskainen@aalto.fi
- aleksanteri.hamalainen@aalto.fi
- aku.karhinen@aalto.fi
- niko.laukkanen@outlook.com
This project automates the analysis of financial regulation documents by:
- Parsing provided JSON articles
- Filtering relevant articles from unrelated content
- Categorizing articles by risk type (credit, liquidity, market, operational, compliance)
- Clustering similar articles with TF-IDF and LLM
- Comparing clustered articles with LLM for contradictions or overlap
- Python 3.10+
- Ollama with the
gemma3:1b&gemma3:4bmodel installed - Conda (recommended) or pip
# Create environment from env.yml
conda env create -f env.yml
conda activate bureaucracy-buster
# Install additional dependencies
pip install scikit-learn# Pull the required models
ollama pull gemma3:1b
ollama pull gemma3:4bThe processing pipeline consists of 5 main stages:
Parse → Filter → Split by Risk → Cluster → Compare
The easiest way to run the entire pipeline is using run_all.py:
cd src
python run_all.pyThis will execute all 5 stages sequentially for all sources (EBA, FIVA_MOK) and all risk categories.
You can also run each step individually. Edit the source/category variables in the file as needed, then run:
Parses regulatory documents from the data/gold/ folder into structured JSON.
# Edit SOURCE variable in data_parse.py as needed (EBA or FIVA_MOK)
python src/data_parse.pyInput: Documents in data/gold/{source}/
Output: data/intermediate/parsed/all_{source}_parsed.json
Uses an LLM to classify articles as relevant or unrelated to credit-giving organizations.
# Edit SOURCE variable in data_filter.py as needed
python src/data_filter.pyInput: Parsed documents from Step 1
Output:
data/intermediate/filtered/credit_related_{source}.jsondata/intermediate/filtered/unrelated_{source}.json
Uses an LLM to classify articles into 5 risk categories.
# Edit SOURCE variable in data_categorize.py as needed
python src/data_categorize.pyInput: Filtered articles from Step 2
Output: data/intermediate/categorized/{category}_{source}.json for each category:
credit_risk_{source}.jsonliquidity_risk_{source}.jsonmarket_risk_{source}.jsonoperational_risk_{source}.jsoncompliance_risk_{source}.json
Clusters articles across sources using TF-IDF pre-filtering + LLM similarity scoring.
# Edit category variable in data_cluster.py as needed
python src/data_cluster.pyParameters (configurable in file):
tfidf_threshold: TF-IDF pre-filter threshold (default: 0.2)llm_threshold: LLM similarity threshold for clustering (default: 0.84)early_exit: Stop comparing if very high similarity (default: 0.91)
Input: Categorized articles from Step 3
Output: data/intermediate/clustered/{category}.json
Compares articles within clusters to identify overlaps and contradictions.
# Edit category variable in data_compare.py as needed
python src/data_compare.pyInput: Clustered articles from Step 4
Output: results/{category}.json containing:
overlap: Articles with same regulatory requirementscontradiction: Articles with conflicting requirementsbloat: Generic similarities without concrete overlap
- TF-IDF Pre-filtering: Computes text similarity using TF-IDF vectorization to reduce LLM calls
- Same-document filtering: Skips articles from the same document to focus on cross-regulation overlap
- LLM Similarity Scoring: Uses Gemma 3:1b via Ollama to assess semantic similarity of requirements
- Cluster Assignment: Groups articles with similarity above threshold or creates new reference clusters
- Progressive Reference Building: Unmatched articles become references for future comparisons
Poor clustering results: Adjust thresholds based on your needs:
- Lower
llm_thresholdfor broader clusters - Raise
tfidf_thresholdto reduce LLM calls
MIT