# Automatic detection similar paragraphs covering emerging risks in peer banks' annual reports
## Proof-of-concept / work in progress
Author: Marcin Lipka (m453441), Group Risk Office

_October 2025_

This notebook aims to implement a simplified prototype of the NLP component described in the paper _Hanley Hoberg - Dynamic Interpretation of Emerging Risks in the Financial Sector.pdf_.
In the subsequent sections, we:
1. Set up the working environment to enable the analysis
2. Ingest specific pages (covering risk) of 4 annual reports (for 2024) from Danske Group, Deutsche Bank Group, ING Group and UBS Group
3. Parse the reports into paragraphs
4. Map the paragraphs to a semantic vector space using a BERT model
5. Run a pairwise similarity comparison between the paragraphs coming from different documents
6. Pick up most similar pairs of paragraphs
7. Test the use of KeyBERT for characterizing the paired paragraphs in a simplified form

### Setting up the environment 

In [2]:
from transformers import AutoTokenizer, AutoModel, AutoModelForSequenceClassification
import torch
import pandas as pd
import pickle
import torch.nn.functional as F
import os
import numpy as np
from itertools import compress
import fitz  # this is pymupdf
from itertools import takewhile
import re
from sentence_transformers import SentenceTransformer, CrossEncoder, util
from itertools import chain

In [2]:
%pip uninstall torch torchvision torchaudio -y


%pip install torch==2.2.2
%pip install torchvision==0.17.2
%pip install torchaudio==2.2.2


^C
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Found existing installation: torch 2.8.0
Uninstalling torch-2.8.0:
  Successfully uninstalled torch-2.8.0
Found existing installation: torchvision 0.23.0
Uninstalling torchvision-0.23.0:
  Successfully uninstalled torchvision-0.23.0
Found existing installation: torchaudio 2.2.2
Uninstalling torchaudio-2.2.2:
  Successfully uninstalled torchaudio-2.2.2
Found existing installation: torch 2.8.0
Uninstalling torch-2.8.0:
  Successfully uninstalled torch-2.8.0
Found existing installation: torchvision 0.23.0
Uninstalling torchvision-0.23.0:
  Successfully uninstalled torchvision-0.23.0
Found existing installation: torchaudio 2.2.2
Uninstalling torchaudio-2.2.2:
  Successfully uninstalled torchaudio-2.2.2


You can safely remove it manually.
You can safely remove it manually.


^C
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Defaulting to user installation because normal site-packages is not writeable
Looking in indexes: https://artifactory.itcm.oneadr.net/api/pypi/pypi-remote/simple
Collecting torchvision==0.17.2
  Downloading https://artifactory.itcm.oneadr.net/api/pypi/pypi-remote/packages/packages/fd/d1/8da7f30169f56764f0ef9ed961a32f300a2d782b6c1bc8b391c3014092f8/torchvision-0.17.2-cp312-cp312-win_amd64.whl (1.2 MB)
     ---------------------------------------- 0.0/1.2 MB ? eta -:--:--
     -------- ------------------------------- 0.3/1.2 MB ? eta -:--:--
     ----------------------------------- ---- 1.0/1.2 MB 2.8 MB/s eta 0:00:01
     ---------------------------------------- 1.2/1.2 MB 2.6 MB/s eta 0:00:00
Installing collected packages: torchvision
Successfully installed torchvision-0.17.2
Note: you may need to restart the kernel to use updated packages.
Defaulting to


[notice] A new release of pip is available: 25.0.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


Defaulting to user installation because normal site-packages is not writeable
Looking in indexes: https://artifactory.itcm.oneadr.net/api/pypi/pypi-remote/simple
Collecting torchaudio==2.2.2
  Downloading https://artifactory.itcm.oneadr.net/api/pypi/pypi-remote/packages/packages/13/47/2a86273d9b04f0e328ddc79269e1a91d2716ce9f6acc0ef4250d0421e858/torchaudio-2.2.2-cp312-cp312-win_amd64.whl (2.4 MB)
     ---------------------------------------- 0.0/2.4 MB ? eta -:--:--
     -------- ------------------------------- 0.5/2.4 MB 3.4 MB/s eta 0:00:01
     ----------------- ---------------------- 1.0/2.4 MB 3.1 MB/s eta 0:00:01
     ----------------------------------- ---- 2.1/2.4 MB 3.7 MB/s eta 0:00:01
     ---------------------------------------- 2.4/2.4 MB 3.7 MB/s eta 0:00:00
Installing collected packages: torchaudio
Successfully installed torchaudio-2.2.2
Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 25.0.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


Defaulting to user installation because normal site-packages is not writeable
Looking in indexes: https://artifactory.itcm.oneadr.net/api/pypi/pypi-remote/simple
Collecting torch==2.2.2
  Downloading https://artifactory.itcm.oneadr.net/api/pypi/pypi-remote/packages/packages/72/ce/beca89dcdcf4323880d3b959ef457a4c61a95483af250e6892fec9174162/torch-2.2.2-cp312-cp312-win_amd64.whl (198.5 MB)
     ---------------------------------------- 0.0/198.5 MB ? eta -:--:--
     ---------------------------------------- 1.0/198.5 MB 8.5 MB/s eta 0:00:24
      -------------------------------------- 3.9/198.5 MB 12.4 MB/s eta 0:00:16
     - ------------------------------------- 7.6/198.5 MB 14.7 MB/s eta 0:00:14
     -- ----------------------------------- 12.3/198.5 MB 17.2 MB/s eta 0:00:11
     --- ---------------------------------- 17.6/198.5 MB 19.1 MB/s eta 0:00:10
     ---- --------------------------------- 23.3/198.5 MB 20.8 MB/s eta 0:00:09
     ----- -------------------------------- 29.4/198.5 M

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
docling-ibm-models 3.9.1 requires torchvision<1,>=0, which is not installed.
easyocr 1.7.2 requires torchvision>=0.5, which is not installed.
pretrainedmodels 0.7.4 requires torchvision, which is not installed.

[notice] A new release of pip is available: 25.0.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [None]:
!pip install protobuf==3.20.3



Defaulting to user installation because normal site-packages is not writeable
Looking in indexes: https://artifactory.itcm.oneadr.net/api/pypi/pypi-remote/simple
Collecting protobuf==3.20.3
  Downloading https://artifactory.itcm.oneadr.net/api/pypi/pypi-remote/packages/packages/8d/14/619e24a4c70df2901e1f4dbc50a6291eb63a759172558df326347dce1f0d/protobuf-3.20.3-py2.py3-none-any.whl (162 kB)
Installing collected packages: protobuf
  Attempting uninstall: protobuf
    Found existing installation: protobuf 3.20.0
    Uninstalling protobuf-3.20.0:
      Successfully uninstalled protobuf-3.20.0
Successfully installed protobuf-3.20.3



[notice] A new release of pip is available: 25.0.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


#### Initiating the models

The models below are text embedding models downloaded from Huggingface via the HuggingfaceHub library.

In [5]:
bi_encoder = SentenceTransformer(r"C:\Users\m453441\OneDrive - Nordea\1 Audit\01 Search tool\models\\")
cross_encoder = CrossEncoder(r"C:\Users\m453441\OneDrive - Nordea\1 Audit\01 Search tool\cross_encoder_bin\\")

#### Setting up the folder structure

In [6]:
reports_folder = './reports/'
report_files = [f for f in os.listdir(reports_folder) if f.endswith('.pdf')]
random_report = report_files[0]
doc_path = os.path.join(reports_folder, random_report)


### Ingestion of documents

Using pymupdf to process the data - in v0 the necessary input is page ranges per document.

In [7]:
# Define page ranges for specific documents (or use default for all)
# You can specify different page ranges for different files
page_ranges = {
    '2024_Danske_group.pdf': range(26, 30),  # First 20 pages
    '2024_UBS_group.pdf': chain(range(88,90), range(99,107)),   # First 3 and last page
    '2024_DeutscheBank_group.pdf': chain(range(69, 71), range(75,87), range(122, 125)),   # First 3 and last page
    '2024_ING_group.pdf': chain(range(158, 162), range(168, 196)),   # Pages 10-29
    # Add more files and their specific page ranges as needed
}

# Default pages to process if no specific range is defined for a file
# Options:
# - range(0, 10) for first 10 pages
# - [0, 1, 2, -1] for first 3 and last page (use negative for counting from end)
# - None to process all pages
default_pages = range(0, 10)  # First 10 pages by default

files_walk = os.walk(reports_folder)  # Fixed variable name
report_paragraphs = []
report_paragraphs_source = []
report_pages_source = []

for path, dirs, files in files_walk:
    pdfs = [file for file in files if file.endswith('.pdf')]
    for _file in pdfs:
        print(f"Processing {_file}...")
        
        # Determine which pages to process for this file
        if _file in page_ranges:
            pages_to_process = page_ranges[_file]
        else:
            pages_to_process = default_pages
        
        with fitz.open(os.path.join(path, _file)) as doc:
            total_pages = len(doc)
            
            # If pages_to_process is None, process all pages
            if pages_to_process is None:
                pages_to_process = range(total_pages)
            
            # Handle negative page numbers (count from end)
            actual_pages = []
            for page_num in pages_to_process:
                if isinstance(page_num, int):
                    if page_num < 0:
                        actual_page = total_pages + page_num  # Convert negative to positive
                    else:
                        actual_page = page_num
                    
                    # Only include valid page numbers
                    if 0 <= actual_page < total_pages:
                        actual_pages.append(actual_page)
            
            
            # Process only the specified pages
            for page_num in actual_pages:
                page = doc[page_num]
                blocks = [x[4] for x in page.get_text("blocks")]
                # get rid of empty blocks
                blocks = [block.strip() for block in blocks if block.strip()]
                
                if blocks:  # Only add if there are non-empty blocks
                    report_paragraphs.extend(blocks)
                    report_pages_source.extend([page_num] * len(blocks))
                    report_paragraphs_source.extend([_file] * len(blocks))



Processing 2024_Danske_group.pdf...
Processing 2024_DeutscheBank_group.pdf...
Processing 2024_ING_group.pdf...
Processing 2024_UBS_group.pdf...


Ingestion sanity check:

In [8]:
report_paragraphs[10]

'Market risk \nA robust trading portfolio mainly consisting of fixed income products, with stable income and risk \nlevels. The IRRBB EV model was further improved in 2024, and the risk is well hedged.'

In [9]:
change_indices = [i for i in range(1, len(report_paragraphs_source)) if report_paragraphs_source[i] != report_paragraphs_source[i-1]]
for index in change_indices:
    print(report_paragraphs_source[index-1])
    print(report_paragraphs_source[index])
    print(report_pages_source[index-1])
    print(report_pages_source[index])
    print(report_paragraphs[index-15:index])

2024_Danske_group.pdf
2024_DeutscheBank_group.pdf
29
69
['Market risk  \nTaking on market risk is a key part of the Group’s business strategy, \nin relation to both trading and non-trading activities. The market risk \ninherent in trading activities is split into two elements: one covering', 'explicit position-taking arising from customer trades and market-\nmaking, and one focusing on the risk associated with valuation \nadjustments (xVA risk) in the derivatives portfolio. Most of the \nGroup’s market risk relates to fixed income products (interest rate \nrisk and bond spread risk). Interest rate volatility remained relatively \nmuted for the first half of 2024, continuing the trend from late 2023 \ntowards lower levels, as markets were awaiting the beginning of \ncentral bank interest rate cuts. In the second part of 2024, interest \nrate volatility moved up, supported by recession fears, geopolitical \ntensions and the US election. These key events were all managed \nwell by the Gro

Checking if the ingestion worked as expected and we have the same number of paragraphs, source pages and source file.

In [10]:
print(len(report_paragraphs_source))
print(len(report_pages_source))
print(len(report_paragraphs))

1480
1480
1480


### Cleaning ingested data

Remove near-duplicate paragraphs (differ by less than 5 characters) for each document - attempt to get rid of headers, footers, page numbers etc.

In [11]:
from collections import defaultdict

# Group indices by document
doc_indices = defaultdict(list)
for idx, doc in enumerate(report_paragraphs_source):
    doc_indices[doc].append(idx)

# Indices to keep
indices_to_keep = set()

for doc, indices in doc_indices.items():
    seen = []
    for idx in indices:
        para = report_paragraphs[idx]
        # Check if this paragraph is very similar to any already seen (diff < 5 chars)
        if not any(abs(len(para) - len(other)) < 5 and sum(a != b for a, b in zip(para, other)) < 5 for other in seen):
            indices_to_keep.add(idx)
            seen.append(para)

# Sort indices to keep
indices_to_keep = sorted(indices_to_keep)

# Filter all lists
report_paragraphs = [report_paragraphs[i] for i in indices_to_keep]
report_paragraphs_source = [report_paragraphs_source[i] for i in indices_to_keep]
report_pages_source = [report_pages_source[i] for i in indices_to_keep]

Sanity check to see how many paragraphs were removed and if the corresponding elements got successfully removed:

In [12]:
print(len(report_paragraphs_source))
print(len(report_pages_source))
print(len(report_paragraphs))

1058
1058
1058


Manual review of paragraphs that are the "edge" ones - at the end of different documents:

In [13]:
change_indices = [i for i in range(1, len(report_paragraphs_source)) if report_paragraphs_source[i] != report_paragraphs_source[i-1]]
for index in change_indices:
    print(report_paragraphs_source[index-1])
    print(report_paragraphs_source[index])
    print(report_pages_source[index-1])
    print(report_pages_source[index])
    print(report_paragraphs[index-15:index])

2024_Danske_group.pdf
2024_DeutscheBank_group.pdf
29
69
['The ﬁnancial highlights provide information about the balance sheet.', 'Trading portfolio assets and trading portfolio liabilities increased to \nnet assets of DKK 174.3 billion (end-2023: net assets of DKK 93.7 \nbillion). The increase in net assets was due mainly to changes in the \nfair value of the derivatives portfolio.', 'Market risk  \nTaking on market risk is a key part of the Group’s business strategy, \nin relation to both trading and non-trading activities. The market risk \ninherent in trading activities is split into two elements: one covering', 'explicit position-taking arising from customer trades and market-\nmaking, and one focusing on the risk associated with valuation \nadjustments (xVA risk) in the derivatives portfolio. Most of the \nGroup’s market risk relates to fixed income products (interest rate \nrisk and bond spread risk). Interest rate volatility remained relatively \nmuted for the first half of 2024

### Paragraph embeddings generation

In [None]:
report_embeddings_all = bi_encoder.encode(report_paragraphs, convert_to_tensor=True, show_progress_bar=True)

Batches:   0%|          | 0/34 [00:00<?, ?it/s]

### Store data in pickle files

In [14]:
# Change the version according to the current numbering
# v1 - first version, no filters
# v2 - filtering out paragraphs that are too similar (5 chars) to others within the same document
version = 'v2'

In [None]:
with open(f'report_embeddings_all_{version}.pickle', 'wb') as pkl:
    pickle.dump(report_embeddings_all, pkl)
with open(f'report_paragraphs_{version}.pickle', 'wb') as pkl:
    pickle.dump(report_paragraphs, pkl)
with open(f'report_pages_source_{version}.pickle', 'wb') as pkl:
    pickle.dump(report_pages_source, pkl)
with open(f'report_paragraphs_source_{version}.pickle', 'wb') as pkl:
    pickle.dump(report_paragraphs_source, pkl)

Use the cell below if you ever want to load the pickle files:

In [15]:
with open(f'report_embeddings_all_{version}.pickle', 'rb') as pkl:
    report_embeddings_all = pickle.load(pkl)
with open(f'report_paragraphs_{version}.pickle', 'rb') as pkl:
    report_paragraphs = pickle.load(pkl)
with open(f'report_pages_source_{version}.pickle', 'rb') as pkl:
    report_pages_source = pickle.load(pkl)
with open(f'report_paragraphs_source_{version}.pickle', 'rb') as pkl:
    report_paragraphs_source = pickle.load(pkl)

### Similarity search

Looking for similar vectors between the different sources. To be fixed later, but currently it's only at this stage that I'm testing filtering out: 
1. the paragraphs that contain too many digits (targeting tables and table rows).
2. the paragraphs that are shorter than 100 characters - typically headers and similar irrelevant.

This phase obviously needs to be improved further in future versions, but for now this is to get rid of the most obvious noise.

In [16]:
def fun_digit_ratio(text, threshold=0.1):
                    digits = sum(c.isdigit() for c in text)
                    ratio = digits / max(1, len(text))
                    if ratio > threshold:
                        return True
                    else:
                        return False

# Compute cosine similarity for each pair
similarities = []
for i in range(len(report_paragraphs)):
    if len(report_paragraphs[i]) > 100 and fun_digit_ratio(report_paragraphs[i], threshold=0.1) is False:
        for j in range(i + 1, len(report_paragraphs)):
            # Check if neither text has more than 10% digits
            if (len(report_paragraphs[j]) > 100) and (report_paragraphs_source[i] != report_paragraphs_source[j]) and fun_digit_ratio(report_paragraphs[j], threshold=0.1) is False:
                

                emb1 = report_embeddings_all[i]
                emb2 = report_embeddings_all[j]
                score = torch.nn.functional.cosine_similarity(emb1.unsqueeze(0), emb2.unsqueeze(0)).item()
                similarities.append({
                    'text1': report_paragraphs[i],
                    'text2': report_paragraphs[j],
                    'source_text1': report_paragraphs_source[i],
                    'source_text2': report_paragraphs_source[j],  
                    'len_t1': len(report_paragraphs[i]),
                    'len_t2': len(report_paragraphs[j]),
                    'similarity': score
                })


# Create dataframe
similarity_df = pd.DataFrame(similarities)
similarity_df.sort_values('similarity', ascending=False, inplace=True)
similarity_df.reset_index(drop=True, inplace=True)
similarity_df

Unnamed: 0,text1,text2,source_text1,source_text2,len_t1,len_t2,similarity
0,"Amidst the ongoing war in Ukraine, potential f...",During 2024 a trend emerged whereby Russian pa...,2024_DeutscheBank_group.pdf,2024_ING_group.pdf,568,691,0.807738
1,The risks listed below are defined as top (alr...,"An overview of our top and emerging risks, fro...",2024_ING_group.pdf,2024_UBS_group.pdf,707,430,0.773770
2,Country risk management \nAvoiding undue conce...,The credit concentration risk framework is com...,2024_DeutscheBank_group.pdf,2024_ING_group.pdf,734,1009,0.766335
3,"In addition to country thresholds, gap risk th...",The credit concentration risk framework is com...,2024_DeutscheBank_group.pdf,2024_ING_group.pdf,1097,1009,0.756699
4,The war in Ukraine continues to lead to a high...,The war in Ukraine\nThe war in Ukraine continu...,2024_DeutscheBank_group.pdf,2024_ING_group.pdf,1159,669,0.751910
...,...,...,...,...,...,...,...
66730,Goodwill is reviewed annually for impairment o...,Geopolitical risk is an important concern for ...,2024_DeutscheBank_group.pdf,2024_ING_group.pdf,124,627,-0.099426
66731,1\nStage 3 lifetime credit impaired provision ...,Geopolitical uncertainty\nWe remain watchful o...,2024_ING_group.pdf,2024_UBS_group.pdf,702,388,-0.103083
66732,US-EU-China relations and regional tensions\nU...,The calculation of affordability takes into ac...,2024_ING_group.pdf,2024_UBS_group.pdf,570,321,-0.107103
66733,Goodwill is reviewed annually for impairment o...,International elections\nThere were an unprece...,2024_DeutscheBank_group.pdf,2024_ING_group.pdf,124,1072,-0.123577


The resulting dataframe contains pairs of similar paragraphs from different documents along with their length and similarity score. To facilitate easy review of model output, we generate an Excel file with an extract of the pairs that have similarity higher than 0.6:

In [18]:
similarity_df.query('similarity > 0.6').to_excel('similar_paragraphs_06.xlsx')

Upon manual review of most similar pairs, we can see that they are indeed semantically similar, which shows promise for the target solution. One mechanism that could increase the efficiency of working with the identified pairs would be some sort of simple characterization based on keywords that describe the given pair in a simplistic way. KeyBERT is a library that does exactly that - see below.

### Summarizing similarity search results with KeyBERT

Test running a KeyBERT keyword extraction model on the similar paragraphs - every paragraph will be "characterized" by top 5 keywords extracted with KeyBERT. This effectively maps the similar paragraphs into a common keyword space, which can be used to group similar content together more easily - to be done in future versions.

In [None]:
from keybert import KeyBERT

kw_model = KeyBERT(model = bi_encoder)

# Get top N most similar pairs
top_n = 10
top_pairs = similarity_df.head(top_n)

# Extract keywords for each text in the pairs
for idx, row in top_pairs.iterrows():
    print(f"Pair {idx+1}:")
    print(f"Source 1: {row['source_text1']}")
    print("Keywords 1:", kw_model.extract_keywords(row['text1'], top_n=5))
    print(f"Source 2: {row['source_text2']}")
    print("Keywords 2:", kw_model.extract_keywords(row['text2'], top_n=5))
    print('-'*40)

Pair 1:
Source 1: 2024_DeutscheBank_group.pdf
Keywords 1: [('sanctions', 0.4968), ('banks', 0.4337), ('ukraine', 0.3427), ('disputes', 0.3304), ('russia', 0.3136)]
Source 2: 2024_ING_group.pdf
Keywords 2: [('banking', 0.4111), ('banks', 0.3934), ('sanctions', 0.3923), ('courts', 0.3487), ('liable', 0.34)]
----------------------------------------
Pair 2:
Source 1: 2024_ING_group.pdf
Keywords 1: [('risks', 0.5732), ('risk', 0.5317), ('emerging', 0.3946), ('impact', 0.3268), ('liquidity', 0.2811)]
Source 2: 2024_UBS_group.pdf
Keywords 2: [('risks', 0.6533), ('risk', 0.6103), ('prospects', 0.4051), ('investors', 0.3431), ('strategy', 0.3175)]
----------------------------------------
Pair 3:
Source 1: 2024_DeutscheBank_group.pdf
Keywords 1: [('countries', 0.4215), ('risk', 0.3845), ('country', 0.3527), ('japan', 0.3446), ('asia', 0.3392)]
Source 2: 2024_ING_group.pdf
Keywords 2: [('risk', 0.3926), ('countries', 0.2985), ('concentration', 0.2793), ('credit', 0.2713), ('sector', 0.2332)]
----

This is where the notebook currently ends - it has been agreed that it brings business value in its current form and it's a decent starting point for developing the final product. Further steps to be implemented in future versions (among others):
- Automatic detection of risk management sections in annual reports, to avoid manual specification of relevant page ranges
- Clustering similar paragraph pairs based on the extracted keywords
- Building some sort of an interactive layer on top, that allows the user to use their judgement and select the relevant combinations of documents / paragraphs for further analysis based on stock price covariance or other metrics
