# Automatic detection similar paragraphs covering emerging risks in peer banks' annual reports
## Proof-of-concept / work in progress
Author: Marcin Lipka (m453441), Group Risk Office

_October 2025_

### Setting up the environment 

In [None]:
from transformers import AutoTokenizer, AutoModel, AutoModelForSequenceClassification 
#AutoTokenizer – convert raw text into tokens (numerical input) that a transformer model understands
#AutoModel – load a pretrained transformer (like BERT, RoBERTa, etc.) for generating embeddings or contextual representations of text.
#AutoModelForSequenceClassification – load a pretrained transformer fine-tuned for classification tasks (e.g., sentiment analysis, topic detection)
import torch
#PyTorch – a popular deep learning framework for building and training neural networks.
import pandas as pd
#Pandas – a powerful data manipulation and analysis library for working with structured data (like CSV files, DataFrames, etc.)
import pickle
#Pickle – a Python module for serializing and deserializing Python objects, allowing you to save and load complex data structures.
import torch.nn.functional as F
#Torch.nn.functional – a submodule in PyTorch that provides various functions for building neural networks, such as activation functions, loss functions, etc.
import os
#OS – a standard Python library for interacting with the operating system, handling file paths, directories, and environment variables.
import numpy as np
#NumPy – a fundamental library for numerical computing in Python, providing support for arrays, matrices, and mathematical functions.
from itertools import compress
#Itertools.compress – a function that filters elements from an iterable based on a corresponding boolean selector iterable.
import fitz  
#Pymupdf (fitz) – a library for working with PDF documents, allowing you to read, manipulate, and extract content from PDFs.
from itertools import takewhile
#Itertools.takewhile – a function that returns elements from an iterable as long as a specified condition is true.
import re
#Re – a standard Python library for working with regular expressions, enabling pattern matching and text manipulation.
from sentence_transformers import SentenceTransformer, CrossEncoder, util
#SentenceTransformers – a library built on top of transformers for easy generation of sentence embeddings and semantic search.
#CrossEncoder – a model architecture that encodes pairs of sentences jointly for tasks like sentence similarity and ranking.
#Util – a utility module in SentenceTransformers that provides various helper functions for tasks like similarity computation and embedding operations.
from itertools import chain
#Itertools.chain – a function that combines multiple iterables into a single iterable, allowing you to iterate over all elements sequentially.

In [None]:
%pip uninstall torch torchvision torchaudio -y
#Reinstall specific versions of torch, torchvision, and torchaudio


%pip install torch==2.2.2
#Install torch version 2.2.2
%pip install torchvision==0.17.2
#Install torchvision version 0.17.2
%pip install torchaudio==2.2.2
#Install torchaudio version 2.2.2


Found existing installation: torch 2.9.0
Uninstalling torch-2.9.0:
  Successfully uninstalled torch-2.9.0
Found existing installation: torchvision 0.24.0
Uninstalling torchvision-0.24.0:
  Successfully uninstalled torchvision-0.24.0
Found existing installation: torchaudio 2.9.0
Uninstalling torchaudio-2.9.0:
  Successfully uninstalled torchaudio-2.9.0
Note: you may need to restart the kernel to use updated packages.
Collecting torch==2.2.2
  Downloading torch-2.2.2-cp312-none-macosx_11_0_arm64.whl.metadata (25 kB)
Downloading torch-2.2.2-cp312-none-macosx_11_0_arm64.whl (59.7 MB)
[2K   [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.7/59.7 MB[0m [31m8.4 MB/s[0m  [33m0:00:07[0m[0m eta [36m0:00:01[0m[36m0:00:01[0m
[?25hInstalling collected packages: torch
Successfully installed torch-2.2.2
Note: you may need to restart the kernel to use updated packages.
Collecting torchvision==0.17.2
  Downloading torchvision-0.17.2-cp312-cp312-macosx_11_0_arm64.whl.met

In [None]:
!pip install protobuf==3.20.3
#Install specific version of protobuf to avoid compatibility issues


Collecting protobuf==3.20.3
  Downloading protobuf-3.20.3-py2.py3-none-any.whl.metadata (720 bytes)
Downloading protobuf-3.20.3-py2.py3-none-any.whl (162 kB)
Installing collected packages: protobuf
Successfully installed protobuf-3.20.3


#### Initiating the models

The models below are text embedding models downloaded from Huggingface via the HuggingfaceHub library.

In [None]:
bi_encoder = SentenceTransformer(r"C:\Users\m45344\OneDrive - Nordea\1 Audit\01 Search tool\models\\")
#Load the bi-encoder model from the specified directory
cross_encoder = CrossEncoder(r"C:\Users\m453441\OneDrive - Nordea\1 Audit\01 Search tool\cross_encoder_bin\\")
#Load the cross-encoder model from the specified directory
#The bi-encoder is used for generating embeddings for sentences/documents, while the cross-encoder is used for scoring pairs of sentences/documents.
#embeddings is when you convert text into numerical vectors that capture the semantic meaning of the text.

FileNotFoundError: Path C:\Users\m453441\OneDrive - Nordea\1 Audit\01 Search tool\models\\ not found

#### Setting up the folder structure

In [None]:
reports_folder = './reports/'
#Select a random report from the reports folder
report_files = [f for f in os.listdir(reports_folder) if f.endswith('.pdf')]
#here, we list all files in the reports folder and filter out only those that end with '.pdf', indicating they are PDF documents.
random_report = report_files[0]
#select the first report in the list
doc_path = os.path.join(reports_folder, random_report)
#Create the full path to the selected report by joining the reports folder path with the report filename.
#so it simply constructs the complete file path to access the selected PDF report.


### Ingestion of documents

Using pymupdf to process the data - in v0 the necessary input is page ranges per document.

In [None]:
# Define page ranges for specific documents (or use default for all)
# You can specify different page ranges for different files
page_ranges = {
    '2024_Danske_group.pdf': range(26, 30),  # First 20 pages
    '2024_UBS_group.pdf': chain(range(88,90), range(99,107)),   # First 3 and last page
    '2024_DeutscheBank_group.pdf': chain(range(69, 71), range(75,87), range(122, 125)),   # First 3 and last page
    '2024_ING_group.pdf': chain(range(158, 162), range(168, 196)),   # Pages 10-29
    # Add more files and their specific page ranges as needed
}

# Default pages to process if no specific range is defined for a file
# Options:
# - range(0, 10) for first 10 pages
# - [0, 1, 2, -1] for first 3 and last page (use negative for counting from end)
# - None to process all pages
default_pages = range(0, 10)  # First 10 pages by default

files_walk = os.walk(reports_folder)  # Fixed variable name
# Initialize lists to store extracted data
report_paragraphs = []
# List to store extracted paragraphs
report_paragraphs_source = []
# List to store source file for each paragraph
report_pages_source = []
# List to store source page number for each paragraph

for path, dirs, files in files_walk:
    # Filter for PDF files
    pdfs = [file for file in files if file.endswith('.pdf')]
    # Process each PDF file
    for _file in pdfs:
        # Log the file being processed
        print(f"Processing {_file}...")
        
        
        # Determine which pages to process for this file
        if _file in page_ranges:
            #   Use specific page range for this file
            pages_to_process = page_ranges[_file]
        else:
            #   Use default page range
            pages_to_process = default_pages
            # If default_pages is None, process all pages
        
        with fitz.open(os.path.join(path, _file)) as doc:
            total_pages = len(doc)
            # Log total number of pages in the document
            print(f"  Total pages in {_file}: {total_pages}")
            
            # Determine actual pages to process
            # If pages_to_process is None, process all pages
            if pages_to_process is None:
                pages_to_process = range(total_pages)
            
            # Handle negative page numbers (count from end)
            actual_pages = []
            for page_num in pages_to_process:
                # Check if page_num is an integer (to handle range and chain objects)
                if isinstance(page_num, int):
                    # Handle negative page numbers
                    if page_num < 0:
                        actual_page = total_pages + page_num  # Convert negative to positive
                    else:
                        actual_page = page_num
                    
                    # Safety check: Only include if it is a valid page number (between 0 and total_pages-1)
                    if 0 <= actual_page < total_pages:
                        actual_pages.append(actual_page)
            
            print(f"  Processing pages: {actual_pages}")
            
            # This loop process each page in the actual_pages list
            for page_num in actual_pages:
            # Acces that page:
                page = doc[page_num] #Doc comando from fitz to access specific page
                # Extract text blocks from the page
                blocks = [x[4] for x in page.get_text("blocks")] #returns a list of layout blocks from that page (x0, y0, x1, y1, text, block_no, block_type)
            
                # get rid of empty blocks
                blocks = [block.strip() for block in blocks if block.strip()]

                
                if blocks:  # Only add if there are non-empty blocks
                    # Extend/add paragraphs to the main/master lists with extracted data
                    report_paragraphs.extend(blocks)
                    # For each block, add the corresponding page number and file source
                    report_pages_source.extend([page_num] * len(blocks)) 
                    # For each block, add the corresponding file source
                    report_paragraphs_source.extend([_file] * len(blocks))
                    
                    print(f'added {len(report_paragraphs)} paragraphs so far')
                    print(f'added {len(report_pages_source)} pages so far')
                    print(f'added {len(report_paragraphs_source)} pages so far')

print(f"Total paragraphs extracted: {len(report_paragraphs)}")
print(f"Total files processed: {len(set(report_paragraphs_source))}")

Processing 2024_Danske_group.pdf...
  Total pages in 2024_Danske_group.pdf: 280
  Processing pages: [26, 27, 28, 29]
added 14 paragraphs so far
added 14 pages so far
added 14 pages so far
added 73 paragraphs so far
added 73 pages so far
added 73 pages so far
added 109 paragraphs so far
added 109 pages so far
added 109 pages so far
added 119 paragraphs so far
added 119 pages so far
added 119 pages so far
Processing 2024_DeutscheBank_group.pdf...
  Total pages in 2024_DeutscheBank_group.pdf: 737
  Processing pages: [69, 70, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 122, 123, 124]
added 136 paragraphs so far
added 136 pages so far
added 136 pages so far
added 146 paragraphs so far
added 146 pages so far
added 146 pages so far
added 156 paragraphs so far
added 156 pages so far
added 156 pages so far
added 168 paragraphs so far
added 168 pages so far
added 168 pages so far
added 179 paragraphs so far
added 179 pages so far
added 179 pages so far
added 193 paragraphs so far
added 193 p

Ingestion sanity check:

In [None]:
report_paragraphs
#check the extracted paragraphs

['27 \nDanske Bank / Annual report 2024',
 'Managements’ report \nFinancial statements Danske Bank A/S \nManagements’ report - Sustainability Statement \nFinancial statements Danske Bank Group \nStatements',
 'Risk management',
 'In 2024, Danske Bank maintained a robust credit portfolio with \nstrong credit quality. Despite signs of macroeconomic normalisation \nin 2024, some effects of a slowdown persisted in specific areas of \nthe corporate portfolios, indicating that the downturn was not yet \ncompletely over. Falling interest rates in 2024 eased some of the \npressure on the commercial and residential real estate segments, \nand the volume of property transactions slowly improved during \n2024, although activity remained historically low. The credit quality \nof the personal customer portfolio remained stable throughout 2024.',
 'Danske Bank maintained a robust trading portfolio with a relatively \nlow level of risk utilisation. At the end of 2024, Danske Bank’s liquidity \npositi

In [None]:
#find indices where source file changes
change_indices = [i for i in range(1, len(report_paragraphs_source)) if report_paragraphs_source[i] != report_paragraphs_source[i-1]]
#that is, we create a list of indices where the source file for the paragraphs changes. This helps us identify boundaries between different documents in the extracted data.
#here the source file can change because we are processing multiple PDF files and appending their paragraphs to a single list.
for index in change_indices:
    print(report_paragraphs_source[index-1]) #print previous source file
    print(report_paragraphs_source[index]) #print current source file
    print(report_pages_source[index-1]) #print previous page number
    print(report_pages_source[index]) #print current page number
    print(report_paragraphs[index-15:index]) #print last 15 paragraphs from previous file to current file

2024_Danske_group.pdf
2024_DeutscheBank_group.pdf
29
69
['Market risk  \nTaking on market risk is a key part of the Group’s business strategy, \nin relation to both trading and non-trading activities. The market risk \ninherent in trading activities is split into two elements: one covering', 'explicit position-taking arising from customer trades and market-\nmaking, and one focusing on the risk associated with valuation \nadjustments (xVA risk) in the derivatives portfolio. Most of the \nGroup’s market risk relates to fixed income products (interest rate \nrisk and bond spread risk). Interest rate volatility remained relatively \nmuted for the first half of 2024, continuing the trend from late 2023 \ntowards lower levels, as markets were awaiting the beginning of \ncentral bank interest rate cuts. In the second part of 2024, interest \nrate volatility moved up, supported by recession fears, geopolitical \ntensions and the US election. These key events were all managed \nwell by the Gro

Checking if the ingestion worked as expected and we have the same number of paragraphs, source pages and source file.

In [None]:
#check that lengths match
print(len(report_paragraphs_source))
print(len(report_pages_source))
print(len(report_paragraphs))

1480
1480
1480


### Cleaning ingested data

Remove near-duplicate paragraphs (differ by less than 5 characters) for each document - attempt to get rid of headers, footers, page numbers etc.

In [None]:
from collections import defaultdict #import defaultdict to group indices by document

# Group indices by document
doc_indices = defaultdict(list) #defaultdict to hold lists of indices for each document. defaultdict is a subclass of the built-in dict class that returns a default value when a key is not found.
for idx, doc in enumerate(report_paragraphs_source): #enumerate through source files. it would give us both the index (idx) and the document name (doc)
    doc_indices[doc].append(idx) #append index to the list for that document. that is, for each document, we maintain a list of indices where paragraphs from that document are located in the main report_paragraphs list.
    #{"file1.pdf": [0, 1, 2], "file2.pdf": [3, 4, 5, 6], ...}

# Indices to keep
indices_to_keep = set() #set to store unique indices to keep

for doc, indices in doc_indices.items():
    seen = [] #list to store already seen paragraphs for this document
    for idx in indices: #enumerate through indices for this document
        para = report_paragraphs[idx] #get the paragraph at that index
        # Check if this paragraph is very similar to any already seen (diff < 5 chars)
        if not any(abs(len(para) - len(other)) < 5 and sum(a != b for a, b in zip(para, other)) < 5 for other in seen):
            indices_to_keep.add(idx)
            seen.append(para)
        #Basically, if there’s no paragraph already seen that’s within 5 characters in length and fewer than 5 character differences away — keep it.

# Sort indices to keep
indices_to_keep = sorted(indices_to_keep)

# Filter all lists. i.e. builing new lists including only those whose index is in indices_to_keep.
report_paragraphs = [report_paragraphs[i] for i in indices_to_keep]
report_paragraphs_source = [report_paragraphs_source[i] for i in indices_to_keep]
report_pages_source = [report_pages_source[i] for i in indices_to_keep]

Sanity check to see how many paragraphs were removed and if the corresponding elements got successfully removed:

In [None]:
print(len(report_paragraphs_source))
print(len(report_pages_source))
print(len(report_paragraphs))

1058
1058
1058


Manual review of paragraphs that are the "edge" ones - at the end of different documents:

In [None]:
change_indices = [i for i in range(1, len(report_paragraphs_source)) if report_paragraphs_source[i] != report_paragraphs_source[i-1]]
for index in change_indices:
    print(report_paragraphs_source[index-1])
    print(report_paragraphs_source[index])
    print(report_pages_source[index-1])
    print(report_pages_source[index])
    print(report_paragraphs[index-15:index])

2024_Danske_group.pdf
2024_DeutscheBank_group.pdf
29
69
['The ﬁnancial highlights provide information about the balance sheet.', 'Trading portfolio assets and trading portfolio liabilities increased to \nnet assets of DKK 174.3 billion (end-2023: net assets of DKK 93.7 \nbillion). The increase in net assets was due mainly to changes in the \nfair value of the derivatives portfolio.', 'Market risk  \nTaking on market risk is a key part of the Group’s business strategy, \nin relation to both trading and non-trading activities. The market risk \ninherent in trading activities is split into two elements: one covering', 'explicit position-taking arising from customer trades and market-\nmaking, and one focusing on the risk associated with valuation \nadjustments (xVA risk) in the derivatives portfolio. Most of the \nGroup’s market risk relates to fixed income products (interest rate \nrisk and bond spread risk). Interest rate volatility remained relatively \nmuted for the first half of 2024

### Paragraph embeddings generation

In [None]:
report_embeddings_all = bi_encoder.encode(report_paragraphs, convert_to_tensor=True, show_progress_bar=True) 
# Generate embeddings for all report paragraphs using the bi-encoder model.
# meaning, we are converting each paragraph into a numerical vector representation (embedding) using the bi-encoder model. 
# These embeddings capture the semantic meaning of the paragraphs and can be used for various downstream tasks, 
# such as similarity comparison, clustering, or classification.

Batches:   0%|          | 0/34 [00:00<?, ?it/s]

### Store data in pickle files

In [None]:
# Change the version according to the current numbering
# v1 - first version, no filters
# v2 - filtering out paragraphs that are too similar (5 chars) to others within the same document
version = 'v2'

In [None]:
#saves the data to pickle files
with open(f'report_embeddings_all_{version}.pickle', 'wb') as pkl: #open a pickle file in write-binary mode
    pickle.dump(report_embeddings_all, pkl) #dump the report_embeddings_all object into the pickle file
with open(f'report_paragraphs_{version}.pickle', 'wb') as pkl: #open a pickle file in write-binary mode
    pickle.dump(report_paragraphs, pkl) #dump the report_paragraphs object into the pickle file
with open(f'report_pages_source_{version}.pickle', 'wb') as pkl: #open a pickle file in write-binary mode
    pickle.dump(report_pages_source, pkl) #dump the report_pages_source object into the pickle file
with open(f'report_paragraphs_source_{version}.pickle', 'wb') as pkl: #open a pickle file in write-binary mode
    pickle.dump(report_paragraphs_source, pkl) #dump the report_paragraphs_source object into the pickle file

Use the cell below if you ever want to load the pickle files:

In [None]:
#Load the data back from pickle files
with open(f'report_embeddings_all_{version}.pickle', 'rb') as pkl: #open a pickle file in read-binary mode
    report_embeddings_all = pickle.load(pkl) #load the report_embeddings_all object from the pickle file
with open(f'report_paragraphs_{version}.pickle', 'rb') as pkl: #open a pickle file in read-binary mode
    report_paragraphs = pickle.load(pkl) #load the report_paragraphs object from the pickle file
with open(f'report_pages_source_{version}.pickle', 'rb') as pkl: #open a pickle file in read-binary mode
    report_pages_source = pickle.load(pkl) #load the report_pages_source object from the pickle file
with open(f'report_paragraphs_source_{version}.pickle', 'rb') as pkl: #open a pickle file in read-binary mode
    report_paragraphs_source = pickle.load(pkl) #load the report_paragraphs_source object from the pickle file

### Similarity search

Looking for similar vectors between the different sources. To be fixed later, but currently it's only at this stage that I'm testing filtering out: 
1. the paragraphs that contain too many digits (targeting tables and table rows).
2. the paragraphs that are shorter than 100 characters - typically headers and similar irrelevant.

This phase obviously needs to be improved further in future versions, but for now this is to get rid of the most obvious noise.

In [None]:
# Function to check digit ratio
def fun_digit_ratio(text, threshold=0.1): #function that calculates the ratio of digit characters to total characters in a given text and compares it to a specified threshold.
                    digits = sum(c.isdigit() for c in text) #count digit characters in the text
                    ratio = digits / max(1, len(text)) #calculate ratio of digits to total characters
                    if ratio > threshold: #compare ratio to threshold
                        return True 
                    else:
                        return False
    #basically, if the ratio of digit characters to total characters in the text exceeds the specified threshold (default is 10%), the function returns True; otherwise, it returns False.

# Compute cosine similarity for each pair 
similarities = [] #list to store similarity results
for i in range(len(report_paragraphs)): #iterate through each paragraph
    if len(report_paragraphs[i]) > 100 and fun_digit_ratio(report_paragraphs[i], threshold=0.1) is False: #only consider paragraphs longer than 100 characters and with less than 10% digits
        for j in range(i + 1, len(report_paragraphs)): #iterate through subsequent paragraphs
            # Check if neither text has more than 10% digits
            if (len(report_paragraphs[j]) > 100) and (report_paragraphs_source[i] != report_paragraphs_source[j]) and fun_digit_ratio(report_paragraphs[j], threshold=0.1) is False:
                

                emb1 = report_embeddings_all[i] #get embedding for first paragraph
                emb2 = report_embeddings_all[j] #get embedding for second paragraph
                score = torch.nn.functional.cosine_similarity(emb1.unsqueeze(0), emb2.unsqueeze(0)).item() #compute cosine similarity between the two embeddings
                similarities.append({ #append similarity result to list
                    'text1': report_paragraphs[i],
                    'text2': report_paragraphs[j],
                    'source_text1': report_paragraphs_source[i],
                    'source_text2': report_paragraphs_source[j],  
                    'len_t1': len(report_paragraphs[i]),
                    'len_t2': len(report_paragraphs[j]),
                    'similarity': score
                })


# Create dataframe
similarity_df = pd.DataFrame(similarities) #create a DataFrame from the list of similarity results
similarity_df.sort_values('similarity', ascending=False, inplace=True) #sort the DataFrame by similarity score in descending order
similarity_df.reset_index(drop=True, inplace=True) #reset the index of the DataFrame
similarity_df #display the similarity DataFrame

Unnamed: 0,text1,text2,source_text1,source_text2,len_t1,len_t2,similarity
0,"Amidst the ongoing war in Ukraine, potential f...",During 2024 a trend emerged whereby Russian pa...,2024_DeutscheBank_group.pdf,2024_ING_group.pdf,568,691,0.807738
1,The risks listed below are defined as top (alr...,"An overview of our top and emerging risks, fro...",2024_ING_group.pdf,2024_UBS_group.pdf,707,430,0.773770
2,Country risk management \nAvoiding undue conce...,The credit concentration risk framework is com...,2024_DeutscheBank_group.pdf,2024_ING_group.pdf,734,1009,0.766335
3,"In addition to country thresholds, gap risk th...",The credit concentration risk framework is com...,2024_DeutscheBank_group.pdf,2024_ING_group.pdf,1097,1009,0.756699
4,The war in Ukraine continues to lead to a high...,The war in Ukraine\nThe war in Ukraine continu...,2024_DeutscheBank_group.pdf,2024_ING_group.pdf,1159,669,0.751910
...,...,...,...,...,...,...,...
66730,Goodwill is reviewed annually for impairment o...,Geopolitical risk is an important concern for ...,2024_DeutscheBank_group.pdf,2024_ING_group.pdf,124,627,-0.099426
66731,1\nStage 3 lifetime credit impaired provision ...,Geopolitical uncertainty\nWe remain watchful o...,2024_ING_group.pdf,2024_UBS_group.pdf,702,388,-0.103083
66732,US-EU-China relations and regional tensions\nU...,The calculation of affordability takes into ac...,2024_ING_group.pdf,2024_UBS_group.pdf,570,321,-0.107103
66733,Goodwill is reviewed annually for impairment o...,International elections\nThere were an unprece...,2024_DeutscheBank_group.pdf,2024_ING_group.pdf,124,1072,-0.123577


The resulting dataframe contains pairs of similar paragraphs from different documents along with their length and similarity score.

### Summarizing similarity search results

Test running a KeyBERT keyword extraction model on the similar paragraphs - every paragraph will be "characterized" by top 5 keywords extracted with KeyBERT. This effectively maps the similar paragraphs into a common keyword space, which can be used to group similar content together more easily - to be done in future versions.

In [None]:
from keybert import KeyBERT #import KeyBERT for keyword extraction. It is a simple and efficient tool for extracting keywords from text documents using BERT embeddings.

kw_model = KeyBERT(model = bi_encoder) #initialize KeyBERT with the bi-encoder model for keyword extraction. That is, we are using the same bi-encoder model we used for generating embeddings to also extract keywords from the text.

# Get top N most similar pairs
top_n = 10
top_pairs = similarity_df.head(top_n) #get the top N most similar pairs from the similarity DataFrame

# Extract keywords for each text in the pairs
for idx, row in top_pairs.iterrows(): #iterate through each row in the top_pairs DataFrame
    print(f"Pair {idx+1}:") #print pair number
    print(f"Source 1: {row['source_text1']}") #print source of first text
    print("Keywords 1:", kw_model.extract_keywords(row['text1'], top_n=5)) #print top 5 keywords for first text
    print(f"Source 2: {row['source_text2']}") #print source of second text
    print("Keywords 2:", kw_model.extract_keywords(row['text2'], top_n=5)) #print top 5 keywords for second text
    print('-'*40) #print separator

Pair 1:
Source 1: 2024_DeutscheBank_group.pdf
Keywords 1: [('sanctions', 0.4968), ('banks', 0.4337), ('ukraine', 0.3427), ('disputes', 0.3304), ('russia', 0.3136)]
Source 2: 2024_ING_group.pdf
Keywords 2: [('banking', 0.4111), ('banks', 0.3934), ('sanctions', 0.3923), ('courts', 0.3487), ('liable', 0.34)]
----------------------------------------
Pair 2:
Source 1: 2024_ING_group.pdf
Keywords 1: [('risks', 0.5732), ('risk', 0.5317), ('emerging', 0.3946), ('impact', 0.3268), ('liquidity', 0.2811)]
Source 2: 2024_UBS_group.pdf
Keywords 2: [('risks', 0.6533), ('risk', 0.6103), ('prospects', 0.4051), ('investors', 0.3431), ('strategy', 0.3175)]
----------------------------------------
Pair 3:
Source 1: 2024_DeutscheBank_group.pdf
Keywords 1: [('countries', 0.4215), ('risk', 0.3845), ('country', 0.3527), ('japan', 0.3446), ('asia', 0.3392)]
Source 2: 2024_ING_group.pdf
Keywords 2: [('risk', 0.3926), ('countries', 0.2985), ('concentration', 0.2793), ('credit', 0.2713), ('sector', 0.2332)]
----

This is where the notebook currently ends - it has been agreed that it brings business value in its current form and it's a decent starting point for developing the final product. Further steps to be implemented in future versions (among others):
- Automatic detection of risk management sections in annual reports, to avoid manual specification of relevant page ranges
- Clustering similar paragraph pairs based on the extracted keywords
- Building some sort of an interactive layer on top of this, that allows the user to use their judgement and select the relevant combinations of documents / paragraphs for further analysis based on stock price covariance or other metrics
