# Testing the RAG implementation

Purpose of the notebook: test which retrieval queries and generation prompts work best, which embedding models / LLM models, PDF text extraction approaches perform best
Evaluate on a manually coded validation set based on three sample reports

Evaluate
1.  RAG retrieval
2.  RAG generation

# Preparation

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import time
import time
from thefuzz import process, fuzz
import re
import urllib3
import pymupdf
import requests
import os
import io
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_experimental.text_splitter import SemanticChunker
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CrossEncoderReranker
from langchain_community.cross_encoders import HuggingFaceCrossEncoder

import os
import re
import tenacity
import configparser
import markdown

#from langchain.llms import OpenAI
from langchain_openai import ChatOpenAI
#from langchain.chat_models import ChatOpenAI
from langchain.schema import (
    AIMessage,
    HumanMessage,
    SystemMessage
)
from langchain.prompts import PromptTemplate
import json
import tiktoken


import nltk
from nltk.tokenize import sent_tokenize
nltk.download('punkt')
nltk.download('punkt_tab')

# Gneration Evaluation
from sklearn.metrics import mean_absolute_error, mean_squared_error
from scipy.stats import pearsonr

# LLM setup
from transformers import AutoTokenizer, BitsAndBytesConfig, AutoModelForCausalLM, pipeline
import torch
from torch import cuda, bfloat16
import transformers
from concurrent.futures import ThreadPoolExecutor
from langchain_huggingface import HuggingFacePipeline

# import OpenAI API key from environment variable
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

[nltk_data] Downloading package punkt to
[nltk_data]     /home/tu/tu_tu/tu_zxowg46/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /home/tu/tu_tu/tu_zxowg46/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
  backends.update(_get_backends("networkx.backends"))


In [15]:
import os
os.environ["OPENAI_API_KEY"] = "sk-abc123...your-key"

In [2]:
### for working with huggingface
from huggingface_hub import login
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

# Select sample reports
Source: Sustainability Reporting Navigator (crowd-source list of CSRD-compliant reports for fiscal years starting on 01/01/2024)

Download CSV with information on all reports on the 08/04/2025 https://www.sustainabilityreportingnavigator.com/#/csrdreports 

In [3]:
# Open the csv data file
reports_24 = pd.read_csv('esg_reports_2024.csv')
print(len(reports_24))

277


In [None]:
# select only Continental AG as a sample
sample = reports_24[reports_24['company_name'] == 'Continental AG']

In [4]:
# randomly select 3 reports from 2024
sample = reports_24.sample(n=3, random_state=3)
sample.head()

Unnamed: 0.1,Unnamed: 0,company_withAccessInfo,link,country,sector,industry,publication date,pages PDF,auditor
220,19,Continental AG,https://annualreport.continental.com/2024/en/s...,Germany,Transportation,Auto Parts,2025-03-18,125,PwC
58,266,Schneider Electric*,https://www.se.com/ww/en/assets/564/document/5...,France,Resource Transformation,Electrical & Electronic Equipment,2025-03-26,186,PwC & Mazars
89,103,Philips,https://www.results.philips.com/publications/a...,Netherlands,Infrastructure,Electric Utilities & Power Generators,2025-02-21,85,EY


In [5]:
sample_list = ['ContinentalAG_2024', 'SchneiderElectric_2024', 'Philips_2024']

# Manually hand code ground truth = validation set

In [7]:
validation_set = pd.read_excel('validation_dataset.xlsx', sheet_name='S1_Retrieval')
print(validation_set.head())

              report_name query  \
0      ContinentalAG_2024  S1_E   
1  SchneiderElectric_2024  S1_E   
2      ContinentalAG_2024  S1_D   
3      ContinentalAG_2024  S1_A   
4  SchneiderElectric_2024  S1_F   

                                                text page_number  
0  In accordance with Section 76 (4) AktG, the Ex...          27  
1  Our 2025 sustainability commitments\nWith less...          33  
2  The globally applicable Code of\nConduct provi...          41  
3  The globally applicable Code of\nConduct provi...          41  
4  Our Speak Up Mindset\nSchneider Electric emplo...          41  


In [8]:
# print how many text chunks per query, per report name
print(validation_set.groupby(['query', 'report_name' ]).size())

query  report_name           
S1_A   ContinentalAG_2024        22
       Philips_2024              13
       SchneiderElectric_2024    24
S1_B   ContinentalAG_2024        14
       Philips_2024               4
       SchneiderElectric_2024     4
S1_C   ContinentalAG_2024        19
       Philips_2024              13
       SchneiderElectric_2024    18
S1_D   ContinentalAG_2024        23
       Philips_2024              13
       SchneiderElectric_2024    10
S1_E   ContinentalAG_2024        12
       Philips_2024              11
       SchneiderElectric_2024    24
S1_F   ContinentalAG_2024        15
       Philips_2024               6
       SchneiderElectric_2024    23
S1_G   ContinentalAG_2024         5
       Philips_2024               1
       SchneiderElectric_2024     8
dtype: int64


# Evaluate the performance of different approaches


1. Approach: Compare on sentence level whether ground truth sentences are as a whole retrieved

In [8]:
ground_truth_sentences = set(['Equal pay /nFair and equitable pay is a core component of the Group’s compensation philosophy.',
                            'It is in line with the principle of equal pay for equal work.',
                            '100 % of Schneider are paid /nat least a living wage, which /nwas recognized for the /nsecond consecutive year by /nthe Living Wage Employer /nCertification from Fair /nWage Network.'])
retrieved_sentences = set(['Fair and equitable pay is a core component of the Group’s compensation philosophy.',
                           'It is in line with the principle of equal pay for equal work.',
                           '100 % of Schneider are paid /nat least a living wage, which /nwas recognized for the /nsecond consecutive year'])

found_ground_truth_sentences = set()
for sentence in retrieved_sentences:
    if sentence in ground_truth_sentences:
        found_ground_truth_sentences.add(sentence)

print(found_ground_truth_sentences)

{'It is in line with the principle of equal pay for equal work.'}


This approach is very restrictive and cause for lower scores whenever the cunk size abbreviates sentences, eventhough the LLM could still understand the meaning.

2. Approach fuzzy string matching


In [10]:
found_matches = 0
for sentence in retrieved_sentences:
    best_match, score = process.extractOne(sentence, ground_truth_sentences, scorer=fuzz.ratio)
    if score > 80:
        found_matches += 1
        print(f"Score: {score}, Retrieved sentence: '{sentence}', Found match: '{best_match}'")

Score: 93, Retrieved sentence: 'Fair and equitable pay is a core component of the Group’s compensation philosophy.', Found match: 'Equal pay /nFair and equitable pay is a core component of the Group’s compensation philosophy.'
Score: 100, Retrieved sentence: 'It is in line with the principle of equal pay for equal work.', Found match: 'It is in line with the principle of equal pay for equal work.'


In [71]:
def evaluate_retrieval_sentence_level(retrieved_docs, ground_truth_texts):
    """
    Evaluates retrieval performance on a sentence level using fuzzy string matching.

    Args:
        retrieved_docs (list): A list of Document objects retrieved by LangChain.
        ground_truth_texts (list): A list of ground-truth text snippets from the validation set.
        score_threshold (int): The similarity score (0-100) required to consider a sentence a match.

    Returns:
        dict: A dictionary containing precision, recall, and f1-score.
    """
    # 1. Extract all ground-truth and retrieved sentences
    all_ground_truth_sentences = set()
    for text in ground_truth_texts:
        sentences = sent_tokenize(text)
        all_ground_truth_sentences.update([s.strip() for s in sentences if s.strip()])

    if not all_ground_truth_sentences:
        return {"precision": None, "recall": None, "f1": None}

    all_retrieved_sentences = set()
    for doc in retrieved_docs:
        chunk_sentences = sent_tokenize(doc.page_content)
        all_retrieved_sentences.update([s.strip() for s in chunk_sentences if s.strip()])

    if not all_retrieved_sentences:
        return {"precision": 0.0, "recall": 0.0, "f1": 0.0}
    
    # 2. For each retrieved sentence, find its best match in the ground truth sentences.
    found_matches = 0
    for retrieved_sentence in all_retrieved_sentences:
        # process.extractOne finds the best matching string from a collection.
        # It returns a tuple: (best_match_string, score)
        best_match, score = process.extractOne(
            retrieved_sentence, 
            all_ground_truth_sentences, 
            scorer=fuzz.ratio
        )
        
        # If the best match has a score above our threshold, we count it as a successful find.
        if score >= 80:
            found_matches += 1

    # 3. Calculate metrics based on the fuzzy matches.
    true_positives = found_matches
    
    # Precision = (Relevant sentences found) / (Total sentences retrieved)
    precision = true_positives / len(all_retrieved_sentences) if all_retrieved_sentences else 0.0
    
    # Recall = (Relevant sentences found) / (Total relevant sentences that exist)
    recall = true_positives / len(all_ground_truth_sentences) if all_ground_truth_sentences else 0.0
    
    f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0.0

    return {"precision": precision, "recall": recall, "f1": f1}

In [72]:
def evaluate_retrieval(retrieval_results):
    """
    Evaluates the retrieval performance of a LangChain vector store against a validation set.

    Args:
        retrieval_results (list): A list of Document objects retrieved by LangChain.

    Returns:
        dict: A dictionary containing precision, recall, and f1-score.
    """
    evaluation_results_all = []
    
    for report_name, queries_results in retrieval_results.items():
        for query_key, retrieved_documents in queries_results.items():

            # Get the corresponding ground-truth texts from the validation set
            gt_texts = validation_set[
                (validation_set['report_name'] == report_name) & 
                (validation_set['query'] == query_key)
            ]['text'].tolist()

            if not gt_texts:
                continue

            # Evaluate the retrieval performance
            scores = evaluate_retrieval_sentence_level(
                retrieved_docs=retrieved_documents,
                ground_truth_texts=gt_texts
            )

            evaluation_results_all.append({
                "report_name": report_name,
                "query": query_key,
                **scores
            })

    evaluation_results_mean = pd.DataFrame(evaluation_results_all).groupby(['query'])[['precision', 'recall', 'f1']].mean().reset_index()
    overall_mean = evaluation_results_mean[['precision', 'recall', 'f1']].mean()
    evaluation_results_mean.loc['Overall Mean'] = overall_mean

    print("--- Retrieval Performance Summary ---")
    print(evaluation_results_mean.round(3))

## 1. RAG - Retrieval
### 0: Baseline approach based on Ni et al. (2023)
based on: Ni, J., Bingler, J., Colesanti-Senni, C., Kraus, M., Gostlow, G., Schimanski, T., Stammbach, D., Vaghefi, S. A., Wang, Q., Webersinke, N., Wekhof, T., Yu, T., & Leippold, M. (2023). CHATREPORT: Democratizing Sustainability Disclosure Analysis through LLM-based Tools. Swiss Finance Institute Research Paper, No. 23-111. https://doi.org/10.48550/arXiv.2307.15770

- PDF text extraction: MyPuPDF
- Retrieval:
    - Embedding model: OpenAI text-embedding-ada-002
    - top k: 20
    - chunk size: 500
    - chunk overlap: 20

In [6]:
# Code based on Ni et al. (2023)
TOP_K = 20
CHUNK_SIZE = 500
CHUNK_OVERLAP = 20

# 1. Load the PDF
def load_pdf(path=None, url=None):
    assert (path is not None) != (url is not None), "Either path or url must be provided"
    
    if path:
        return pymupdf.open(path)
    else:
        response = requests.get(url)
        pdf_bytes = io.BytesIO(response.content)
        return pymupdf.open(stream=pdf_bytes, filetype='pdf')
    
# 2. Extract text from the PDF
def extract_text(pdf):
    text_list = [page.get_text() for page in pdf]
    all_text = ''.join(text_list)
    return text_list, all_text

# 4. Create or Load Vector Store
def get_retriever(pdf, db_path, top_k=TOP_K, chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP):
    embeddings = OpenAIEmbeddings()
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        length_function=len,
        separators=["\n\n", "\n", " "],
    )

    chunks = []
    page_idx = []

    for i, page in enumerate(pdf):
        page_chunks = text_splitter.split_text(page.get_text())
        page_idx.extend([i + 1] * len(page_chunks))
        chunks.extend(page_chunks)

    if os.path.exists(db_path):
        doc_search = FAISS.load_local(db_path, embeddings=embeddings, allow_dangerous_deserialization=True)
    else:
        doc_search = FAISS.from_texts(
            chunks,
            embeddings,
            metadatas=[{"source": str(i), "page": str(idx)} for i, idx in enumerate(page_idx)]
        )
        doc_search.save_local(db_path)

    retriever = doc_search.as_retriever(search_kwargs={"k": top_k})
    return retriever, doc_search

# 5. Retrieve relevant chunks
def retrieve_chunks(retriever, queries):
    section_text_dict = {}

    for key, prompts in queries.items():
        if key == 'general' and isinstance(prompts, list):
            combined_docs = []
            for prompt in prompts:
                combined_docs.extend(retriever.invoke(prompt)[:5])
            section_text_dict[key] = combined_docs
        else:
            section_text_dict[key] = retriever.invoke(prompts)
    
    return section_text_dict

# for preparing filenames
def prepare_filename(name):
    return re.sub(r'[\\/*?:"<>|]', "", name)

In [10]:
# Defined retrieval queries based on the ESRS S1 requirements

QUERIES = {
    #'general': ["What is the company of the report?", "What sector does the company belong to?", "Where is the company located?"],
    'S1_A': "How does the company manage and disclose material impacts, risks and opportunities related to the own workforce?",
    'S1_B': "What are the material risks and opportunities arising from the company’s impacts and dependencies on people in its own workforce?",
    'S1_C': "What are the company’s human rights practices, risks and incidents related to the own workforce?",
    'S1_D': "What are the company’s processes and policies for engaging with own workers and workers’ representatives about impacts?",
    'S1_E': "What are the company’s policies on non-discrimination, diversity and inclusion in the own workforce?",
    'S1_F': "What are the company’s processes, policies and approaches to remediate negative impacts and channels for own workers to raise concerns?",
    'S1_G': "How is the company’s workfoce social protection coverage?",
}

In [39]:
# Now apply the functions to the 3 sampled reports
all_results = {}

for idx, row in sample.iterrows():
    filename = f"{prepare_filename(row['company_withAccessInfo'])}_2024"
    filename = filename.replace(" ", "")
    print(f"\nProcessing: {filename}")

    try:
        PATH = f"./sample_reports/{filename}.pdf"
        pdf = load_pdf(path=PATH) 
        text_list, all_text = extract_text(pdf)
        DB_PATH = f"./faiss_db_OpenAI_0/{filename}"
        retriever, doc_search = get_retriever(pdf, db_path=DB_PATH)
        results = retrieve_chunks(retriever, queries=QUERIES)
        all_results[filename] = results
    
    except Exception as e:
        print(f"Error processing {row['company_withAccessInfo']}: {e}")



Processing: ContinentalAG_2024

Processing: SchneiderElectric_2024

Processing: Philips_2024


Note: **Ressorces needed for 3 reports**
- Time: 1.5 min (--> 13 h for 500 reports)
- Costs: 0.2 Dollar (--> 100$ for 500 reports)

In [19]:
evaluate_retrieval(all_results)

--- Retrieval Performance Summary ---
             query  precision  recall     f1
0             S1_A      0.335   0.057  0.095
1             S1_B      0.146   0.194  0.161
2             S1_C      0.263   0.192  0.221
3             S1_D      0.177   0.094  0.121
4             S1_E      0.413   0.247  0.304
5             S1_F      0.116   0.063  0.081
6             S1_G      0.135   0.347  0.188
Overall Mean   NaN      0.226   0.170  0.167


In [20]:
# save results
results_df = pd.DataFrame.from_dict(all_results, orient='index')
results_df.to_csv('retrieval_results_0_OpenAI.csv', index=True)

In [None]:
# read in the results
retrieval_results = pd.read_csv('retrieval_results_0_OpenAI.csv', index_col=0)

In [3]:
import pandas as pd
analysis_results = pd.read_csv('analysis_results_0.csv', index_col=0)
print(analysis_results.head())

                                                                    S1_A1  \
ContinentalAG_2024      {'verdict': 'YES', 'analysis': '[[YES]] The co...   
SchneiderElectric_2024  {'verdict': 'YES', 'analysis': '[[YES]] The co...   
Philips_2024            {'verdict': 'NO', 'analysis': "[[NO]] The comp...   

                                                                    S1_A2  \
ContinentalAG_2024      {'verdict': 'YES', 'analysis': "[[YES]] The co...   
SchneiderElectric_2024  {'verdict': 'YES', 'analysis': "[[YES]] The co...   
Philips_2024            {'verdict': 'YES', 'analysis': '[[YES]] The co...   

                                                                    S1_A3  \
ContinentalAG_2024      {'verdict': 'YES', 'analysis': "[[YES]] The co...   
SchneiderElectric_2024  {'verdict': 'YES', 'analysis': '[[YES]] The co...   
Philips_2024            {'verdict': 'YES', 'analysis': '[[YES]] The co...   

                                                                    S1_A4

In [7]:
import pandas as pd
import ast

# Load the CSV file
analysis_results = pd.read_csv('analysis_results_0.csv', index_col=0)

# List to hold structured records
records = []

# Iterate over rows and columns
for report_name, row in analysis_results.iterrows():
    for query, value in row.items():
        # Safely parse the dictionary from string
        verdict = analysis = sources = ""  # Default values
        try:
            value_dict = ast.literal_eval(value) if isinstance(value, str) else {}
            verdict = value_dict.get("verdict", "")
            analysis = value_dict.get("analysis", "")
            sources = value_dict.get("sources", "")  # Optional key
        except (ValueError, SyntaxError) as e:
            analysis = f"Parsing error: {e}"

        records.append({
            "report_name": report_name,
            "query": query,
            "verdict": verdict,
            "analysis": analysis,
            "sources": sources
        })

# Create new DataFrame
results_df = pd.DataFrame(records)

# Export to Excel
results_df.to_excel("validation_dataset.xlsx", index=False)


In [5]:
# Save to Excel
results_df.to_excel("validation_dataset.xlsx", index=False)

### Evaluate with Accuracy

In [69]:
def evaluate_retrieval_sentence_level_with_accuracy(retrieved_docs, ground_truth_texts, all_document_sentences):
    # 1. Extract ground-truth and retrieved sentences
    all_ground_truth_sentences = set(
        s.strip() for text in ground_truth_texts for s in sent_tokenize(text) if s.strip()
    )
    all_retrieved_sentences = set(
        s.strip() for doc in retrieved_docs for s in sent_tokenize(doc.page_content) if s.strip()
    )

    if not all_ground_truth_sentences:
        return {"precision": None, "recall": None, "f1": None, "accuracy": None}
    if not all_retrieved_sentences:
        return {"precision": 0.0, "recall": 0.0, "f1": 0.0, "accuracy": None}

    # 2. Find matches to calculate True Positives (TP)
    # Using your existing logic for consistency
    found_matches = 0
    for retrieved_sentence in all_retrieved_sentences:
        _, score = process.extractOne(
            retrieved_sentence,
            all_ground_truth_sentences,
            scorer=fuzz.ratio
        )
        if score >= 80:
            found_matches += 1
    
    true_positives = found_matches

    # 3. Calculate Precision, Recall, and F1
    # Precision = TP / (All Retrieved Sentences)
    precision = true_positives / len(all_retrieved_sentences)
    # Recall = TP / (All Ground Truth Sentences)
    recall = true_positives / len(all_ground_truth_sentences)
    f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0.0

    # 4. Calculate Accuracy
    # This requires all four components: TP, FP, FN, TN
    false_positives = len(all_retrieved_sentences) - true_positives
    false_negatives = len(all_ground_truth_sentences) - true_positives

    # True Negatives = (All sentences in doc - All relevant sentences) - (Irrelevant retrieved sentences)
    total_irrelevant_in_doc = len(all_document_sentences) - len(all_ground_truth_sentences)
    true_negatives = total_irrelevant_in_doc - false_positives

    total_population = len(all_document_sentences)
    accuracy = (true_positives + true_negatives) / total_population if total_population > 0 else 0.0

    return {"precision": precision, "recall": recall, "f1": f1, "accuracy": accuracy}


def evaluate_retrieval_with_accuracy(retrieval_results, all_docs_sentences):
    evaluation_results_all = []
    
    for report_name, queries_results in retrieval_results.items():
        for query_key, retrieved_documents in queries_results.items():
            gt_texts = validation_set[
                (validation_set['report_name'] == report_name) & 
                (validation_set['query'] == query_key)
            ]['text'].tolist()

            if not gt_texts:
                continue

            # Get the full set of sentences for the current report
            current_doc_all_sentences = all_docs_sentences.get(report_name, set())
            if not current_doc_all_sentences:
                print(f"Warning: Could not find all document sentences for {report_name}. Skipping accuracy calculation.")
                continue

            scores = evaluate_retrieval_sentence_level_with_accuracy(
                retrieved_docs=retrieved_documents,
                ground_truth_texts=gt_texts,
                all_document_sentences=current_doc_all_sentences
            )

            evaluation_results_all.append({
                "report_name": report_name,
                "query": query_key,
                **scores
            })

    # Calculate and display the mean scores, now including accuracy
    if not evaluation_results_all:
        print("No results to evaluate.")
        return
        
    df_eval = pd.DataFrame(evaluation_results_all)
    # Add accuracy to the list of metrics to average
    metrics_to_average = ['precision', 'recall', 'f1', 'accuracy']
    evaluation_results_mean = df_eval.groupby(['query'])[metrics_to_average].mean().reset_index()
    overall_mean = evaluation_results_mean[metrics_to_average].mean()
    evaluation_results_mean.loc['Overall Mean'] = overall_mean

    print("--- Retrieval Performance Summary ---")
    print(evaluation_results_mean.round(3))


In [42]:
all_results = {}
all_docs_sentences = {} # Dictionary to hold all sentences for each document

# Main loop now includes extracting all sentences from each PDF
for idx, row in sample.iterrows():
    filename = f"{prepare_filename(row['company_withAccessInfo'])}_2024"
    filename = filename.replace(" ", "")
    print(f"\nProcessing: {filename}")

    try:
        PATH = f"./sample_reports/{filename}.pdf"
        pdf = load_pdf(path=PATH)
        
        # --- NEW: Extract ALL sentences from the PDF for accuracy calculation ---
        all_sents_in_doc = set()
        for page in pdf:
            sentences = sent_tokenize(page.get_text())
            all_sents_in_doc.update([s.strip() for s in sentences if s.strip()])
        all_docs_sentences[filename] = all_sents_in_doc
        print(f"Found {len(all_sents_in_doc)} total unique sentences in {filename}.")
        
        # Your existing retrieval logic
        DB_PATH = f"./faiss_db_OpenAI_0/{filename}"
        retriever, doc_search = get_retriever(pdf, db_path=DB_PATH)
        results = retrieve_chunks(retriever, queries=QUERIES)
        all_results[filename] = results
    
    except Exception as e:
        print(f"Error processing {filename}: {e}")

# Call the new evaluation function that includes accuracy
evaluate_retrieval_with_accuracy(all_results, all_docs_sentences)


Processing: ContinentalAG_2024
Found 6633 total unique sentences in ContinentalAG_2024.

Processing: SchneiderElectric_2024
Found 11266 total unique sentences in SchneiderElectric_2024.

Processing: Philips_2024
Found 6302 total unique sentences in Philips_2024.
--- Retrieval Performance Summary ---
             query  precision  recall     f1  accuracy
0             S1_A      0.335   0.057  0.095     0.957
1             S1_B      0.146   0.194  0.161     0.989
2             S1_C      0.263   0.192  0.221     0.988
3             S1_D      0.177   0.094  0.121     0.981
4             S1_E      0.413   0.247  0.304     0.984
5             S1_F      0.116   0.063  0.081     0.981
6             S1_G      0.135   0.347  0.188     0.992
Overall Mean   NaN      0.226   0.170  0.167     0.982


### A) Embedding Models
#### 1. Qwen3-Embedding-0.6B
- Place 4 in MTEB Leaderboard (26.06.2025), best for 2 GB Memory Usage https://huggingface.co/spaces/mteb/leaderboard
- Number of Paramaters: 0.6B
- Context Length: 32k
- Embedding Dimension: Up to 1024, supports user-defined output dimensions ranging from 32 to 1024
https://huggingface.co/Qwen/Qwen3-Embedding-0.6B 


In [7]:
# 1. Initialize the embedding model
embeddings_qwen = HuggingFaceEmbeddings(
    model_name="Qwen/Qwen3-Embedding-0.6B",
    model_kwargs={'device': 'cuda'} # specify device='cpu' if GPU not available 
)

In [12]:
# Create or Load Vector Store for Qwen
def get_retriever_qwen(
    pdf,
    db_path,
    embeddings,
    top_k=TOP_K,
    chunk_size=CHUNK_SIZE,
    chunk_overlap=CHUNK_OVERLAP
):
    """
    Loads or creates a FAISS vector store and returns a retriever.
    Skips PDF processing if FAISS DB already exists.
    """

    if os.path.exists(db_path):
        print(f"Loading existing FAISS DB from {db_path}")
        doc_search = FAISS.load_local(
            db_path,
            embeddings=embeddings,
            allow_dangerous_deserialization=True
        )
    else:
        print(f"Creating new FAISS DB at {db_path}")
        
        # 1. Split the document into chunks
        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=chunk_size,
            chunk_overlap=chunk_overlap,
            length_function=len,
            separators=["\n\n", "\n", " "],
        )

        chunks = []
        page_idx = []

        for i, page in enumerate(pdf):
            page_chunks = text_splitter.split_text(page.get_text())
            page_idx.extend([i + 1] * len(page_chunks))
            chunks.extend(page_chunks)

        # 2. Create FAISS vector store
        doc_search = FAISS.from_texts(
            chunks,
            embeddings,
            metadatas=[{"source": str(i), "page": str(idx)} for i, idx in enumerate(page_idx)]
        )
        doc_search.save_local(db_path)

    # 3. Return retriever with top_k
    retriever = doc_search.as_retriever(search_kwargs={"k": top_k})
    return retriever, doc_search

In [24]:
# Now apply on the sample reports
all_results = {}
start_time = time.time()

for idx, row in sample.iterrows():
    filename = f"{prepare_filename(row['company_withAccessInfo'])}_2024"
    filename = filename.replace(" ", "")
    print(f"\nProcessing: {filename}")

    try:
        PATH = f"./sample_reports/{filename}.pdf"
        pdf = load_pdf(path=PATH) 
        text_list, all_text = extract_text(pdf)
        DB_PATH = f"./faiss_db_Qwen_A1/{filename}"
        retriever, doc_search = get_retriever_qwen(pdf, embeddings=embeddings_qwen, db_path=DB_PATH)
        results = retrieve_chunks(retriever, queries=QUERIES)
        all_results[filename] = results
        print(f"Successfully processed and retrieved chunks for {filename}")
    
    except Exception as e:
        print(f"Error processing {row['company_withAccessInfo']}: {e}")

print(f"Computation time: {(time.time() - start_time) / 60:.2f} minutes")


Processing: ContinentalAG_2024
Creating new FAISS DB at ./faiss_db_Qwen_A1/ContinentalAG_2024
Successfully processed and retrieved chunks for ContinentalAG_2024

Processing: SchneiderElectric_2024
Creating new FAISS DB at ./faiss_db_Qwen_A1/SchneiderElectric_2024
Successfully processed and retrieved chunks for SchneiderElectric_2024

Processing: Philips_2024
Creating new FAISS DB at ./faiss_db_Qwen_A1/Philips_2024
Successfully processed and retrieved chunks for Philips_2024
Computation time: 3.30 minutes


**Ressorces needed** for 3 reports
- Time: 
    - on local CPU: 3h (--> 20 days for 500 reports)
    - on 1 remote GPU: 3 min (8h for 500 reports)
- Costs: 0 Dollar (--> opensource)

In [26]:
evaluate_retrieval(all_results)

--- Retrieval Performance Summary ---
             query  precision  recall     f1
0             S1_A      0.340   0.045  0.079
1             S1_B      0.142   0.169  0.151
2             S1_C      0.231   0.146  0.179
3             S1_D      0.184   0.098  0.126
4             S1_E      0.382   0.211  0.266
5             S1_F      0.154   0.076  0.101
6             S1_G      0.092   0.197  0.121
Overall Mean   NaN      0.218   0.135  0.146


#### 2. Qwen3_Embedding-4B
- Place 3 in MTEB Leaderboard (16.07.2025), 2nd best open source https://huggingface.co/spaces/mteb/leaderboard
- Number of Paramaters: 4B
- Context Length: 32k
- Embedding Dimension: Up to 2560, supports user-defined output dimensions ranging from 32 to 2560
  https://huggingface.co/Qwen/Qwen3-Embedding-4B

In [74]:
# 1. Initialize the embedding model
embeddings_qwen_4B = HuggingFaceEmbeddings(
    model_name="Qwen/Qwen3-Embedding-4B",
    model_kwargs={'device': 'cuda'} # specify device='cpu' if GPU not available 
)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [12]:
# apply on the sample reports
all_results = {}
start_time = time.time()

for idx, row in sample.iterrows():
    filename = f"{prepare_filename(row['company_withAccessInfo'])}_2024"
    filename = filename.replace(" ", "")
    print(f"\nProcessing: {filename}")

    try:
        PATH = f"./sample_reports/{filename}.pdf"
        pdf = load_pdf(path=PATH) 
        text_list, all_text = extract_text(pdf)
        DB_PATH = f"./faiss_db_Qwen_A2/{filename}"
        retriever, doc_search = get_retriever_qwen(pdf, embeddings=embeddings_qwen_8B, db_path=DB_PATH)
        results = retrieve_chunks(retriever, queries=QUERIES)
        all_results[filename] = results
        print(f"Successfully processed and retrieved chunks for {filename}")
    
    except Exception as e:
        print(f"Error processing {row['company_withAccessInfo']}: {e}")

print(f"Computation time: {(time.time() - start_time) / 60:.2f} minutes")


Processing: ContinentalAG_2024
Creating new FAISS DB at ./faiss_db_Qwen_A2/ContinentalAG_2024
Successfully processed and retrieved chunks for ContinentalAG_2024

Processing: SchneiderElectric_2024
Creating new FAISS DB at ./faiss_db_Qwen_A2/SchneiderElectric_2024
Successfully processed and retrieved chunks for SchneiderElectric_2024

Processing: Philips_2024
Creating new FAISS DB at ./faiss_db_Qwen_A2/Philips_2024
Successfully processed and retrieved chunks for Philips_2024
Computation time: 19.78 minutes


**Ressorces needed** for 3 reports
- Time: 
    - on 1 remote GPU: 3 min ( for 100 reports)
- Costs: 0 Dollar (--> opensource)

In [13]:
evaluate_retrieval(all_results)

--- Retrieval Performance Summary ---
             query  precision  recall     f1
0             S1_A      0.289   0.038  0.067
1             S1_B      0.131   0.149  0.135
2             S1_C      0.230   0.159  0.187
3             S1_D      0.202   0.100  0.133
4             S1_E      0.393   0.224  0.278
5             S1_F      0.263   0.131  0.171
6             S1_G      0.096   0.280  0.139
Overall Mean   NaN      0.229   0.154  0.159


### B) Retrieval settings

#### 1. Chunksize & overlap

In [15]:
### 1a) Chunk_size 1000 chunk overlap 200 based on A1)
CHUNK_SIZE = 1000
CHUNK_OVERLAP = 200
TOP_K = 20

all_results = {}
start_time = time.time()

for idx, row in sample.iterrows():
    filename = f"{prepare_filename(row['company_withAccessInfo'])}_2024"
    filename = filename.replace(" ", "")
    print(f"\nProcessing: {filename}")

    try:
        PATH = f"./sample_reports/{filename}.pdf"
        pdf = load_pdf(path=PATH) 
        text_list, all_text = extract_text(pdf)
        DB_PATH = f"./faiss_db_Qwen_B1a/{filename}"
        retriever, doc_search = get_retriever_qwen(pdf, embeddings=embeddings_qwen, db_path=DB_PATH, top_k=TOP_K, chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP)
        results = retrieve_chunks(retriever, queries=QUERIES)
        all_results[filename] = results
        print(f"Successfully processed and retrieved chunks for {filename}")
    
    except Exception as e:
        print(f"Error processing {row['company_withAccessInfo']}: {e}")

print(f"Computation time: {(time.time() - start_time) / 60:.2f} minutes")


Processing: ContinentalAG_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B1a/ContinentalAG_2024
Successfully processed and retrieved chunks for ContinentalAG_2024

Processing: SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B1a/SchneiderElectric_2024
Successfully processed and retrieved chunks for SchneiderElectric_2024

Processing: Philips_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B1a/Philips_2024
Successfully processed and retrieved chunks for Philips_2024
Computation time: 0.09 minutes


In [16]:
evaluate_retrieval(all_results)

--- Retrieval Performance Summary ---
             query  precision  recall     f1
0             S1_A      0.377   0.096  0.151
1             S1_B      0.195   0.476  0.270
2             S1_C      0.251   0.315  0.279
3             S1_D      0.231   0.181  0.201
4             S1_E      0.335   0.334  0.330
5             S1_F      0.270   0.267  0.262
6             S1_G      0.147   0.545  0.226
Overall Mean   NaN      0.258   0.316  0.246


#### 2. Semantic Chunking

In [13]:
### Semantic chunking
# B2. Create or Load Vector Store for Qwen
def get_retriever_qwen_B2(pdf, db_path, embeddings, top_k=TOP_K):

    if os.path.exists(db_path):
        print(f"Loading existing FAISS DB from {db_path}")
        doc_search = FAISS.load_local(
            db_path,
            embeddings=embeddings,
            allow_dangerous_deserialization=True
        )
    else:
        print(f"Creating new FAISS DB at {db_path}")
        
        # 1. Split the document into chunks
        # 2. Split the document into semantic chunks.
        text_splitter = SemanticChunker(
            embeddings=embeddings
        )

        chunks = []
        page_idx = []

        for i, page in enumerate(pdf):
            page_chunks = text_splitter.split_text(page.get_text())
            page_idx.extend([i + 1] * len(page_chunks))
            chunks.extend(page_chunks)

        # 2. Create FAISS vector store
        doc_search = FAISS.from_texts(
            chunks,
            embeddings,
            metadatas=[{"source": str(i), "page": str(idx)} for i, idx in enumerate(page_idx)]
        )
        doc_search.save_local(db_path)

    # 3. Return retriever with top_k
    retriever = doc_search.as_retriever(search_kwargs={"k": top_k})
    return retriever, doc_search

In [19]:
### 2a) Semantic chunking without add. config (based on A1)
TOP_K = 20

all_results = {}
start_time = time.time()

for idx, row in sample.iterrows():
    filename = f"{prepare_filename(row['company_withAccessInfo'])}_2024"
    filename = filename.replace(" ", "")
    print(f"\nProcessing: {filename}")

    try:
        PATH = f"./sample_reports/{filename}.pdf"
        pdf = load_pdf(path=PATH) 
        text_list, all_text = extract_text(pdf)
        DB_PATH = f"./faiss_db_Qwen_B2a/{filename}"
        retriever, doc_search = get_retriever_qwen_B2(pdf, embeddings=embeddings_qwen, db_path=DB_PATH, top_k=TOP_K)
        results = retrieve_chunks(retriever, queries=QUERIES)
        all_results[filename] = results
        print(f"Successfully processed and retrieved chunks for {filename}")
    
    except Exception as e:
        print(f"Error processing {row['company_withAccessInfo']}: {e}")

print(f"Computation time: {(time.time() - start_time) / 60:.2f} minutes")

evaluate_retrieval(all_results)


Processing: ContinentalAG_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2a/ContinentalAG_2024
Successfully processed and retrieved chunks for ContinentalAG_2024

Processing: SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2a/SchneiderElectric_2024
Successfully processed and retrieved chunks for SchneiderElectric_2024

Processing: Philips_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2a/Philips_2024
Successfully processed and retrieved chunks for Philips_2024
Computation time: 0.09 minutes
--- Retrieval Performance Summary ---
             query  precision  recall     f1
0             S1_A      0.442   0.198  0.270
1             S1_B      0.160   0.735  0.258
2             S1_C      0.223   0.548  0.315
3             S1_D      0.257   0.421  0.315
4             S1_E      0.360   0.664  0.460
5             S1_F      0.229   0.423  0.293
6             S1_G      0.091   0.535  0.155
Overall Mean   NaN      0.252   0.504  0.295


In [14]:
def get_retriever_qwen_B2b(pdf, db_path, breaking_point, embeddings, top_k=TOP_K):

    # Check if Database already exists
    if os.path.exists(db_path):
        print(f"Loading existing FAISS DB from {db_path}")
        doc_search = FAISS.load_local(
            db_path, 
            embeddings=embeddings, 
            allow_dangerous_deserialization=True # Be sure you trust the source of the DB file
        )
    
    else:
        print(f"Creating new FAISS DB at {db_path}")

        # Split the document into semantic chunks.
        text_splitter = SemanticChunker(
            embeddings=embeddings, 
            breakpoint_threshold_type="standard_deviation",
            breakpoint_threshold_amount=breaking_point
        )
    
        chunks = []
        page_idx = []
    
        for i, page in enumerate(pdf):
            page_chunks = text_splitter.split_text(page.get_text())
            page_idx.extend([i + 1] * len(page_chunks))
            chunks.extend(page_chunks)

        # Create vector store
        doc_search = FAISS.from_texts(
            chunks,
            embeddings,
            metadatas=[{"source": str(i), "page": str(idx)} for i, idx in enumerate(page_idx)]
        )
        doc_search.save_local(db_path)
    

    # Return retriever
    retriever = doc_search.as_retriever(search_kwargs={"k": top_k})
    return retriever, doc_search

In [17]:
### 2b) Semantic chunking with add. config: threshold_type"standard_deviation" & breakpoint=1.5 (based on A1)

TOP_K = 20

all_results = {}
start_time = time.time()

for idx, row in sample.iterrows():
    filename = f"{prepare_filename(row['company_withAccessInfo'])}_2024"
    filename = filename.replace(" ", "")
    print(f"\nProcessing: {filename}")

    try:
        PATH = f"./sample_reports/{filename}.pdf"
        pdf = load_pdf(path=PATH) 
        text_list, all_text = extract_text(pdf)
        DB_PATH = f"./faiss_db_Qwen_B2b/{filename}"
        retriever, doc_search = get_retriever_qwen_B2b(pdf, db_path=DB_PATH, breaking_point=1.5, embeddings=embeddings_qwen, top_k=TOP_K)
        results = retrieve_chunks(retriever, queries=QUERIES)
        all_results[filename] = results
        print(f"Successfully processed and retrieved chunks for {filename}")
    
    except Exception as e:
        print(f"Error processing {row['company_withAccessInfo']}: {e}")

print(f"Computation time: {(time.time() - start_time) / 60:.2f} minutes")

evaluate_retrieval(all_results)


Processing: ContinentalAG_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b/ContinentalAG_2024
Successfully processed and retrieved chunks for ContinentalAG_2024

Processing: SchneiderElectric_2024
Creating new FAISS DB at ./faiss_db_Qwen_B2b/SchneiderElectric_2024
Successfully processed and retrieved chunks for SchneiderElectric_2024

Processing: Philips_2024
Creating new FAISS DB at ./faiss_db_Qwen_B2b/Philips_2024
Successfully processed and retrieved chunks for Philips_2024
Computation time: 9.23 minutes
--- Retrieval Performance Summary ---
             query  precision  recall     f1
0             S1_A      0.413   0.159  0.227
1             S1_B      0.175   0.710  0.276
2             S1_C      0.219   0.526  0.307
3             S1_D      0.278   0.380  0.319
4             S1_E      0.380   0.615  0.453
5             S1_F      0.278   0.427  0.326
6             S1_G      0.086   0.403  0.141
Overall Mean   NaN      0.261   0.460  0.293


In [27]:
### 2b-2) Semantic chunking with add. config: threshold_type"standard_deviation" & breakpoint=1 (based on A1)

TOP_K = 20

all_results = {}
start_time = time.time()

for idx, row in sample.iterrows():
    filename = f"{prepare_filename(row['company_withAccessInfo'])}_2024"
    filename = filename.replace(" ", "")
    print(f"\nProcessing: {filename}")

    try:
        PATH = f"./sample_reports/{filename}.pdf"
        pdf = load_pdf(path=PATH) 
        text_list, all_text = extract_text(pdf)
        DB_PATH = f"./faiss_db_Qwen_B2b-2/{filename}"
        retriever, doc_search = get_retriever_qwen_B2b(pdf, db_path=DB_PATH, breaking_point=1, embeddings=embeddings_qwen, top_k=TOP_K)
        results = retrieve_chunks(retriever, queries=QUERIES)
        all_results[filename] = results
        print(f"Successfully processed and retrieved chunks for {filename}")
    
    except Exception as e:
        print(f"Error processing {row['company_withAccessInfo']}: {e}")

print(f"Computation time: {(time.time() - start_time) / 60:.2f} minutes")

evaluate_retrieval(all_results)


Processing: ContinentalAG_2024
Creating new FAISS DB at ./faiss_db_Qwen_B2b-2/ContinentalAG_2024
Successfully processed and retrieved chunks for ContinentalAG_2024

Processing: SchneiderElectric_2024
Creating new FAISS DB at ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024
Successfully processed and retrieved chunks for SchneiderElectric_2024

Processing: Philips_2024
Creating new FAISS DB at ./faiss_db_Qwen_B2b-2/Philips_2024
Successfully processed and retrieved chunks for Philips_2024
Computation time: 12.20 minutes
--- Retrieval Performance Summary ---
             query  precision  recall     f1
0             S1_A      0.460   0.117  0.184
1             S1_B      0.219   0.555  0.304
2             S1_C      0.289   0.475  0.357
3             S1_D      0.267   0.209  0.233
4             S1_E      0.499   0.420  0.441
5             S1_F      0.374   0.374  0.363
6             S1_G      0.117   0.448  0.183
Overall Mean   NaN      0.318   0.371  0.295


In [32]:
### 2b-3) Semantic chunking with add. config: threshold_type"standard_deviation" & breakpoint=0.7 (based on A1)

TOP_K = 20

all_results = {}
start_time = time.time()

for idx, row in sample.iterrows():
    filename = f"{prepare_filename(row['company_withAccessInfo'])}_2024"
    filename = filename.replace(" ", "")
    print(f"\nProcessing: {filename}")

    try:
        PATH = f"./sample_reports/{filename}.pdf"
        pdf = load_pdf(path=PATH) 
        text_list, all_text = extract_text(pdf)
        DB_PATH = f"./faiss_db_Qwen_B2b-3/{filename}"
        retriever, doc_search = get_retriever_qwen_B2b(pdf, db_path=DB_PATH, breaking_point=0.7, embeddings=embeddings_qwen, top_k=TOP_K)
        results = retrieve_chunks(retriever, queries=QUERIES)
        all_results[filename] = results
        print(f"Successfully processed and retrieved chunks for {filename}")
    
    except Exception as e:
        print(f"Error processing {row['company_withAccessInfo']}: {e}")

print(f"Computation time: {(time.time() - start_time) / 60:.2f} minutes")

evaluate_retrieval(all_results)


Processing: ContinentalAG_2024
Creating new FAISS DB at ./faiss_db_Qwen_B2b-3/ContinentalAG_2024
Successfully processed and retrieved chunks for ContinentalAG_2024

Processing: SchneiderElectric_2024
Creating new FAISS DB at ./faiss_db_Qwen_B2b-3/SchneiderElectric_2024
Successfully processed and retrieved chunks for SchneiderElectric_2024

Processing: Philips_2024
Creating new FAISS DB at ./faiss_db_Qwen_B2b-3/Philips_2024
Successfully processed and retrieved chunks for Philips_2024
Computation time: 12.22 minutes
--- Retrieval Performance Summary ---
             query  precision  recall     f1
0             S1_A      0.518   0.097  0.161
1             S1_B      0.198   0.348  0.245
2             S1_C      0.303   0.382  0.337
3             S1_D      0.290   0.154  0.195
4             S1_E      0.567   0.379  0.436
5             S1_F      0.226   0.151  0.174
6             S1_G      0.124   0.350  0.177
Overall Mean   NaN      0.318   0.266  0.246


In [36]:
### 2b-4) Semantic chunking with add. config: threshold_type"standard_deviation" & breakpoint=1.2 (based on A1)

TOP_K = 20

all_results = {}
start_time = time.time()

for idx, row in sample.iterrows():
    filename = f"{prepare_filename(row['company_withAccessInfo'])}_2024"
    filename = filename.replace(" ", "")
    print(f"\nProcessing: {filename}")

    try:
        PATH = f"./sample_reports/{filename}.pdf"
        pdf = load_pdf(path=PATH) 
        text_list, all_text = extract_text(pdf)
        DB_PATH = f"./faiss_db_Qwen_B2b-4/{filename}"
        retriever, doc_search = get_retriever_qwen_B2b(pdf, db_path=DB_PATH, breaking_point=1.2, embeddings=embeddings_qwen, top_k=TOP_K)
        results = retrieve_chunks(retriever, queries=QUERIES)
        all_results[filename] = results
        print(f"Successfully processed and retrieved chunks for {filename}")
    
    except Exception as e:
        print(f"Error processing {row['company_withAccessInfo']}: {e}")

print(f"Computation time: {(time.time() - start_time) / 60:.2f} minutes")

evaluate_retrieval(all_results)


Processing: ContinentalAG_2024
Creating new FAISS DB at ./faiss_db_Qwen_B2b-4/ContinentalAG_2024
Successfully processed and retrieved chunks for ContinentalAG_2024

Processing: SchneiderElectric_2024
Creating new FAISS DB at ./faiss_db_Qwen_B2b-4/SchneiderElectric_2024
Successfully processed and retrieved chunks for SchneiderElectric_2024

Processing: Philips_2024
Creating new FAISS DB at ./faiss_db_Qwen_B2b-4/Philips_2024
Successfully processed and retrieved chunks for Philips_2024
Computation time: 12.29 minutes
--- Retrieval Performance Summary ---
             query  precision  recall     f1
0             S1_A      0.453   0.128  0.197
1             S1_B      0.203   0.587  0.295
2             S1_C      0.268   0.508  0.349
3             S1_D      0.282   0.258  0.268
4             S1_E      0.395   0.480  0.424
5             S1_F      0.297   0.354  0.312
6             S1_G      0.097   0.403  0.155
Overall Mean   NaN      0.285   0.388  0.286


In [19]:
### 2c) Semantic chunking with add. config: breakpoint=2 (based on A1)

def get_retriever_qwen_B2c(pdf, db_path, embeddings, top_k=TOP_K):

        # Check if Database already exists
    if os.path.exists(db_path):
        print(f"Loading existing FAISS DB from {db_path}")
        doc_search = FAISS.load_local(
            db_path, 
            embeddings=embeddings, 
            allow_dangerous_deserialization=True # Be sure you trust the source of the DB file
        )
    
    else:
        print(f"Creating new FAISS DB at {db_path}")

        # Split the document into semantic chunks.
        text_splitter = SemanticChunker(
            embeddings=embeddings,
            breakpoint_threshold_amount=2
        )
    
        chunks = []
        page_idx = []
    
        for i, page in enumerate(pdf):
            page_chunks = text_splitter.split_text(page.get_text())
            page_idx.extend([i + 1] * len(page_chunks))
            chunks.extend(page_chunks)

        # Create vector store
        doc_search = FAISS.from_texts(
            chunks,
            embeddings,
            metadatas=[{"source": str(i), "page": str(idx)} for i, idx in enumerate(page_idx)]
        )
        doc_search.save_local(db_path)
    

    # Return retriever
    retriever = doc_search.as_retriever(search_kwargs={"k": top_k})
    return retriever, doc_search


TOP_K = 20

all_results = {}
start_time = time.time()

for idx, row in sample.iterrows():
    filename = f"{prepare_filename(row['company_withAccessInfo'])}_2024"
    filename = filename.replace(" ", "")
    print(f"\nProcessing: {filename}")

    try:
        PATH = f"./sample_reports/{filename}.pdf"
        pdf = load_pdf(path=PATH) 
        text_list, all_text = extract_text(pdf)
        DB_PATH = f"./faiss_db_Qwen_B2c/{filename}"
        retriever, doc_search = get_retriever_qwen_B2c(pdf, embeddings=embeddings_qwen, db_path=DB_PATH, top_k=TOP_K)
        results = retrieve_chunks(retriever, queries=QUERIES)
        all_results[filename] = results
        print(f"Successfully processed and retrieved chunks for {filename}")
    
    except Exception as e:
        print(f"Error processing {row['company_withAccessInfo']}: {e}")

print(f"Computation time: {(time.time() - start_time) / 60:.2f} minutes")

evaluate_retrieval(all_results)


Processing: ContinentalAG_2024
Creating new FAISS DB at ./faiss_db_Qwen_B2c/ContinentalAG_2024
Successfully processed and retrieved chunks for ContinentalAG_2024

Processing: SchneiderElectric_2024
Creating new FAISS DB at ./faiss_db_Qwen_B2c/SchneiderElectric_2024
Successfully processed and retrieved chunks for SchneiderElectric_2024

Processing: Philips_2024
Creating new FAISS DB at ./faiss_db_Qwen_B2c/Philips_2024
Successfully processed and retrieved chunks for Philips_2024
Computation time: 11.98 minutes
--- Retrieval Performance Summary ---
             query  precision  recall     f1
0             S1_A      0.486   0.030  0.057
1             S1_B      0.195   0.095  0.125
2             S1_C      0.438   0.133  0.204
3             S1_D      0.247   0.049  0.081
4             S1_E      0.660   0.141  0.229
5             S1_F      0.275   0.060  0.097
6             S1_G      0.146   0.119  0.126
Overall Mean   NaN      0.349   0.090  0.131


#### 3. Top_K

In [22]:
### 3a. Top_K = 40 based on A1
TOP_K = 40
CHUNK_SIZE = 500
chunk_overlap = 20

all_results = {}
start_time = time.time()

for idx, row in sample.iterrows():
    filename = f"{prepare_filename(row['company_withAccessInfo'])}_2024"
    filename = filename.replace(" ", "")
    print(f"\nProcessing: {filename}")

    try:
        PATH = f"./sample_reports/{filename}.pdf"
        pdf = load_pdf(path=PATH) 
        text_list, all_text = extract_text(pdf)
        DB_PATH = f"./faiss_db_Qwen_A1/{filename}"
        retriever, doc_search = get_retriever_qwen(pdf, db_path=DB_PATH, embeddings=embeddings_qwen, top_k=TOP_K, chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP)
        results = retrieve_chunks(retriever, queries=QUERIES)
        all_results[filename] = results
        print(f"Successfully processed and retrieved chunks for {filename}")
    
    except Exception as e:
        print(f"Error processing {row['company_withAccessInfo']}: {e}")

print(f"Computation time: {(time.time() - start_time) / 60:.2f} minutes")

evaluate_retrieval(all_results)


Processing: ContinentalAG_2024
Loading existing FAISS DB from ./faiss_db_Qwen_A1/ContinentalAG_2024
Successfully processed and retrieved chunks for ContinentalAG_2024

Processing: SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_A1/SchneiderElectric_2024
Successfully processed and retrieved chunks for SchneiderElectric_2024

Processing: Philips_2024
Loading existing FAISS DB from ./faiss_db_Qwen_A1/Philips_2024
Successfully processed and retrieved chunks for Philips_2024
Computation time: 0.10 minutes
--- Retrieval Performance Summary ---
             query  precision  recall     f1
0             S1_A      0.300   0.082  0.128
1             S1_B      0.135   0.341  0.187
2             S1_C      0.205   0.283  0.237
3             S1_D      0.154   0.155  0.152
4             S1_E      0.310   0.323  0.309
5             S1_F      0.190   0.203  0.193
6             S1_G      0.067   0.308  0.109
Overall Mean   NaN      0.194   0.242  0.188


In [28]:
### 3b. Top_K = 40, Chunk_size 1000 chunk overlap 200 (B1a)
TOP_K = 40

all_results = {}
start_time = time.time()

for idx, row in sample.iterrows():
    filename = f"{prepare_filename(row['company_withAccessInfo'])}_2024"
    filename = filename.replace(" ", "")
    print(f"\nProcessing: {filename}")

    try:
        PATH = f"./sample_reports/{filename}.pdf"
        #pdf = load_pdf(path=PATH) 
        #text_list, all_text = extract_text(pdf)
        DB_PATH = f"./faiss_db_Qwen_B1a/{filename}"
        retriever, doc_search = get_retriever_qwen(pdf, db_path=DB_PATH, embeddings=embeddings_qwen, top_k=TOP_K, chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP)
        results = retrieve_chunks(retriever, queries=QUERIES)
        all_results[filename] = results
        print(f"Successfully processed and retrieved chunks for {filename}")
    
    except Exception as e:
        print(f"Error processing {row['company_withAccessInfo']}: {e}")

print(f"Computation time: {(time.time() - start_time) / 60:.2f} minutes")

evaluate_retrieval(all_results)


Processing: ContinentalAG_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B1a/ContinentalAG_2024
Successfully processed and retrieved chunks for ContinentalAG_2024

Processing: SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B1a/SchneiderElectric_2024
Successfully processed and retrieved chunks for SchneiderElectric_2024

Processing: Philips_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B1a/Philips_2024
Successfully processed and retrieved chunks for Philips_2024
Computation time: 0.01 minutes


Applied processor reduces input query to empty string, all comparisons will have score 0. [Query: '*']


--- Retrieval Performance Summary ---
             query  precision  recall     f1
0             S1_A      0.347   0.180  0.232
1             S1_B      0.139   0.658  0.226
2             S1_C      0.210   0.567  0.306
3             S1_D      0.232   0.377  0.284
4             S1_E      0.288   0.545  0.370
5             S1_F      0.213   0.410  0.275
6             S1_G      0.085   0.667  0.148
Overall Mean   NaN      0.216   0.486  0.263


In [31]:
### 3c) Top K = 40 with semantic chunking (B 2a)
TOP_K = 40

all_results = {}
start_time = time.time()

for idx, row in sample.iterrows():
    filename = f"{prepare_filename(row['company_withAccessInfo'])}_2024"
    filename = filename.replace(" ", "")
    print(f"\nProcessing: {filename}")

    try:
        PATH = f"./sample_reports/{filename}.pdf"
        pdf = load_pdf(path=PATH) 
        text_list, all_text = extract_text(pdf)
        DB_PATH = f"./faiss_db_Qwen_B2a/{filename}"
        retriever, doc_search = get_retriever_qwen_B2(pdf, db_path=DB_PATH, embeddings=embeddings_qwen, top_k=TOP_K)
        results = retrieve_chunks(retriever, queries=QUERIES)
        all_results[filename] = results
        print(f"Successfully processed and retrieved chunks for {filename}")
    
    except Exception as e:
        print(f"Error processing {row['company_withAccessInfo']}: {e}")

print(f"Computation time: {(time.time() - start_time) / 60:.2f} minutes")

evaluate_retrieval(all_results)


Processing: ContinentalAG_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2a/ContinentalAG_2024
Successfully processed and retrieved chunks for ContinentalAG_2024

Processing: SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2a/SchneiderElectric_2024
Successfully processed and retrieved chunks for SchneiderElectric_2024

Processing: Philips_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2a/Philips_2024
Successfully processed and retrieved chunks for Philips_2024
Computation time: 0.08 minutes
--- Retrieval Performance Summary ---
             query  precision  recall     f1
0             S1_A      0.347   0.343  0.343
1             S1_B      0.087   0.793  0.154
2             S1_C      0.132   0.751  0.224
3             S1_D      0.186   0.622  0.285
4             S1_E      0.246   0.821  0.373
5             S1_F      0.157   0.611  0.248
6             S1_G      0.046   0.595  0.085
Overall Mean   NaN      0.172   0.648  0.245


In [34]:
### 3c) Top K = 60 with semantic chunking (B 2a)
TOP_K = 60

all_results = {}
start_time = time.time()

for idx, row in sample.iterrows():
    filename = f"{prepare_filename(row['company_withAccessInfo'])}_2024"
    filename = filename.replace(" ", "")
    print(f"\nProcessing: {filename}")

    try:
        PATH = f"./sample_reports/{filename}.pdf"
        pdf = load_pdf(path=PATH) 
        text_list, all_text = extract_text(pdf)
        DB_PATH = f"./faiss_db_Qwen_B2a/{filename}"
        retriever, doc_search = get_retriever_qwen_B2(pdf, db_path=DB_PATH, embeddings=embeddings_qwen, top_k=TOP_K)
        results = retrieve_chunks(retriever, queries=QUERIES)
        all_results[filename] = results
        print(f"Successfully processed and retrieved chunks for {filename}")
    
    except Exception as e:
        print(f"Error processing {row['company_withAccessInfo']}: {e}")

print(f"Computation time: {(time.time() - start_time) / 60:.2f} minutes")

evaluate_retrieval(all_results)


Processing: ContinentalAG_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2a/ContinentalAG_2024
Successfully processed and retrieved chunks for ContinentalAG_2024

Processing: SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2a/SchneiderElectric_2024
Successfully processed and retrieved chunks for SchneiderElectric_2024

Processing: Philips_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2a/Philips_2024
Successfully processed and retrieved chunks for Philips_2024
Computation time: 0.08 minutes
--- Retrieval Performance Summary ---
             query  precision  recall     f1
0             S1_A      0.334   0.503  0.397
1             S1_B      0.062   0.825  0.115
2             S1_C      0.095   0.800  0.170
3             S1_D      0.149   0.717  0.245
4             S1_E      0.188   0.864  0.304
5             S1_F      0.146   0.766  0.244
6             S1_G      0.036   0.747  0.069
Overall Mean   NaN      0.144   0.746  0.221


We observe how the recall get's better (higher coverage of (ground-truth) relevant information within the retrieved results, but the precison get's worse (more noise within the retrieved results).

In [15]:
### 3e) B 2b-2) with Top K = 40

TOP_K = 40

all_results = {}
start_time = time.time()

for idx, row in sample.iterrows():
    filename = f"{prepare_filename(row['company_withAccessInfo'])}_2024"
    filename = filename.replace(" ", "")
    print(f"\nProcessing: {filename}")

    try:
        PATH = f"./sample_reports/{filename}.pdf"
        pdf = load_pdf(path=PATH) 
        text_list, all_text = extract_text(pdf)
        DB_PATH = f"./faiss_db_Qwen_B2b-2/{filename}"
        retriever, doc_search = get_retriever_qwen_B2b(pdf, db_path=DB_PATH, breaking_point=1, embeddings=embeddings_qwen, top_k=TOP_K)
        results = retrieve_chunks(retriever, queries=QUERIES)
        all_results[filename] = results
        print(f"Successfully processed and retrieved chunks for {filename}")
    
    except Exception as e:
        print(f"Error processing {row['company_withAccessInfo']}: {e}")

print(f"Computation time: {(time.time() - start_time) / 60:.2f} minutes")

evaluate_retrieval(all_results)


Processing: ContinentalAG_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024
Successfully processed and retrieved chunks for ContinentalAG_2024

Processing: SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024
Successfully processed and retrieved chunks for SchneiderElectric_2024

Processing: Philips_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/Philips_2024
Successfully processed and retrieved chunks for Philips_2024
Computation time: 0.14 minutes
--- Retrieval Performance Summary ---
             query  precision  recall     f1
0             S1_A      0.443   0.231  0.302
1             S1_B      0.139   0.684  0.229
2             S1_C      0.190   0.616  0.288
3             S1_D      0.269   0.455  0.335
4             S1_E      0.343   0.629  0.429
5             S1_F      0.266   0.523  0.345
6             S1_G      0.077   0.495  0.131
Overall Mean   NaN      0.247   0.519  0.294


In [16]:
### 3f) Semantic chunking with add. config: threshold_type"standard_deviation" & breakpoint=1 (based on A1)

TOP_K = 30

all_results = {}
start_time = time.time()

for idx, row in sample.iterrows():
    filename = f"{prepare_filename(row['company_withAccessInfo'])}_2024"
    filename = filename.replace(" ", "")
    print(f"\nProcessing: {filename}")

    try:
        PATH = f"./sample_reports/{filename}.pdf"
        pdf = load_pdf(path=PATH) 
        text_list, all_text = extract_text(pdf)
        DB_PATH = f"./faiss_db_Qwen_B2b-2/{filename}"
        retriever, doc_search = get_retriever_qwen_B2b(pdf, db_path=DB_PATH, breaking_point=1, embeddings=embeddings_qwen, top_k=TOP_K)
        results = retrieve_chunks(retriever, queries=QUERIES)
        all_results[filename] = results
        print(f"Successfully processed and retrieved chunks for {filename}")
    
    except Exception as e:
        print(f"Error processing {row['company_withAccessInfo']}: {e}")

print(f"Computation time: {(time.time() - start_time) / 60:.2f} minutes")

evaluate_retrieval(all_results)


Processing: ContinentalAG_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024
Successfully processed and retrieved chunks for ContinentalAG_2024

Processing: SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024
Successfully processed and retrieved chunks for SchneiderElectric_2024

Processing: Philips_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/Philips_2024
Successfully processed and retrieved chunks for Philips_2024
Computation time: 0.08 minutes
--- Retrieval Performance Summary ---
             query  precision  recall     f1
0             S1_A      0.492   0.181  0.262
1             S1_B      0.173   0.663  0.269
2             S1_C      0.221   0.540  0.313
3             S1_D      0.251   0.310  0.275
4             S1_E      0.397   0.538  0.438
5             S1_F      0.333   0.491  0.384
6             S1_G      0.088   0.448  0.145
Overall Mean   NaN      0.279   0.453  0.298


#### 4. Reranker

In [17]:
### Rerank retrieval using ContextualCompressionRetriever from langchain

# B4. Create or Load Vector Store for Qwen
def get_retriever_qwen_B4(pdf, db_path, embeddings, top_k, top_n):

    if os.path.exists(db_path):
        print(f"Loading existing FAISS DB from {db_path}")
        doc_search = FAISS.load_local(
            db_path, 
            embeddings=embeddings, 
            allow_dangerous_deserialization=True # Be sure you trust the source of the DB file
        )
    else:
        print(f"Creating new FAISS DB at {db_path}")

        # Split the document into semantic chunks.
        text_splitter = SemanticChunker(
            embeddings=embeddings
        )

        # Create vector store
        chunks = []
        page_idx = []
    
        for i, page in enumerate(pdf):
            page_chunks = text_splitter.split_text(page.get_text())
            page_idx.extend([i + 1] * len(page_chunks))
            chunks.extend(page_chunks)
        
        doc_search = FAISS.from_texts(
            chunks,
            embeddings,
            metadatas=[{"source": str(i), "page": str(idx)} for i, idx in enumerate(page_idx)]
        )
        doc_search.save_local(db_path)


    # Create a base retriever with more results
    base_retriever = doc_search.as_retriever(search_kwargs={"k": top_k})

    # Rerank results
    # Initialize cross-encoder model
    model = HuggingFaceCrossEncoder(model_name="BAAI/bge-reranker-base")
    compressor = CrossEncoderReranker(model=model, top_n=top_n)
    # Create the compression retriever
    compression_retriever = ContextualCompressionRetriever(
        base_compressor=compressor, base_retriever=base_retriever
    )

    return compression_retriever, doc_search

In [20]:
### 4a) Rerank retrieval using ContextualCompressionRetriever with RecursiveCharacterTextSplitting (based on B A1)
TOP_K = 50
TOP_N = 20

all_results = {}
start_time = time.time()

for idx, row in sample.iterrows():
    filename = f"{prepare_filename(row['company_withAccessInfo'])}_2024"
    filename = filename.replace(" ", "")
    print(f"\nProcessing: {filename}")

    try:
        PATH = f"./sample_reports/{filename}.pdf"
        pdf = load_pdf(path=PATH) 
        text_list, all_text = extract_text(pdf)
        DB_PATH = f"./faiss_db_Qwen_A1/{filename}"
        retriever, doc_search = get_retriever_qwen_B4(pdf, db_path=DB_PATH, embeddings=embeddings_qwen, top_k=TOP_K, top_n=TOP_N)
        results = retrieve_chunks(retriever, queries=QUERIES)
        all_results[filename] = results
        print(f"Successfully processed and retrieved chunks for {filename}")
    
    except Exception as e:
        print(f"Error processing {row['company_withAccessInfo']}: {e}")

print(f"Computation time: {(time.time() - start_time) / 60:.2f} minutes")

evaluate_retrieval(all_results)


Processing: ContinentalAG_2024
Loading existing FAISS DB from ./faiss_db_Qwen_A1/ContinentalAG_2024
Successfully processed and retrieved chunks for ContinentalAG_2024

Processing: SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_A1/SchneiderElectric_2024
Successfully processed and retrieved chunks for SchneiderElectric_2024

Processing: Philips_2024
Loading existing FAISS DB from ./faiss_db_Qwen_A1/Philips_2024
Successfully processed and retrieved chunks for Philips_2024
Computation time: 0.29 minutes
--- Retrieval Performance Summary ---
             query  precision  recall     f1
0             S1_A      0.326   0.044  0.077
1             S1_B      0.178   0.205  0.185
2             S1_C      0.210   0.145  0.170
3             S1_D      0.163   0.080  0.107
4             S1_E      0.351   0.173  0.229
5             S1_F      0.125   0.067  0.087
6             S1_G      0.100   0.273  0.142
Overall Mean   NaN      0.207   0.141  0.142


In [23]:
### 4b) Rerank retrieval using ContextualCompressionRetriever with semantic chunks (based on B 2a)

TOP_K = 50
TOP_N = 20

all_results = {}
start_time = time.time()

for idx, row in sample.iterrows():
    filename = f"{prepare_filename(row['company_withAccessInfo'])}_2024"
    filename = filename.replace(" ", "")
    print(f"\nProcessing: {filename}")

    try:
        PATH = f"./sample_reports/{filename}.pdf"
        pdf = load_pdf(path=PATH) 
        text_list, all_text = extract_text(pdf)
        DB_PATH = f"./faiss_db_Qwen_B2a/{filename}"
        retriever, doc_search = get_retriever_qwen_B4(pdf, db_path=DB_PATH, embeddings=embeddings_qwen, top_k=TOP_K, top_n=TOP_N)
        results = retrieve_chunks(retriever, queries=QUERIES)
        all_results[filename] = results
        print(f"Successfully processed and retrieved chunks for {filename}")
    
    except Exception as e:
        print(f"Error processing {row['company_withAccessInfo']}: {e}")

print(f"Computation time: {(time.time() - start_time) / 60:.2f} minutes")

evaluate_retrieval(all_results)


Processing: ContinentalAG_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2a/ContinentalAG_2024
Successfully processed and retrieved chunks for ContinentalAG_2024

Processing: SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2a/SchneiderElectric_2024
Successfully processed and retrieved chunks for SchneiderElectric_2024

Processing: Philips_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2a/Philips_2024
Successfully processed and retrieved chunks for Philips_2024
Computation time: 0.29 minutes
--- Retrieval Performance Summary ---
             query  precision  recall     f1
0             S1_A      0.347   0.163  0.218
1             S1_B      0.103   0.413  0.161
2             S1_C      0.169   0.461  0.247
3             S1_D      0.204   0.353  0.258
4             S1_E      0.301   0.526  0.374
5             S1_F      0.194   0.407  0.259
6             S1_G      0.080   0.519  0.138
Overall Mean   NaN      0.200   0.406  0.236


In [24]:
### 4c) Rerank retrieval using ContextualCompressionRetriever with semantic chunks (based on B 2b-2)

TOP_K = 50
TOP_N = 20

all_results = {}
start_time = time.time()

for idx, row in sample.iterrows():
    filename = f"{prepare_filename(row['company_withAccessInfo'])}_2024"
    filename = filename.replace(" ", "")
    print(f"\nProcessing: {filename}")

    try:
        PATH = f"./sample_reports/{filename}.pdf"
        pdf = load_pdf(path=PATH) 
        text_list, all_text = extract_text(pdf)
        DB_PATH = f"./faiss_db_Qwen_B2b-2/{filename}"
        retriever, doc_search = get_retriever_qwen_B4(pdf, db_path=DB_PATH, embeddings=embeddings_qwen, top_k=TOP_K, top_n=TOP_N)
        results = retrieve_chunks(retriever, queries=QUERIES)
        all_results[filename] = results
        print(f"Successfully processed and retrieved chunks for {filename}")
    
    except Exception as e:
        print(f"Error processing {row['company_withAccessInfo']}: {e}")

print(f"Computation time: {(time.time() - start_time) / 60:.2f} minutes")

evaluate_retrieval(all_results)


Processing: ContinentalAG_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024
Successfully processed and retrieved chunks for ContinentalAG_2024

Processing: SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024
Successfully processed and retrieved chunks for SchneiderElectric_2024

Processing: Philips_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/Philips_2024
Successfully processed and retrieved chunks for Philips_2024
Computation time: 0.29 minutes
--- Retrieval Performance Summary ---
             query  precision  recall     f1
0             S1_A      0.436   0.107  0.171
1             S1_B      0.131   0.363  0.190
2             S1_C      0.174   0.339  0.230
3             S1_D      0.283   0.260  0.270
4             S1_E      0.371   0.352  0.346
5             S1_F      0.255   0.307  0.272
6             S1_G      0.100   0.380  0.150
Overall Mean   NaN      0.250   0.301  0.233


In [25]:
### 4d) Rerank retrieval using ContextualCompressionRetriever with semantic chunks (based on B 2b-2)

TOP_K = 50
TOP_N = 30

all_results = {}
start_time = time.time()

for idx, row in sample.iterrows():
    filename = f"{prepare_filename(row['company_withAccessInfo'])}_2024"
    filename = filename.replace(" ", "")
    print(f"\nProcessing: {filename}")

    try:
        PATH = f"./sample_reports/{filename}.pdf"
        pdf = load_pdf(path=PATH) 
        text_list, all_text = extract_text(pdf)
        DB_PATH = f"./faiss_db_Qwen_B2b-2/{filename}"
        retriever, doc_search = get_retriever_qwen_B4(pdf, db_path=DB_PATH, embeddings=embeddings_qwen, top_k=TOP_K, top_n=TOP_N)
        results = retrieve_chunks(retriever, queries=QUERIES)
        all_results[filename] = results
        print(f"Successfully processed and retrieved chunks for {filename}")
    
    except Exception as e:
        print(f"Error processing {row['company_withAccessInfo']}: {e}")

print(f"Computation time: {(time.time() - start_time) / 60:.2f} minutes")

evaluate_retrieval(all_results)


Processing: ContinentalAG_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024
Successfully processed and retrieved chunks for ContinentalAG_2024

Processing: SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024
Successfully processed and retrieved chunks for SchneiderElectric_2024

Processing: Philips_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/Philips_2024
Successfully processed and retrieved chunks for Philips_2024
Computation time: 0.30 minutes
--- Retrieval Performance Summary ---
             query  precision  recall     f1
0             S1_A      0.426   0.169  0.241
1             S1_B      0.151   0.612  0.242
2             S1_C      0.175   0.508  0.260
3             S1_D      0.234   0.318  0.269
4             S1_E      0.420   0.555  0.462
5             S1_F      0.272   0.454  0.330
6             S1_G      0.109   0.519  0.177
Overall Mean   NaN      0.255   0.448  0.283


In [27]:
### 4e) Rerank retrieval using ContextualCompressionRetriever with semantic chunks (based on B 2b-2)

TOP_K = 40
TOP_N = 30

all_results = {}
start_time = time.time()

for idx, row in sample.iterrows():
    filename = f"{prepare_filename(row['company_withAccessInfo'])}_2024"
    filename = filename.replace(" ", "")
    print(f"\nProcessing: {filename}")

    try:
        PATH = f"./sample_reports/{filename}.pdf"
        pdf = load_pdf(path=PATH) 
        text_list, all_text = extract_text(pdf)
        DB_PATH = f"./faiss_db_Qwen_B2b-2/{filename}"
        retriever, doc_search = get_retriever_qwen_B4(pdf, db_path=DB_PATH, embeddings=embeddings_qwen, top_k=TOP_K, top_n=TOP_N)
        results = retrieve_chunks(retriever, queries=QUERIES)
        all_results[filename] = results
        print(f"Successfully processed and retrieved chunks for {filename}")
    
    except Exception as e:
        print(f"Error processing {row['company_withAccessInfo']}: {e}")

print(f"Computation time: {(time.time() - start_time) / 60:.2f} minutes")

evaluate_retrieval(all_results)


Processing: ContinentalAG_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024
Successfully processed and retrieved chunks for ContinentalAG_2024

Processing: SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024
Successfully processed and retrieved chunks for SchneiderElectric_2024

Processing: Philips_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/Philips_2024
Successfully processed and retrieved chunks for Philips_2024
Computation time: 0.26 minutes
--- Retrieval Performance Summary ---
             query  precision  recall     f1
0             S1_A      0.475   0.187  0.267
1             S1_B      0.158   0.642  0.252
2             S1_C      0.210   0.573  0.303
3             S1_D      0.270   0.362  0.307
4             S1_E      0.423   0.588  0.475
5             S1_F      0.294   0.465  0.351
6             S1_G      0.082   0.434  0.136
Overall Mean   NaN      0.273   0.465  0.299


#### 5. Similarity threshold

In [30]:
def get_or_create_retriever_B5(
    db_path: str,
    embeddings: HuggingFaceEmbeddings,
    similarity_threshold: float,
    top_k: int
):
    """
    Loads FAISS database that already exists at db_path.

    Args:
        db_path: Path to the FAISS database.
        embeddings: The initialized HuggingFaceEmbeddings object.
        similarity_threshold: The score threshold for retrieval.
        top_k: The number of documents to retrieve.

    Returns:
        A configured LangChain retriever.
    """
    print(f"Loading existing FAISS DB from {db_path}")
    doc_search = FAISS.load_local(
        db_path,
        embeddings=embeddings,
        allow_dangerous_deserialization=True
    )

    # Create the retriever with the specified search parameters
    retriever = doc_search.as_retriever(
        search_type="similarity_score_threshold",
        search_kwargs={'score_threshold': similarity_threshold, 'k': top_k}
    )
    return retriever

In [31]:
### 5a) Top K = 60, similarity threshold=0.25 with semantic chunking (B 2a)
TOP_K = 60
SIMILARITY_THRESHOLD = 0.25

all_results = {}
start_time = time.time()

for idx, row in sample.iterrows():
    filename = f"{prepare_filename(row['company_withAccessInfo'])}_2024"
    filename = filename.replace(" ", "")
    print(f"\nProcessing: {filename}")

    try:
        PATH = f"./sample_reports/{filename}.pdf"
        DB_PATH = f"./faiss_db_Qwen_B2a/{filename}"
        retriever = get_or_create_retriever_B5(db_path=DB_PATH, embeddings=embeddings_qwen, similarity_threshold=SIMILARITY_THRESHOLD, top_k=TOP_K)
        results = retrieve_chunks(retriever, queries=QUERIES)
        all_results[filename] = results
        print(f"Successfully processed and retrieved chunks for {filename}")
    
    except Exception as e:
        print(f"Error processing {row['company_withAccessInfo']}: {e}")

print(f"Computation time: {(time.time() - start_time) / 60:.2f} minutes")

evaluate_retrieval(all_results)


Processing: ContinentalAG_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2a/ContinentalAG_2024
Successfully processed and retrieved chunks for ContinentalAG_2024

Processing: SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2a/SchneiderElectric_2024
Successfully processed and retrieved chunks for SchneiderElectric_2024

Processing: Philips_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2a/Philips_2024
Successfully processed and retrieved chunks for Philips_2024
Computation time: 0.01 minutes
--- Retrieval Performance Summary ---
             query  precision  recall     f1
0             S1_A      0.395   0.359  0.366
1             S1_B      0.129   0.696  0.212
2             S1_C      0.122   0.728  0.206
3             S1_D      0.231   0.622  0.336
4             S1_E      0.298   0.746  0.425
5             S1_F      0.216   0.389  0.275
6             S1_G      0.134   0.450  0.206
Overall Mean   NaN      0.218   0.570  0.289


In [32]:
results_df = pd.DataFrame.from_dict(all_results, orient='index')

# Count the number of items in each cell
counts_df = results_df.applymap(lambda x: len(x) if isinstance(x, list) else 0)

# Display the result
print(counts_df)

                        S1_A  S1_B  S1_C  S1_D  S1_E  S1_F  S1_G
ContinentalAG_2024        50    31    50    40    23    31     5
SchneiderElectric_2024    50    43    60    38    44    20    12
Philips_2024              21    15    29    20    22    15     9


  counts_df = results_df.applymap(lambda x: len(x) if isinstance(x, list) else 0)


In [33]:
### 5b) Top K = 60, similarity threshold=0.25 with semantic chunking (B 2a)
TOP_K = 60
SIMILARITY_THRESHOLD = 0.25

all_results = {}
start_time = time.time()

for idx, row in sample.iterrows():
    filename = f"{prepare_filename(row['company_withAccessInfo'])}_2024"
    filename = filename.replace(" ", "")
    print(f"\nProcessing: {filename}")

    try:
        PATH = f"./sample_reports/{filename}.pdf"
        DB_PATH = f"./faiss_db_Qwen_B2b-2/{filename}"
        retriever = get_or_create_retriever_B5(db_path=DB_PATH, embeddings=embeddings_qwen, similarity_threshold=SIMILARITY_THRESHOLD, top_k=TOP_K)
        results = retrieve_chunks(retriever, queries=QUERIES)
        all_results[filename] = results
        print(f"Successfully processed and retrieved chunks for {filename}")
    
    except Exception as e:
        print(f"Error processing {row['company_withAccessInfo']}: {e}")

print(f"Computation time: {(time.time() - start_time) / 60:.2f} minutes")

evaluate_retrieval(all_results)


Processing: ContinentalAG_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024
Successfully processed and retrieved chunks for ContinentalAG_2024

Processing: SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024
Successfully processed and retrieved chunks for SchneiderElectric_2024

Processing: Philips_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/Philips_2024
Successfully processed and retrieved chunks for Philips_2024
Computation time: 0.01 minutes
--- Retrieval Performance Summary ---
             query  precision  recall     f1
0             S1_A      0.406   0.325  0.361
1             S1_B      0.138   0.741  0.230
2             S1_C      0.129   0.655  0.214
3             S1_D      0.186   0.447  0.263
4             S1_E      0.313   0.686  0.425
5             S1_F      0.240   0.492  0.323
6             S1_G      0.113   0.388  0.174
Overall Mean   NaN      0.218   0.533  0.284


In [34]:
results_df = pd.DataFrame.from_dict(all_results, orient='index')

# Count the number of items in each cell
counts_df = results_df.applymap(lambda x: len(x) if isinstance(x, list) else 0)

# Display the result
print(counts_df)

                        S1_A  S1_B  S1_C  S1_D  S1_E  S1_F  S1_G
ContinentalAG_2024        60    51    60    60    42    45    11
SchneiderElectric_2024    60    60    60    60    60    55    24
Philips_2024              49    29    55    36    44    25    16


  counts_df = results_df.applymap(lambda x: len(x) if isinstance(x, list) else 0)


In [38]:
### 5b-2)

# Define parameter ranges
TOP_K_VALUES = list(range(20, 81, 10))  # 20, 30, ..., 80
SIMILARITY_THRESHOLDS = np.arange(0.05, 0.36, 0.05)  # 0.05 to 0.35

# Results dictionary
all_grid_results = {}

# Loop over all combinations
for sim_thresh in SIMILARITY_THRESHOLDS:
    for top_k in TOP_K_VALUES:
        print(f"\n--- Running with top_k={top_k}, similarity_threshold={sim_thresh:.2f} ---")
        all_results = {}
        start_time = time.time()

        for idx, row in sample.iterrows():
            filename = f"{prepare_filename(row['company_withAccessInfo'])}_2024"
            filename = filename.replace(" ", "")
            print(f"\nProcessing: {filename}")

            try:
                DB_PATH = f"./faiss_db_Qwen_B2b-2/{filename}"
                retriever = get_or_create_retriever_B5(
                    db_path=DB_PATH,
                    embeddings=embeddings_qwen,
                    similarity_threshold=sim_thresh,
                    top_k=top_k
                )
                results = retrieve_chunks(retriever, queries=QUERIES)
                all_results[filename] = results
                print(f"Successfully processed {filename}")
            except Exception as e:
                print(f"Error processing {row['company_withAccessInfo']}: {e}")

        runtime = (time.time() - start_time) / 60
        print(f"Computation time for top_k={top_k}, sim_thresh={sim_thresh:.2f}: {runtime:.2f} minutes")

        # Evaluate and save
        eval_metrics = evaluate_retrieval(all_results)
        all_grid_results[(top_k, sim_thresh)] = {
            "results": all_results,
            "eval": eval_metrics,
            "runtime_minutes": runtime
        }



--- Running with top_k=20, similarity_threshold=0.05 ---

Processing: ContinentalAG_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024
Successfully processed ContinentalAG_2024

Processing: SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024
Successfully processed SchneiderElectric_2024

Processing: Philips_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/Philips_2024
Successfully processed Philips_2024
Computation time for top_k=20, sim_thresh=0.05: 0.01 minutes
--- Retrieval Performance Summary ---
             query  precision  recall     f1
0             S1_A      0.460   0.117  0.184
1             S1_B      0.219   0.555  0.304
2             S1_C      0.289   0.475  0.357
3             S1_D      0.267   0.209  0.233
4             S1_E      0.499   0.420  0.441
5             S1_F      0.374   0.374  0.363
6             S1_G      0.117   0.448  0.183
Overall Mean   NaN      0.318   0.371  0.295

--- R

No relevant docs were retrieved using the relevance score threshold 0.35000000000000003


Successfully processed SchneiderElectric_2024

Processing: Philips_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/Philips_2024
Successfully processed Philips_2024
Computation time for top_k=20, sim_thresh=0.35: 0.01 minutes
--- Retrieval Performance Summary ---
             query  precision  recall     f1
0             S1_A      0.496   0.072  0.124
1             S1_B      0.216   0.259  0.192
2             S1_C      0.297   0.475  0.362
3             S1_D      0.346   0.174  0.227
4             S1_E      0.597   0.370  0.416
5             S1_F      0.274   0.107  0.147
6             S1_G      0.471   0.312  0.334
Overall Mean   NaN      0.385   0.253  0.257

--- Running with top_k=30, similarity_threshold=0.35 ---

Processing: ContinentalAG_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024
Successfully processed ContinentalAG_2024

Processing: SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024


No relevant docs were retrieved using the relevance score threshold 0.35000000000000003


Successfully processed SchneiderElectric_2024

Processing: Philips_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/Philips_2024
Successfully processed Philips_2024
Computation time for top_k=30, sim_thresh=0.35: 0.01 minutes
--- Retrieval Performance Summary ---
             query  precision  recall     f1
0             S1_A      0.494   0.072  0.123
1             S1_B      0.220   0.265  0.196
2             S1_C      0.281   0.511  0.354
3             S1_D      0.340   0.174  0.225
4             S1_E      0.616   0.387  0.436
5             S1_F      0.274   0.107  0.147
6             S1_G      0.471   0.312  0.334
Overall Mean   NaN      0.385   0.261  0.259

--- Running with top_k=40, similarity_threshold=0.35 ---

Processing: ContinentalAG_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024
Successfully processed ContinentalAG_2024

Processing: SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024


No relevant docs were retrieved using the relevance score threshold 0.35000000000000003


Successfully processed SchneiderElectric_2024

Processing: Philips_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/Philips_2024
Successfully processed Philips_2024
Computation time for top_k=40, sim_thresh=0.35: 0.01 minutes
--- Retrieval Performance Summary ---
             query  precision  recall     f1
0             S1_A      0.494   0.072  0.123
1             S1_B      0.220   0.265  0.196
2             S1_C      0.280   0.520  0.355
3             S1_D      0.340   0.174  0.225
4             S1_E      0.616   0.387  0.436
5             S1_F      0.274   0.107  0.147
6             S1_G      0.471   0.312  0.334
Overall Mean   NaN      0.385   0.262  0.259

--- Running with top_k=50, similarity_threshold=0.35 ---

Processing: ContinentalAG_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024
Successfully processed ContinentalAG_2024

Processing: SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024


No relevant docs were retrieved using the relevance score threshold 0.35000000000000003


Successfully processed SchneiderElectric_2024

Processing: Philips_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/Philips_2024
Successfully processed Philips_2024
Computation time for top_k=50, sim_thresh=0.35: 0.01 minutes
--- Retrieval Performance Summary ---
             query  precision  recall     f1
0             S1_A      0.494   0.072  0.123
1             S1_B      0.220   0.265  0.196
2             S1_C      0.280   0.520  0.355
3             S1_D      0.340   0.174  0.225
4             S1_E      0.616   0.387  0.436
5             S1_F      0.274   0.107  0.147
6             S1_G      0.471   0.312  0.334
Overall Mean   NaN      0.385   0.262  0.259

--- Running with top_k=60, similarity_threshold=0.35 ---

Processing: ContinentalAG_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024
Successfully processed ContinentalAG_2024

Processing: SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024


No relevant docs were retrieved using the relevance score threshold 0.35000000000000003


Successfully processed SchneiderElectric_2024

Processing: Philips_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/Philips_2024
Successfully processed Philips_2024
Computation time for top_k=60, sim_thresh=0.35: 0.01 minutes
--- Retrieval Performance Summary ---
             query  precision  recall     f1
0             S1_A      0.494   0.072  0.123
1             S1_B      0.220   0.265  0.196
2             S1_C      0.280   0.520  0.355
3             S1_D      0.340   0.174  0.225
4             S1_E      0.616   0.387  0.436
5             S1_F      0.274   0.107  0.147
6             S1_G      0.471   0.312  0.334
Overall Mean   NaN      0.385   0.262  0.259

--- Running with top_k=70, similarity_threshold=0.35 ---

Processing: ContinentalAG_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024
Successfully processed ContinentalAG_2024

Processing: SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024


No relevant docs were retrieved using the relevance score threshold 0.35000000000000003


Successfully processed SchneiderElectric_2024

Processing: Philips_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/Philips_2024
Successfully processed Philips_2024
Computation time for top_k=70, sim_thresh=0.35: 0.01 minutes
--- Retrieval Performance Summary ---
             query  precision  recall     f1
0             S1_A      0.494   0.072  0.123
1             S1_B      0.220   0.265  0.196
2             S1_C      0.280   0.520  0.355
3             S1_D      0.340   0.174  0.225
4             S1_E      0.616   0.387  0.436
5             S1_F      0.274   0.107  0.147
6             S1_G      0.471   0.312  0.334
Overall Mean   NaN      0.385   0.262  0.259

--- Running with top_k=80, similarity_threshold=0.35 ---

Processing: ContinentalAG_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024
Successfully processed ContinentalAG_2024

Processing: SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024


No relevant docs were retrieved using the relevance score threshold 0.35000000000000003


Successfully processed SchneiderElectric_2024

Processing: Philips_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/Philips_2024
Successfully processed Philips_2024
Computation time for top_k=80, sim_thresh=0.35: 0.01 minutes
--- Retrieval Performance Summary ---
             query  precision  recall     f1
0             S1_A      0.494   0.072  0.123
1             S1_B      0.220   0.265  0.196
2             S1_C      0.280   0.520  0.355
3             S1_D      0.340   0.174  0.225
4             S1_E      0.616   0.387  0.436
5             S1_F      0.274   0.107  0.147
6             S1_G      0.471   0.312  0.334
Overall Mean   NaN      0.385   0.262  0.259


In [64]:
### similarity threshold & reranker

# This function combines both techniques
def get_retriever_B5c(db_path, embeddings, top_k, top_n, similarity_threshold):
    
    # Load the existing FAISS database
    print(f"Loading existing FAISS DB from {db_path}")
    doc_search = FAISS.load_local(
        db_path,
        embeddings=embeddings,
        allow_dangerous_deserialization=True
    )

    # 1. Create a base retriever that uses the similarity threshold.
    # This acts as a "pre-filter" to remove low-quality results.
    base_retriever = doc_search.as_retriever(
        search_type="similarity_score_threshold",
        search_kwargs={'k': top_k, 'score_threshold': similarity_threshold}
    )

    # 2. Initialize the re-ranker model (same as before)
    model = HuggingFaceCrossEncoder(model_name="BAAI/bge-reranker-base")
    compressor = CrossEncoderReranker(model=model, top_n=top_n)

    # 3. Create the compression retriever that wraps the pre-filtering retriever
    compression_retriever = ContextualCompressionRetriever(
        base_compressor=compressor, base_retriever=base_retriever
    )

    return compression_retriever, doc_search

In [42]:
### 5c) similarity threshold=0.25 & reranker (B 4e)
TOP_K = 50
SIMILARITY_THRESHOLD = 0.3
TOP_N = 30

all_results = {}
start_time = time.time()

for idx, row in sample.iterrows():
    filename = f"{prepare_filename(row['company_withAccessInfo'])}_2024"
    filename = filename.replace(" ", "")
    print(f"\nProcessing: {filename}")

    try:
        PATH = f"./sample_reports/{filename}.pdf"
        DB_PATH = f"./faiss_db_Qwen_B2b-2/{filename}"
        retriever, doc_search = get_retriever_B5c(db_path=DB_PATH, embeddings=embeddings_qwen, similarity_threshold=SIMILARITY_THRESHOLD, top_k=TOP_K, top_n=TOP_N)
        results = retrieve_chunks(retriever, queries=QUERIES)
        all_results[filename] = results
        print(f"Successfully processed and retrieved chunks for {filename}")
    
    except Exception as e:
        print(f"Error processing {row['company_withAccessInfo']}: {e}")

print(f"Computation time: {(time.time() - start_time) / 60:.2f} minutes")

evaluate_retrieval(all_results)


Processing: ContinentalAG_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024
Successfully processed and retrieved chunks for ContinentalAG_2024

Processing: SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024
Successfully processed and retrieved chunks for SchneiderElectric_2024

Processing: Philips_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/Philips_2024
Successfully processed and retrieved chunks for Philips_2024
Computation time: 0.17 minutes
--- Retrieval Performance Summary ---
             query  precision  recall     f1
0             S1_A      0.496   0.182  0.266
1             S1_B      0.236   0.536  0.319
2             S1_C      0.205   0.532  0.294
3             S1_D      0.255   0.302  0.276
4             S1_E      0.494   0.502  0.480
5             S1_F      0.292   0.212  0.243
6             S1_G      0.278   0.380  0.307
Overall Mean   NaN      0.322   0.378  0.312


In [44]:
### Grid search set up
# Define parameter ranges for the grid search
TOP_K_VALUES = [40, 50, 60, 70, 80]  # Number of initial candidates
TOP_N_VALUES = [10, 20, 30, 40, 60]  # Final number of docs after reranking
SIMILARITY_THRESHOLDS = np.arange(0.1, 0.4, 0.1) 


# --- Grid Search Experimental Loop ---
all_grid_results = {}

for sim_thresh in SIMILARITY_THRESHOLDS:
    for top_k in TOP_K_VALUES:
        for top_n in TOP_N_VALUES:
            # Constraint: top_n cannot be greater than top_k
            if top_n > top_k:
                continue

            print(f"\n--- Running with top_k={top_k}, top_n={top_n}, sim_thresh={sim_thresh:.2f} ---")
            
            current_run_results = {}
            start_time = time.time()

            for idx, row in sample.iterrows():
                filename = f"{prepare_filename(row['company_withAccessInfo'])}_2024"
                filename = filename.replace(" ", "")
                # print(f"Processing: {filename}") # Commented out for cleaner grid search output

                try:
                    DB_PATH = f"./faiss_db_Qwen_B2b-2/{filename}"
                    
                    retriever, _ = get_retriever_B5c(
                        db_path=DB_PATH, 
                        embeddings=embeddings_qwen, 
                        similarity_threshold=sim_thresh, 
                        top_k=top_k, 
                        top_n=top_n
                    )
                    
                    results = retrieve_chunks(retriever, queries=QUERIES)
                    current_run_results[filename] = results
                
                except Exception as e:
                    print(f"Error processing {row['company_withAccessInfo']}: {e}")

            runtime = (time.time() - start_time)
            print(f"Computation time: {runtime:.2f} seconds")

            # Evaluate and save the metrics for this parameter combination
            eval_metrics = evaluate_retrieval(current_run_results)
            all_grid_results[(top_k, top_n, sim_thresh)] = {
                "eval": eval_metrics,
                "runtime_seconds": runtime
            }


--- Running with top_k=40, top_n=10, sim_thresh=0.10 ---
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/Philips_2024
Computation time: 11.68 seconds
--- Retrieval Performance Summary ---
             query  precision  recall     f1
0             S1_A      0.486   0.070  0.123
1             S1_B      0.165   0.233  0.190
2             S1_C      0.233   0.181  0.203
3             S1_D      0.348   0.157  0.216
4             S1_E      0.375   0.179  0.233
5             S1_F      0.238   0.169  0.191
6             S1_G      0.138   0.364  0.188
Overall Mean   NaN      0.283   0.193  0.192

--- Running with top_k=40, top_n=20, sim_thresh=0.10 ---
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_

No relevant docs were retrieved using the relevance score threshold 0.4


Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/Philips_2024


No relevant docs were retrieved using the relevance score threshold 0.4
No relevant docs were retrieved using the relevance score threshold 0.4


Computation time: 7.78 seconds
--- Retrieval Performance Summary ---
             query  precision  recall     f1
0             S1_A      0.358   0.020  0.037
1             S1_B      0.130   0.137  0.127
2             S1_C      0.262   0.214  0.231
3             S1_D      0.253   0.058  0.094
4             S1_E      0.720   0.167  0.246
5             S1_F      0.138   0.068  0.091
6             S1_G      0.222   0.031  0.054
Overall Mean   NaN      0.298   0.099  0.126

--- Running with top_k=40, top_n=20, sim_thresh=0.40 ---
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024


No relevant docs were retrieved using the relevance score threshold 0.4


Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/Philips_2024


No relevant docs were retrieved using the relevance score threshold 0.4
No relevant docs were retrieved using the relevance score threshold 0.4


Computation time: 7.60 seconds
--- Retrieval Performance Summary ---
             query  precision  recall     f1
0             S1_A      0.349   0.021  0.039
1             S1_B      0.133   0.143  0.132
2             S1_C      0.248   0.268  0.244
3             S1_D      0.253   0.058  0.094
4             S1_E      0.720   0.167  0.246
5             S1_F      0.138   0.068  0.091
6             S1_G      0.222   0.031  0.054
Overall Mean   NaN      0.295   0.108  0.129

--- Running with top_k=40, top_n=30, sim_thresh=0.40 ---
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024


No relevant docs were retrieved using the relevance score threshold 0.4


Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/Philips_2024


No relevant docs were retrieved using the relevance score threshold 0.4
No relevant docs were retrieved using the relevance score threshold 0.4


Computation time: 7.83 seconds
--- Retrieval Performance Summary ---
             query  precision  recall     f1
0             S1_A      0.349   0.021  0.039
1             S1_B      0.133   0.143  0.132
2             S1_C      0.248   0.268  0.244
3             S1_D      0.253   0.058  0.094
4             S1_E      0.720   0.167  0.246
5             S1_F      0.138   0.068  0.091
6             S1_G      0.222   0.031  0.054
Overall Mean   NaN      0.295   0.108  0.129

--- Running with top_k=40, top_n=40, sim_thresh=0.40 ---
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024


No relevant docs were retrieved using the relevance score threshold 0.4


Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/Philips_2024


No relevant docs were retrieved using the relevance score threshold 0.4
No relevant docs were retrieved using the relevance score threshold 0.4


Computation time: 7.41 seconds
--- Retrieval Performance Summary ---
             query  precision  recall     f1
0             S1_A      0.349   0.021  0.039
1             S1_B      0.133   0.143  0.132
2             S1_C      0.248   0.268  0.244
3             S1_D      0.253   0.058  0.094
4             S1_E      0.720   0.167  0.246
5             S1_F      0.138   0.068  0.091
6             S1_G      0.222   0.031  0.054
Overall Mean   NaN      0.295   0.108  0.129

--- Running with top_k=50, top_n=10, sim_thresh=0.40 ---
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024


No relevant docs were retrieved using the relevance score threshold 0.4


Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/Philips_2024


No relevant docs were retrieved using the relevance score threshold 0.4
No relevant docs were retrieved using the relevance score threshold 0.4


Computation time: 7.43 seconds
--- Retrieval Performance Summary ---
             query  precision  recall     f1
0             S1_A      0.358   0.020  0.037
1             S1_B      0.130   0.137  0.127
2             S1_C      0.262   0.214  0.231
3             S1_D      0.253   0.058  0.094
4             S1_E      0.720   0.167  0.246
5             S1_F      0.138   0.068  0.091
6             S1_G      0.222   0.031  0.054
Overall Mean   NaN      0.298   0.099  0.126

--- Running with top_k=50, top_n=20, sim_thresh=0.40 ---
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024


No relevant docs were retrieved using the relevance score threshold 0.4


Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/Philips_2024


No relevant docs were retrieved using the relevance score threshold 0.4
No relevant docs were retrieved using the relevance score threshold 0.4


Computation time: 7.48 seconds
--- Retrieval Performance Summary ---
             query  precision  recall     f1
0             S1_A      0.349   0.021  0.039
1             S1_B      0.133   0.143  0.132
2             S1_C      0.248   0.268  0.244
3             S1_D      0.253   0.058  0.094
4             S1_E      0.720   0.167  0.246
5             S1_F      0.138   0.068  0.091
6             S1_G      0.222   0.031  0.054
Overall Mean   NaN      0.295   0.108  0.129

--- Running with top_k=50, top_n=30, sim_thresh=0.40 ---
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024


No relevant docs were retrieved using the relevance score threshold 0.4


Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/Philips_2024


No relevant docs were retrieved using the relevance score threshold 0.4
No relevant docs were retrieved using the relevance score threshold 0.4


Computation time: 7.39 seconds
--- Retrieval Performance Summary ---
             query  precision  recall     f1
0             S1_A      0.349   0.021  0.039
1             S1_B      0.133   0.143  0.132
2             S1_C      0.248   0.268  0.244
3             S1_D      0.253   0.058  0.094
4             S1_E      0.720   0.167  0.246
5             S1_F      0.138   0.068  0.091
6             S1_G      0.222   0.031  0.054
Overall Mean   NaN      0.295   0.108  0.129

--- Running with top_k=50, top_n=40, sim_thresh=0.40 ---
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024


No relevant docs were retrieved using the relevance score threshold 0.4


Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/Philips_2024


No relevant docs were retrieved using the relevance score threshold 0.4
No relevant docs were retrieved using the relevance score threshold 0.4


Computation time: 7.49 seconds
--- Retrieval Performance Summary ---
             query  precision  recall     f1
0             S1_A      0.349   0.021  0.039
1             S1_B      0.133   0.143  0.132
2             S1_C      0.248   0.268  0.244
3             S1_D      0.253   0.058  0.094
4             S1_E      0.720   0.167  0.246
5             S1_F      0.138   0.068  0.091
6             S1_G      0.222   0.031  0.054
Overall Mean   NaN      0.295   0.108  0.129

--- Running with top_k=60, top_n=10, sim_thresh=0.40 ---
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024


No relevant docs were retrieved using the relevance score threshold 0.4


Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/Philips_2024


No relevant docs were retrieved using the relevance score threshold 0.4
No relevant docs were retrieved using the relevance score threshold 0.4


Computation time: 7.57 seconds
--- Retrieval Performance Summary ---
             query  precision  recall     f1
0             S1_A      0.358   0.020  0.037
1             S1_B      0.130   0.137  0.127
2             S1_C      0.262   0.214  0.231
3             S1_D      0.253   0.058  0.094
4             S1_E      0.720   0.167  0.246
5             S1_F      0.138   0.068  0.091
6             S1_G      0.222   0.031  0.054
Overall Mean   NaN      0.298   0.099  0.126

--- Running with top_k=60, top_n=20, sim_thresh=0.40 ---
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024


No relevant docs were retrieved using the relevance score threshold 0.4


Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/Philips_2024


No relevant docs were retrieved using the relevance score threshold 0.4
No relevant docs were retrieved using the relevance score threshold 0.4


Computation time: 7.43 seconds
--- Retrieval Performance Summary ---
             query  precision  recall     f1
0             S1_A      0.349   0.021  0.039
1             S1_B      0.133   0.143  0.132
2             S1_C      0.248   0.268  0.244
3             S1_D      0.253   0.058  0.094
4             S1_E      0.720   0.167  0.246
5             S1_F      0.138   0.068  0.091
6             S1_G      0.222   0.031  0.054
Overall Mean   NaN      0.295   0.108  0.129

--- Running with top_k=60, top_n=30, sim_thresh=0.40 ---
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024


No relevant docs were retrieved using the relevance score threshold 0.4


Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/Philips_2024


No relevant docs were retrieved using the relevance score threshold 0.4
No relevant docs were retrieved using the relevance score threshold 0.4


Computation time: 7.45 seconds
--- Retrieval Performance Summary ---
             query  precision  recall     f1
0             S1_A      0.349   0.021  0.039
1             S1_B      0.133   0.143  0.132
2             S1_C      0.248   0.268  0.244
3             S1_D      0.253   0.058  0.094
4             S1_E      0.720   0.167  0.246
5             S1_F      0.138   0.068  0.091
6             S1_G      0.222   0.031  0.054
Overall Mean   NaN      0.295   0.108  0.129

--- Running with top_k=60, top_n=40, sim_thresh=0.40 ---
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024


No relevant docs were retrieved using the relevance score threshold 0.4


Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/Philips_2024


No relevant docs were retrieved using the relevance score threshold 0.4
No relevant docs were retrieved using the relevance score threshold 0.4


Computation time: 7.11 seconds
--- Retrieval Performance Summary ---
             query  precision  recall     f1
0             S1_A      0.349   0.021  0.039
1             S1_B      0.133   0.143  0.132
2             S1_C      0.248   0.268  0.244
3             S1_D      0.253   0.058  0.094
4             S1_E      0.720   0.167  0.246
5             S1_F      0.138   0.068  0.091
6             S1_G      0.222   0.031  0.054
Overall Mean   NaN      0.295   0.108  0.129

--- Running with top_k=60, top_n=60, sim_thresh=0.40 ---
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024


No relevant docs were retrieved using the relevance score threshold 0.4


Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/Philips_2024


No relevant docs were retrieved using the relevance score threshold 0.4
No relevant docs were retrieved using the relevance score threshold 0.4


Computation time: 7.45 seconds
--- Retrieval Performance Summary ---
             query  precision  recall     f1
0             S1_A      0.349   0.021  0.039
1             S1_B      0.133   0.143  0.132
2             S1_C      0.248   0.268  0.244
3             S1_D      0.253   0.058  0.094
4             S1_E      0.720   0.167  0.246
5             S1_F      0.138   0.068  0.091
6             S1_G      0.222   0.031  0.054
Overall Mean   NaN      0.295   0.108  0.129

--- Running with top_k=70, top_n=10, sim_thresh=0.40 ---
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024


No relevant docs were retrieved using the relevance score threshold 0.4


Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/Philips_2024


No relevant docs were retrieved using the relevance score threshold 0.4
No relevant docs were retrieved using the relevance score threshold 0.4


Computation time: 7.38 seconds
--- Retrieval Performance Summary ---
             query  precision  recall     f1
0             S1_A      0.358   0.020  0.037
1             S1_B      0.130   0.137  0.127
2             S1_C      0.262   0.214  0.231
3             S1_D      0.253   0.058  0.094
4             S1_E      0.720   0.167  0.246
5             S1_F      0.138   0.068  0.091
6             S1_G      0.222   0.031  0.054
Overall Mean   NaN      0.298   0.099  0.126

--- Running with top_k=70, top_n=20, sim_thresh=0.40 ---
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024


No relevant docs were retrieved using the relevance score threshold 0.4


Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/Philips_2024


No relevant docs were retrieved using the relevance score threshold 0.4
No relevant docs were retrieved using the relevance score threshold 0.4


Computation time: 7.37 seconds
--- Retrieval Performance Summary ---
             query  precision  recall     f1
0             S1_A      0.349   0.021  0.039
1             S1_B      0.133   0.143  0.132
2             S1_C      0.248   0.268  0.244
3             S1_D      0.253   0.058  0.094
4             S1_E      0.720   0.167  0.246
5             S1_F      0.138   0.068  0.091
6             S1_G      0.222   0.031  0.054
Overall Mean   NaN      0.295   0.108  0.129

--- Running with top_k=70, top_n=30, sim_thresh=0.40 ---
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024


No relevant docs were retrieved using the relevance score threshold 0.4


Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/Philips_2024


No relevant docs were retrieved using the relevance score threshold 0.4
No relevant docs were retrieved using the relevance score threshold 0.4


Computation time: 7.85 seconds
--- Retrieval Performance Summary ---
             query  precision  recall     f1
0             S1_A      0.349   0.021  0.039
1             S1_B      0.133   0.143  0.132
2             S1_C      0.248   0.268  0.244
3             S1_D      0.253   0.058  0.094
4             S1_E      0.720   0.167  0.246
5             S1_F      0.138   0.068  0.091
6             S1_G      0.222   0.031  0.054
Overall Mean   NaN      0.295   0.108  0.129

--- Running with top_k=70, top_n=40, sim_thresh=0.40 ---
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024


No relevant docs were retrieved using the relevance score threshold 0.4


Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/Philips_2024


No relevant docs were retrieved using the relevance score threshold 0.4
No relevant docs were retrieved using the relevance score threshold 0.4


Computation time: 7.86 seconds
--- Retrieval Performance Summary ---
             query  precision  recall     f1
0             S1_A      0.349   0.021  0.039
1             S1_B      0.133   0.143  0.132
2             S1_C      0.248   0.268  0.244
3             S1_D      0.253   0.058  0.094
4             S1_E      0.720   0.167  0.246
5             S1_F      0.138   0.068  0.091
6             S1_G      0.222   0.031  0.054
Overall Mean   NaN      0.295   0.108  0.129

--- Running with top_k=70, top_n=60, sim_thresh=0.40 ---
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024


No relevant docs were retrieved using the relevance score threshold 0.4


Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/Philips_2024


No relevant docs were retrieved using the relevance score threshold 0.4
No relevant docs were retrieved using the relevance score threshold 0.4


Computation time: 7.07 seconds
--- Retrieval Performance Summary ---
             query  precision  recall     f1
0             S1_A      0.349   0.021  0.039
1             S1_B      0.133   0.143  0.132
2             S1_C      0.248   0.268  0.244
3             S1_D      0.253   0.058  0.094
4             S1_E      0.720   0.167  0.246
5             S1_F      0.138   0.068  0.091
6             S1_G      0.222   0.031  0.054
Overall Mean   NaN      0.295   0.108  0.129

--- Running with top_k=80, top_n=10, sim_thresh=0.40 ---
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024


No relevant docs were retrieved using the relevance score threshold 0.4


Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/Philips_2024


No relevant docs were retrieved using the relevance score threshold 0.4
No relevant docs were retrieved using the relevance score threshold 0.4


Computation time: 7.81 seconds
--- Retrieval Performance Summary ---
             query  precision  recall     f1
0             S1_A      0.358   0.020  0.037
1             S1_B      0.130   0.137  0.127
2             S1_C      0.262   0.214  0.231
3             S1_D      0.253   0.058  0.094
4             S1_E      0.720   0.167  0.246
5             S1_F      0.138   0.068  0.091
6             S1_G      0.222   0.031  0.054
Overall Mean   NaN      0.298   0.099  0.126

--- Running with top_k=80, top_n=20, sim_thresh=0.40 ---
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024


No relevant docs were retrieved using the relevance score threshold 0.4


Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/Philips_2024


No relevant docs were retrieved using the relevance score threshold 0.4
No relevant docs were retrieved using the relevance score threshold 0.4


Computation time: 7.47 seconds
--- Retrieval Performance Summary ---
             query  precision  recall     f1
0             S1_A      0.349   0.021  0.039
1             S1_B      0.133   0.143  0.132
2             S1_C      0.248   0.268  0.244
3             S1_D      0.253   0.058  0.094
4             S1_E      0.720   0.167  0.246
5             S1_F      0.138   0.068  0.091
6             S1_G      0.222   0.031  0.054
Overall Mean   NaN      0.295   0.108  0.129

--- Running with top_k=80, top_n=30, sim_thresh=0.40 ---
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024


No relevant docs were retrieved using the relevance score threshold 0.4


Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/Philips_2024


No relevant docs were retrieved using the relevance score threshold 0.4
No relevant docs were retrieved using the relevance score threshold 0.4


Computation time: 7.39 seconds
--- Retrieval Performance Summary ---
             query  precision  recall     f1
0             S1_A      0.349   0.021  0.039
1             S1_B      0.133   0.143  0.132
2             S1_C      0.248   0.268  0.244
3             S1_D      0.253   0.058  0.094
4             S1_E      0.720   0.167  0.246
5             S1_F      0.138   0.068  0.091
6             S1_G      0.222   0.031  0.054
Overall Mean   NaN      0.295   0.108  0.129

--- Running with top_k=80, top_n=40, sim_thresh=0.40 ---
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024


No relevant docs were retrieved using the relevance score threshold 0.4


Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/Philips_2024


No relevant docs were retrieved using the relevance score threshold 0.4
No relevant docs were retrieved using the relevance score threshold 0.4


Computation time: 7.42 seconds
--- Retrieval Performance Summary ---
             query  precision  recall     f1
0             S1_A      0.349   0.021  0.039
1             S1_B      0.133   0.143  0.132
2             S1_C      0.248   0.268  0.244
3             S1_D      0.253   0.058  0.094
4             S1_E      0.720   0.167  0.246
5             S1_F      0.138   0.068  0.091
6             S1_G      0.222   0.031  0.054
Overall Mean   NaN      0.295   0.108  0.129

--- Running with top_k=80, top_n=60, sim_thresh=0.40 ---
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024


No relevant docs were retrieved using the relevance score threshold 0.4


Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/Philips_2024


No relevant docs were retrieved using the relevance score threshold 0.4
No relevant docs were retrieved using the relevance score threshold 0.4


Computation time: 7.41 seconds
--- Retrieval Performance Summary ---
             query  precision  recall     f1
0             S1_A      0.349   0.021  0.039
1             S1_B      0.133   0.143  0.132
2             S1_C      0.248   0.268  0.244
3             S1_D      0.253   0.058  0.094
4             S1_E      0.720   0.167  0.246
5             S1_F      0.138   0.068  0.091
6             S1_G      0.222   0.031  0.054
Overall Mean   NaN      0.295   0.108  0.129


### C) Test retrieval queries

In [None]:
### similarity threshold & reranker

# This function combines both techniques
def get_retriever_B5c(db_path, embeddings, top_k, top_n, similarity_threshold):
    
    # Load the existing FAISS database
    print(f"Loading existing FAISS DB from {db_path}")
    doc_search = FAISS.load_local(
        db_path,
        embeddings=embeddings,
        allow_dangerous_deserialization=True
    )

    # 1. Create a base retriever that uses the similarity threshold.
    # This acts as a "pre-filter" to remove low-quality results.
    base_retriever = doc_search.as_retriever(
        search_type="similarity_score_threshold",
        search_kwargs={'k': top_k, 'score_threshold': similarity_threshold}
    )

    # 2. Initialize the re-ranker model (same as before)
    model = HuggingFaceCrossEncoder(model_name="BAAI/bge-reranker-base")
    compressor = CrossEncoderReranker(model=model, top_n=top_n)

    # 3. Create the compression retriever that wraps the pre-filtering retriever
    compression_retriever = ContextualCompressionRetriever(
        base_compressor=compressor, base_retriever=base_retriever
    )

    return compression_retriever, doc_search

In [50]:
def evaluate_query_variants(
    query_key: str,
    query_variants: dict,
    embeddings,
    top_k: int = 70,
    top_n: int = 40,
    similarity_threshold: float = 0.30,
    db_base_path: str = "./faiss_db_Qwen_B2b-2",
    report_base_path: str = "./sample_reports"
) -> pd.DataFrame:

    results_per_variant = {}

    for variant_name, query_text in query_variants.items():
        print(f"\n==== Evaluating query variant: {variant_name} ====")
        all_results = {}

        for idx, row in sample.iterrows():
            filename = f"{prepare_filename(row['company_withAccessInfo'])}_2024".replace(" ", "")
            try:
                db_path = f"{db_base_path}/{filename}"
                retriever, doc_search = get_retriever_B5c(
                    db_path=db_path,
                    embeddings=embeddings,
                    top_k=top_k,
                    similarity_threshold=similarity_threshold,
                    top_n=top_n
                )

                # Only evaluate the current query variant
                results = retrieve_chunks(retriever, queries={query_key: query_text})
                all_results[filename] = results

            except Exception as e:
                print(f"Error processing {row['company_withAccessInfo']}: {e}")

        # --- Evaluation ---
        evaluation_results = []

        for report_name, query_results in all_results.items():
            retrieved_documents = query_results.get(query_key, [])
            gt_texts = validation_set[
                (validation_set['report_name'] == report_name) &
                (validation_set['query'] == query_key)
            ]['text'].tolist()

            if not gt_texts:
                continue

            scores = evaluate_retrieval_sentence_level(
                retrieved_docs=retrieved_documents,
                ground_truth_texts=gt_texts
            )
            evaluation_results.append(scores)

        if evaluation_results:
            df = pd.DataFrame(evaluation_results)
            avg_scores = df.mean().to_dict()
        else:
            avg_scores = {"precision": None, "recall": None, "f1": None}

        results_per_variant[variant_name] = avg_scores

    # Return as DataFrame
    df_results = pd.DataFrame.from_dict(results_per_variant, orient="index")
    return df_results


Baseline querries:
QUERIES = {
    'S1_A': "How does the company manage and disclose material impacts, risks and opportunities related to the own workforce?",
    'S1_B': "What are the material risks and opportunities arising from the company’s impacts and dependencies on people in its own workforce?",
    'S1_C': "What are the company’s human rights practices, risks and incidents related to the own workforce?",
    'S1_D': "What are the company’s processes and policies for engaging with own workers and workers’ representatives about impacts?",
    'S1_E': "What are the company’s policies on non-discrimination, diversity and inclusion in the own workforce?",
    'S1_F': "What are the company’s processes, policies and approaches to remediate negative impacts and channels for own workers to raise concerns?",
    'S1_G': "How is the company’s workfoce social protection coverage?",
}

In [52]:
#### S1_A management and disclosure of material impacts, risks and opportunities related to the own workforce

QUERY_KEY = "S1_A"

query_variants_S1A = {
    "original": "How does the company manage and disclose material impacts, risks and opportunities related to the own workforce?",
    "variant_1": "disclosure of major impacts and risks related to employees",
    "variant_2": "How the company manages workforce-related material risks and opportunities",
    "variant_3": "reporting on workforce material impacts and risks",
    "variant_4": "strategies and disclosures about workforce-related risks and opportunities",
    "variant_5": "management and disclosure of material impacts, risks and opportunities related to the own workforce",
    "variant_6": "What are material impacts on the workers?",
    "variant_7": "material negative and positive impacts on workforce",
    "variant_8": "How does the company manage and disclose material impacts, risks and opportunities related to its own workers?",
}

results_S1A = evaluate_query_variants(
    query_key="S1_A",
    query_variants=query_variants_S1A,
    embeddings=embeddings_qwen
)

print(results_S1A.round(3))


==== Evaluating query variant: original ====
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/Philips_2024

==== Evaluating query variant: variant_1 ====
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/Philips_2024

==== Evaluating query variant: variant_2 ====
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/Philips_2024


Applied processor reduces input query to empty string, all comparisons will have score 0. [Query: ').']



==== Evaluating query variant: variant_3 ====
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/Philips_2024

==== Evaluating query variant: variant_4 ====
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/Philips_2024

==== Evaluating query variant: variant_5 ====
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/Philips_2024

==== Evaluating query variant: variant_6 ====
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024
Loading existing FAISS

In [53]:
QUERY_KEY = "S1_A"

query_variants_S1A = {
    "original": "How does the company manage and disclose material impacts, risks and opportunities related to the own workforce?",
    "variant_5": "management and disclosure of material impacts, risks and opportunities related to the own workforce",
    "variant_6": "How are material impacts, risks and opportunities related to the own workforce managed and disclosed?",
    "variant_7": "How are material impacts, risks and opportunities related to the own workforce managed?",
}

results_S1A = evaluate_query_variants(
    query_key="S1_A",
    query_variants=query_variants_S1A,
    embeddings=embeddings_qwen
)

print(results_S1A.round(3))


==== Evaluating query variant: original ====
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/Philips_2024

==== Evaluating query variant: variant_5 ====
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/Philips_2024

==== Evaluating query variant: variant_6 ====
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/Philips_2024

==== Evaluating query variant: variant_7 ====
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024
Loading existing FAISS 

In [54]:
#### S1_B material risks and opportunities arising from impacts and dependencies on own workforce

query_variants_S1B = {
    "baseline": "What are the material risks and opportunities arising from the company’s impacts and dependencies on people in its own workforce?",
    "variant_1": "material risks and opportunities arising from impacts and dependencies on own workforce",
    "variant_2": "What are the material risks and opportunities related to the company’s impacts and dependencies on people in its own workforce?",
    "variant_3": "Which material risks and opportunities arise from impacts and dependencies on people in its own workforce?"
}

results_S1B = evaluate_query_variants(
    query_key="S1_B",
    query_variants=query_variants_S1B,
    embeddings=embeddings_qwen
)

print(results_S1B.round(3))


==== Evaluating query variant: baseline ====
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/Philips_2024

==== Evaluating query variant: variant_1 ====
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/Philips_2024

==== Evaluating query variant: variant_2 ====
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/Philips_2024

==== Evaluating query variant: variant_3 ====
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024
Loading existing FAISS 

In [57]:
#### S1_C human rights practices, risks and incidents related to the own workforce
query_variants_S1C = {
    "baseline": "What are the company’s human rights practices, risks and incidents related to the own workforce?",
    "variant_1": "human rights practices, risks and incidents related to the own workforce",
    "variant_2": "What are the company’s human rights and labor rights commitments for its own workforce? What violations have occured and how were they managed? Which operations are considered at significant risk for human rights violations such as forced labor, child labor, or trafficking?",
    "variant_3": "What are incidents, complaints and severe human rights impacts the company has on it's workforce?",
    "variant_4": "What is the company's general approach in relation to respect for human rights including labour rights, of people in its own workforce?",
}

results_S1C = evaluate_query_variants(
    query_key="S1_C",
    query_variants=query_variants_S1C,
    embeddings=embeddings_qwen
)

print(results_S1C.round(3))


==== Evaluating query variant: baseline ====
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/Philips_2024

==== Evaluating query variant: variant_1 ====
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/Philips_2024

==== Evaluating query variant: variant_2 ====
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/Philips_2024

==== Evaluating query variant: variant_3 ====
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024
Loading existing FAISS 

In [59]:
query_variants_S1C = {
    "variant_2": "What are the company’s human rights and labor rights commitments for its own workforce? What violations have occured and how were they managed? Which operations are considered at significant risk for human rights violations such as forced labor, child labor, or trafficking?",
    "variant_3": "What are the company’s human rights and labor rights commitments for its own workforce?",
    "variant_4": "What are the company’s human rights and labor rights commitments for its own workforce? What violations have occured and how were they managed? Which operations are considered at significant risk for human rights violations?",
    "variant_5": "What is the company's general approach in relation to respect for human rights including labour rights, of people in its own workforce?",
    "variant_6": "What is the company's approach in relation to respect for human rights including labour rights, of people in its own workforce?",
    "variant_7": "What is the company's approach in relation to respect for human and labour rights including of people in its own workforce including forced labor, child labor, or trafficking?"

}

results_S1C = evaluate_query_variants(
    query_key="S1_C",
    query_variants=query_variants_S1C,
    embeddings=embeddings_qwen
)

print(results_S1C.round(3))


==== Evaluating query variant: variant_2 ====
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/Philips_2024

==== Evaluating query variant: variant_3 ====
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/Philips_2024

==== Evaluating query variant: variant_4 ====
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/Philips_2024

==== Evaluating query variant: variant_5 ====
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024
Loading existing FAISS

In [61]:
query_variants_S1C = {
    "variant_5": "Which operations or countries are considered at high risk for incidents of forced labour or child labour within the company's own workforce?",
    "variant_6": "What are the company’s policy commitments related to human rights, especially regarding trafficking, forced labour, and child labour?",
    "variant_7": "Does the company reference any international human rights instruments in its workforce-related policies?",
    "variant_8": "Has the company signed a Global Framework Agreement or other agreements with workers' representatives related to human rights?",
    "variant_9": "What human rights violations (e.g. harassment, discrimination) have occurred, and were there any penalties or compensations?",
    "variant_10": "Are there disclosures of severe incidents related to the workforce and how these are financially reconciled?"
}

results_S1C = evaluate_query_variants(
    query_key="S1_C",
    query_variants=query_variants_S1C,
    embeddings=embeddings_qwen
)

print(results_S1C.round(3))


==== Evaluating query variant: variant_5 ====
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/Philips_2024

==== Evaluating query variant: variant_6 ====
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/Philips_2024

==== Evaluating query variant: variant_7 ====
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/Philips_2024

==== Evaluating query variant: variant_8 ====
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024
Loading existing FAISS

In [62]:
#### S1_D processes and policies for engaging with own workers and workers’ representatives about impacts
query_variants_S1D = {
    "baseline": "What are the company’s processes and policies for engaging with own workers and workers’ representatives about impacts?",
    "variant_1": "processes and policies for engaging with own workers and workers’ representatives about impacts",
    "variant_2": "Processes for engaging with own workers and workers’ representatives about impacts",
    "variant_3": "How does the company engage with its workers and workers’ representatives?",
    "variant_4": "How does the company engage with its workers?",
    "variant_5": "How does the company engage with its workers and workers’ representatives about impacts?",
}

results_S1D = evaluate_query_variants(
    query_key="S1_D",
    query_variants=query_variants_S1D,
    embeddings=embeddings_qwen
)

print(results_S1D.round(3))


==== Evaluating query variant: baseline ====
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/Philips_2024

==== Evaluating query variant: variant_1 ====
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/Philips_2024

==== Evaluating query variant: variant_2 ====
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/Philips_2024

==== Evaluating query variant: variant_3 ====
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024
Loading existing FAISS 

No relevant docs were retrieved using the relevance score threshold 0.3



==== Evaluating query variant: variant_4 ====
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/Philips_2024

==== Evaluating query variant: variant_5 ====
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/Philips_2024
           precision  recall     f1
baseline       0.254   0.319  0.283
variant_1      0.231   0.359  0.281
variant_2      0.235   0.208  0.219
variant_3      0.437   0.155  0.227
variant_4      0.343   0.070  0.117
variant_5      0.448   0.192  0.256


In [63]:
#### S1_E policies on non-discrimination, diversity and inclusion in the own workforce

query_variants_S1E = {
    "baseline": "What are the company’s policies on non-discrimination, diversity and inclusion in the own workforce?",
    "variant_1": "policies on non-discrimination, diversity and inclusion in the own workforce",
    "variant_2": "How does the company prevent discrimination in its workkforce? How is it mitigated and acted upon once detected?",
    "variant_3": "How is discrimination in the workforce prevented, mitigated and acted upon once detected?",
    "variant_4": "How does the company advance diversity and inclusions in its workforce?",
    "variant_5": "How does the company prevent discrimination, advance diversity and inclusions in its workforce?",
    "variant_6": "How does the company prevent discrimination and advance diversity and inclusions in its workforce?",
    "variant_7": "What are policies to prevent discrimination, advance diversity and inclusion in the own workforce?",
}

results_S1E = evaluate_query_variants(
    query_key="S1_E",
    query_variants=query_variants_S1E,
    embeddings=embeddings_qwen
)

print(results_S1E.round(3))


==== Evaluating query variant: baseline ====
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/Philips_2024

==== Evaluating query variant: variant_1 ====
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/Philips_2024

==== Evaluating query variant: variant_2 ====
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/Philips_2024

==== Evaluating query variant: variant_3 ====
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024
Loading existing FAISS 

In [64]:
#### S1_F processes, policies and approaches to remediate negative impacts and channels for own workers to raise concerns

query_variants_S1F = {
    "baseline": "What are the company’s processes, policies and approaches to remediate negative impacts and channels for own workers to raise concerns?",
    "variant_1": "processes, policies and approaches to remediate negative impacts and channels for own workers to raise concerns",
    "variant_2": "What are the company’s processes, policies and approaches to remediate negative impacts in the workforce? What are channels for the own workers to raise concerns?",
    "variant_3": "How does the company remediate negative impacts in the workforce? What are channels for the own workers to raise concerns?",
    "variant_4": "How does the company remediate negative impacts in the workforce?",
    "variant_5": "Has the company taken any actions to provide or enable remedy for negative impacts on its own workforce, and how does it ensure that such remedies are appropriate and effective?",
    "variant_6": "What channels are available for people in the company’s own workforce to raise concerns or needs, how are these channels supported, protected, and monitored?",
    "variant_7": "What are the company’s approaches to remediate negative impacts in the workforce? What channels are available for people in the company’s own workforce to raise concerns or needs, how are they supported, protected and monitored?"
}

results_S1F = evaluate_query_variants(
    query_key="S1_F",
    query_variants=query_variants_S1F,
    embeddings=embeddings_qwen
)

print(results_S1F.round(3))


==== Evaluating query variant: baseline ====
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/Philips_2024

==== Evaluating query variant: variant_1 ====
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/Philips_2024

==== Evaluating query variant: variant_2 ====
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/Philips_2024

==== Evaluating query variant: variant_3 ====
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024
Loading existing FAISS 

In [65]:
#### S1_G social protection coverage in the own workforce 
query_variants_S1G = {
    "baseline": "How is the company’s workfoce social protection coverage?",
    "variant_1": "social protection coverage in the own workforce ",
    "variant_2": "Are the employees covered by social protection against loss of income due to major life events?",
    "variant_3": "Are all employees in own workforce covered by social protection against loss of income due to sickness, employment injury and acquired disability, parental leave, retirement?",
    "variant_4": "To what extent does the company’s own workforce receive social protection, and how does the company address employees who are not yet covered by such protections?",
    "variant_5": "How is the company's workforce protected by social protection?",
    "variant_6": "How is the company's workforce protected against loss of income through social protection?",
}

results_S1G = evaluate_query_variants(
    query_key="S1_G",
    query_variants=query_variants_S1G,
    embeddings=embeddings_qwen
)

print(results_S1G.round(3))


==== Evaluating query variant: baseline ====
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/Philips_2024

==== Evaluating query variant: variant_1 ====
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/Philips_2024

==== Evaluating query variant: variant_2 ====
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024


No relevant docs were retrieved using the relevance score threshold 0.3


Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/Philips_2024

==== Evaluating query variant: variant_3 ====
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024


No relevant docs were retrieved using the relevance score threshold 0.3


Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/Philips_2024

==== Evaluating query variant: variant_4 ====
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024


No relevant docs were retrieved using the relevance score threshold 0.3


Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/Philips_2024

==== Evaluating query variant: variant_5 ====
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/Philips_2024

==== Evaluating query variant: variant_6 ====
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/Philips_2024
           precision  recall     f1
baseline       0.278   0.380  0.307
variant_1      0.179   0.426  0.251
variant_2      0.383   0.312  0.328
variant_3      0.425   0.397  0.410
variant_4      0.519   0.312  0.343
variant_5      0.320   0.312  0.313
variant_6      0.320   0.312  0.313


In [None]:
### taking all together:

QUERIES = {
    'S1_A': "How does the company manage and disclose material impacts, risks and opportunities related to the own workforce?",
    'S1_B': "What are the material risks and opportunities related to the company’s impacts and dependencies on people in its own workforce?",
    'S1_C': "What is the company's general approach in relation to respect for human rights including labour rights, of people in its own workforce?",
    'S1_D': "How does the company engage with its workers and workers’ representatives?",
    'S1_E': "How does the company prevent discrimination and advance diversity and inclusions in its workforce?",
    'S1_F': "What are the company’s approaches to remediate negative impacts in the workforce? What channels are available for people in the company’s own workforce to raise concerns or needs, how are they supported, protected and monitored?",
    'S1_G': "How is the company's workforce protected by social protection?",
}

TOP_K=30
SIMILARITY_THRESHOLD = 0.20

all_results = {}
start_time = time.time()

for idx, row in sample.iterrows():
    filename = f"{prepare_filename(row['company_withAccessInfo'])}_2024"
    filename = filename.replace(" ", "")
    print(f"\nProcessing: {filename}")

    try:
        PATH = f"./sample_reports/{filename}.pdf"
        DB_PATH = f"./faiss_db_Qwen_semanticchunks_2a/{filename}"
        retriever = get_or_create_retriever_B5(db_path=DB_PATH, embeddings=embeddings_model, similarity_threshold=SIMILARITY_THRESHOLD, top_k=TOP_K)
        results = retrieve_chunks(retriever, queries=QUERIES)
        all_results[filename] = results
        print(f"Successfully processed and retrieved chunks for {filename}")
    
    except Exception as e:
        print(f"Error processing {row['company_withAccessInfo']}: {e}")

print(f"Computation time: {(time.time() - start_time) / 60:.2f} minutes")

evaluate_retrieval(all_results)


Processing: ContinentalAG_2024
Loading existing FAISS DB from ./faiss_db_Qwen_semanticchunks_2a/ContinentalAG_2024
Successfully processed and retrieved chunks for ContinentalAG_2024

Processing: SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_semanticchunks_2a/SchneiderElectric_2024
Successfully processed and retrieved chunks for SchneiderElectric_2024

Processing: Philips_2024
Loading existing FAISS DB from ./faiss_db_Qwen_semanticchunks_2a/Philips_2024
Successfully processed and retrieved chunks for Philips_2024
Computation time: 0.01 minutes
--- Retrieval Performance Summary ---
             query  precision  recall     f1
0             S1_A      0.367   0.264  0.307
1             S1_B      0.110   0.645  0.184
2             S1_C      0.149   0.631  0.241
3             S1_D      0.480   0.614  0.538
4             S1_E      0.379   0.700  0.492
5             S1_F      0.226   0.566  0.323
6             S1_G      0.072   0.393  0.121
Overall Mean   NaN      0.25

In [68]:
### taking all together:

QUERIES = {
    'S1_A': "How does the company manage and disclose material impacts, risks and opportunities related to the own workforce?",
    'S1_B': "What are the material risks and opportunities related to the company’s impacts and dependencies on people in its own workforce?",
    'S1_C': "What are the company’s human rights and labor rights commitments for its own workforce?",
    'S1_D': "What are the company’s processes and policies for engaging with own workers and workers’ representatives about impacts?",
    'S1_E': "How does the company prevent discrimination and advance diversity and inclusions in its workforce?",
    'S1_F': "What channels are available for people in the company’s own workforce to raise concerns or needs, how are these channels supported, protected, and monitored?",
    'S1_G': "Are all employees in own workforce covered by social protection against loss of income due to sickness, employment injury and acquired disability, parental leave, retirement?",
}


all_results = {}
start_time = time.time()

for idx, row in sample.iterrows():
    filename = f"{prepare_filename(row['company_withAccessInfo'])}_2024"
    filename = filename.replace(" ", "")
    print(f"\nProcessing: {filename}")

    try:
        PATH = f"./sample_reports/{filename}.pdf"
        DB_PATH = f"./faiss_db_Qwen_B2b-2/{filename}"
        retriever, doc_search = get_retriever_B5c(
                    db_path=DB_PATH,
                    embeddings=embeddings_qwen,
                    top_k=70,
                    similarity_threshold=0.3,
                    top_n=40
                )
        results = retrieve_chunks(retriever, queries=QUERIES)
        all_results[filename] = results
        print(f"Successfully processed and retrieved chunks for {filename}")
    
    except Exception as e:
        print(f"Error processing {row['company_withAccessInfo']}: {e}")

print(f"Computation time: {(time.time() - start_time) / 60:.2f} minutes")

evaluate_retrieval(all_results)


Processing: ContinentalAG_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024


No relevant docs were retrieved using the relevance score threshold 0.3


Successfully processed and retrieved chunks for ContinentalAG_2024

Processing: SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024
Successfully processed and retrieved chunks for SchneiderElectric_2024

Processing: Philips_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/Philips_2024
Successfully processed and retrieved chunks for Philips_2024
Computation time: 0.16 minutes
--- Retrieval Performance Summary ---
             query  precision  recall     f1
0             S1_A      0.463   0.212  0.288
1             S1_B      0.236   0.620  0.326
2             S1_C      0.265   0.576  0.353
3             S1_D      0.254   0.319  0.283
4             S1_E      0.694   0.414  0.514
5             S1_F      0.480   0.397  0.423
6             S1_G      0.425   0.397  0.410
Overall Mean   NaN      0.402   0.419  0.371


In [70]:
# save results
results_df = pd.DataFrame.from_dict(all_results, orient='index')
results_df.to_csv('retrieval_results_C_Qwen.csv', index=True)

In [3]:
# read in the results
retrieval_results = pd.read_csv('retrieval_results_C_Qwen.csv', index_col=0)

In [71]:
### evaluate with accuracy score

all_results = {}
all_docs_sentences = {} # Dictionary to hold all sentences for each document

# Main loop now includes extracting all sentences from each PDF
for idx, row in sample.iterrows():
    filename = f"{prepare_filename(row['company_withAccessInfo'])}_2024"
    filename = filename.replace(" ", "")
    print(f"\nProcessing: {filename}")

    try:
        PATH = f"./sample_reports/{filename}.pdf"
        pdf = load_pdf(path=PATH)
        
        # --- NEW: Extract ALL sentences from the PDF for accuracy calculation ---
        all_sents_in_doc = set()
        for page in pdf:
            sentences = sent_tokenize(page.get_text())
            all_sents_in_doc.update([s.strip() for s in sentences if s.strip()])
        all_docs_sentences[filename] = all_sents_in_doc
        print(f"Found {len(all_sents_in_doc)} total unique sentences in {filename}.")
        
        # Your existing retrieval logic
        DB_PATH = f"./faiss_db_Qwen_B2b-2/{filename}"
        retriever, doc_search = get_retriever_B5c(
                    db_path=DB_PATH,
                    embeddings=embeddings_qwen,
                    top_k=70,
                    similarity_threshold=0.3,
                    top_n=40
                )
        results = retrieve_chunks(retriever, queries=QUERIES)
        all_results[filename] = results
    
    except Exception as e:
        print(f"Error processing {filename}: {e}")

# Call the new evaluation function that includes accuracy
evaluate_retrieval_with_accuracy(all_results, all_docs_sentences)


Processing: ContinentalAG_2024
Found 6633 total unique sentences in ContinentalAG_2024.
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024


No relevant docs were retrieved using the relevance score threshold 0.3



Processing: SchneiderElectric_2024
Found 11266 total unique sentences in SchneiderElectric_2024.
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024

Processing: Philips_2024
Found 6302 total unique sentences in Philips_2024.
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/Philips_2024
--- Retrieval Performance Summary ---
             query  precision  recall     f1  accuracy
0             S1_A      0.463   0.212  0.288     0.961
1             S1_B      0.236   0.620  0.326     0.985
2             S1_C      0.265   0.576  0.353     0.981
3             S1_D      0.254   0.319  0.283     0.978
4             S1_E      0.694   0.414  0.514     0.988
5             S1_F      0.480   0.397  0.423     0.986
6             S1_G      0.425   0.397  0.410     0.997
Overall Mean   NaN      0.402   0.419  0.371     0.982


### Taking all together - Retrieval

In [8]:
class FallbackCompressionRetriever:
    def __init__(self, primary_retriever, fallback_retriever):
        self.primary = primary_retriever
        self.fallback = fallback_retriever

    def invoke(self, query, **kwargs):
        docs = self.primary.invoke(query, **kwargs)
        if not docs:
            print("[Fallback] No documents found with threshold — falling back to top_k=30.")
            docs = self.fallback.invoke(query, **kwargs)
        return docs

    def __call__(self, query, **kwargs):
        return self.invoke(query, **kwargs)

In [9]:
def get_final_retriever(db_path, embeddings):

    if os.path.exists(db_path):
        print(f"Loading existing FAISS DB from {db_path}")
        doc_search = FAISS.load_local(
            db_path, 
            embeddings=embeddings, 
            allow_dangerous_deserialization=True
        )
    else:
        print(f"Creating new FAISS DB at {db_path}")
        text_splitter = SemanticChunker(
            embeddings=embeddings, 
            breakpoint_threshold_type="standard_deviation",
            breakpoint_threshold_amount=1
        )

        chunks, page_idx = [], []
        for i, page in enumerate(pdf):
            page_chunks = text_splitter.split_text(page.get_text())
            page_idx.extend([i + 1] * len(page_chunks))
            chunks.extend(page_chunks)

        doc_search = FAISS.from_texts(
            chunks,
            embeddings,
            metadatas=[{"source": str(i), "page": str(idx)} for i, idx in enumerate(page_idx)]
        )
        doc_search.save_local(db_path)

    # Primary retriever with threshold
    base_retriever_thresh = doc_search.as_retriever(
        search_type="similarity_score_threshold",
        search_kwargs={'k': 70, 'score_threshold': 0.3}
    )

    # Fallback retriever without threshold
    base_retriever_topk = doc_search.as_retriever(
        search_kwargs={'k': 30}
    )

    # Cross-encoder reranker
    model = HuggingFaceCrossEncoder(model_name="BAAI/bge-reranker-base")
    compressor = CrossEncoderReranker(model=model, top_n=40)

    primary = ContextualCompressionRetriever(
        base_compressor=compressor, base_retriever=base_retriever_thresh
    )
    fallback = ContextualCompressionRetriever(
        base_compressor=compressor, base_retriever=base_retriever_topk
    )

    final_retriever = FallbackCompressionRetriever(primary_retriever=primary, fallback_retriever=fallback)
    return final_retriever, doc_search

In [77]:
all_results = {}
start_time = time.time()

for idx, row in sample.iterrows():
    filename = f"{prepare_filename(row['company_withAccessInfo'])}_2024"
    filename = filename.replace(" ", "")
    print(f"\nProcessing: {filename}")

    try:
        PATH = f"./sample_reports/{filename}.pdf"
        DB_PATH = f"./faiss_db_Qwen_B2b-2/{filename}"
        retriever, doc_search = get_final_retriever(db_path=DB_PATH, embeddings=embeddings_qwen)
        results = retrieve_chunks(retriever, queries=QUERIES)
        all_results[filename] = results
        print(f"Successfully processed and retrieved chunks for {filename}")
    
    except Exception as e:
        print(f"Error processing {row['company_withAccessInfo']}: {e}")

print(f"Computation time: {(time.time() - start_time) / 60:.2f} minutes")

evaluate_retrieval(all_results)


Processing: ContinentalAG_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024


No relevant docs were retrieved using the relevance score threshold 0.3


[Fallback] No documents found with threshold — falling back to top_k=30.
Successfully processed and retrieved chunks for ContinentalAG_2024

Processing: SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024
Successfully processed and retrieved chunks for SchneiderElectric_2024

Processing: Philips_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/Philips_2024
Successfully processed and retrieved chunks for Philips_2024
Computation time: 0.18 minutes
--- Retrieval Performance Summary ---
             query  precision  recall     f1
0             S1_A      0.463   0.212  0.288
1             S1_B      0.236   0.620  0.326
2             S1_C      0.265   0.576  0.353
3             S1_D      0.254   0.319  0.283
4             S1_E      0.694   0.414  0.514
5             S1_F      0.480   0.397  0.423
6             S1_G      0.425   0.397  0.410
Overall Mean   NaN      0.402   0.419  0.371


## 2. RAG - Generation
### Evalualte Generation

In [27]:
gen_validation_set = pd.read_excel('validation_dataset.xlsx', sheet_name='S1_Generation')
print(gen_validation_set.head())

          report_name query  discloscure_degree  \
0  ContinentalAG_2024  S1_A                  90   
1  ContinentalAG_2024  S1_B                  85   
2  ContinentalAG_2024  S1_C                  95   
3  ContinentalAG_2024  S1_D                  80   
4  ContinentalAG_2024  S1_E                  95   

                                        dd_reasoning  conformity_score  \
0  (+) detailed and comprehensive disclosure on h...                70   
1  (-) room for improvement in providing more spe...                75   
2  (+) high level of detail in describing both po...                90   
3  (+) direct channels (annual survey) and indire...                90   
4  (+) detailed and specific disclosure on polici...               100   

                                        cs_reasoning  
0  (-) does not provide a clear picture of the fi...  
1  (-) lacks specific details on actions planned ...  
2  (-) absence of any reference to Global Framewo...  
3  (+) provides strong, spec

In [28]:
gen_validation_set = gen_validation_set.rename(columns={"discloscure_degree": "disclosure_degree"})

In [25]:
def evaluate_llm_scores(llm_results: list, validation_df: pd.DataFrame) -> dict:
    score_types = {
        'disclosure_degree': ['disclosure_degree'],  # handle common typo
        'conformity_score': ['conformity_score']
    }
    
    results = {}

    for score_type, possible_columns in score_types.items():
        # Try to locate the correct column in validation_df
        val_col = next((col for col in possible_columns if col in validation_df.columns), None)
        if val_col is None:
            print(f"Column for {score_type} not found in validation set.")
            continue
        
        preds, truths = [], []

        for report_entry in llm_results:
            report_name = report_entry['report']
            scores = report_entry.get(score_type, {})
            for query, content in scores.items():
                if isinstance(content, dict) and 'SCORE' in content:
                    preds.append(float(content['SCORE']))
                    row = validation_df[
                        (validation_df['report_name'] == report_name) &
                        (validation_df['query'] == query)
                    ]
                    if not row.empty:
                        truths.append(float(row[val_col].values[0]))

        if len(preds) != len(truths) or len(preds) == 0:
            print(f"Skipping {score_type}: missing or mismatched entries.")
            continue

        mae = mean_absolute_error(truths, preds)
        rmse = np.sqrt(mean_squared_error(truths, preds))
        correlation, _ = pearsonr(truths, preds)

        results[score_type] = {
            'MAE': round(mae, 2),
            'RMSE': round(rmse, 2),
            'Pearson_Correlation': round(correlation, 2)
        }

    return results

### 0: Basemodel with OpenAI based on Ni et al. (2023)
Adjustments to prompts:
- climate science -> sustainability reporting (non-financial reporting?)
- environmental responsibility -> corporate social responsibility

In [11]:
PROMPTS = {
    'general':
        """You are tasked with the role of a sustainability reporting analyst, assigned to analyze the social sustainability disclosure of a company's annual report. Based on the following extracted parts from the annual report, answer the given QUESTIONS. 
If you don't know the answer, just say that you don't know. Don't try to make up an answer.
Format your answers in JSON format with the following keys: COMPANY_NAME, COMPANY_SECTOR, and COMPANY_LOCATION.

QUESTIONS: 
1. What is the company of the report?
2. What sector does the company belong to? 
3. Where is the company located?

=========
{context}
=========
Your FINAL_ANSWER in JSON (ensure there's no format error):
""",

    # Ni et al. 2023: 'scoring' not used in the paper but in the documentation code (changes: score 0 - 1, to 0 - 100 %; without report summary but with all retrieved chunks; score for each question)
    'disclosure_degree': """Your task is to rate the disclosure quality of an annaul report on the following <CRITICAL_ELEMENT>:

<CRITICAL_ELEMENT>: {question}

Presented below are select excerpts from the annual report, which pertain to the <CRITICAL_ELEMENT>:

<DISCLOSURE>:
====
{disclosure}
====

Please assign a score reflecting the depth and comprehensiveness of the disclosed information.
Your response should be formatted in JSON with two keys:
1. ANALYSIS: A paragraph of analysis (be in a string format). No longer than 150 words.
2. SCORE: An integer score from 0 to 100. A score of 100 denotes a detailed and comprehensive disclosure. A score of 50 suggests that the disclosed information is lacking in detail. A score of 0 indicates that the requested information is either not disclosed or is disclosed without any detail.

Your FINAL_ANSWER in JSON (ensure there's no format error):
""",

    # Ni et al. 2023: 'tcfd_assessment'
    'conformity_score': """Your task is to rate a annual report's disclosure quality on the following <CRITICAL_ELEMENT>:

<CRITICAL_ELEMENT>: {question}

These are the <REQUIREMENTS> that outline the necessary components for high-quality disclosure pertaining to the <CRITICAL_ELEMENT>:

<REQUIREMENTS>:
====
{requirements}
====

Presented below are select excerpts from the annual report, which pertain to the <CRITICAL_ELEMENT>:

<DISCLOSURE>:
====
{disclosure}
====

Please analyze the extent to which the given <DISCLOSURE> satisfies the aforementioned <REQUIREMENTS>. Your ANALYSIS should specify which <REQUIREMENTS> have been met and which ones have not been satisfied.
Your response should be formatted in JSON with two keys:
1. ANALYSIS: A paragraph of analysis (be in a string format). No longer than 150 words.
2. SCORE: An integer score from 0 to 100. A score of 0 indicates that most of the <REQUIREMENTS> have not been met or are insufficiently detailed. In contrast, a score of 100 suggests that the majority of the <REQUIREMENTS> have been met and are accompanied by specific details.

Your FINAL_ANSWER in JSON (ensure there's no format error):
""",
}

In [17]:
QUERIES = {
    'general': ["What is the company of the report?", "What sector does the company belong to?", "Where is the company located?"],
    'S1_A': "How does the company manage and disclose material impacts, risks and opportunities related to the own workforce?",
    'S1_B': "What are the material risks and opportunities related to the company’s impacts and dependencies on people in its own workforce?",
    'S1_C': "What is the company's general approach in relation to respect for human rights including labour rights, of people in its own workforce?",
    'S1_D': "How does the company engage with its workers and workers’ representatives?",
    'S1_E': "How does the company prevent discrimination and advance diversity and inclusions in its workforce?",
    'S1_F': "What are the company’s approaches to remediate negative impacts in the workforce? What channels are available for people in the company’s own workforce to raise concerns or needs, how are they supported, protected and monitored?",
    'S1_G': "How is the company's workforce protected by social protection?"
}


In [12]:
ESRS_ASSESSMENT = {
    'S1_A': """In describing how the undertaking manages and discloses material impacts, risks and opportunities related to its own workforce, organizations should consider including a discussion of the following:
1. Whether all people in its own workforce who could be materially impacted by the undertaking are included in the scope of the disclosure, including a description of the types of employees and non-employees subject to material impacts.
2. In the case of material negative impacts
    a. Whether such impacts are (i) widespread or systemic in the contexts where the undertaking operates or (ii) related to individual incidents;
    b. How the undertaking has developed an understanding of how people with particular characteristics, those working in particular contexts, or those undertaking particular activities may be at greater risk of harm;
    c. The actions taken, planned, or underway to prevent or mitigate such material negative impacts on its own workforce;
    d. The processes through which the undertaking identifies what action is needed and appropriate in response to particular actual or potential negative impacts;
    e. Whether and how the undertaking ensures that its own practices do not cause or contribute to material negative impacts on its own workforce.
3. In the case of material positive impacts:
    a. A description of the activities that result in positive impacts and the types of employees and non-employees in its own workforce who are or could be positively affected;
    b. Any additional initiatives or actions undertaken with the primary purpose of delivering positive impacts for its own workforce.
4. Any material impacts on the undertaking's own workforce that may arise from transition plans aimed at reducing negative environmental impacts and achieving greener and climate-neutral operations, including measures taken to mitigate those impacts.
5. What resources are allocated to the management of material impacts, including information that enables users to understand how such impacts are managed.
Organizations should describe whether the policies, actions, and initiatives to manage material impacts, risks, and opportunities:
- Apply to specific groups within the workforce or to the workforce as a whole.
- How the effectiveness of these measures is tracked and assessed.
""",

    'S1_B': """In describing the material risks and opportunities arising from the undertaking’s impacts and dependencies on people in its own workforce, organizations should consider including the following information:
1. A description of the material risks and opportunities that arise from the undertaking’s impacts and dependencies on its own workforce, including whether these relate to specific groups within the workforce.
2. The actions planned or underway to mitigate material risks associated with such impacts and dependencies, and how the effectiveness of these actions is tracked and assessed in practice.
3. The actions planned or underway to pursue material opportunities related to the undertaking’s own workforce.
""",

    'S1_C': """In describing the human rights practices, risks and incidents related to its own workforce, organizations should consider including the following information:
1. Information on the types of operations and geographic areas identified as being at significant risk of incidents involving forced, compulsory, or child labour.
2. A description of human rights policy commitments relevant to the undertaking’s own workforce, including its general approach to:
    a. Respect for human rights, including labour rights, of people in its own workforce;
    b. Whether and how these policies are aligned with relevant internationally recognised human rights instruments;
    c. Whether the policies explicitly address trafficking in human beings, forced or compulsory labour, and child labour.
3. Where applicable, disclosure of Global Framework Agreements or other agreements the undertaking has with workers' representatives related to respect for workers’ human rights.
4. The following information regarding identified cases of severe human rights incidents connected to the undertaking’s own workforce:
    a. Disclosure of any such severe human rights issues and incidents;
    b. Information on the reconciliation of any fines, penalties, or compensation related to those incidents.
""",

'S1_D': """In describing the processes and policies for engaging with its own workers and workers’ representatives about impacts, organizations should consider including the following information:
1. A disclosure of the general approach to engagement with people in its own workforce.
2. Whether and how the perspectives of its own workforce inform decisions or activities aimed at managing actual and potential impacts, including:
    a. Whether engagement occurs directly with the undertaking’s own workforce or through workers' representatives;
    b. The stage(s) at which engagement occurs, the type of engagement, and the frequency of engagement;
    c. The function and most senior role within the undertaking responsible for ensuring that engagement takes place and that the results inform the undertaking’s approach;
    d. Where applicable, how the undertaking assesses the effectiveness of its engagement with its own workforce.
3. Where applicable, a disclosure of steps taken to gain insight into the perspectives of people in its own workforce who may be particularly vulnerable to or marginalised by impacts.
4. Whether and how the undertaking engaged directly with its own workforce or workers’ representatives in setting targets to manage its material impacts, risks, and opportunities, track performance, and identify improvements.

If the undertaking cannot disclose the above information because it has not adopted a general process to engage with its own workforce, it shall disclose this to be the case. It may also disclose a timeframe within which it aims to establish such a process.
""",

'S1_E': """In describing the policies on non-discrimination, diversity, and inclusion in its own workforce, organizations should consider including the following information:
1. Whether it has specific policies aimed at eliminating discrimination (including harassment), promoting equal opportunities, and other ways to advance diversity and inclusion.
2. Whether the following grounds for discrimination are specifically covered in the policy: racial and ethnic origin, colour, sex, sexual orientation, gender identity, disability, age, religion, political opinion, national extraction or social origin, or other forms of discrimination covered by Union regulations and national law.
3. Whether the undertaking has specific policy commitments related to inclusion or positive action for people from groups at particular risk of vulnerability in its own workforce, and if so, what these commitments are.
4. Whether and how these policies are implemented through specific procedures to ensure that discrimination is prevented, mitigated, and addressed once detected, as well as to promote diversity and inclusion in general.
5. Disclosure of fines, penalties, and compensation for damages resulting from incidents of discrimination and complaints.
""",

'S1_F': """In describing the processes, policies, and approaches to remediate negative impacts and channels for own workers to raise concerns, organizations should consider including the following information:
1. Whether and how it has taken action to provide or enable remedy in relation to an actual material impact.
2. The general approach to and processes for providing or contributing to remedy where the undertaking has caused or contributed to a material negative impact on people in its own workforce, including remedy for human rights impacts.
3. Disclosure of specific channels in place for its own workforce to raise concerns or needs directly with the undertaking and have them addressed.
4. Whether or not the undertaking has a grievance/complaints handling mechanism related to employee matters.
5. Processes through which the undertaking supports the availability of such channels in the workplace of its own workforce.
6. How it tracks and monitors issues raised and addressed, and how it ensures the effectiveness of the channels, including through the involvement of stakeholders who are intended users.
7. Whether and how it assesses that people in its own workforce are aware of, and trust, these structures or processes as a way to raise their concerns or needs and have them addressed.
8. Whether it has policies in place regarding the protection of individuals who use these channels, including workers’ representatives, against retaliation.

If the undertaking cannot disclose the above required information because it has not adopted a channel for raising concerns and/or does not support the availability of such a channel in the workplace for its own workforce, it shall disclose this to be the case. It may disclose a timeframe in which it aims to have such a channel in place.
""",

'S1_G': """In describing the social protection coverage of its own workforce, organizations should consider including the following information:
1. Whether all employees in the own workforce are covered by social protection mechanisms against loss of income due to any of the following major life events:
    a. Sickness;
    b. Unemployment starting from when the own worker is working for the undertaking;
    c. Employment injury and acquired disability;
    d. Parental leave; and
    e. Retirement.
2. If not all employees are covered by social protection, disclosure of the countries where employees do not have social protection and a description of the types of employees who are not covered by such protection.
""",
}

In [13]:
SYSTEM_PROMPT = "You are an AI assistant in the role of a Senior Equity Analyst with expertise in sustainability reporting that analyzes companys' annual reports."

In [14]:
def remove_brackets(string):
    return re.sub(r'\([^)]*\)', '', string).strip()


def _docs_to_string(docs, with_source=True):
# def _docs_to_string(docs, num_docs=TOP_K, with_source=True):
    output = ""
    # docs = docs[:num_docs]
    for doc in docs:
        output += "Content: {}\n".format(doc.page_content)
        if with_source:
            output += "Source: {}\n".format(doc.metadata['source'])
        output += "\n---\n"
    return output


def _find_answer(string, name="ANSWER"):
    for l in string.split('\n'):
        if name in l:
            start = l.find(":") + 3
            end = len(l) - 1
            return l[start:end]
    return string


def _find_sources(string):
    pattern = r'\d+'
    numbers = [int(n) for n in re.findall(pattern, string)]
    return numbers


def _find_float_numbers(string):
    pattern = r"[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?"
    float_numbers = [float(n) for n in re.findall(pattern, string)]
    return float_numbers


def _find_score(string):
    for l in string.split('\n'):
        if "SCORE" in l:
            d = re.search(r'[-+]?\d*\.?\d+', l)
            if d:
                return d.group(0)
    return None

In [18]:
# Set up
# TOP_K = 20
## variables with in the __init__ function
llm_name='gpt-3.5-turbo'
answer_key_name='ANSWER'
max_token=512
queries=QUERIES
qa_prompt='summary_with_source'
assessments=ESRS_ASSESSMENT
answer_length='60'
root_path='./'
gitee_key=''
user_name='defualt'
language='en'

# defined with self.
tiktoken_encoder = tiktoken.encoding_for_model(llm_name)
cur_api = 0
file_format = 'md' # or 'txt'
prompts = PROMPTS
basic_info_answers = []
answers = []
assessment_results = []
user_questions = []
user_answers = []

In [26]:
# Ni et al. 2023: 'qa_with_chat'
async def get_basic_info(report_list, section_text_dict):
    all_answers = []
    basic_info_answers = [] # new

    for report_index, report in enumerate(report_list):

        # Extract the basic information from the report
        basic_info_prompt = PromptTemplate(template=prompts['general'], input_variables=['context'])
        if "turbo" in llm_name:
            message =[
                SystemMessage(content=SYSTEM_PROMPT),
                HumanMessage(content=basic_info_prompt.format(
                    context = _docs_to_string(section_text_dict[report]['general'], with_source=False)))
            ]
            llm = ChatOpenAI(temperature=0, max_tokens=256)
            output_text = llm(message).content
        else:
            message = basic_info_prompt.format(
                context=_docs_to_string(report.section_text_dict['general'], with_source=False))
            llm = ChatOpenAI(temperature=0, max_tokens=256)
            output_text = llm(message).content
        print(output_text)
        try:
            basic_info_dict = json.loads(output_text)
        except ValueError as e:
            basic_info_dict = {'COMPANY_NAME': _find_answer(output_text, name='COMPANY_NAME'),
                                'COMPANY_SECTOR': _find_answer(output_text, name='COMPANY_SECTOR'),
                                'COMPANY_LOCATION': _find_answer(output_text, name='COMPANY_LOCATION')}
        basic_info_answers.append(basic_info_dict)
        
        basic_info_string = """Company name: {name}\nCompany sector: {sector}\nCompany Location: {location}""" \
            .format(name=basic_info_dict['COMPANY_NAME'], sector=basic_info_dict['COMPANY_SECTOR'],
                    location=basic_info_dict['COMPANY_LOCATION'])   

        all_answers.append({
            'report': report,
            'basic_info': basic_info_dict

        })
    
    return all_answers

In [24]:
### retrieve the relevant text chunks:
section_text_dict = {}

for idx, row in sample.iterrows():
    filename = f"{prepare_filename(row['company_withAccessInfo'])}_2024"
    filename = filename.replace(" ", "")
    print(f"\nProcessing: {filename}")

    try:
        PATH = f"./sample_reports/{filename}.pdf"
        pdf = load_pdf(path=PATH) 
        text_list, all_text = extract_text(pdf)
        DB_PATH = f"./faiss_db_OpenAI_0/{filename}"
        retriever, doc_search = get_retriever(pdf, db_path=DB_PATH)
        results = retrieve_chunks(retriever, queries=QUERIES)
        section_text_dict[filename] = results
    
    except Exception as e:
        print(f"Error processing {row['company_withAccessInfo']}: {e}")


Processing: ContinentalAG_2024

Processing: SchneiderElectric_2024

Processing: Philips_2024


In [27]:
results = await get_basic_info(sample_list, section_text_dict)

{
    "COMPANY_NAME": "Continental AG",
    "COMPANY_SECTOR": "Automotive",
    "COMPANY_LOCATION": "Hanover, Germany"
}
{
    "COMPANY_NAME": "Schneider Electric SE",
    "COMPANY_SECTOR": "Energy Management and Automation",
    "COMPANY_LOCATION": "Rueil-Malmaison, France"
}
{
    "COMPANY_NAME": "Koninklijke Philips N.V.",
    "COMPANY_SECTOR": "Healthcare Technology",
    "COMPANY_LOCATION": "Netherlands"
}


In [29]:
# Ni et al. 2023: 'analyze_with_chat'
async def get_conformity_score(report_list, section_text_dict):
    all_assessment_results = []
    for report_index, report in enumerate(report_list):
        esrs_assessment_prompt = PromptTemplate(template=prompts['conformity_score'],
                                                input_variables=["question", "requirements", "disclosure"])
        esrs_questions = {k: v for k, v in queries.items() if 'S1' in k}
        
        messages = []
        keys = []
        for idx, k in enumerate(assessments.keys()):
            num_docs = 20
            current_prompt = esrs_assessment_prompt.format(question=queries[k],
                                                            requirements=assessments[k],
                                                            disclosure=_docs_to_string(
                                                            section_text_dict[report][k], with_source=False))
            if '16k' not in llm_name:
                while len(tiktoken_encoder.encode(current_prompt)) > 3200 and num_docs > 10:
                    num_docs -= 1
                    current_prompt = esrs_assessment_prompt.format(question=queries[k],
                                                                    requirements=assessments[k],
                                                                    disclosure=_docs_to_string(
                                                                        section_text_dict[report][k],
                                                                        num_docs=num_docs,
                                                                        with_source=False))
            if "turbo" in llm_name:
                message = [
                    SystemMessage(content=SYSTEM_PROMPT),
                    HumanMessage(content=current_prompt)
                ]
            else:
                message = current_prompt
            keys.append(k)
            messages.append(message)
        if "turbo" in llm_name:
            llm = ChatOpenAI(temperature=0, max_tokens=512)
        else:
            llm = ChatOpenAI(temperature=0, max_tokens=512)
        outputs = await llm.agenerate(messages)
        output_texts = {k: g[0].text for k, g in zip(keys, outputs.generations)}

        for k, text in output_texts.items():
            try:
                assessments[k] = json.loads(text)
                if 'SCORE' not in assessments[k].keys() or 'ANALYSIS' not in assessments[k].keys():
                    raise ValueError("Key name(s) not defined!")
            except ValueError as e:
                assessments[k] = {'ANALYSIS': _find_answer(text, name='ANALYSIS'),
                                    'SCORE': _find_score(text)}
            analysis_text = remove_brackets(assessments[k]['ANALYSIS'])
            if "<CRITICAL_ELEMENT>" in analysis_text:
                analysis_text = analysis_text.replace("<CRITICAL_ELEMENT>", "ESRS recommendation point")
            if "<DISCLOSURE>" in analysis_text:
                analysis_text = analysis_text.replace("<DISCLOSURE>", "report's disclosure")
            if "<REQUIREMENTS>" in analysis_text:
                analysis_text = analysis_text.replace("<REQUIREMENTS>", "ESRS guidelines")
            assessments[k]['ANALYSIS'] = analysis_text
            print(assessments[k])
        all_assessment_results.append(assessments)
        all_scores = [float(s['SCORE']) for s in assessments.values()]
        print(all_scores)

    return all_assessment_results

In [30]:
conformity_results = await get_conformity_score(sample_list, section_text_dict)

{'ANALYSIS': 'The disclosure provided in the annual report by Continental AG regarding the management and disclosure of material impacts, risks, and opportunities related to its own workforce is comprehensive. The company addresses various aspects such as negative and positive impacts, actions taken, processes for engagement, and resource allocation. However, there is room for improvement in providing more specific details on certain actions and outcomes to enhance transparency and accountability.', 'SCORE': 85}
{'ANALYSIS': "The disclosure provided in the annual report partially meets the requirements for describing material risks and opportunities related to the company's impacts and dependencies on its own workforce. The report includes information on the material risks and opportunities, actions planned to mitigate risks, and consideration of specific groups within the workforce. However, there is a lack of detail on how the effectiveness of mitigation actions is tracked and assess

In [32]:
async def get_disclosure_degree(report_list, section_text_dict):
    all_assessment_results = []
    for report_index, report in enumerate(report_list):
        esrs_assessment_prompt = PromptTemplate(template=prompts['disclosure_degree'],
                                                input_variables=["question", "disclosure"])
        esrs_questions = {k: v for k, v in queries.items() if 'S1' in k}
        
        messages = []
        keys = []
        for idx, k in enumerate(assessments.keys()):
            num_docs = 20
            current_prompt = esrs_assessment_prompt.format(question=queries[k],
                                                            requirements=assessments[k],
                                                            disclosure=_docs_to_string(
                                                            section_text_dict[report][k], with_source=False))
            if '16k' not in llm_name:
                while len(tiktoken_encoder.encode(current_prompt)) > 3200 and num_docs > 10:
                    num_docs -= 1
                    current_prompt = esrs_assessment_prompt.format(question=queries[k],
                                                                    requirements=assessments[k],
                                                                    disclosure=_docs_to_string(
                                                                        section_text_dict[report][k],
                                                                        num_docs=num_docs,
                                                                        with_source=False))
            if "turbo" in llm_name:
                message = [
                    SystemMessage(content=SYSTEM_PROMPT),
                    HumanMessage(content=current_prompt)
                ]
            else:
                message = current_prompt
            keys.append(k)
            messages.append(message)
        if "turbo" in llm_name:
            llm = ChatOpenAI(temperature=0, max_tokens=512)
        else:
            llm = ChatOpenAI(temperature=0, max_tokens=512)
        outputs = await llm.agenerate(messages)
        output_texts = {k: g[0].text for k, g in zip(keys, outputs.generations)}

        for k, text in output_texts.items():
            try:
                assessments[k] = json.loads(text)
                if 'SCORE' not in assessments[k].keys() or 'ANALYSIS' not in assessments[k].keys():
                    raise ValueError("Key name(s) not defined!")
            except ValueError as e:
                assessments[k] = {'ANALYSIS': _find_answer(text, name='ANALYSIS'),
                                    'SCORE': _find_score(text)}
            analysis_text = remove_brackets(assessments[k]['ANALYSIS'])
            if "<CRITICAL_ELEMENT>" in analysis_text:
                analysis_text = analysis_text.replace("<CRITICAL_ELEMENT>", "ESRS recommendation point")
            if "<DISCLOSURE>" in analysis_text:
                analysis_text = analysis_text.replace("<DISCLOSURE>", "report's disclosure")
            if "<REQUIREMENTS>" in analysis_text:
                analysis_text = analysis_text.replace("<REQUIREMENTS>", "ESRS guidelines")
            assessments[k]['ANALYSIS'] = analysis_text
            print(assessments[k])
        all_assessment_results.append(assessments)
        all_scores = [float(s['SCORE']) for s in assessments.values()]
        print(all_scores)

    return all_assessment_results

In [33]:
disclosure_results = await get_disclosure_degree(sample_list, section_text_dict)

{'ANALYSIS': "The annual report provides a detailed and comprehensive disclosure regarding the management and disclosure of material impacts, risks, and opportunities related to the company's own workforce. The report covers various aspects such as assessment methodologies, key actions, management approaches, metrics, engagement processes, and integration with the company's strategy and business model. It also includes information on policies, targets, and communication channels with employees, demonstrating a strong commitment to workforce-related sustainability issues.", 'SCORE': 95}
{'ANALYSIS': "The annual report provides a detailed and comprehensive disclosure regarding the material risks and opportunities related to the company's impacts and dependencies on its own workforce. The report covers various aspects such as workforce impacts, risks, opportunities, vulnerable groups, joint ventures, remediation processes, and targets. It also highlights the integration of workforce consi

In [34]:
async def _run_assessment(report, section_text_dict, mode="conformity_score"):
    assert mode in ["conformity_score", "disclosure_degree"], "Invalid mode"

    esrs_prompt = PromptTemplate(
        template=prompts[mode],
        input_variables=["question", "requirements", "disclosure"] if mode == "conformity_score" else ["question", "disclosure"]
    )

    esrs_questions = {k: v for k, v in queries.items() if 'S1' in k}
    assessments = {}
    messages = []
    keys = []

    for k in esrs_questions.keys():
        num_docs = 20
        disclosure_text = _docs_to_string(section_text_dict[report][k], with_source=False)
        requirements = assessments[k] if mode == "conformity_score" and k in assessments else ""

        prompt = esrs_prompt.format(
            question=queries[k],
            requirements=requirements,
            disclosure=disclosure_text
        )

        if '16k' not in llm_name:
            while len(tiktoken_encoder.encode(prompt)) > 3200 and num_docs > 10:
                num_docs -= 1
                disclosure_text = _docs_to_string(section_text_dict[report][k], num_docs=num_docs, with_source=False)
                prompt = esrs_prompt.format(
                    question=queries[k],
                    requirements=requirements,
                    disclosure=disclosure_text
                )

        if "turbo" in llm_name:
            message = [
                SystemMessage(content=SYSTEM_PROMPT),
                HumanMessage(content=prompt)
            ]
        else:
            message = prompt

        keys.append(k)
        messages.append(message)

    llm = ChatOpenAI(temperature=0, max_tokens=512)
    outputs = await llm.agenerate(messages)
    output_texts = {k: g[0].text for k, g in zip(keys, outputs.generations)}

    for k, text in output_texts.items():
        try:
            assessments[k] = json.loads(text)
            if 'SCORE' not in assessments[k] or 'ANALYSIS' not in assessments[k]:
                raise ValueError("Missing keys in JSON.")
        except:
            assessments[k] = {
                'SCORE': _find_score(text),
                'ANALYSIS': _find_answer(text, name="ANALYSIS")
            }

        analysis_text = remove_brackets(assessments[k]['ANALYSIS'])
        analysis_text = analysis_text.replace("<CRITICAL_ELEMENT>", "ESRS recommendation point")
        analysis_text = analysis_text.replace("<DISCLOSURE>", "report's disclosure")
        analysis_text = analysis_text.replace("<REQUIREMENTS>", "ESRS guidelines")
        assessments[k]['ANALYSIS'] = analysis_text

    return assessments


In [38]:
### combining analysis of disclosure degree and conformity score together
async def analyze_reports(report_list, section_text_dict):
    results = []

    for report_index, report in enumerate(report_list):
        print(f"\n--- Processing report: {report} ---")

        result_dict = {"report": report}

        # ----- BASIC INFO -----
        basic_info_prompt = PromptTemplate(template=prompts['general'], input_variables=['context'])
        context_str = _docs_to_string(section_text_dict[report]['general'], with_source=False)
        if "turbo" in llm_name:
            message = [
                    SystemMessage(content=SYSTEM_PROMPT),
                    HumanMessage(content=basic_info_prompt.format(context=context_str))
            ]
            llm = ChatOpenAI(temperature=0, max_tokens=256)
            output_text = llm(message).content
        else:
            message = basic_info_prompt.format(context=context_str)
            llm = ChatOpenAI(temperature=0, max_tokens=256)
            output_text = llm(message).content
        try:
            basic_info_dict = json.loads(output_text)
        except ValueError:
            basic_info_dict = {
                'COMPANY_NAME': _find_answer(output_text, name='COMPANY_NAME'),
                'COMPANY_SECTOR': _find_answer(output_text, name='COMPANY_SECTOR'),
                'COMPANY_LOCATION': _find_answer(output_text, name='COMPANY_LOCATION')
            }
        result_dict["basic_info"] = basic_info_dict

        # ----- DISCLOSURE DEGREE -----
        try:
            disclosure_results = await _run_assessment(report, section_text_dict, mode="disclosure_degree")
            result_dict["disclosure_degree"] = disclosure_results
        except Exception as e:
            print(f"Error in disclosure degree for {report}: {e}")
            result_dict["disclosure_degree"] = {}


        # ----- CONFORMITY SCORE -----
        try:
            conformity_results = await _run_assessment(report, section_text_dict, mode="conformity_score")
            result_dict["conformity_score"] = conformity_results
        except Exception as e:
            print(f"Error in conformity score for {report}: {e}")
            result_dict["conformity_score"] = {}

        results.append(result_dict)

    return results

In [None]:
### GENERATION ###
async def get_assessment(report_list, section_text_dict):
    all_assessment_results = []
    for report_index, report in enumerate(report_list):
        disclosure_prompt = PromptTemplate(template=PROMPT_TEMPLATE,
                                           input_variables=["query", "sources", "guideline", "answer_length"])

        messages = []
        keys = []

        for idx, k in enumerate(QUERIES.keys()):
            current_prompt = disclosure_prompt.format(query=QUERIES[k],
                                                      sources=_docs_to_string(section_text_dict[report][k], with_source=False),
                                                      guideline=GUIDELINES[k],
                                                      answer_length=ANSWER_LENGTH)
            message = [
                SystemMessage(content=SYSTEM_PROMPT),
                HumanMessage(content=current_prompt)
            ]
            keys.append(k)
            messages.append(message)

        llm = ChatOpenAI(temperature=0, max_tokens=max_token)
        
        outputs = await llm.agenerate(messages)
    return outputs

In [39]:
analysis_results = await analyze_reports(sample_list, section_text_dict)


--- Processing report: ContinentalAG_2024 ---
{
    "COMPANY_NAME": "Continental AG",
    "COMPANY_SECTOR": "Automotive",
    "COMPANY_LOCATION": "Hanover, Germany"
}

--- Processing report: SchneiderElectric_2024 ---
{
    "COMPANY_NAME": "Schneider Electric SE",
    "COMPANY_SECTOR": "Energy Management and Automation",
    "COMPANY_LOCATION": "Rueil-Malmaison, France"
}

--- Processing report: Philips_2024 ---
{
    "COMPANY_NAME": "Koninklijke Philips N.V.",
    "COMPANY_SECTOR": "Healthcare Technology",
    "COMPANY_LOCATION": "Netherlands"
}


In [40]:
analysis_results

[{'report': 'ContinentalAG_2024',
  'basic_info': {'COMPANY_NAME': 'Continental AG',
   'COMPANY_SECTOR': 'Automotive',
   'COMPANY_LOCATION': 'Hanover, Germany'},
  'disclosure_degree': {'S1_A': {'ANALYSIS': "The annual report provides a comprehensive overview of how Continental AG manages and discloses material impacts, risks, and opportunities related to its own workforce. The report covers various aspects such as assessment methodologies, management approaches, engagement processes, metrics, and integration with the business model. It also highlights the link between risks and the company's strategy. However, there is room for improvement in providing more specific examples and outcomes of the actions taken.",
    'SCORE': 85},
   'S1_B': {'ANALYSIS': "The annual report provides a detailed and comprehensive disclosure regarding the material risks and opportunities related to the company's impacts and dependencies on its own workforce. The report covers various aspects such as workf

In [57]:
evaluate_llm_scores(llm_results=analysis_results, validation_df=gen_validation_set)

{'disclosure_degree': {'MAE': 5.95,
  'RMSE': 7.79,
  'Pearson_Correlation': 0.3086},
 'conformity_score': {'MAE': 11.67,
  'RMSE': 16.8,
  'Pearson_Correlation': 0.1292}}

### Ground truth retrieval dataset for generation

In [33]:
print(section_text_dict)



In [24]:
ground_truth_data = {}

# Gruppieren nach Report und dann nach Query
for report_name, report_group in validation_set.groupby('report_name'):
    ground_truth_data[report_name] = {}
    for query_key, query_group in report_group.groupby('query'):
        # Alle Textpassagen für diese Query zu einem String zusammenfügen
        # Wir trennen die einzelnen Passagen durch eine klare Linie
        concatenated_text = "\n\n---\n\n".join(query_group['text'])
        ground_truth_data[report_name][query_key] = concatenated_text

print(ground_truth_data)



In [25]:
# explore length of the ground_truth_data

# Initialisieren Sie den Encoder (wie in Ihrem Code)
tiktoken_encoder = tiktoken.encoding_for_model(llm_name)

# Template für den Prompt
esrs_assessment_prompt = PromptTemplate(template=prompts['conformity_score'],
                                      input_variables=["question", "requirements", "disclosure"])

token_counts = {}

# Iterieren durch die aufbereiteten Ground-Truth-Daten
for report_name, queries_dict in ground_truth_data.items():
    token_counts[report_name] = {}
    print(f"\n--- Analysiere Report: {report_name} ---")
    for query_key, disclosure_text in queries_dict.items():
        if query_key not in queries or query_key not in assessments:
            continue

        # Den vollständigen Prompt erstellen, genau wie er an die LLM gesendet wird
        current_prompt = esrs_assessment_prompt.format(
            question=queries[query_key],
            requirements=assessments[query_key],
            disclosure=disclosure_text
        )

        # Tokens zählen
        num_tokens = len(tiktoken_encoder.encode(current_prompt))
        token_counts[report_name][query_key] = num_tokens

        print(f"Query '{query_key}': {num_tokens} Tokens")



--- Analysiere Report: ContinentalAG_2024 ---
Query 'S1_A': 15065 Tokens
Query 'S1_B': 2345 Tokens
Query 'S1_C': 3099 Tokens
Query 'S1_D': 4602 Tokens
Query 'S1_E': 3463 Tokens
Query 'S1_F': 4954 Tokens
Query 'S1_G': 974 Tokens

--- Analysiere Report: Philips_2024 ---
Query 'S1_A': 10301 Tokens
Query 'S1_B': 1988 Tokens
Query 'S1_C': 3208 Tokens
Query 'S1_D': 3390 Tokens
Query 'S1_E': 3605 Tokens
Query 'S1_F': 2635 Tokens
Query 'S1_G': 899 Tokens

--- Analysiere Report: SchneiderElectric_2024 ---
Query 'S1_A': 8319 Tokens
Query 'S1_B': 1453 Tokens
Query 'S1_C': 3422 Tokens
Query 'S1_D': 4962 Tokens
Query 'S1_E': 7092 Tokens
Query 'S1_F': 4751 Tokens
Query 'S1_G': 1697 Tokens


In [28]:
### disclosure degree analysis with ground truth data

async def analyze_with_ground_truth(report_list, ground_truth_data):
    all_assessment_results = []

    for report in report_list:

        print(f"\n--- Process report: {report} ---")
        
        esrs_assessment_prompt = PromptTemplate(template=prompts['disclosure_degree'],
                                              input_variables=["question", "disclosure"])
        
        messages = []
        keys = []
        
        # Iteriert durch die Queries, die im Ground-Truth-Set für diesen Report vorhanden sind
        for k, disclosure_text in ground_truth_data[report].items():
            current_prompt = esrs_assessment_prompt.format(
                question=queries[k],
                disclosure=disclosure_text
            )

            message = [
                SystemMessage(content=SYSTEM_PROMPT),
                HumanMessage(content=current_prompt)
            ]
            keys.append(k)
            messages.append(message)

        llm = ChatOpenAI(model_name=llm_name, temperature=0, max_tokens=1024)

        # Asynchrone API-Aufrufe
        outputs = await llm.agenerate(messages)
        output_texts = {k: g[0].text for k, g in zip(keys, outputs.generations)}

        report_assessments = {}
        # Verarbeitung der Ergebnisse (direkt aus Ihrem Code übernommen)
        for k, text in output_texts.items():
            try:
                report_assessments[k] = json.loads(text)
                if 'SCORE' not in report_assessments[k] or 'ANALYSIS' not in report_assessments[k]:
                    raise ValueError("Key name(s) not defined!")
            except (ValueError, json.JSONDecodeError):
                report_assessments[k] = {
                    'ANALYSIS': _find_answer(text, name='ANALYSIS'),
                    'SCORE': _find_score(text)
                }
            
            # Textbereinigung (direkt aus Ihrem Code übernommen)
            analysis_text = remove_brackets(report_assessments[k]['ANALYSIS'])
            analysis_text = analysis_text.replace("<CRITICAL_ELEMENT>", "ESRS recommendation point")
            analysis_text = analysis_text.replace("<DISCLOSURE>", "report's disclosure")
            analysis_text = analysis_text.replace("<REQUIREMENTS>", "ESRS guidelines")
            report_assessments[k]['ANALYSIS'] = analysis_text
            
            print(f"Ergebnis für Query '{k}':")
            print(report_assessments[k])

        all_assessment_results.append(report_assessments)
        all_scores = [float(s['SCORE']) for s in report_assessments.values() if 'SCORE' in s and s['SCORE'] is not None]
        print(f"\nScores für {report}: {all_scores}")

    return all_assessment_results

ground_truth_results = await analyze_with_ground_truth(sample_list, ground_truth_data)


--- Process report: ContinentalAG_2024 ---
Ergebnis für Query 'S1_A':
{'ANALYSIS': 'The annual report provides a detailed and comprehensive disclosure on how the company manages and discloses material impacts, risks, and opportunities related to its own workforce. The report covers various aspects such as labor standards, adequate wages, work-life balance, training and skill development, secure employment, social dialogue, employee privacy, and occupational safety and health. The company has outlined specific processes, methodologies, and actions taken to address these areas, demonstrating a strong commitment to employee well-being and engagement.', 'SCORE': 95}
Ergebnis für Query 'S1_B':
{'ANALYSIS': "The annual report provides a detailed overview of the material risks and opportunities related to the company's impacts and dependencies on its own workforce. It covers various aspects such as training, skill development, secure employment, and potential risks like labor rights violatio

In [29]:
# Save results
with open("ground_truth_results_dd.json", "w", encoding="utf-8") as f:
    json.dump(ground_truth_results, f, ensure_ascii=False, indent=2)

In [None]:
### conformity score analysis with ground truth data

async def analyze_with_ground_truth(report_list, ground_truth_data):
    all_assessment_results = []

    for report in report_list:

        print(f"\n--- Process report: {report} ---")
        
        esrs_assessment_prompt = PromptTemplate(template=prompts['conformity_score'],
                                              input_variables=["question", "requirements", "disclosure"])
        
        messages = []
        keys = []
        
        # Iteriert durch die Queries, die im Ground-Truth-Set für diesen Report vorhanden sind
        for k, disclosure_text in ground_truth_data[report].items():
            current_prompt = esrs_assessment_prompt.format(
                question=queries[k],
                requirements=assessments[k],
                disclosure=disclosure_text
            )

            message = [
                SystemMessage(content=SYSTEM_PROMPT),
                HumanMessage(content=current_prompt)
            ]
            keys.append(k)
            messages.append(message)

        llm = ChatOpenAI(model_name=llm_name, temperature=0, max_tokens=1024)

        # Asynchrone API-Aufrufe
        outputs = await llm.agenerate(messages)
        output_texts = {k: g[0].text for k, g in zip(keys, outputs.generations)}

        report_assessments = {}
        # Verarbeitung der Ergebnisse (direkt aus Ihrem Code übernommen)
        for k, text in output_texts.items():
            try:
                report_assessments[k] = json.loads(text)
                if 'SCORE' not in report_assessments[k] or 'ANALYSIS' not in report_assessments[k]:
                    raise ValueError("Key name(s) not defined!")
            except (ValueError, json.JSONDecodeError):
                report_assessments[k] = {
                    'ANALYSIS': _find_answer(text, name='ANALYSIS'),
                    'SCORE': _find_score(text)
                }
            
            # Textbereinigung (direkt aus Ihrem Code übernommen)
            analysis_text = remove_brackets(report_assessments[k]['ANALYSIS'])
            analysis_text = analysis_text.replace("<CRITICAL_ELEMENT>", "ESRS recommendation point")
            analysis_text = analysis_text.replace("<DISCLOSURE>", "report's disclosure")
            analysis_text = analysis_text.replace("<REQUIREMENTS>", "ESRS guidelines")
            report_assessments[k]['ANALYSIS'] = analysis_text
            
            print(f"Ergebnis für Query '{k}':")
            print(report_assessments[k])

        all_assessment_results.append(report_assessments)
        all_scores = [float(s['SCORE']) for s in report_assessments.values() if 'SCORE' in s and s['SCORE'] is not None]
        print(f"\nScores für {report}: {all_scores}")

    return all_assessment_results

ground_truth_results = await analyze_with_ground_truth(sample_list, ground_truth_data)


--- Process report: ContinentalAG_2024 ---
Ergebnis für Query 'S1_A':
{'ANALYSIS': 'The disclosure provided by Continental in the annual report partially meets the requirements outlined for managing and disclosing material impacts, risks, and opportunities related to its own workforce. The company has detailed its processes for identifying and assessing impacts, risks, and opportunities, as well as the actions taken to address them. However, there is room for improvement in providing specific details on the types of employees impacted, the effectiveness of measures, and the tracking and assessment of these efforts.', 'SCORE': 75}
Ergebnis für Query 'S1_B':
{'ANALYSIS': "The disclosure provided in the annual report partially addresses the requirements for describing material risks and opportunities related to the company's impacts and dependencies on its own workforce. The report discusses various risks related to labor rights, workforce management, and skill development initiatives. H

In [27]:
# Save results
with open("ground_truth_results_cs.json", "w", encoding="utf-8") as f:
    json.dump(ground_truth_results, f, ensure_ascii=False, indent=2)

In [2]:
# Load results
with open("ground_truth_results_0.json", "r", encoding="utf-8") as f:
    results = json.load(f)

### D) LLM Model for Generation
#### 1. OpenAI
with the best performing retrieval set up so far:
- PDF Extraction: MyPuPDF
- embedding model: Qwen3-Embedding-0.6B
- chunking: SemanticChunker + config: threshhold_type=standard deviation breakpoint=1
- retrieval: top K: 70 and similarity threshold: 0,30 reranker with top N:40
and the best retrieval queries

In [10]:
### the best performing queries
QUERIES = {
    'S1_A': "How does the company manage and disclose material impacts, risks and opportunities related to the own workforce?",
    'S1_B': "What are the material risks and opportunities related to the company’s impacts and dependencies on people in its own workforce?",
    'S1_C': "What are the company’s human rights and labor rights commitments for its own workforce?",
    'S1_D': "What are the company’s processes and policies for engaging with own workers and workers’ representatives about impacts?",
    'S1_E': "How does the company prevent discrimination and advance diversity and inclusions in its workforce?",
    'S1_F': "What channels are available for people in the company’s own workforce to raise concerns or needs, how are these channels supported, protected, and monitored?",
    'S1_G': "Are all employees in own workforce covered by social protection against loss of income due to sickness, employment injury and acquired disability, parental leave, retirement?",
}

In [15]:
### retrieve the relevant text chunks:     
section_text_dict = {}
#start_time = time.time()

for idx, row in sample.iterrows():
    filename = f"{prepare_filename(row['company_withAccessInfo'])}_2024"
    filename = filename.replace(" ", "")
    print(f"\nProcessing: {filename}")

    try:
        PATH = f"./sample_reports/{filename}.pdf"
        DB_PATH = f"./faiss_db_Qwen_B2b-2/{filename}"
        retriever, doc_search = get_final_retriever(db_path=DB_PATH, embeddings=embeddings_qwen)
        results = retrieve_chunks(retriever, queries=QUERIES)
        section_text_dict[filename] = results
        print(f"Successfully processed and retrieved chunks for {filename}")
    
    except Exception as e:
        print(f"Error processing {row['company_withAccessInfo']}: {e}")

#print(f"Computation time: {(time.time() - start_time) / 60:.2f} minutes")


Processing: ContinentalAG_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/ContinentalAG_2024


No relevant docs were retrieved using the relevance score threshold 0.3


[Fallback] No documents found with threshold — falling back to top_k=30.
Successfully processed and retrieved chunks for ContinentalAG_2024

Processing: SchneiderElectric_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/SchneiderElectric_2024
Successfully processed and retrieved chunks for SchneiderElectric_2024

Processing: Philips_2024
Loading existing FAISS DB from ./faiss_db_Qwen_B2b-2/Philips_2024
Successfully processed and retrieved chunks for Philips_2024


In [86]:
# explore length of the retrived docs

# Initialisieren Sie den Encoder (wie in Ihrem Code)
tiktoken_encoder = tiktoken.encoding_for_model(llm_name)

# Template für den Prompt
esrs_assessment_prompt = PromptTemplate(template=prompts['conformity_score'],
                                      input_variables=["question", "requirements", "disclosure"])

token_counts = {}

# Iterieren durch die aufbereiteten Ground-Truth-Daten
for report_name, queries_dict in section_text_dict.items():
    token_counts[report_name] = {}
    print(f"\n--- Analysiere Report: {report_name} ---")
    for query_key, disclosure_text in queries_dict.items():
        if query_key not in queries or query_key not in assessments:
            continue

        # Den vollständigen Prompt erstellen, genau wie er an die LLM gesendet wird
        current_prompt = esrs_assessment_prompt.format(
            question=queries[query_key],
            requirements=assessments[query_key],
            disclosure=disclosure_text
        )

        # Tokens zählen
        num_tokens = len(tiktoken_encoder.encode(current_prompt))
        token_counts[report_name][query_key] = num_tokens

        print(f"Query '{query_key}': {num_tokens} Tokens")


--- Analysiere Report: ContinentalAG_2024 ---
Query 'S1_A': 13778 Tokens
Query 'S1_B': 9891 Tokens
Query 'S1_C': 12231 Tokens
Query 'S1_D': 11367 Tokens
Query 'S1_E': 2422 Tokens
Query 'S1_F': 6977 Tokens
Query 'S1_G': 8128 Tokens

--- Analysiere Report: SchneiderElectric_2024 ---
Query 'S1_A': 9861 Tokens
Query 'S1_B': 12255 Tokens
Query 'S1_C': 12106 Tokens
Query 'S1_D': 8907 Tokens
Query 'S1_E': 6498 Tokens
Query 'S1_F': 5921 Tokens
Query 'S1_G': 1956 Tokens

--- Analysiere Report: Philips_2024 ---
Query 'S1_A': 7286 Tokens
Query 'S1_B': 5518 Tokens
Query 'S1_C': 8841 Tokens
Query 'S1_D': 7255 Tokens
Query 'S1_E': 3591 Tokens
Query 'S1_F': 3879 Tokens
Query 'S1_G': 971 Tokens


In [90]:
### adjustments needed: num_docs delete

async def _run_assessment(report, section_text_dict, mode="conformity_score"):
    assert mode in ["conformity_score", "disclosure_degree"], "Invalid mode"

    esrs_prompt = PromptTemplate(
        template=prompts[mode],
        input_variables=["question", "requirements", "disclosure"] if mode == "conformity_score" else ["question", "disclosure"]
    )

    esrs_questions = {k: v for k, v in queries.items() if 'S1' in k}
    assessments = {}
    messages = []
    keys = []

    for k in esrs_questions.keys():
        disclosure_text = _docs_to_string(section_text_dict[report][k], with_source=False)
        requirements = assessments[k] if mode == "conformity_score" and k in assessments else ""

        prompt = esrs_prompt.format(
            question=queries[k],
            requirements=requirements,
            disclosure=disclosure_text
        )

        if "turbo" in llm_name:
            message = [
                SystemMessage(content=SYSTEM_PROMPT),
                HumanMessage(content=prompt)
            ]
        else:
            message = prompt

        keys.append(k)
        messages.append(message)

    llm = ChatOpenAI(temperature=0, max_tokens=512)
    outputs = await llm.agenerate(messages)
    output_texts = {k: g[0].text for k, g in zip(keys, outputs.generations)}

    for k, text in output_texts.items():
        try:
            assessments[k] = json.loads(text)
            if 'SCORE' not in assessments[k] or 'ANALYSIS' not in assessments[k]:
                raise ValueError("Missing keys in JSON.")
        except:
            assessments[k] = {
                'SCORE': _find_score(text),
                'ANALYSIS': _find_answer(text, name="ANALYSIS")
            }

        analysis_text = remove_brackets(assessments[k]['ANALYSIS'])
        analysis_text = analysis_text.replace("<CRITICAL_ELEMENT>", "ESRS recommendation point")
        analysis_text = analysis_text.replace("<DISCLOSURE>", "report's disclosure")
        analysis_text = analysis_text.replace("<REQUIREMENTS>", "ESRS guidelines")
        assessments[k]['ANALYSIS'] = analysis_text

    return assessments

In [91]:
### combining analysis of disclosure degree and conformity score together
async def analyze_reports(report_list, section_text_dict):
    results = []

    for report_index, report in enumerate(report_list):
        print(f"\n--- Processing report: {report} ---")

        result_dict = {"report": report}

        # ----- DISCLOSURE DEGREE -----
        try:
            disclosure_results = await _run_assessment(report, section_text_dict, mode="disclosure_degree")
            result_dict["disclosure_degree"] = disclosure_results
        except Exception as e:
            print(f"Error in disclosure degree for {report}: {e}")
            result_dict["disclosure_degree"] = {}


        # ----- CONFORMITY SCORE -----
        try:
            conformity_results = await _run_assessment(report, section_text_dict, mode="conformity_score")
            result_dict["conformity_score"] = conformity_results
        except Exception as e:
            print(f"Error in conformity score for {report}: {e}")
            result_dict["conformity_score"] = {}

        results.append(result_dict)

    return results

In [92]:
start_time = time.time()
analysis_results = await analyze_reports(sample_list, section_text_dict)
print(f"Generation time: {(time.time() - start_time) / 60:.2f} minutes")
evaluate_llm_scores(llm_results=analysis_results, validation_df=gen_validation_set)


--- Processing report: ContinentalAG_2024 ---

--- Processing report: SchneiderElectric_2024 ---

--- Processing report: Philips_2024 ---
Generation time: 0.34 minutes


{'disclosure_degree': {'MAE': 6.43, 'RMSE': 8.66, 'Pearson_Correlation': 0.16},
 'conformity_score': {'MAE': 12.62,
  'RMSE': 17.35,
  'Pearson_Correlation': 0.34}}

#### 2. Llama 3.1 8B
- context length: 128k
https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct

In [16]:
### loading the model
model_id = "meta-llama/Llama-3.1-8B-Instruct"

device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.pad_token_id = tokenizer.eos_token_id
tokenizer.padding_side = "left" # important for batching since Llma is a decoder-only architecture

# to reduce memory usage and speed up performance
quantization_config = BitsAndBytesConfig(load_in_4bit=True, # maximizing speed and minimizing memory
                                         bnb_4bit_compute_dtype=torch.bfloat16, # computations in bfloat16
                                         bnb_4bit_use_double_quant=True,
                                         bnb_4bit_quant_type= "nf4"
                                         )

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    quantization_config=quantization_config
)

# enable evaluation mode to allow model inference
model.eval()

print(f"Model loaded on {device}")

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Model loaded on cuda:0


In [17]:
generate_text = transformers.pipeline(
    model=model, 
    tokenizer=tokenizer,
    return_full_text=False,
    task='text-generation',
    # we pass model parameters here too
    temperature=0.1,  # 'randomness' of outputs, 0.0 is the min and 1.0 the 
    max_new_tokens=512,  # max number of tokens to generate in the output
    # repetition_penalty=1.1  # without this output begins repeating
)
llm = HuggingFacePipeline(pipeline=generate_text)

Device set to use cuda:0


In [18]:
def run_assessment(report, section_text_dict, llm, tokenizer=tokenizer, mode="conformity_score"):
    assert mode in ["conformity_score", "disclosure_degree"], "Invalid mode"

    # Define the LangChain prompt template
    input_vars = ["question", "disclosure"]
    if mode == "conformity_score":
        input_vars.append("requirements")
    esrs_prompt = PromptTemplate(template=prompts[mode], input_variables=input_vars)

    esrs_questions = {k: v for k, v in queries.items() if 'S1' in k}
    
    batch_of_message_lists = []
    keys_for_batch = []

    for k in esrs_questions.keys():
        disclosure_text = _docs_to_string(section_text_dict[report][k], with_source=False)
        
        prompt_data = {"question": queries[k], "disclosure": disclosure_text}
        if mode == "conformity_score":
            prompt_data["requirements"] = assessments.get(k, "")
            
        user_prompt = esrs_prompt.format(**prompt_data)
        
        # Create the message structure that apply_chat_template expects
        messages = [
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": user_prompt}
        ]
        
        batch_of_message_lists.append(messages)
        keys_for_batch.append(k)
    

    prompts_for_batch = tokenizer.apply_chat_template(
        batch_of_message_lists,
        tokenize=False, # We want string outputs to pass to the pipeline
        add_generation_prompt=True # Ensures the final 'assistant' token is added
    )

    # Generate answers
    outputs = llm.generate(prompts_for_batch)
    # Parse the output
    output_texts = {k: g[0].text for k, g in zip(keys_for_batch, outputs.generations)}
    
    final_assessments = {}
    for k, text in output_texts.items():
        try:
            # Llama models are good at JSON, but fallback parsing is still wise
            parsed_json = json.loads(text)
            if 'SCORE' not in parsed_json or 'ANALYSIS' not in parsed_json:
                raise ValueError("Missing 'SCORE' or 'ANALYSIS' in JSON response.")
            final_assessments[k] = parsed_json
        except (json.JSONDecodeError, ValueError):
            # Fallback to regex parsing if JSON fails
            final_assessments[k] = {
                'SCORE': _find_score(text),
                'ANALYSIS': _find_answer(text, name="ANALYSIS")
            }

        # Post-processing text is unchanged
        analysis_text = remove_brackets(final_assessments[k].get('ANALYSIS', ''))
        analysis_text = analysis_text.replace("<CRITICAL_ELEMENT>", "ESRS recommendation point")
        analysis_text = analysis_text.replace("<DISCLOSURE>", "report's disclosure")
        analysis_text = analysis_text.replace("<REQUIREMENTS>", "ESRS guidelines")
        final_assessments[k]['ANALYSIS'] = analysis_text

    return final_assessments

In [19]:
def process_single_report(report, section_text_dict, llm):
    print(f"--- Processing report: {report} ---")
    result_dict = {"report": report}

    # ----- DISCLOSURE DEGREE -----
    try:
        disclosure_results = run_assessment(report, section_text_dict, llm, mode="disclosure_degree")
        result_dict["disclosure_degree"] = disclosure_results
    except Exception as e:
        print(f"Error in disclosure degree for {report}: {e}")
        result_dict["disclosure_degree"] = {"error": str(e)}

    # ----- CONFORMITY SCORE -----
    try:
        conformity_results = run_assessment(report, section_text_dict, llm, mode="conformity_score")
        result_dict["conformity_score"] = conformity_results
    except Exception as e:
        print(f"Error in conformity score for {report}: {e}")
        result_dict["conformity_score"] = {"error": str(e)}
        
    return result_dict


def run_analysis_pipeline(report_list, section_text_dict, llm, max_workers=3):
    results = []
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        # Create a future for each report processing task
        future_to_report = {
            executor.submit(process_single_report, report, section_text_dict, llm): report 
            for report in report_list
        }
        
        for future in future_to_report:
            try:
                result = future.result()
                results.append(result)
            except Exception as exc:
                report_name = future_to_report[future]
                print(f"Report {report_name} generated an exception: {exc}")
                results.append({"report": report_name, "error": str(exc)})
    
    return results

In [20]:
prompts=PROMPTS
queries=QUERIES
assessments=ESRS_ASSESSMENT

In [22]:
start_time = time.time()
final_results = run_analysis_pipeline(sample_list, section_text_dict, llm, max_workers=3)
print(f"Computation time: {(time.time() - start_time) / 60:.2f} minutes")

--- Processing report: ContinentalAG_2024 ---
--- Processing report: SchneiderElectric_2024 ---
--- Processing report: Philips_2024 ---
Computation time: 8.91 minutes


In [23]:
final_results

[{'report': 'ContinentalAG_2024',
  'disclosure_degree': {'S1_A': {'ANALYSIS': "Continental AG's annual report provides a comprehensive overview of its management approaches and policies related to its own workforce. The company identifies several material impacts, risks, and opportunities, including labor standards, adequate wages, work-life balance, training and skill development, secure employment, and social dialogue. Continental has established various channels for engaging with its workforce, including grievance mechanisms, employee surveys, and training programs. The company also provides details on its targets related to diversity and gender equality, as well as its efforts to promote equal treatment and prevent discrimination. However, some information, such as the effectiveness of its management approaches and the outcomes of its grievance mechanisms, is not fully disclosed.",
    'SCORE': 80},
   'S1_B': {'ANALYSIS': "Continental AG's annual report provides a comprehensive o

In [42]:
evaluate_llm_scores(llm_results=final_results, validation_df=gen_validation_set)

{'disclosure_degree': {'MAE': 10.24,
  'RMSE': 12.39,
  'Pearson_Correlation': 0.27},
 'conformity_score': {'MAE': 19.43,
  'RMSE': 24.61,
  'Pearson_Correlation': -0.15}}

In [29]:
evaluate_llm_scores(llm_results=final_results, validation_df=gen_validation_set)

{'disclosure_degree': {'MAE': 10.71,
  'RMSE': 13.23,
  'Pearson_Correlation': 0.12},
 'conformity_score': {'MAE': 18.0, 'RMSE': 22.92, 'Pearson_Correlation': -0.2}}

### E) Evaluate PDF Text Extraction
- OCR (Optical Character Recognition) with MyPuPDF: This method involves utilizing open source 3rd party technology (Tesseract) to scan the page for images and to convert that imagery into text. Imagine PDFs which contain screenshots of information, these will just be identified as “image” within the PDF, but somehow we want machine-readable text. This method uses PyMuPDF’s Page.get_textpage_ocr()function to take on the heavy lifting. https://medium.com/@pymupdf/text-extraction-strategies-with-pymupdf-dd0ef2461847 
    - Limitations: Slower processing times, Accuracy depends on image quality, Higher computational and memory requirements, May introduce errors in recognition
- Docling OCR
- spacy Layout

### F) Generation prompts
- incl. metadata about the company
  As a senior equity analyst with expertise in climate science evaluating a company's sustainability report, you are presented with the following essential information about the report:

{basic_info}

With the above information and the following extracted components (which may have incomplete sentences at the beginnings and the ends) of the sustainability report at hand, please respond to the posed question. 
Your answer should be precise, comprehensive, and substantiated by direct extractions from the report to establish its credibility.
If you don't know the answer, just say that you don't know. Don't try to make up an answer.