# Evaluate the Performance of different RAG system settings

Purpose of the notebook: evaluate the performance of different RAG settings to find the configuration that works best for this usecase

Evaluate:
- Embedding models
- Retrieval settings
- Retrieval queries
- Generative LLM models
- PDF Extraction
- Generation prompts

# Preparation

In [6]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import time
import re
import urllib3
import tenacity
import configparser
import markdown
import json

# for PDF extraction
import pymupdf
import requests
import os
import io


# for retrieval
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_experimental.text_splitter import SemanticChunker
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CrossEncoderReranker
from langchain_community.cross_encoders import HuggingFaceCrossEncoder
from langchain_core.output_parsers import PydanticOutputParser

# for generation
from langchain_openai import ChatOpenAI
from langchain.schema import (
    AIMessage,
    HumanMessage,
    SystemMessage
)
from langchain.prompts import PromptTemplate
from langchain_core.prompts import ChatPromptTemplate
import tiktoken

# old evaluation
import nltk
from nltk.tokenize import sent_tokenize
from thefuzz import process, fuzz

In [6]:
# import OpenAI API key from environment variable
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

# Sample reports
Source: Sustainability Reporting Navigator (crowd-source list of CSRD-compliant reports for fiscal years starting on 01/01/2024)

Downloaded CSV with information on all reports on the 08/04/2025 https://www.sustainabilityreportingnavigator.com/#/csrdreports 

In [7]:
# Open the csv data file
reports_24 = pd.read_csv('esg_reports_2024.csv')
print(len(reports_24))

277


In [8]:
# randomly select 3 reports from 2024
sample = reports_24.sample(n=3, random_state=3)
sample.head()

Unnamed: 0.1,Unnamed: 0,company_withAccessInfo,link,country,sector,industry,publication date,pages PDF,auditor
220,19,Continental AG,https://annualreport.continental.com/2024/en/s...,Germany,Transportation,Auto Parts,2025-03-18,125,PwC
58,266,Schneider Electric*,https://www.se.com/ww/en/assets/564/document/5...,France,Resource Transformation,Electrical & Electronic Equipment,2025-03-26,186,PwC & Mazars
89,103,Philips,https://www.results.philips.com/publications/a...,Netherlands,Infrastructure,Electric Utilities & Power Generators,2025-02-21,85,EY


In [9]:
sample_list = ['ContinentalAG_2024', 'SchneiderElectric_2024', 'Philips_2024']

In [10]:
# Read in the manually hand coded validation set (based on the sample reports)
validation_set = pd.read_excel('validation_dataset.xlsx')
print(validation_set.head())

              report_name query  \
0      ContinentalAG_2024  S1_E   
1  SchneiderElectric_2024  S1_E   
2      ContinentalAG_2024  S1_D   
3      ContinentalAG_2024  S1_A   
4  SchneiderElectric_2024  S1_F   

                                                text page_number  
0  In accordance with Section 76 (4) AktG, the Ex...          27  
1  Our 2025 sustainability commitments\nWith less...          33  
2  The globally applicable Code of\nConduct provi...          41  
3  The globally applicable Code of\nConduct provi...          41  
4  Our Speak Up Mindset\nSchneider Electric emplo...          41  


In [11]:
# check for consistency in the validation set
print(validation_set.groupby(['query', 'report_name' ]).size())

query  report_name           
S1_A   ContinentalAG_2024        22
       Philips_2024              13
       SchneiderElectric_2024    24
S1_B   ContinentalAG_2024        14
       Philips_2024               4
       SchneiderElectric_2024     4
S1_C   ContinentalAG_2024        19
       Philips_2024              13
       SchneiderElectric_2024    18
S1_D   ContinentalAG_2024        23
       Philips_2024              13
       SchneiderElectric_2024    10
S1_E   ContinentalAG_2024        12
       Philips_2024              11
       SchneiderElectric_2024    24
S1_F   ContinentalAG_2024        15
       Philips_2024               6
       SchneiderElectric_2024    23
S1_G   ContinentalAG_2024         5
       Philips_2024               1
       SchneiderElectric_2024     8
dtype: int64


# 0: Baseline approach
Based on Ni et al., 2023 and Colesanti Senni et al., 2025
    Ni, J., Bingler, J., Colesanti-Senni, C., Kraus, M., Gostlow, G., Schimanski, T., Stammbach, D., Vaghefi, S. A., Wang, Q., Webersinke, N., Wekhof, T., Yu, T., & Leippold, M. (2023). CHATREPORT: Democratizing Sustainability Disclosure Analysis through LLM-based Tools. Swiss Finance Institute Research Paper, No. 23-111. https://doi.org/10.48550/arXiv.2307.15770
    Colesanti Senni, C., Schimanski, T., Bingler, J., Ni, J., & Leippold, M. (2025). Using AI to assess corporate climate transition disclosures. Environmental Research Communications, 7(2), 021010. https://doi.org/10.1088/2515-7620/ad9e88

With the configuration:
- PDF text extraction: MyPuPDF
- Retrieval:
    - Embedding model: OpenAI text-embedding-ada-002
    - top k: 8
    - chunk size: 350
    - chunk overlap: 50
- Generation:
    - Generative LLM: OpenAI o4-mini-2025-04-16
    - answer_length: 200
    - temperature: 0

In [None]:
# retrieval settings
TOP_K = 8
CHUNK_SIZE = 350
CHUNK_OVERLAP = 50

# generation settings
llm_name = 'o4-mini-2025-04-16'
max_token=200


QUERIES = {
    'S1_A': "How does the company manage and disclose material impacts, risks and opportunities related to the own workforce?",
    'S1_B': "What are the material risks and opportunities arising from the company’s impacts and dependencies on people in its own workforce?",
    'S1_C': "What are the company’s human rights practices, risks and incidents related to the own workforce?",
    'S1_D': "What are the company’s processes and policies for engaging with own workers and workers’ representatives about impacts?",
    'S1_E': "What are the company’s policies on non-discrimination, diversity and inclusion in the own workforce?",
    'S1_F': "What are the company’s processes, policies and approaches to remediate negative impacts and channels for own workers to raise concerns?",
    'S1_G': "How is the company’s workfoce social protection coverage?",
}

In [None]:
### HELPER FUNCTIONS ###

# preparing filenames
def prepare_filename(name):
    return re.sub(r'[\\/*?:"<>|]', "", name)

# for the generation
def remove_brackets(string):
    return re.sub(r'\([^)]*\)', '', string).strip()

def _docs_to_string(docs, with_source=True):
# def _docs_to_string(docs, num_docs=TOP_K, with_source=True):
    output = ""
    # docs = docs[:num_docs]
    for doc in docs:
        output += "Content: {}\n".format(doc.page_content)
        if with_source:
            output += "Source: {}\n".format(doc.metadata['source'])
        output += "\n---\n"
    return output

def _find_answer(string, name="ANSWER"):
    for l in string.split('\n'):
        if name in l:
            start = l.find(":") + 3
            end = len(l) - 1
            return l[start:end]
    return string

def _find_sources(string):
    pattern = r'\d+'
    numbers = [int(n) for n in re.findall(pattern, string)]
    return numbers

def _find_float_numbers(string):
    pattern = r"[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?"
    float_numbers = [float(n) for n in re.findall(pattern, string)]
    return float_numbers

def _find_score(string):
    for l in string.split('\n'):
        if "SCORE" in l:
            d = re.search(r'[-+]?\d*\.?\d+', l)
            break
    return d[0]

In [14]:
### RETRIEVAL ###

# 1. Load the PDF
def load_pdf(path=None, url=None):
    assert (path is not None) != (url is not None), "Either path or url must be provided"
    
    if path:
        return pymupdf.open(path)
    else:
        response = requests.get(url)
        pdf_bytes = io.BytesIO(response.content)
        return pymupdf.open(stream=pdf_bytes, filetype='pdf')
    
# 2. Extract text from the PDF
def extract_text(pdf):
    text_list = [page.get_text() for page in pdf]
    all_text = ''.join(text_list)
    return text_list, all_text

# 4. Create or Load Vector Store
def get_retriever(pdf, db_path, top_k=TOP_K, chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP):
    embeddings = OpenAIEmbeddings()
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        length_function=len,
        separators=["\n\n", "\n", " "],
    )

    chunks = []
    page_idx = []

    for i, page in enumerate(pdf):
        page_chunks = text_splitter.split_text(page.get_text())
        page_idx.extend([i + 1] * len(page_chunks))
        chunks.extend(page_chunks)

    if os.path.exists(db_path):
        doc_search = FAISS.load_local(db_path, embeddings=embeddings, allow_dangerous_deserialization=True)
    else:
        doc_search = FAISS.from_texts(
            chunks,
            embeddings,
            metadatas=[{"source": str(i), "page": str(idx)} for i, idx in enumerate(page_idx)]
        )
        doc_search.save_local(db_path)

    retriever = doc_search.as_retriever(search_kwargs={"k": top_k})
    return retriever, doc_search

# 5. Retrieve relevant chunks
def retrieve_chunks(retriever, queries):
    section_text_dict = {}

    for key, prompts in queries.items():
        if key == 'general' and isinstance(prompts, list):
            combined_docs = []
            for prompt in prompts:
                combined_docs.extend(retriever.invoke(prompt)[:5])
            section_text_dict[key] = combined_docs
        else:
            section_text_dict[key] = retriever.invoke(prompts)
    
    return section_text_dict

In [None]:
# Apply retrieval the functions to the 3 sampled reports
all_results = {}

for idx, row in sample.iterrows():
    filename = f"{prepare_filename(row['company_withAccessInfo'])}_2024"
    filename = filename.replace(" ", "")
    print(f"\nProcessing: {filename}")

    try:
        PATH = f"./sample_reports/{filename}.pdf"
        pdf = load_pdf(path=PATH) 
        text_list, all_text = extract_text(pdf)
        DB_PATH = f"./faiss_db_OpenAI_0/{filename}"
        retriever, doc_search = get_retriever(pdf, db_path=DB_PATH)
        results = retrieve_chunks(retriever, queries=QUERIES)
        all_results[filename] = results
    
    except Exception as e:
        print(f"Error processing {row['company_withAccessInfo']}: {e}")

# save results
results_df = pd.DataFrame.from_dict(all_results, orient='index')
results_df.to_csv('retrieval_results_0.csv', index=True)


Processing: ContinentalAG_2024

Processing: SchneiderElectric_2024

Processing: Philips_2024


*Ressources needed for embeddings of three reports*
- Time 1.5 min (50 min for 100 reports)
- Costs: $0.2 ($7 for 100 reports)
- no GPU needed

In [None]:
### Generation ###

### 1. Define the Structured Output with Pydantic
from pydantic import BaseModel, Field
from typing import List

class ReportAnalysis(BaseModel):
    """Pydantic model for the structured output of the report analysis."""
    answer: str = Field(description='The full analysis, starting with "[[YES]]" or "[[NO]]", followed by a detailed explanation. Max 150 words.')
    sources: List[int] = Field(description="A list of the integer source numbers referenced in the answer.")


### 2. Create Langchain Prompt Template
# 1. Initialize the Pydantic parser based on our model
pydantic_parser = PydanticOutputParser(pydantic_object=ReportAnalysis)

# 2. Define the prompt template
prompt_template = ChatPromptTemplate.from_messages(
    [
        ("system", "You are an AI assistant in the role of a Senior Equity Analyst with expertise in sustainability reporting that analyzes companys' annual reports."),
        ("human", """
You are a senior sustainabiliy analyst with expertise in sustainability reporting evaluating a company's annual report.

You are presented with the following sources from the company's report. Each source is numbered.
--------------------- [BEGIN OF SOURCES]
{context}
--------------------- [END OF SOURCES]

Given the sources information and no prior knowledge, your main task is to respond to the posed question encapsulated in "||".
Question: ||{question}||

Please consider the following additional explanation to the question:
+++++ [BEGIN OF EXPLANATION]
{explanation}
+++++ [END OF EXPLANATION]

Please enforce to the following guidelines in your answer:
1. Your response must be precise and grounded on specific extracts from the report.
2. If you are unsure, simply acknowledge it.
3. Keep your answer within 150 words.
4. Be skeptical and critical of the information disclosed.
5. Identify and be critical of any "cheap talk" (costless, unverifiable statements).
6. Scrutinize whether the report provides quantifiable data versus vague statements.
7. Start your answer with "[[YES]]" or "[[NO]]", followed by a short, informative explanation.

{format_instructions}
""")
    ]
)

# 3. Inject the auto-generated formatting instructions into the prompt
partial_prompt = prompt_template.partial(
    format_instructions=pydantic_parser.get_format_instructions()
)


### 3. Core 

# A) Embedding models

# B) Retrieval settings

# C) Retrieval queries

# D) Generative LLM Models

# E) PDF Extraction

# F) Generation prompts
- include "basic information" 
This is basic information to the company:
{basic_info}

# Evaluation: System soley based on Ni et al. 2023
- goal LLM directly outputs a conformity score [0, 100] on the overarching themes S1_A - S1_G
- hand coded a ground-truth dataset on the retrieval chunks for every indicator in the three sample reports

- PDF text extraction: MyPuPDF
- Retrieval:
    - Embedding model: OpenAI text-embedding-ada-002
    - top k: 20
    - chunk size: 500
    - chunk overlap: 20

In [25]:
validation_set = pd.read_excel('old_validation_dataset.xlsx', sheet_name='S1_Retrieval')
print(validation_set.head())

              report_name query  \
0      ContinentalAG_2024  S1_E   
1  SchneiderElectric_2024  S1_E   
2      ContinentalAG_2024  S1_D   
3      ContinentalAG_2024  S1_A   
4  SchneiderElectric_2024  S1_F   

                                                text page_number  
0  In accordance with Section 76 (4) AktG, the Ex...          27  
1  Our 2025 sustainability commitments\nWith less...          33  
2  The globally applicable Code of\nConduct provi...          41  
3  The globally applicable Code of\nConduct provi...          41  
4  Our Speak Up Mindset\nSchneider Electric emplo...          41  


In [26]:
print(validation_set.groupby(['query', 'report_name' ]).size())

query  report_name           
S1_A   ContinentalAG_2024        22
       Philips_2024              13
       SchneiderElectric_2024    24
S1_B   ContinentalAG_2024        14
       Philips_2024               4
       SchneiderElectric_2024     4
S1_C   ContinentalAG_2024        19
       Philips_2024              13
       SchneiderElectric_2024    18
S1_D   ContinentalAG_2024        23
       Philips_2024              13
       SchneiderElectric_2024    10
S1_E   ContinentalAG_2024        12
       Philips_2024              11
       SchneiderElectric_2024    24
S1_F   ContinentalAG_2024        15
       Philips_2024               6
       SchneiderElectric_2024    23
S1_G   ContinentalAG_2024         5
       Philips_2024               1
       SchneiderElectric_2024     8
dtype: int64


## Evaluate retrieval performance

In [28]:
ground_truth_sentences = set(['Equal pay /nFair and equitable pay is a core component of the Group’s compensation philosophy.',
                            'It is in line with the principle of equal pay for equal work.',
                            '100 % of Schneider are paid /nat least a living wage, which /nwas recognized for the /nsecond consecutive year by /nthe Living Wage Employer /nCertification from Fair /nWage Network.'])
retrieved_sentences = set(['Fair and equitable pay is a core component of the Group’s compensation philosophy.',
                           'It is in line with the principle of equal pay for equal work.',
                           '100 % of Schneider are paid /nat least a living wage, which /nwas recognized for the /nsecond consecutive year'])

found_ground_truth_sentences = set()
for sentence in retrieved_sentences:
    if sentence in ground_truth_sentences:
        found_ground_truth_sentences.add(sentence)

print(found_ground_truth_sentences)

{'It is in line with the principle of equal pay for equal work.'}


This approach is very restrictive and cause for lower scores whenever the cunk size abbreviates sentences, eventhough the LLM could still understand the meaning.

2. Approach fuzzy string matching

In [30]:
found_matches = 0
for sentence in retrieved_sentences:
    best_match, score = process.extractOne(sentence, ground_truth_sentences, scorer=fuzz.ratio)
    if score > 80:
        found_matches += 1
        print(f"Score: {score}, Retrieved sentence: '{sentence}', Found match: '{best_match}'")

Score: 93, Retrieved sentence: 'Fair and equitable pay is a core component of the Group’s compensation philosophy.', Found match: 'Equal pay /nFair and equitable pay is a core component of the Group’s compensation philosophy.'
Score: 100, Retrieved sentence: 'It is in line with the principle of equal pay for equal work.', Found match: 'It is in line with the principle of equal pay for equal work.'


In [34]:
def evaluate_retrieval_sentence_level(retrieved_docs, ground_truth_texts):
    """
    Evaluates retrieval performance on a sentence level using fuzzy string matching.

    Args:
        retrieved_docs (list): A list of Document objects retrieved by LangChain.
        ground_truth_texts (list): A list of ground-truth text snippets from the validation set.
        score_threshold (int): The similarity score (0-100) required to consider a sentence a match.

    Returns:
        dict: A dictionary containing precision, recall, and f1-score.
    """
    # 1. Extract all ground-truth and retrieved sentences
    all_ground_truth_sentences = set()
    for text in ground_truth_texts:
        sentences = sent_tokenize(text)
        all_ground_truth_sentences.update([s.strip() for s in sentences if s.strip()])

    if not all_ground_truth_sentences:
        return {"precision": None, "recall": None, "f1": None}

    all_retrieved_sentences = set()
    for doc in retrieved_docs:
        chunk_sentences = sent_tokenize(doc.page_content)
        all_retrieved_sentences.update([s.strip() for s in chunk_sentences if s.strip()])

    if not all_retrieved_sentences:
        return {"precision": 0.0, "recall": 0.0, "f1": 0.0}
    
    # 2. For each retrieved sentence, find its best match in the ground truth sentences.
    found_matches = 0
    for retrieved_sentence in all_retrieved_sentences:
        # process.extractOne finds the best matching string from a collection.
        # It returns a tuple: (best_match_string, score)
        best_match, score = process.extractOne(
            retrieved_sentence, 
            all_ground_truth_sentences, 
            scorer=fuzz.ratio
        )
        
        # If the best match has a score above our threshold, we count it as a successful find.
        if score >= 80:
            found_matches += 1

    # 3. Calculate metrics based on the fuzzy matches.
    true_positives = found_matches
    
    # Precision = (Relevant sentences found) / (Total sentences retrieved)
    precision = true_positives / len(all_retrieved_sentences) if all_retrieved_sentences else 0.0
    
    # Recall = (Relevant sentences found) / (Total relevant sentences that exist)
    recall = true_positives / len(all_ground_truth_sentences) if all_ground_truth_sentences else 0.0
    
    f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0.0

    return {"precision": precision, "recall": recall, "f1": f1}


def evaluate_retrieval(retrieval_results):
    """
    Evaluates the retrieval performance of a LangChain vector store against a validation set.

    Args:
        retrieval_results (list): A list of Document objects retrieved by LangChain.

    Returns:
        dict: A dictionary containing precision, recall, and f1-score.
    """
    evaluation_results_all = []
    
    for report_name, queries_results in retrieval_results.items():
        for query_key, retrieved_documents in queries_results.items():

            # Get the corresponding ground-truth texts from the validation set
            gt_texts = validation_set[
                (validation_set['report_name'] == report_name) & 
                (validation_set['query'] == query_key)
            ]['text'].tolist()

            if not gt_texts:
                continue

            # Evaluate the retrieval performance
            scores = evaluate_retrieval_sentence_level(
                retrieved_docs=retrieved_documents,
                ground_truth_texts=gt_texts
            )

            evaluation_results_all.append({
                "report_name": report_name,
                "query": query_key,
                **scores
            })

    evaluation_results_mean = pd.DataFrame(evaluation_results_all).groupby(['query'])[['precision', 'recall', 'f1']].mean().reset_index()
    overall_mean = evaluation_results_mean[['precision', 'recall', 'f1']].mean()
    evaluation_results_mean.loc['Overall Mean'] = overall_mean

    print("--- Retrieval Performance Summary ---")
    print(evaluation_results_mean.round(3))

In [None]:
# Based on Ni et al. 2023
TOP_K = 20
CHUNK_SIZE = 500
CHUNK_OVERLAP = 20

QUERIES = {
    'S1_A': "How does the company manage and disclose material impacts, risks and opportunities related to the own workforce?",
    'S1_B': "What are the material risks and opportunities arising from the company’s impacts and dependencies on people in its own workforce?",
    'S1_C': "What are the company’s human rights practices, risks and incidents related to the own workforce?",
    'S1_D': "What are the company’s processes and policies for engaging with own workers and workers’ representatives about impacts?",
    'S1_E': "What are the company’s policies on non-discrimination, diversity and inclusion in the own workforce?",
    'S1_F': "What are the company’s processes, policies and approaches to remediate negative impacts and channels for own workers to raise concerns?",
    'S1_G': "How is the company’s workfoce social protection coverage?",
}

In [35]:
# retrive relevant chunks from the sample reports
all_results = {}

for idx, row in sample.iterrows():
    filename = f"{prepare_filename(row['company_withAccessInfo'])}_2024"
    filename = filename.replace(" ", "")
    print(f"\nProcessing: {filename}")

    try:
        PATH = f"./sample_reports/{filename}.pdf"
        pdf = load_pdf(path=PATH) 
        text_list, all_text = extract_text(pdf)
        DB_PATH = f"./faiss_db_OpenAI_0/{filename}"
        retriever, doc_search = get_retriever(pdf, db_path=DB_PATH)
        results = retrieve_chunks(retriever, queries=QUERIES)
        all_results[filename] = results
    
    except Exception as e:
        print(f"Error processing {row['company_withAccessInfo']}: {e}")

evaluate_retrieval(all_results)


Processing: ContinentalAG_2024

Processing: SchneiderElectric_2024

Processing: Philips_2024
--- Retrieval Performance Summary ---
             query  precision  recall     f1
0             S1_A      0.479   0.028  0.052
1             S1_B      0.156   0.088  0.109
2             S1_C      0.216   0.058  0.091
3             S1_D      0.214   0.043  0.072
4             S1_E      0.549   0.135  0.215
5             S1_F      0.212   0.044  0.072
6             S1_G      0.235   0.250  0.235
Overall Mean   NaN      0.295   0.092  0.121


We observe that the Retrival Recall is very low. Only 9% of all sentences that were manually evaluated as important for the indicator were included in the retrieved sentences from the system.
Additionally, only 30% of the sentences the model retrieved were considered important.