# Financial RAG Agent 

This is a research notebook for building a financial RAG agent. The dataset for this notebook is taken from Kaggle [Financial Reports QA Dataset for RAG-based LLM Fin](https://www.kaggle.com/datasets/ahmedsta/data-retreiver). The data contains financial reports in pdf format from various companies. 

The dataset also contains a set of questions and expected answers that can be used to evaluate the performance of the RAG agent.

In [43]:
import os
from langchain.chat_models import init_chat_model
from dotenv import load_dotenv

load_dotenv()

True

This RAG agent would be using OpenAI model as the underlying LLM and hence the embedding would be done using OpenAI's text-embedding-3-large model.

In [44]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

In [45]:
from langchain_core.vectorstores import InMemoryVectorStore

vector_store = InMemoryVectorStore(embeddings)

In [46]:
from langchain_community.document_loaders import PyPDFLoader
import os

data_dir = "./../dataset/Structured data-20250319T105519Z-001/Structured data"

pdf_files = [f for f in os.listdir(data_dir) if f.endswith('.pdf')]

documents = []

for pdf_file in pdf_files:
    file_path = os.path.join(data_dir, pdf_file)
    print(f"Loading {pdf_file}...")
    try:
        loader = PyPDFLoader(file_path)
        pages = loader.load()
        documents.extend(pages)
        print(f"Loaded {len(pages)} pages from {pdf_file}")
    except Exception as e:
        print(f"Error loading {pdf_file}: {str(e)}")

print(f"\nTotal documents loaded: {len(documents)}")

Loading 2021ESG.pdf...
Loaded 116 pages from 2021ESG.pdf
Loading 2022-Absa-Group-limited-Environmental-Social-and-Governance-Data-sheet.pdf...
Loaded 26 pages from 2022-Absa-Group-limited-Environmental-Social-and-Governance-Data-sheet.pdf
Loading Clicks-Sustainability-Report-2022.pdf...
Loaded 21 pages from Clicks-Sustainability-Report-2022.pdf
Loading DISTELL ESG Appendix 2022.pdf...
Loaded 17 pages from DISTELL ESG Appendix 2022.pdf
Loading ESG-spreads.pdf...
Loaded 76 pages from ESG-spreads.pdf
Loading picknpay-esg-report-spreads-2023.pdf...
Loaded 24 pages from picknpay-esg-report-spreads-2023.pdf
Loading SASOL Sustainability Report 2023 20-09_0.pdf...
Loaded 66 pages from SASOL Sustainability Report 2023 20-09_0.pdf

Total documents loaded: 346


In [47]:
print(documents[1].page_content[:500])

PARTNERING IN GROWTH
CONTENTS
02
04
05
01 INTRODUCTION
02 COMPREHENSIVE DATA TABLE
08 ABOUT THIS REPORT
10 ABOUT TONGAAT HULETT
11 GEOGRAPHIC FOOTPRINT
12 OUR BUSINESS MODEL
14 OUR CONTRIBUTION TO SOCIETY
OUR APPROACH
18 APPROACH TO SUSTAINABILITY
19 CLIMATE CHANGE AND SUSTAINABILITY GOVERNANCE 
22 MEDIUM- TERM ESG STRATEGY
23 OUR MOST MATERIAL ISSUES
26 RESPONDING TO THE SUSTAINABLE DEVELOPMENT GOALS
38 UNITED NATIONS GLOBAL COMPACT COMMUNICATION  
ON PROGRESS FOR 2021
HUMAN CAPITAL
42 HUMAN CA


In [48]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,  
    chunk_overlap=200,  
    add_start_index=True, 
)
all_splits = text_splitter.split_documents(documents)

print(f"Split the financial docs into {len(all_splits)} sub-documents.")

Split the financial docs into 1683 sub-documents.


In [49]:
document_ids = vector_store.add_documents(documents=all_splits)

print(document_ids[:10])

['032b98bb-080b-4cee-ae4c-264d8d7edbee', 'cd0fccd4-caee-418b-b02f-73fab032032b', 'acf6edd8-2dba-40b5-a11a-7cf2f3899bb2', 'ba54e834-8774-449c-a75f-32436b6cc3b8', 'f510f220-f890-48f2-bf37-84ee6d04dfc2', '3b3f0217-33ac-435b-9f3f-f5693b9f50ae', '07c2d686-049b-4de5-a51c-88502a725845', '70cf62e4-80a2-4a62-81d1-ef6327e4b1bf', '49ca71b4-2640-4e05-a31e-e0aa34146ac9', 'd06d4255-9443-4140-9a7b-d8550e532164']


In [50]:
from langchain.tools import tool

@tool(response_format="content_and_artifact")
def retrieve_context(query: str):
    """Retrieve information to help answer a query."""
    retrieved_docs = vector_store.similarity_search(query, k=2)
    serialized = "\n\n".join(
        (f"Source: {doc.metadata}\nContent: {doc.page_content}")
        for doc in retrieved_docs
    )
    return serialized, retrieved_docs

In [51]:
from langgraph.prebuilt import create_react_agent

tools = [retrieve_context]
prompt = """
    You have access to a tool that retrieves context from financial documents. 
    Use the tool to help answer user queries.
"""

agent = create_react_agent(
    model="openai:gpt-4.1",
    tools=tools,
    prompt=(prompt),
)


In [52]:
query = "what is the value of Average age 40-49 years in Absa document in 2020?"

for event in agent.stream(
    {"messages": [{"role": "user", "content": query}]},
    stream_mode="values",
):
    event["messages"][-1].pretty_print()


what is the value of Average age 40-49 years in Absa document in 2020?
Tool Calls:
  retrieve_context (call_AdNxdWwYp3POiXBNaaK6sDMt)
 Call ID: call_AdNxdWwYp3POiXBNaaK6sDMt
  Args:
    query: Average age 40-49 years in Absa document 2020
Name: retrieve_context

Source: {'producer': 'Adobe PDF Library 16.0.7', 'creator': 'Adobe InDesign 17.4 (Macintosh)', 'creationdate': '2023-03-31T16:04:42+02:00', 'moddate': '2023-04-17T07:26:06+02:00', 'trapped': '/False', 'source': './../dataset/Structured data-20250319T105519Z-001/Structured data\\2022-Absa-Group-limited-Environmental-Social-and-Governance-Data-sheet.pdf', 'total_pages': 26, 'page': 3, 'page_label': '4', 'start_index': 0}
Content: Absa Group Limited 2022 E nvironmental, Social and Governance Indicators
4
Absa social data – Operations
Indicator Trend 2022 2021 2020 2019 Targets/Comments
Labour
Total number of employees  35 451 35 267 36 737 38 472
Per employment category:
Permanent – male  13 413 13 503 14 032 14 325
Permanent –

In [53]:
response = agent.invoke(
    {"messages": [{"role": "user", "content": query}]}
)

In [58]:
response['messages'][-1].content

'After searching the available Absa documents for 2020, I could not find a specific value labeled "Average age 40-49 years" for that year. The document provides employee numbers by age group and other metrics like average training hours, but it does not list the average age of employees in the 40-49 year bracket.\n\nIf you meant the number of employees in the 40-49 years age group, or need statistics related to that segment, please clarify, and I can assist further. Otherwise, as of now, the specific value for "Average age 40-49 years" in Absa\'s 2020 document is not available.'