# Retrieval-Augmented Generation (RAG) System for Financial Reports

# STEP 1: BUILD A BASIC RAG PIPELINE

### Objective: Build a simple RAG pipeline for factual QA from a single financial report.

### Preprocessing: Extract and clean text from PDF.

In [26]:
pip install PyMuPDF

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Note: you may need to restart the kernel to use updated packages.


In [27]:
import fitz

def extract_text_from_pdf(path):
    doc = fitz.open(path)
    text = ""
    for page in doc:
        text += page.get_text()
    return text

raw_text = extract_text_from_pdf("/kaggle/input/financial-reporta/Metas Q1 2024 Financial Report.pdf")

### Chunking & Embedding: Split into chunks; generate embeddings with an open-source model.


In [28]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=200)
chunks = splitter.split_text(raw_text)

In [29]:
from sentence_transformers import SentenceTransformer

embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = embedding_model.encode(chunks)

Batches:   0%|          | 0/10 [00:00<?, ?it/s]

### Retrieval: Use vector similarity to retrieve top-3 relevant chunks.
Note:
Here I used L2 distance (Euclidean distance). This measures the straight-line distance between vectors in the embedding space.

In [30]:
pip install faiss-cpu


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Note: you may need to restart the kernel to use updated packages.


In [31]:

import faiss
import numpy as np

dimension = embeddings[0].shape[0]
index = faiss.IndexFlatL2(dimension)
index.add(np.array(embeddings).astype('float32'))


query = "What was Meta’s revenue in Q1 2024?"
query_embedding = embedding_model.encode([query]).astype('float32')
top_k = 3
D, I = index.search(np.array(query_embedding), top_k)
retrieved_chunks = [chunks[i] for i in I[0]]
context = "\n".join(retrieved_chunks)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [32]:
def remove_newlines(text):
    if isinstance(text, list):
        text = ' '.join(text)
    return text.replace('\n', ' ')



cleaned = remove_newlines(context)
print(cleaned)

4 META PLATFORMS, INC. CONDENSED CONSOLIDATED STATEMENTS OF INCOME (In millions, except per share amounts) (Unaudited) Three Months Ended March 31, 2024 2023 Revenue $  36,455 $  28,645 steady progress building the metaverse as well." First Quarter 2024 Financial Highlights Three Months Ended March 31, % Change In millions, except percentages and per share amounts 2024 2023 Revenue 6 META PLATFORMS, INC. CONDENSED CONSOLIDATED STATEMENTS OF CASH FLOWS (In millions) (Unaudited) Three Months Ended March 31, 2024 2023 Cash flows from operating activities Net income $  12,369 $


### Generation: Answer queries using an open-source LLM

In [33]:
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load model and tokenizer
model_name = "HuggingFaceTB/SmolLM2-1.7B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Ensure padding token exists
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# Example context + query
context = cleaned
query = query

# Instruction-style prompt
prompt = f"<|user|>\nBased on the following context:\n{context}\n\nAnswer the question in **100 words**:\n{query}\n<|assistant|>"

# Tokenize
inputs = tokenizer(prompt, return_tensors="pt", padding=True, truncation=True, max_length=1024)

# Generate
outputs = model.generate(
    **inputs,
    max_new_tokens=130,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=tokenizer.pad_token_id,
)

# Decode only new tokens (exclude the prompt)
input_length = inputs["input_ids"].shape[-1]
generated_ids = outputs[0][input_length:]
answer = tokenizer.decode(generated_ids, skip_special_tokens=True)

print("Answer:", answer)


Answer: 
<|output|>
6 META PLATFORMS, INC. CONDENSED CONSOLIDATED STATEMENTS OF INCOME (In millions, except per share amounts) (Unaudited) Three Months Ended March 31, 2024 2023 Revenue $  36,455 $  28,645 steady progress building the metaverse as well." First Quarter 2024 Financial Highlights Three Months Ended March 31, % Change In millions, except percentages and per share amounts 202


### Model works fine

### creating a function

In [34]:
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load model and tokenizer
model_name = "HuggingFaceTB/SmolLM2-1.7B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token


def get_answer(query):
  query = query
  query_embedding = embedding_model.encode([query]).astype('float32')

  top_k = 3
  D, I = index.search(np.array(query_embedding), top_k)
  retrieved_chunks = [chunks[i] for i in I[0]]
  context = "\n".join(retrieved_chunks)
  def remove_newlines(text):
    if isinstance(text, list):
        text = ' '.join(text)
    return text.replace('\n', ' ')

  cleaned = remove_newlines(context)
  print(cleaned)
  context = cleaned

  prompt =prompt = f"<|user|>\nBased on the following context:\n{context}\n\nAnswer the question in **100 words**:\n{query}\n<|assistant|>"
  prompt = prompt.replace("’", "'")

  inputs = tokenizer(prompt, return_tensors="pt", padding=True, truncation=True, max_length=1024)

# Generate
  outputs = model.generate(
    **inputs,
    max_new_tokens=130,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=tokenizer.pad_token_id,
   )

# Decode only new tokens (exclude the prompt)
  input_length = inputs["input_ids"].shape[-1]
  generated_ids = outputs[0][input_length:]
  answer = tokenizer.decode(generated_ids, skip_special_tokens=True)

  print("Answer:", answer)




# Testing the model

In [35]:
query='What was Meta’s net income in Q1 2024 compared to Q1 2023?'
get_answer(query)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

4 META PLATFORMS, INC. CONDENSED CONSOLIDATED STATEMENTS OF INCOME (In millions, except per share amounts) (Unaudited) Three Months Ended March 31, 2024 2023 Revenue $  36,455 $  28,645 6 META PLATFORMS, INC. CONDENSED CONSOLIDATED STATEMENTS OF CASH FLOWS (In millions) (Unaudited) Three Months Ended March 31, 2024 2023 Cash flows from operating activities Net income $  12,369 $ steady progress building the metaverse as well." First Quarter 2024 Financial Highlights Three Months Ended March 31, % Change In millions, except percentages and per share amounts 2024 2023 Revenue
Answer: 
<|response|>
$12,369
<|response|>
The net income of Meta in Q1 2024 was $12,369 compared to $12,369 in Q1 2023.
<|response|>


In [36]:
query='What were the key financial highlights for Meta in Q1 2024?'
get_answer(query)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

steady progress building the metaverse as well." First Quarter 2024 Financial Highlights Three Months Ended March 31, % Change In millions, except percentages and per share amounts 2024 2023 Revenue Meta Reports First Quarter 2024 Results MENLO PARK, Calif. – April 24, 2024 – Meta Platforms, Inc. (Nasdaq: META) today reported financial results for the quarter  ended March 31, 2024. 4 META PLATFORMS, INC. CONDENSED CONSOLIDATED STATEMENTS OF INCOME (In millions, except per share amounts) (Unaudited) Three Months Ended March 31, 2024 2023 Revenue $  36,455 $  28,645
Answer: 
The key financial highlights for Meta in Q1 2024 were steady progress in building the metaverse as well as significant revenue growth. The company reported revenue of $36,455 million, a 14% increase from the same period in 2023. This growth was driven by the continued expansion of its metaverse platforms and services.
