<a href="https://colab.research.google.com/github/Miftahul-adib/Building-a-RAG-system/blob/main/RAG_application_build_ipynb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# PROJECT OVERVIEW:

This project implements a Retrieval-Augmented Generation (RAG) system designed to answer natural language questions from financial reports, specifically based on Meta's Q1 2024 Financial Report


| Task                       | Tool/Library            |
| -------------------------- | ----------------------- |
| PDF Text Extraction        | `PyMuPDF`, `pdfplumber` |
| Text Chunking              | `LangChain`             |
| Embeddings                 | `sentence-transformers` |
| Vector Storage & Retrieval | `FAISS`                 |
| Table Extraction           | `pdfplumber`, `pandas`  |
| LLM (Answer Generation)    | `Qwen/Qwen3-0.6B' |






# STEP 1: BUILD A BASIC RAG PIPELINE
## Objective: Build a simple RAG pipeline for factual QA from a single financial report.

## Preprocessing: Extract and clean text from PDF.


In [110]:
pip install PyMuPDF



In [111]:
import fitz

def extract_text_from_pdf(path):
    doc = fitz.open(path)
    text = ""
    for page in doc:
        text += page.get_text()
    return text

raw_text = extract_text_from_pdf("/content/Meta’s Q1 2024 Financial Report.pdf")


In [112]:
raw_text

'Meta Reports First Quarter 2024 Results\nMENLO PARK, Calif. – April 24, 2024 – Meta Platforms, Inc. (Nasdaq: META) today reported financial results for the quarter \nended March 31, 2024.\n"It\'s been a good start to the year," said Mark Zuckerberg, Meta founder and CEO. "The new version of Meta AI with Llama 3 is \nanother step towards building the world\'s leading AI. We\'re seeing healthy growth across our apps and we continue making \nsteady progress building the metaverse as well."\nFirst Quarter 2024 Financial Highlights\nThree Months Ended March 31,\n% Change\nIn millions, except percentages and per share amounts\n2024\n2023\nRevenue\n$ \n36,455 \n$ \n28,645 \n 27 %\nCosts and expenses\n \n22,637 \n \n21,418 \n 6 %\nIncome from operations\n$ \n13,818 \n$ \n7,227 \n 91 %\nOperating margin\n 38 %\n 25 %\nProvision for income taxes\n$ \n1,814 \n$ \n1,598 \n 14 %\nEffective tax rate\n 13 %\n 22 %\nNet income\n$ \n12,369 \n$ \n5,709 \n 117 %\nDiluted earnings per share (EPS)\n$ \n4.

## Chunking & Embedding: Split into chunks; generate embeddings with an open-source model.


In [113]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=200)
chunks = splitter.split_text(raw_text)

In [114]:
len(chunks)

22

In [115]:
from sentence_transformers import SentenceTransformer

embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = embedding_model.encode(chunks)

In [116]:
pip install faiss-cpu




### Retrieval: Use vector similarity to retrieve top-3 relevant chunks.
###### Note:
This is my first time working with vector databases.

Here I used L2 distance (Euclidean distance). This measures the straight-line distance between vectors in the embedding space.


In [117]:

import faiss
import numpy as np

dimension = embeddings[0].shape[0]
index = faiss.IndexFlatL2(dimension)
index.add(np.array(embeddings).astype('float32'))


query = "What was Meta’s revenue in Q1 2024?"
query_embedding = embedding_model.encode([query]).astype('float32')
top_k = 3
D, I = index.search(np.array(query_embedding), top_k)
retrieved_chunks = [chunks[i] for i in I[0]]
context = "\n".join(retrieved_chunks)

In [118]:
def remove_newlines(text):
    """Remove all newline characters from a string."""
    if isinstance(text, list):
        text = ' '.join(text)  # Join list items with spaces first
    return text.replace('\n', ' ')




 # or a single string
cleaned = remove_newlines(context)
print(cleaned)

Meta Reports First Quarter 2024 Results MENLO PARK, Calif. – April 24, 2024 – Meta Platforms, Inc. (Nasdaq: META) today reported financial results for the quarter  ended March 31, 2024. "It's been a good start to the year," said Mark Zuckerberg, Meta founder and CEO. "The new version of Meta AI with Llama 3 is  another step towards building the world's leading AI. We're seeing healthy growth across our apps and we continue making  steady progress building the metaverse as well." First Quarter 2024 Financial Highlights Three Months Ended March 31, % Change In millions, except percentages and per share amounts 2024 2023 Revenue $  36,455  $  28,645   27 % Costs and expenses   22,637    21,418   6 % Income from operations $  13,818  $  7,227   91 % Operating margin  38 %  25 % Provision for income taxes $  1,814  $  1,598   14 % Effective tax rate  13 %  22 % Net income $  12,369  $  5,709   117 % Diluted earnings per share (EPS) $  4.71  $  2.20   114 % intended to represent our residual

# Generation: Answer queries using an open-source LLM

In [119]:
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "Qwen/Qwen3-0.6B"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Assign pad token if missing
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

context = cleaned

prompt = f"Based on the following context:\n{context}\n\nAnswer the question in **100 words**:\n{query}"

prompt = prompt.replace("’", "'")  # optional cleanup of smart quotes

inputs = tokenizer(prompt, return_tensors="pt", padding=True, truncation=True, max_length=512)

outputs = model.generate(
    **inputs,
    max_new_tokens = 130 ,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=tokenizer.pad_token_id,
)


# Decode the generated output
answer = tokenizer.decode(outputs[0], skip_special_tokens=True)

if answer.startswith(prompt):
    answer_only = answer[len(prompt):].strip()
else:
    answer_only = answer.replace(prompt, '', 1).strip()

print("Answer:", answer_only)

Answer: Based on the following context:
Meta Reports First Quarter 2024 Results MENLO PARK, Calif. – April 24, 2024 – Meta Platforms, Inc. (Nasdaq: META) today reported financial results for the quarter  ended March 31, 2024. "It's been a good start to the year," said Mark Zuckerberg, Meta founder and CEO. "The new version of Meta AI with Llama 3 is  another step towards building the world's leading AI. We're seeing healthy growth across our apps and we continue making  steady progress building the metaverse as well." First Quarter 2024 Financial Highlights Three Months Ended March 31, % Change In millions, except percentages and per share amounts 2024 2023 Revenue $  36,455  $  28,645   27 % Costs and expenses   22,637    21,418   6 % Income from operations $  13,818  $  7,227   91 % Operating margin  38 %  25 % Provision for income taxes $  1,814  $  1,598   14 % Effective tax rate  13 %  22 % Net income $  12,369  $  5,709   117 % Diluted earnings per share (EPS) $  4.71  $  2.20   

the model is not providing with suitable results.


In [135]:
def get_answer(query):
  query = query
  query_embedding = embedding_model.encode([query]).astype('float32')

  top_k = 3
  D, I = index.search(np.array(query_embedding), top_k)
  retrieved_chunks = [chunks[i] for i in I[0]]
  context = "\n".join(retrieved_chunks)
  def remove_newlines(text):
    """Remove all newline characters from a string."""
    if isinstance(text, list):
        text = ' '.join(text)  # Join list items with spaces first
    return text.replace('\n', ' ')





 # or a single string
  cleaned = remove_newlines(context)
  print(cleaned)
  context = cleaned

  prompt = f"Based on the following context:\n{context}\n\nAnswer the question in **100 words**:\n{query}"

  prompt = prompt.replace("’", "'")  # optional cleanup of smart quotes

  inputs = tokenizer(prompt, return_tensors="pt", padding=True, truncation=True, max_length=512)

  outputs = model.generate(
    **inputs,
    max_new_tokens = 130 ,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=tokenizer.pad_token_id)


# Decode the generated output
  answer = tokenizer.decode(outputs[0], skip_special_tokens=True)
  if answer.startswith(prompt):
     answer_only = answer[len(prompt):].strip()
  else:
     answer_only = answer.replace(prompt, '', 1).strip()

  print("Answer:", answer_only)




# Testing the model

In [136]:
query1='What was Meta’s revenue in Q1 2024?'
get_answer(query1)

Meta Reports First Quarter 2024 Results MENLO PARK, Calif. – April 24, 2024 – Meta Platforms, Inc. (Nasdaq: META) today reported financial results for the quarter  ended March 31, 2024. "It's been a good start to the year," said Mark Zuckerberg, Meta founder and CEO. "The new version of Meta AI with Llama 3 is  another step towards building the world's leading AI. We're seeing healthy growth across our apps and we continue making  steady progress building the metaverse as well." First Quarter 2024 Financial Highlights Three Months Ended March 31, % Change In millions, except percentages and per share amounts 2024 2023 Revenue $  36,455  $  28,645   27 % Costs and expenses   22,637    21,418   6 % Income from operations $  13,818  $  7,227   91 % Operating margin  38 %  25 % Provision for income taxes $  1,814  $  1,598   14 % Effective tax rate  13 %  22 % Net income $  12,369  $  5,709   117 % Diluted earnings per share (EPS) $  4.71  $  2.20   114 % intended to represent our residual

In [137]:
query2='What were the key financial highlights for Meta in Q1 2024?'
get_answer(query2)

Meta Reports First Quarter 2024 Results MENLO PARK, Calif. – April 24, 2024 – Meta Platforms, Inc. (Nasdaq: META) today reported financial results for the quarter  ended March 31, 2024. "It's been a good start to the year," said Mark Zuckerberg, Meta founder and CEO. "The new version of Meta AI with Llama 3 is  another step towards building the world's leading AI. We're seeing healthy growth across our apps and we continue making  steady progress building the metaverse as well." First Quarter 2024 Financial Highlights Three Months Ended March 31, % Change In millions, except percentages and per share amounts 2024 2023 Revenue $  36,455  $  28,645   27 % Costs and expenses   22,637    21,418   6 % Income from operations $  13,818  $  7,227   91 % Operating margin  38 %  25 % Provision for income taxes $  1,814  $  1,598   14 % Effective tax rate  13 %  22 % Net income $  12,369  $  5,709   117 % Diluted earnings per share (EPS) $  4.71  $  2.20   114 % as we invest aggressively to suppo

# STEP 2:Structured Data Integration


In [123]:
pip install tabula-py




In [124]:
pip install pdfplumber



# Table Extraction: Parse tables into structured formats (e.g., DataFrame,JSON).

In [125]:
import pdfplumber
import pandas as pd

tables = []
with pdfplumber.open("/content/Meta’s Q1 2024 Financial Report.pdf") as pdf:
    for page in pdf.pages:
        page_tables = page.extract_tables()
        for table in page_tables:
            if table and len(table) > 1:  # Check if table exists and has data
                df = pd.DataFrame(table[1:], columns=table[0])
                tables.append(df)

In [126]:
tables[1]

Unnamed: 0,Revenue,"$ 36,455 $ 28,645"
0,Costs and expenses:,
1,Cost of revenue,"6,640 6,108"
2,Research and development,"9,978 9,381"
3,Marketing and sales,"2,564 3,044"
4,General and administrative,"3,455 2,885"
5,Total costs and expenses,"22,637 21,418"
6,,"13,818 7,227"
7,"Interest and other income, net",365 80
8,Income before provision for income taxes,"14,183 7,307"
9,Provision for income taxes,"1,814 1,598"


# Hybrid Retrieval: Combine vector search (text) + keyword/SQL-like search(structured).

In [127]:
def search_tables_simple(query, tables):
    """Simple search across all tables"""
    if not tables:
        return None

    # Split query into words
    stop_words = ['what', 'was', 'the', 'in', 'to', 'and', 'or', 'a', 'an', 'is', 'are', 'compared']
    words = [word.lower().strip('?.,!') for word in query.split()
             if word.lower() not in stop_words and len(word) > 2]

    print(f"Searching for: {words}")

    results = []
    for i, table in enumerate(tables):
        matching_rows = []

        # Check each row
        for idx, row in table.iterrows():
            # Convert entire row to lowercase string
            row_text = ' '.join(str(val).lower() for val in row.values)

            # Check if any search word is in this row
            if any(word in row_text for word in words):
                matching_rows.append(idx)

        if matching_rows:
            matches = table.loc[matching_rows]
            print(f"Table {i+1}: {len(matches)} matches")
            results.append((i+1, matches))

    return results if results else None

# Test it
question = "What was Meta's net income in Q1 2024 compared to Q1 2023?"
results = search_tables_simple(question, tables)

if results:
    for table_num, matches in results:
        print(f"\n--- Table {table_num} matches ---")
        print(matches)
else:
    print("No matches found")

Searching for: ["meta's", 'net', 'income', '2024', '2023']
Table 1: 3 matches
Table 2: 4 matches
Table 3: 3 matches
Table 4: 10 matches
Table 5: 2 matches
Table 7: 2 matches
Table 8: 4 matches

--- Table 1 matches ---
                    Revenue $ 36,455 $ 28,645 27 %
1     Income from operations $ 13,818 $ 7,227 91 %
3  Provision for income taxes $ 1,814 $ 1,598 14 %
5                Net income $ 12,369 $ 5,709 117 %

--- Table 2 matches ---
                                     Revenue $ 36,455 $ 28,645
7             Interest and other income, net            365 80
8   Income before provision for income taxes      14,183 7,307
9                 Provision for income taxes       1,814 1,598
10                                Net income  $ 12,369 $ 5,709

--- Table 3 matches ---
                                       Assets
3      Accounts receivable, net 13,430 16,169
7   Property and equipment, net 98,908 96,587
21         Long-term income taxes 7,795 7,514

--- Table 4 matches ---
    

# Retrieval

In [128]:
import faiss
import numpy as np

dimension = embeddings[0].shape[0]
index = faiss.IndexFlatL2(dimension)
index.add(np.array(embeddings).astype('float32'))


query = query
query_embedding = embedding_model.encode([query]).astype('float32')

top_k = 3
D, I = index.search(np.array(query_embedding), top_k)
retrieved_chunks = [chunks[i] for i in I[0]]
context = "\n".join(retrieved_chunks)
import re

def remove_newlines(text):
    """Remove all newline characters from a string."""
    if isinstance(text, list):
        text = ' '.join(text)  # Join list items with spaces first
    return text.replace('\n', ' ')




 # or a single string
cleaned = remove_newlines(context)
print(cleaned)
context=cleaned

In [129]:
results=search_tables_simple(query, tables)

Searching for: ['meta’s', 'revenue', '2024']
Table 2: 1 matches
Table 7: 2 matches
Table 8: 9 matches


# Model Generation

In [132]:
model_name = "Qwen/Qwen3-0.6B"


prompt = f" Answer the query: {query} based on, Text context:{context} and Structured data: {results} "

prompt = prompt.replace("’", "'")  # optional cleanup of smart quotes

inputs = tokenizer(prompt, return_tensors="pt", padding=True, truncation=True, max_length=512)

outputs = model.generate(
    **inputs,
    max_new_tokens = 130 ,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=tokenizer.pad_token_id,
)


# Decode the generated output
answer = tokenizer.decode(outputs[0], skip_special_tokens=True)


print("Answer:", answer)

Answer:  Answer the query: What was Meta's revenue in Q1 2024? based on, Text context and Structured data.Text context:Meta Reports First Quarter 2024 Results MENLO PARK, Calif. – April 24, 2024 – Meta Platforms, Inc. (Nasdaq: META) today reported financial results for the quarter  ended March 31, 2024. "It's been a good start to the year," said Mark Zuckerberg, Meta founder and CEO. "The new version of Meta AI with Llama 3 is  another step towards building the world's leading AI. We're seeing healthy growth across our apps and we continue making  steady progress building the metaverse as well." First Quarter 2024 Financial Highlights Three Months Ended March 31, % Change In millions, except percentages and per share amounts 2024 2023 Revenue $  36,455  $  28,645   27 % Costs and expenses   22,637    21,418   6 % Income from operations $  13,818  $  7,227   91 % Operating margin  38 %  25 % Provision for income taxes $  1,814  $  1,598   14 % Effective tax rate  13 %  22 % Net income $

In [146]:
def get_answer(query):
  query=query


  query_embedding = embedding_model.encode([query]).astype('float32')

  top_k = 3
  D, I = index.search(np.array(query_embedding), top_k)
  retrieved_chunks = [chunks[i] for i in I[0]]
  context = "\n".join(retrieved_chunks)
  cleaned = remove_newlines(context)
  context=cleaned
  results=search_tables_simple(query, tables)
  model_name = "Qwen/Qwen3-0.6B"


  prompt = f" Answer the query: {query} based on, Text context:{context} and Structured data: {results} "

  prompt = prompt.replace("’", "'")  # optional cleanup of smart quotes

  inputs = tokenizer(prompt, return_tensors="pt", padding=True, truncation=True, max_length=512)

  outputs = model.generate(
    **inputs,
    max_new_tokens = 130 ,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=tokenizer.pad_token_id,
)


# Decode the generated output
  answer = tokenizer.decode(outputs[0], skip_special_tokens=True)
  if answer.startswith(prompt):
    answer_only = answer[len(prompt):].strip()
  else:
    answer_only = answer.replace(prompt, '', 1).strip()

  print("Answer:", answer_only)




# Testing the queries

In [147]:
query1='What was Meta’s net income in Q1 2024 compared to Q1 2023?'
get_answer(query1)

Searching for: ['meta’s', 'net', 'income', '2024', '2023']
Table 1: 3 matches
Table 2: 4 matches
Table 3: 3 matches
Table 4: 10 matches
Table 5: 2 matches
Table 7: 2 matches
Table 8: 4 matches
Answer: Answer the query: What was Meta's net income in Q1 2024 compared to Q1 2023? based on, Text context:Meta Reports First Quarter 2024 Results MENLO PARK, Calif. – April 24, 2024 – Meta Platforms, Inc. (Nasdaq: META) today reported financial results for the quarter  ended March 31, 2024. "It's been a good start to the year," said Mark Zuckerberg, Meta founder and CEO. "The new version of Meta AI with Llama 3 is  another step towards building the world's leading AI. We're seeing healthy growth across our apps and we continue making  steady progress building the metaverse as well." First Quarter 2024 Financial Highlights Three Months Ended March 31, % Change In millions, except percentages and per share amounts 2024 2023 Revenue $  36,455  $  28,645   27 % Costs and expenses   22,637    21,418

In [145]:
query2='Summarize Meta’s operating expenses in Q1 2024.'
get_answer(query2)

Searching for: ['summarize', 'meta’s', 'operating', 'expenses', '2024']
Table 1: 2 matches
Table 2: 2 matches
Table 3: 5 matches
Table 4: 5 matches
Table 5: 2 matches
Table 8: 3 matches
Answer: Answer the query: Summarize Meta's operating expenses in Q1 2024. based on, Text context:intended to represent our residual cash flow available for discretionary expenditures.  For more information on our non-GAAP financial measures and a reconciliation of GAAP to non-GAAP measures, please see  the "Reconciliation of GAAP to Non-GAAP Results" table in this press release. 4 META PLATFORMS, INC. CONDENSED CONSOLIDATED STATEMENTS OF INCOME (In millions, except per share amounts) (Unaudited) Three Months Ended March 31, 2024 2023 Revenue $  36,455 $  28,645  Costs and expenses: Cost of revenue   6,640   6,108  Research and development   9,978   9,381  Marketing and sales   2,564   3,044  General and administrative   3,455   2,885  Total costs and expenses   22,637   21,418  Income from operations   