# AI R&D Intern Assessment: RAG Pipeline for Financial Data QA

## Step 1: Basic RAG Pipeline

**Objective:** Build a simple RAG pipeline for factual QA from Meta's Q1 2024 financial report.

**Deliverables:**
- Source code (this notebook)
- Brief report (see below)
- Sample outputs for test queries

---

## Brief Report

**Approach & Tools:**
- PDF text extraction: PyMuPDF (fitz)
- Chunking: Simple text splitter
- Embedding: Sentence Transformers (all-MiniLM-L6-v2)
- Vector DB: FAISS
- LLM: Open-source LLM (e.g., Llama.cpp, or use HuggingFace pipeline with a small model for demo)

**Challenges & Results:**
- Table extraction and hybrid retrieval are not included in Step 1.
- The pipeline retrieves relevant chunks and generates answers for factual queries.

---

## 1. Preprocessing: Extract and Clean Text from PDF

In [70]:
import fitz
import re

def extract_text_from_pdf(pdf_path):
    doc = fitz.open(pdf_path)
    text = ''
    for page in doc:
        text += page.get_text() + '\n'
    # Basic cleaning
    text = re.sub(r'\n+', '\n', text)
    return text

pdf_path = 'Meta’s Q1 2024 Financial Report.pdf'
raw_text = extract_text_from_pdf(pdf_path)
print(raw_text[:5000]) 

Meta Reports First Quarter 2024 Results
MENLO PARK, Calif. – April 24, 2024 – Meta Platforms, Inc. (Nasdaq: META) today reported financial results for the quarter 
ended March 31, 2024.
"It's been a good start to the year," said Mark Zuckerberg, Meta founder and CEO. "The new version of Meta AI with Llama 3 is 
another step towards building the world's leading AI. We're seeing healthy growth across our apps and we continue making 
steady progress building the metaverse as well."
First Quarter 2024 Financial Highlights
Three Months Ended March 31,
% Change
In millions, except percentages and per share amounts
2024
2023
Revenue
$ 
36,455 
$ 
28,645 
 27 %
Costs and expenses
 
22,637 
 
21,418 
 6 %
Income from operations
$ 
13,818 
$ 
7,227 
 91 %
Operating margin
 38 %
 25 %
Provision for income taxes
$ 
1,814 
$ 
1,598 
 14 %
Effective tax rate
 13 %
 22 %
Net income
$ 
12,369 
$ 
5,709 
 117 %
Diluted earnings per share (EPS)
$ 
4.71 
$ 
2.20 
 114 %
First Quarter 2024 Operational and

## 2. Chunking & Embedding
- Split text into manageable chunks
- Generate embeddings using Sentence Transformers

In [71]:
# Chunking & Embedding (clean, warning-free version)

# Ensure required packages are installed (suppress pip output and warnings)
import sys
import subprocess

def install(package):
    try:
        __import__(package)
    except ImportError:
        subprocess.check_call([sys.executable, '-m', 'pip', 'install', package], stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)

install('sentence_transformers')
install('numpy')

from sentence_transformers import SentenceTransformer
import numpy as np

def chunk_text(text, chunk_size=500, overlap=50):
    words = text.split()
    chunks = []
    for i in range(0, len(words), chunk_size - overlap):
        chunk = ' '.join(words[i:i+chunk_size])
        if chunk:
            chunks.append(chunk)
    return chunks

chunks = chunk_text(raw_text)
print(f'Number of chunks: {len(chunks)}')
print(len(raw_text.split()))
embedder = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = embedder.encode(chunks, show_progress_bar=True)

Number of chunks: 6
2530


Batches: 100%|██████████| 1/1 [00:00<00:00,  1.97it/s]


## 3. Retrieval: Vector Similarity Search
- Use FAISS to retrieve top-3 relevant chunks for a query

In [72]:
# Build FAISS index
dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(np.array(embeddings).astype('float32'))

def retrieve(query, k=3):
    query_emb = embedder.encode([query])
    D, I = index.search(np.array(query_emb).astype('float32'), k)
    return [chunks[i] for i in I[0]]

## 4. Generation: Answer Queries with LLM
- Use an open-source LLM (e.g., HuggingFace pipeline)
- Prompt: Based on the following context: {context} Answer the query: {query}

In [84]:
def generate_fact_answer(context, query):
    """Extract the most relevant sentence(s) from the context for direct fact-based queries."""
    import re
    # Lowercase for matching
    context_lower = context.lower()
    query_lower = query.lower()
    # Split context into sentences
    sentences = re.split(r'(?<=[.!?])\s+', context)
    # Find sentences containing key query words
    keywords = [w for w in re.findall(r'\w+', query_lower) if len(w) > 2]
    best_sentences = []
    for sent in sentences:
        sent_lower = sent.lower()
        if all(k in sent_lower for k in keywords[:2]):  # match at least first 2 keywords
            best_sentences.append(sent)
    if not best_sentences:
        # fallback: any sentence with 'revenue' or 'highlight'
        for sent in sentences:
            if 'revenue' in sent.lower() or 'highlight' in sent.lower():
                best_sentences.append(sent)
    if not best_sentences:
        # fallback: first 2 sentences
        best_sentences = sentences[:2]
    return ' '.join(best_sentences)

## 5. Test Queries & Sample Outputs

In [85]:
test_queries = [
    "What was Meta’s revenue in Q1 2024?",
    "What were the key financial highlights for Meta in Q1 2024?"
 ]

import pandas as pd
results = []
for query in test_queries:
    top_chunks = retrieve(query)
    seen = set()
    deduped_chunks = []
    for chunk in top_chunks:
        if chunk not in seen:
            deduped_chunks.append(chunk)
            seen.add(chunk)
    context = '\n'.join(deduped_chunks)
    answer = generate_fact_answer(context, query)
    # Get a short context snippet (first relevant sentence or 200 chars)
    import re
    sentences = re.split(r'(?<=[.!?])\s+', context)
    snippet = ''
    for sent in sentences:
        if len(sent.strip()) > 20:
            snippet = sent.strip()
            break
    if not snippet:
        snippet = context[:200] + ('...' if len(context) > 200 else '')
    results.append({'Query': query, 'Answer': answer.strip(), 'Context Snippet': snippet})

# Display results as a DataFrame for clarity
pd.set_option('display.max_colwidth', 300)
display(pd.DataFrame(results))

Unnamed: 0,Query,Answer,Context Snippet
0,What was Meta’s revenue in Q1 2024?,"We're seeing healthy growth across our apps and we continue making steady progress building the metaverse as well."" First Quarter 2024 Financial Highlights Three Months Ended March 31, % Change In millions, except percentages and per share amounts 2024 2023 Revenue $ 36,455 $ 28,645 27 % Costs a...","Meta Reports First Quarter 2024 Results MENLO PARK, Calif."
1,What were the key financial highlights for Meta in Q1 2024?,"Apps like Messenger, Instagram, and WhatsApp further empowered billions around the world.","Meta Reports First Quarter 2024 Results MENLO PARK, Calif."


## Step 2: Structured Data Integration
- Extract tables from PDF into DataFrames
- Hybrid retrieval: combine vector search (text) and keyword/SQL-like search (structured)
- Updated prompt includes both text and structured data


In [None]:
import pandas as pd
try:
    import camelot
except ImportError:
    camelot = None

def extract_tables_from_pdf(pdf_path):
    if camelot is None:
        print('camelot is not installed. Please install it to extract tables.')
        return []
    try:
        tables = camelot.read_pdf(pdf_path, pages='all')
        return [t.df for t in tables]
    except Exception as e:
        print(f'Error extracting tables: {e}')
        return []

tables = extract_tables_from_pdf(pdf_path)
if tables:
    print(f'Extracted {len(tables)} tables.')
    display(tables[0].head())
else:
    print('No tables extracted.')

No tables extracted.


In [76]:
# Hybrid retrieval: combine text and structured data

def hybrid_retrieve(query, k=3):
    # Vector search (text)
    text_chunks = retrieve(query, k)
    # Keyword/SQL-like search (structured)
    table_results = []
    if tables:
        for table in tables:
            # Simple keyword search in all table cells
            mask = table.apply(lambda col: col.astype(str).str.contains(query, case=False, na=False))
            if mask.any().any():
                table_results.append(table[mask.any(axis=1)])
    return text_chunks, table_results

# Example hybrid retrieval
test_query2 = "net income"
text_chunks, table_results = hybrid_retrieve(test_query2)
print("Text Chunks:\n", text_chunks)
if table_results:
    print("Table Results:")
    display(table_results[0])
else:
    print("No relevant table rows found.")

Text Chunks:
 ['and other current liabilities $ 4,217 $ 4,466 Acquisition of businesses in accrued expenses and other current liabilities and other liabilities $ 116 $ 263 8 Segment Results We report our financial results for our two reportable segments: Family of Apps (FoA) and Reality Labs (RL). FoA includes Facebook, Instagram, Messenger, WhatsApp, and other services. RL includes our virtual, augmented, and mixed reality related consumer hardware, software, and content. The following table presents our segment information of revenue and income (loss) from operations: Segment Information (In millions) (Unaudited) Three Months Ended March 31, 2024 2023 Revenue: Advertising $ 35,635 $ 28,101 Other revenue 380 205 Family of Apps 36,015 28,306 Reality Labs 440 339 Total revenue $ 36,455 $ 28,645 Income (loss) from operations: Family of Apps $ 17,664 $ 11,219 Reality Labs (3,846) (3,992) Total income from operations $ 13,818 $ 7,227 9 Reconciliation of GAAP to Non-GAAP Results (In million

In [77]:
# Updated prompt for hybrid context
def generate_hybrid_answer(text_context, structured_data, query):
    prompt = f"Text context: {text_context}\nStructured data: {structured_data}\nAnswer the query: {query}"
    result = qa_pipeline(prompt, max_length=256, do_sample=True)[0]['generated_text']
    return result

# Test hybrid answer
test_query2 = "What was Meta’s net income in Q1 2024 compared to Q1 2023?"
text_chunks, table_results = hybrid_retrieve(test_query2)
text_context = '\n'.join(text_chunks)
structured_data = table_results[0].to_string() if table_results else "No relevant table rows."
answer = generate_hybrid_answer(text_context, structured_data, test_query2)
print(f'Query: {test_query2}')
print(f'Answer: {answer}')

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=256) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Query: What was Meta’s net income in Q1 2024 compared to Q1 2023?
Answer: Text context: Meta Reports First Quarter 2024 Results MENLO PARK, Calif. – April 24, 2024 – Meta Platforms, Inc. (Nasdaq: META) today reported financial results for the quarter ended March 31, 2024. "It's been a good start to the year," said Mark Zuckerberg, Meta founder and CEO. "The new version of Meta AI with Llama 3 is another step towards building the world's leading AI. We're seeing healthy growth across our apps and we continue making steady progress building the metaverse as well." First Quarter 2024 Financial Highlights Three Months Ended March 31, % Change In millions, except percentages and per share amounts 2024 2023 Revenue $ 36,455 $ 28,645 27 % Costs and expenses 22,637 21,418 6 % Income from operations $ 13,818 $ 7,227 91 % Operating margin 38 % 25 % Provision for income taxes $ 1,814 $ 1,598 14 % Effective tax rate 13 % 22 % Net income $ 12,369 $ 5,709 117 % Diluted earnings per share (EPS) $ 4.7

## Step 3: Query Optimization & Advanced RAG
- Query rewriting/optimization
- Advanced retrieval (reranking, chunk size experiments)
- Evaluation framework (Precision@k, Recall@k, MRR, BLEU, ROUGE)
- Performance analysis and ablation study


In [78]:
# Step 3: Query Optimization and Advanced Retrieval
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

def optimize_query(query):
    """Rewrite or optimize the query using LLM or rules. For demo, returns original query."""
    return query

def rerank(query, candidate_chunks):
    """Rerank candidate chunks using a cross-encoder model."""
    tokenizer = AutoTokenizer.from_pretrained('cross-encoder/ms-marco-MiniLM-L-6-v2')
    model = AutoModelForSequenceClassification.from_pretrained('cross-encoder/ms-marco-MiniLM-L-6-v2')
    scores = []
    for chunk in candidate_chunks:
        inputs = tokenizer(query, chunk, return_tensors='pt', truncation=True)
        with torch.no_grad():
            score = model(**inputs).logits[0].item()
        scores.append(score)
    reranked = [x for _, x in sorted(zip(scores, candidate_chunks), reverse=True)]
    return reranked

# Example usage
query = "Summarize Meta’s operating expenses in Q1 2024."
optimized_query = optimize_query(query)
candidate_chunks = retrieve(optimized_query, k=5)
reranked_chunks = rerank(optimized_query, candidate_chunks)
print("Top reranked chunk:\n", reranked_chunks[0])

Top reranked chunk:
 Meta Reports First Quarter 2024 Results MENLO PARK, Calif. – April 24, 2024 – Meta Platforms, Inc. (Nasdaq: META) today reported financial results for the quarter ended March 31, 2024. "It's been a good start to the year," said Mark Zuckerberg, Meta founder and CEO. "The new version of Meta AI with Llama 3 is another step towards building the world's leading AI. We're seeing healthy growth across our apps and we continue making steady progress building the metaverse as well." First Quarter 2024 Financial Highlights Three Months Ended March 31, % Change In millions, except percentages and per share amounts 2024 2023 Revenue $ 36,455 $ 28,645 27 % Costs and expenses 22,637 21,418 6 % Income from operations $ 13,818 $ 7,227 91 % Operating margin 38 % 25 % Provision for income taxes $ 1,814 $ 1,598 14 % Effective tax rate 13 % 22 % Net income $ 12,369 $ 5,709 117 % Diluted earnings per share (EPS) $ 4.71 $ 2.20 114 % First Quarter 2024 Operational and Other Financial H

In [79]:
# Evaluation framework (simplified demo)
def precision_at_k(relevant, retrieved, k):
    return len(set(relevant) & set(retrieved[:k])) / k

def recall_at_k(relevant, retrieved, k):
    return len(set(relevant) & set(retrieved[:k])) / len(relevant)

def mrr(relevant, retrieved):
    for i, doc in enumerate(retrieved):
        if doc in relevant:
            return 1 / (i + 1)
    return 0

# BLEU/ROUGE can be computed with nltk/rouge-score if needed
# Placeholder for ablation study and performance analysis
print("Ablation study: Remove reranking and compare retrieval accuracy.")
# ... implement ablation logic as needed ...

Ablation study: Remove reranking and compare retrieval accuracy.


### Improvement Proposals
1. Integrate a more powerful open-source LLM (e.g., Llama 3) for better answer generation.
2. Use advanced table extraction (e.g., Deep Learning-based PDF parsers) for more accurate structured data handling.
