# RAG with LlamaParse + Gemini Embeddings

This notebook uses:
- **LlamaParse** for PDF extraction (better for tables & financial docs)
- **Gemini Embeddings** for vector embeddings
- **ChromaDB** for vector storage
- **Gemini 2.5 Flash** for generation


In [3]:
# Install/upgrade required packages (run once)
# Fix for "cannot import name 'Sentinel' from 'typing_extensions'" error
%pip install --upgrade typing_extensions>=4.12.0
%pip install llama-cloud-services chromadb google-genai python-dotenv nest_asyncio
%pip install google-adk

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Collecting google-adk
  Downloading google_adk-1.21.0-py3-none-any.whl.metadata (14 kB)
Collecting authlib<2.0.0,>=1.5.1 (from google-adk)
  Downloading authlib-1.6.6-py2.py3-none-any.whl.metadata (9.8 kB)
Collecting google-api-python-client<3.0.0,>=2.157.0 (from google-adk)
  Using cached google_api_python_client-2.187.0-py3-none-any.whl.metadata (7.0 kB)
Collecting google-cloud-aiplatform<2.0.0,>=1.125.0 (from google-cloud-aiplatform[agent-engines]<2.0.0,>=1.125.0->google-adk)
  Downloading google_cloud_aiplatform-1.132.0-py2.py3-none-any.whl.metadata (46 kB)
Collecting google-cloud-bigquery-storage>=2.0.0 (from google-adk)
  Downloading google_cloud_bigquery_storage-2.36.0-py3-none-any.whl.metadata (10 kB)
Collecting google-cloud-bigquery>=2.2.0 (from google-adk)
  Downloading google_cloud_bigquery-3.39.0-py3-none-any.whl.metadata (8.2 kB)
Collecting go

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-ai-generativelanguage 0.4.0 requires protobuf!=3.20.0,!=3.20.1,!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.19.5, but you have protobuf 6.33.2 which is incompatible.
langchain 0.1.0 requires langchain-core<0.2,>=0.1.7, but you have langchain-core 1.2.3 which is incompatible.
langchain 0.1.0 requires langsmith<0.1.0,>=0.0.77, but you have langsmith 0.5.0 which is incompatible.
langchain 0.1.0 requires tenacity<9.0.0,>=8.1.0, but you have tenacity 9.1.2 which is incompatible.
langchain-community 0.0.13 requires langchain-core<0.2,>=0.1.9, but you have langchain-core 1.2.3 which is incompatible.
langchain-community 0.0.13 requires langsmith<0.1.0,>=0.0.63, but you have langsmith 0.5.0 which is incompatible.
langchain-community 0.0.13 requires tenacity<9.0.0,>=8.1.0, but you have tenacity

In [1]:
from google.genai import types
retry_config=types.HttpRetryOptions(
    attempts=5,  # Maximum retry attempts
    exp_base=7,  # Delay multiplier
    initial_delay=1, # Initial delay before first retry (in seconds)
    http_status_codes=[429, 500, 503, 504] # Retry on these HTTP errors
)

In [2]:
import os
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

# API Keys
LLAMA_CLOUD_API_KEY = os.getenv("LlamaParse")  # Get from https://cloud.llamaindex.ai/
GEMINI_API_KEY = os.getenv("GEMINI_API_KEY")

print("‚úì API keys loaded" if LLAMA_CLOUD_API_KEY and GEMINI_API_KEY else "‚úó Missing API keys")


‚úì API keys loaded


In [3]:
# Define the financial filings
FILINGS = [
    {"id": "apple_2023", "company": "Apple", "year": 2023, "path": r"C:\Users\rushy\Downloads\FINBOT\GenAI_FInBot\NOV_2023.pdf"},
    #{"id": "apple_2024", "company": "Apple", "year": 2024, "path": r"C:\Users\rushy\Downloads\FINBOT\GenAI_FInBot\NOV_2024.pdf"},
    #{"id": "apple_2025", "company": "Apple", "year": 2025, "path": r"C:\Users\rushy\Downloads\FINBOT\GenAI_FInBot\OCT_2025.pdf"},
]


In [None]:
import nest_asyncio
nest_asyncio.apply() 

from llama_cloud_services import LlamaParse

# Initialize LlamaParse with the new llama_cloud_services package
parser = LlamaParse(api_key=LLAMA_CLOUD_API_KEY)

def extract_text_with_llamaparse(file_path):
    """Extract text from PDF using LlamaParse."""
    # Parse the document - returns markdown by default
    documents = parser.parse(file_path)
    
    # Each document represents a page
    pages = []
    for i, doc in enumerate(documents):
        pages.append({
            "page": i + 1,
            "text": doc.text if hasattr(doc, 'text') else str(doc)
        })
    return pages

print("‚úì LlamaParse initialized (llama_cloud_services)")


‚úì LlamaParse initialized (llama_cloud_services)


In [5]:
import re

def chunk_text(pages, chunk_chars=1500, overlap=300):
    """
    Chunk the extracted text into smaller pieces.
    Larger chunks for LlamaParse since markdown preserves structure better.
    """
    chunks = []
    for page in pages:
        text = re.sub(r"\s+", " ", page['text']).strip()
        start = 0
        while start < len(text):
            end = min(len(text), start + chunk_chars)
            chunk_text = text[start:end]
            chunks.append({
                "text": chunk_text,
                "page_start": page["page"],
                "page_end": page["page"],
            })
            if end >= len(text):
                break
            start = end - overlap
    return chunks

print("‚úì Chunking function ready")


‚úì Chunking function ready


In [6]:
from google import genai
import chromadb
from chromadb.config import Settings
import uuid

# Initialize Gemini client
gemini_client = genai.Client(api_key=GEMINI_API_KEY)

# Gemini embedding function
def get_gemini_embeddings(texts, model="text-embedding-004"):
    """Get embeddings using Google's Gemini embedding model."""
    embeddings = []
    for text in texts:
        result = gemini_client.models.embed_content(
            model=model,
            contents=text
        )
        embeddings.append(result.embeddings[0].values)
    return embeddings

# Initialize ChromaDB
chroma_client = chromadb.Client(Settings(persist_directory="./chroma_llamaparse_db"))

# Delete existing collection if exists (fresh start)
try:
    chroma_client.delete_collection("filings_llamaparse")
    print("Deleted old collection")
except:
    pass

collection = chroma_client.get_or_create_collection("filings_llamaparse")
print("‚úì ChromaDB & Gemini Embeddings initialized")


‚úì ChromaDB & Gemini Embeddings initialized


In [7]:
def ingest_filings_with_llamaparse(filings, batch_size=32):
    """Ingest PDFs using LlamaParse and store in ChromaDB with Gemini embeddings."""
    
    for filing in filings:
        print(f"\nüìÑ Processing: {filing['id']}...")
        
        # Extract text using LlamaParse
        pages = extract_text_with_llamaparse(filing["path"])
        print(f"   Extracted {len(pages)} pages")
        
        # Chunk the text
        chunks = chunk_text(pages)
        print(f"   Created {len(chunks)} chunks")
        
        # Prepare data
        documents = [ch["text"] for ch in chunks]
        metadatas = [{
            "filing_id": filing["id"],
            "company": filing["company"],
            "year": filing["year"],
            "page_start": ch["page_start"],
            "page_end": ch["page_end"],
        } for ch in chunks]
        ids = [str(uuid.uuid4()) for _ in chunks]
        
        # Batch insert with embeddings
        for i in range(0, len(documents), batch_size):
            docs_b = documents[i:i+batch_size]
            metas_b = metadatas[i:i+batch_size]
            ids_b = ids[i:i+batch_size]
            
            # Get Gemini embeddings
            embeddings_b = get_gemini_embeddings(docs_b)
            
            collection.add(
                documents=docs_b,
                metadatas=metas_b,
                ids=ids_b,
                embeddings=embeddings_b
            )
            print(f"   Added batch {i//batch_size + 1}/{(len(documents)-1)//batch_size + 1}")
        
        print(f"‚úì {filing['id']} ‚Üí {len(documents)} chunks ingested")

# Run the ingestion
ingest_filings_with_llamaparse(FILINGS)



üìÑ Processing: apple_2023...
Started parsing the file under job_id 08789195-be26-417a-ac74-082ded40fd74
.   Extracted 8 pages
   Created 1144 chunks
   Added batch 1/36
   Added batch 2/36
   Added batch 3/36
   Added batch 4/36
   Added batch 5/36
   Added batch 6/36
   Added batch 7/36
   Added batch 8/36
   Added batch 9/36
   Added batch 10/36
   Added batch 11/36
   Added batch 12/36
   Added batch 13/36
   Added batch 14/36
   Added batch 15/36
   Added batch 16/36
   Added batch 17/36
   Added batch 18/36
   Added batch 19/36
   Added batch 20/36
   Added batch 21/36
   Added batch 22/36
   Added batch 23/36
   Added batch 24/36
   Added batch 25/36
   Added batch 26/36
   Added batch 27/36
   Added batch 28/36
   Added batch 29/36
   Added batch 30/36
   Added batch 31/36
   Added batch 32/36
   Added batch 33/36
   Added batch 34/36
   Added batch 35/36
   Added batch 36/36
‚úì apple_2023 ‚Üí 1144 chunks ingested


In [8]:
# Verify ingestion
peek = collection.peek(limit=3)
print(f"Total documents in collection: {collection.count()}\n")

for i in range(len(peek["documents"])):
    print("=" * 80)
    print(f"METADATA: {peek['metadatas'][i]}")
    print(f"TEXT (first 300 chars):\n{peek['documents'][i][:300]}...")


Total documents in collection: 1144

METADATA: {'filing_id': 'apple_2023', 'year': 2023, 'page_start': 1, 'page_end': 1, 'company': 'Apple'}
TEXT (first 300 chars):
('pages', [Page(page=1, text=' UNITED STATES\nSECURITIES AND EXCHANGE COMMISSION\n Washington, D.C. 20549\n\n FORM 10-K\n\n (Mark One)\n ‚òí ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934\n For the fiscal year ended September 30, 2023\n or\n‚òê TRANSITION REPORT PUR...
METADATA: {'page_start': 1, 'filing_id': 'apple_2023', 'company': 'Apple', 'year': 2023, 'page_end': 1}
TEXT (first 300 chars):
et LLC\n1.625% Notes due 2026 ‚Äî The Nasdaq Stock Market LLC\n2.000% Notes due 2027 ‚Äî The Nasdaq Stock Market LLC\n1.375% Notes due 2029 ‚Äî The Nasdaq Stock Market LLC\n3.050% Notes due 2029 ‚Äî The Nasdaq Stock Market LLC\n0.500% Notes due 2031 ‚Äî The Nasdaq Stock Market LLC\n3.600% Notes due 2042 ‚Äî The...
METADATA: {'company': 'Apple', 'year': 2023, 'filing_id': 'apple_2023', 'page_start

In [9]:
SYSTEM_INSTRUCTIONS = """
You are a financial assistant specialized in analyzing SEC filings and financial documents.
Answer ONLY using the provided filing snippets.
Cite sources as (filing_id, year, pages X-Y).
If information is not present, say "I don't know based on the filings."
When discussing numbers, be precise and include units (millions, billions, etc.).
"""

def retrieve(query, k=6, filter_by=None):
    """Retrieve relevant chunks using Gemini embeddings."""
    # Get embedding for query
    query_embedding = get_gemini_embeddings([query])[0]
    
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=k,
        where=filter_by if filter_by else None
    )
    
    docs = results["documents"][0]
    metas = results["metadatas"][0]
    
    snippets = []
    for d, m in zip(docs, metas):
        cite = f"{m['filing_id']} ({m['year']}), pages {m['page_start']}-{m['page_end']}"
        snippets.append(f"[{cite}]\n{d}")
    
    return "\n\n".join(snippets)

def build_prompt(query, context):
    return f"{SYSTEM_INSTRUCTIONS}\n\nContext from filings:\n{context}\n\nQuestion: {query}\nAnswer:"

print("‚úì Retrieval functions ready")


‚úì Retrieval functions ready


In [14]:
def answer(query, filter_by=None, k=6):
    """Answer a question using RAG with LlamaParse + Gemini."""
    context = retrieve(query, k=k, filter_by=filter_by)
    prompt = build_prompt(query, context)
    
    response = gemini_client.models.generate_content(
        model="gemini-2.5-flash",
        contents=[{"role": "user", "parts": [{"text": prompt}]}],
        config=types.GenerateContentConfig(http_options=retry_config)
    )
    return response.text

# Test the RAG system
print("=" * 80)
print("Question: Who is the ceo in 2023?")
print("=" * 80)
print(answer("Who is the ceo in 2023?"))


Question: Who is the ceo in 2023?


Type mismatch in GenerateContentConfig.http_options: expected HttpOptions, got HttpRetryOptions


Timothy D. Cook is the Chief Executive Officer (Principal Executive Officer) as of November 2, 2023 (apple_2023, 2023, pages 1-1).


In [15]:
# More test queries
test_queries = [
    "Compare iPhone revenue between 2022 and 2023",
    "What are Apple's main risk factors?",
    "What was the net income in 2023?",
]

for q in test_queries:
    print("\n" + "=" * 80)
    print(f"Q: {q}")
    print("=" * 80)
    print(answer(q))



Q: Compare iPhone revenue between 2022 and 2023


Type mismatch in GenerateContentConfig.http_options: expected HttpOptions, got HttpRetryOptions


iPhone net sales decreased 2% or $4.9 billion during 2023 compared to 2022. The net sales for iPhone were $200,583 million in 2023, down from $205,489 million in 2022 (apple_2023, 2023, pages 1-1). This decrease was due to lower net sales of non-Pro iPhone models, partially offset by higher net sales of Pro iPhone models (apple_2023, 2023, pages 1-1).

Q: What are Apple's main risk factors?


Type mismatch in GenerateContentConfig.http_options: expected HttpOptions, got HttpRetryOptions


Apple's main risk factors, which can materially and adversely affect its business, reputation, results of operations, financial condition, and stock price, include:

*   **Macroeconomic and Industry Risks:**
    *   Political events, trade and other international disputes, war, terrorism, natural disasters, public health issues, industrial accidents, and other business interruptions can harm international commerce and the global economy (apple_2023, 2023, pages 1-1).
    *   Restrictions on international trade, such as tariffs and other controls on imports or exports, can adversely affect operations and supply chain (apple_2023, 2023, pages 1-1).
    *   Intense media, political, and regulatory scrutiny exposes the Company to increasing regulation, government investigations, legal actions, and penalties, such as compliance with the EU Digital Markets Act (apple_2023, 2023, pages 1-1).
    *   Dependence on the continuing and timely introduction of innovative new products, services, and

Type mismatch in GenerateContentConfig.http_options: expected HttpOptions, got HttpRetryOptions


Net income in 2023 was $96,995 million (apple_2023, 2023, pages 1-1).
