# RAG Pipeline for Annual Report Analyzer

This notebook combines the PDF processing and RAG system components into a single interactive pipeline.

## 1. Setup & Dependencies
Run the following cell to ensure all dependencies are installed.

In [4]:
%pip install langchain-google-genai langchain-community langchain-huggingface chromadb pypdf python-dotenv sentence-transformers google-generativeai

Collecting google-generativeai
  Downloading google_generativeai-0.8.6-py3-none-any.whl.metadata (3.9 kB)
Collecting google-ai-generativelanguage==0.6.15 (from google-generativeai)
  Using cached google_ai_generativelanguage-0.6.15-py3-none-any.whl.metadata (5.7 kB)
Collecting google-api-core (from google-generativeai)
  Downloading google_api_core-2.29.0-py3-none-any.whl.metadata (3.3 kB)
Collecting google-api-python-client (from google-generativeai)
  Downloading google_api_python_client-2.188.0-py3-none-any.whl.metadata (7.0 kB)
Collecting proto-plus<2.0.0dev,>=1.22.3 (from google-ai-generativelanguage==0.6.15->google-generativeai)
  Downloading proto_plus-1.27.0-py3-none-any.whl.metadata (2.2 kB)
Collecting protobuf (from google-generativeai)
  Using cached protobuf-5.29.5-cp310-abi3-win_amd64.whl.metadata (592 bytes)
Collecting grpcio-status<2.0.0,>=1.33.2 (from google-api-core[grpc]!=2.0.*,!=2.1.*,!=2.10.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,!=2.8.*,!=2.9.*,<3.0.0dev,

## 2. Imports and Environment Setup

In [5]:
import os
import re
from typing import List, Optional, Any
from dotenv import load_dotenv
import google.generativeai as genai

from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.documents import Document
from langchain_community.vectorstores import Chroma
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_core.prompts import PromptTemplate, ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough


load_dotenv()

google_api_key = os.getenv("GOOGLE_API_KEY")
if not google_api_key:
    print("WARNING: GOOGLE_API_KEY not found in environment variables.")

  from .autonotebook import tqdm as notebook_tqdm

All support for the `google.generativeai` package has ended. It will no longer be receiving 
updates or bug fixes. Please switch to the `google.genai` package as soon as possible.
See README for more details:

https://github.com/google-gemini/deprecated-generative-ai-python/blob/main/README.md

  import google.generativeai as genai


## 3. PDF Processor Class
Handles loading, chunking, and cleaning of PDF documents.

In [7]:
class PDFProcessor:
    def __init__(self, chunk_size: int = 1000, chunk_overlap: int = 200):
        """
        Initialize PDF processor.
        
        Args:
            chunk_size: Size of text chunks
            chunk_overlap: Overlap between chunks to maintain context
        """
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
        self.text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=chunk_size,
            chunk_overlap=chunk_overlap,
            length_function=len,
            separators=["\n\n", "\n", " ", ""]
        )
    
    def load_pdf(self, pdf_path: str) -> List[Document]:
        """
        Args:
            pdf_path: Path to the PDF file
        Returns:
            List of Document objects with text chunks
        """
        print(f"Loading PDF: {pdf_path}")
        loader = PyPDFLoader(pdf_path)
        documents = loader.load()
        print(f"Loaded {len(documents)} pages")
        

        for doc in documents:
            doc.page_content = self._clean_text(doc.page_content)
        
        return documents
    
    def chunk_documents(self, documents: List[Document]) -> List[Document]:
        """
        Split documents into smaller chunks.
        
        Args:
            documents: List of documents to chunk
            
        Returns:
            List of chunked documents
        """
        chunks = self.text_splitter.split_documents(documents)
        print(f"Created {len(chunks)} chunks from documents")
        return chunks
    
    def _clean_text(self, text: str) -> str:
        """
        Args:
            text: Raw text content
            
        Returns:
            Cleaned text
        """
        # Remove excessive whitespace
        text = re.sub(r'\s+', ' ', text)
        
        # Remove special characters but keep financial symbols
        # Keep: numbers, letters, common punctuation, $, %, etc.
        text = re.sub(r'[^\w\s\$\%\.\,\-\(\)\:\;\/]', '', text)
        
        return text.strip()
    
    def extract_financial_data(self, text: str) -> List[str]:
        """
        Args:
            text: Text to extract from
        Returns:
            List of extracted financial figures
        """
        # Pattern for currency amounts: $123,456.78 or $123.4 million/billion
        currency_pattern = r'\$[\d,]+\.?\d*\s*(?:million|billion|trillion)?'
        return re.findall(currency_pattern, text)

## 4. RAG System Class
Configures the Embedding model (HuggingFace), Vector Store (Chroma), and LLM (Gemini).

In [8]:
class AnnualReportRAG:
    """RAG system for analyzing annual reports with numerical accuracy."""
    
    def __init__(
        self,
        google_api_key: str,
        persist_directory: str = "./chroma_db",
        model_name: str = "gemini-flash-latest"
    ):
        """
        Initialize RAG system.
        
        Args:
            google_api_key: Google API key
            persist_directory: Directory to persist vector store
            model_name: Gemini model to use
        """
        self.google_api_key = google_api_key
        self.persist_directory = persist_directory
        self.model_name = model_name
        
        # Initialize components
        # Use HuggingFace embeddings (free, local, no API limits)
        print(f"Loading embedding model... (first time may take a minute to download)")
        self.embeddings = HuggingFaceEmbeddings(
            model_name="sentence-transformers/all-MiniLM-L6-v2",
            model_kwargs={'device': 'cpu'}
        )
        
        # Initialize LLM
        self.llm = ChatGoogleGenerativeAI(
            model=model_name,
            temperature=0,  # Low temperature for factual accuracy
            google_api_key=google_api_key
        )
        
        self.vectorstore: Optional[Chroma] = None
        self.qa_chain: Optional[Any] = None
        self.retriever: Optional[Any] = None
        self.pdf_processor = PDFProcessor()
        
    def load_and_index_documents(self, pdf_paths: List[str]) -> None:
        """
        Load PDF documents and create vector store.
        
        Args:
            pdf_paths: List of paths to PDF files
        """
        all_chunks = []
        
        for pdf_path in pdf_paths:
            if not os.path.exists(pdf_path):
                print(f"Error: File not found at {pdf_path}")
                continue
                
            # Load and chunk each PDF
            documents = self.pdf_processor.load_pdf(pdf_path)
            chunks = self.pdf_processor.chunk_documents(documents)
            all_chunks.extend(chunks)
        
        if not all_chunks:
            print("No documents to process.")
            return
            
        print(f"Total chunks to index: {len(all_chunks)}")
        
        # Create vector store
        self.vectorstore = Chroma.from_documents(
            documents=all_chunks,
            embedding=self.embeddings,
            persist_directory=self.persist_directory
        )
        
        print(f"Vector store created and persisted to {self.persist_directory}")
        
        # Create QA chain
        self._create_qa_chain()
    
    def load_existing_vectorstore(self) -> None:
        """Load existing vector store from disk."""
        if not os.path.exists(self.persist_directory):
            raise ValueError(f"Vector store not found at {self.persist_directory}")
        
        self.vectorstore = Chroma(
            persist_directory=self.persist_directory,
            embedding_function=self.embeddings
        )
        
        print(f"Loaded existing vector store from {self.persist_directory}")
        self._create_qa_chain()
    
    def _create_qa_chain(self) -> None:
        """Create the QA chain with custom prompt."""
        
        # Custom prompt for financial accuracy
        template = """You are a financial analyst assistant analyzing company annual reports. 
Your task is to answer questions based ONLY on the provided context from the annual report.

CRITICAL RULES:
1. Use ONLY the exact numbers and figures found in the context
2. NEVER make up, estimate, or calculate numbers that aren't explicitly stated
3. If a specific figure is not in the context, clearly state "This information is not available in the provided context"
4. When citing numbers, quote them exactly as they appear in the source
5. Preserve units (millions, billions, percentages, etc.) exactly as stated
6. If asked about trends or comparisons, only use data explicitly present in the context

Context from annual report:
{context}

Question: {question}

Detailed Answer (with exact figures from the context):"""

        prompt = ChatPromptTemplate.from_template(template)
        
        # Create the retriever
        retriever = self.vectorstore.as_retriever(
            search_type="similarity",
            search_kwargs={"k": 5}  # Retrieve top 5 most relevant chunks
        )
        
        # Helper function to format documents
        def format_docs(docs):
            return "\n\n".join(doc.page_content for doc in docs)
        
        # Create chain using LCEL (LangChain Expression Language)
        self.qa_chain = (
            {
                "context": retriever | format_docs,
                "question": RunnablePassthrough()
            }
            | prompt
            | self.llm
            | StrOutputParser()
        )
        
        # Store retriever for getting source documents
        self.retriever = retriever
    
    def ask_question(self, question: str) -> dict:
        """
        Ask a question about the annual report.
        
        Args:
            question: Question to ask
            
        Returns:
            Dictionary with answer and source documents
        """
        if self.qa_chain is None:
            raise ValueError("QA chain not initialized. Load documents first.")
        
        # Get the answer
        answer = self.qa_chain.invoke(question)
        
        # Get source documents
        source_documents = self.retriever.invoke(question)
        
        return {
            "question": question,
            "answer": answer,
            "source_documents": source_documents
        }

## 5. Main Execution
Run the RAG system.

In [12]:
# Initialize the System
rag = AnnualReportRAG(
    google_api_key=google_api_key,
    model_name="gemini-flash-latest"  # Using the latest Flash model for speed/cost
)

pdf_path = "C:\\Users\\samri\\Downloads\\Amazon-2024-Annual-Report.pdf"  
rag.load_and_index_documents([pdf_path])

Loading embedding model... (first time may take a minute to download)
Loading PDF: C:\Users\samri\Downloads\Amazon-2024-Annual-Report.pdf
Loaded 91 pages
Created 421 chunks from documents
Total chunks to index: 421
Vector store created and persisted to ./chroma_db


In [14]:

try:
    # Try to load existing vector DB to save time
    # rag.load_existing_vectorstore()
    if os.path.exists(pdf_path):
        query = "What is the operating cash flow?"
        result = rag.ask_question(query)
        
        print("\n=== Question ===")
        print(result['question'])
        print("\n=== Answer ===")
        print(result['answer'])
        print("\n=== Sources ===")
        for i, doc in enumerate(result['source_documents']):
            print(f"Source {i+1}: {doc.page_content[:200]}...")
    else:
        print(f"Please place a PDF file at '{pdf_path}' or update the path variable to test the pipeline.")

except Exception as e:
    print(f"An error occurred: {e}")


=== Question ===
What is the operating cash flow?

=== Answer ===
Cash provided by (used in) operating activities was "$84.9 billion" in 2023 and "$115.9 billion" in 2024.

=== Sources ===
Source 1: December 31, 2023 and 2024. Our foreign currency balances include British Pounds, Canadian Dollars, Euros, Indian Rupees, and Japanese Yen. Cash provided by (used in) operating activities was $84.9 bi...
Source 2: December 31, 2023 and 2024. Our foreign currency balances include British Pounds, Canadian Dollars, Euros, Indian Rupees, and Japanese Yen. Cash provided by (used in) operating activities was $84.9 bi...
Source 3: December 31, 2023 and 2024. Our foreign currency balances include British Pounds, Canadian Dollars, Euros, Indian Rupees, and Japanese Yen. Cash provided by (used in) operating activities was $84.9 bi...
Source 4: December 31, 2023 and 2024. Our foreign currency balances include British Pounds, Canadian Dollars, Euros, Indian Rupees, and Japanese Yen. Cash provided by (