<a href="https://www.kaggle.com/code/emmanuelajalae/rag-application-with-financial-reports?scriptVersionId=234948463" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# SEC 10-Q Financial Report Analysis: RAG Implementation with Google Gemini

    Human Author: Emmanuel Ajala (Emmanuelajala22@gmail.com)
    AI Contributor: Claude AI
    Date: 20-April-2025

# Introduction
This notebook demonstrates a Retrieval-Augmented Generation (RAG) system designed to analyze and extract insights from SEC 10-Q quarterly financial reports of Amazon (AMZN).

# Data Source:
The dataset consists of SEC 10-Q quarterly reports downloaded from the investor relations sites of Amazon . The document contains standardized financial statements, management discussions, risk disclosures, and other regulatory information that provides insights into the companies' financial health and operational performance. In the main data source, we have more financial information about more companies but I selected just a subset for a test use case

[https://github.com/docugami/KG-RAG-datasets/tree/main?tab=readme-ov-file](http://)

# Some Real Life Use Case for This Submission:

1. Financial Analysis: Extract financial metrics, compare quarterly performance, and identify trends in revenue, profit margins, and cash flow
2. Risk Assessment: Identify potential business risks, litigation concerns, and regulatory issues disclosed in the reports
3. Legal Document Analysis: Extract and compare corporate policies across different companies, patent information, licensing agreements, IP strategies and lots more


# GEN AI Capabilites utilized 

✅ Used:

1. Document understanding (PDF processing)
2. Embeddings (Gemini embedding model)
3. RAG (ChromaDB + LLM-generated responses)
4. Vector search (ChromaDB storage & retrieval)
5. Few shot prompting

# Technical Implementation:
This notebook implements a complete RAG pipeline with the following components:

1. Document loading and preprocessing using LangChain
2. Text chunking with RecursiveCharacterTextSplitter
3. Semantic embedding generation with Google's embedding models
4. Vector storage and retrieval with ChromaDB
5. Response generation with gemini-1.5-pro

The system enables people to ask complex questions across financial and legal domains and receive accurate, contextually relevant answers based on official corporate disclosures and legal documents. 


In [1]:
!pip install -q pypdf chromadb langchain langchain-google-genai langchain-community langchain-google-genai

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.3/18.3 MB[0m [31m64.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m57.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m94.9/94.9 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.7/43.7 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m53.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m34.2 MB/s[0m eta [36m

In [2]:
import os
import re
import io
import json
import time
import random
import requests
import getpass
import tempfile
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
import chromadb
from chromadb.utils import embedding_functions
from google.api_core import retry
from langchain_google_genai import ChatGoogleGenerativeAI
from google.genai import types
import google.generativeai as genai
from google.generativeai.types import HarmCategory, HarmBlockThreshold
from kaggle_secrets import UserSecretsClient
from langchain.schema import Document

  warn(


In [3]:
genai.__version__

'0.8.4'

In [4]:
# Get API key
GOOGLE_API_KEY = UserSecretsClient().get_secret("GOOGLE_API_KEY")

# Configure the Google Generative AI
genai.configure(api_key=GOOGLE_API_KEY)

In [5]:
pdf_paths = []
for dirname, _, filenames in os.walk('/kaggle/input/law-document'):
    for filename in filenames:
        if filename.endswith('.pdf'):
            pdf_paths.append(os.path.join(dirname, filename))

# Extract text from all PDFs
documents = []
for path in pdf_paths:
    loader = PyPDFLoader(path)
    documents.extend(loader.load())

In [6]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1500,
    chunk_overlap=200  
)
texts = text_splitter.split_documents(documents)

for i, doc in enumerate(texts):
    print(f"Chunk {i+1}:\n{doc.page_content}\n")

Chunk 1:
Table of Contents
UNITED STATES
SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549
 ____________________________________
FORM 10-Q
____________________________________ 
(Mark One)
☒ QUARTERLY REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF1934
For the quarterly period ended September 30, 2022
or
☐ TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF1934
For the transition period from            to             .
Commission File No. 000-22513
____________________________________
AMAZON.COM, INC.
(Exact name of registrant as specified in its charter)
 ____________________________________
Delaware  91-1646860
(State or other jurisdiction ofincorporation or organization)  (I.R.S. EmployerIdentification No.)
410 Terry Avenue North, Seattle, Washington 98109-5210(206) 266-1000(Address and telephone number, including area code, of registrant’s principal executive offices)
Securities registered pursuant to Section 12(b) of 

In [7]:
import chromadb
from chromadb.utils.embedding_functions import EmbeddingFunction
import google.generativeai as genai



# Define Gemini embedding function for ChromaDB
class GeminiEmbeddingFunction(EmbeddingFunction):
    def __init__(self):
        # Initialize any attributes needed for the embedding function
        self.model = "models/embedding-001"
        self.task_type = "retrieval_document"
        
    def __call__(self, input):
        if not isinstance(input, list):
            input = [input]
        
        embeddings = []
        for text in input:
            try:
                embedding = genai.embed_content(
                    model=self.model,
                    content=text,
                    task_type=self.task_type
                )
                # Extract the actual embedding values
                embeddings.append(embedding["embedding"])
            except Exception as e:
                print(f"Error generating embedding: {e}")
                # Provide a fallback with appropriate dimensions
                embeddings.append([0.0] * 768)
                
        return embeddings
        
# Initialize ChromaDB client
chroma_client = chromadb.PersistentClient(path="/kaggle/working/chroma_db")

# Create embedding function instance
embed_fn = GeminiEmbeddingFunction()
# Create or get the collection
law_db = chroma_client.get_or_create_collection(
    name="law_documents", 
    embedding_function=embed_fn
)

# Add documents to vector store
law_db.add(
    documents=[text.page_content for text in texts],
    ids=[f"chunk_{i}" for i in range(len(texts))]
)

In [8]:
# Create a simple query function
def query_documents(query, k=3):
    # Query the vector store
    results = law_db.query(
        query_texts=[query],
        n_results=k
    )
    
    # Get the matching documents (these are already strings from ChromaDB)
    matching_docs = results['documents'][0]
    
    return matching_docs

test_query = "1. How has Amazon's total net sales changed over time? 2. Does Amazon report any significant new business acquisitions or divestitures in these 10-Qs? 3. Has Amazon engaged in any significant share repurchase activities in the reported quarters and what are the financial implications of these activities?"
matching_documents = query_documents(test_query)

# Print retrieved contexts
print("Retrieved contexts:")
for i, doc in enumerate(matching_documents):
    print(f"\n--- Document {i+1} ---\n{doc[:200]}...\n")

# Create RAG prompt with the string documents
context = "\n\n".join(matching_documents)

Retrieved contexts:

--- Document 1 ---
Table of Contents
Item 2. Management’s Discussion and Analysis of Financial Condition and Results of Operations
Forward-Looking Statements
This Quarterly Report on Form 10-Q includes forward-looking s...


--- Document 2 ---
and $42.9 billion for the nine months ended September 30, 2021 and 2022, which primarily reflect investments in technology infrastructure (the majority of
which is to support AWS business growth) and ...


--- Document 3 ---
where we record revenue gross. Service sales primarily represent third-party seller fees, which includes commissions and any related fulfillment and shipping
fees, AWS sales, advertising services, Ama...



In [9]:
# List available models
available_models = genai.list_models()
for model in available_models:
    if "gemini" in model.name.lower():
        print(f"Model name: {model.name}")
        print(f"Supported generation methods: {model.supported_generation_methods}")
        print("-" * 50)

Model name: models/gemini-1.0-pro-vision-latest
Supported generation methods: ['generateContent', 'countTokens']
--------------------------------------------------
Model name: models/gemini-pro-vision
Supported generation methods: ['generateContent', 'countTokens']
--------------------------------------------------
Model name: models/gemini-1.5-pro-latest
Supported generation methods: ['generateContent', 'countTokens']
--------------------------------------------------
Model name: models/gemini-1.5-pro-001
Supported generation methods: ['generateContent', 'countTokens', 'createCachedContent']
--------------------------------------------------
Model name: models/gemini-1.5-pro-002
Supported generation methods: ['generateContent', 'countTokens', 'createCachedContent']
--------------------------------------------------
Model name: models/gemini-1.5-pro
Supported generation methods: ['generateContent', 'countTokens']
--------------------------------------------------
Model name: models/gem

In [10]:
# Set up LLM for answering
llm = ChatGoogleGenerativeAI(
    model="gemini-1.5-pro",
    google_api_key=GOOGLE_API_KEY,
    temperature=0.2
)

# Create RAG prompt
context = "\n\n".join(matching_documents)
prompt = f"""
You are a financial analyst reviewing Amazon's SEC filings. Answer questions using exact data from the documents when available
. If you cannot find the answer in the context, say "Answer not founfd."


Answer Guidelines:
    - Use specific numbers, quarters, and percentages when available
    - Clearly state if information is not found
    - Separate factual reporting from analysis
    - Keep answers under 100 words

    Answer:
    
Context:
{context}

Question: {test_query}

Answer:
"""

# Get response from LLM
response = llm.invoke(prompt)
print("Final Answer:")
print(response.content)

Final Answer:
1. Amazon's consolidated net sales increased 15% in Q3 2022 compared to Q3 2021 ($110.812 billion to $127.101 billion). For the nine months ended September 30, 2022, net sales increased by 10% compared to the same period in 2021 ($332.410 billion to $364.779 billion).

2. Amazon made cash payments, net of acquired cash, related to acquisition and other investment activity of $654 million and $885 million during Q3 2021 and Q3 2022, respectively, and $1.6 billion and $7.5 billion for the nine months ended September 30, 2021 and 2022, respectively.  While the text mentions investments in "acquisition and other investment activity," it doesn't provide details on specific significant acquisitions or divestitures.

3. Answer not found.  The provided text doesn't contain information regarding share repurchases.
