# **1. Introduction to the Problem**

For retail investors and finance enthusiasts, access to reliable and structured financial data is often behind paywalls or dependent on third-party interpretations. While all public companies release detailed financial documents like 10-K and 10-Q filings, these documents are lengthy, technical, and time-consuming to analyze manually.

This notebook addresses the challenge of automating the extraction of crucial information from such filings—starting with Apple’s 10-K—as a proof-of-concept. It aims to build a foundation for a Retrieval-Augmented Generation (RAG) system that can interpret, structure, and answer queries based on financial documents without relying on subscription-based services or third-party news outlets.

# **2. Motivation**

As someone interested in the stock market, I often find it difficult to gather precise information without relying on paid services or financial analysts. Even though SEC filings are publicly available, extracting meaningful insights from them manually is time-consuming and inefficient.

This project is born out of the desire to:
1. Automatically extract and structure critical information from 10-K filings.

2. Build an internal, searchable knowledge base using LLMs and embeddings.

3. Enable natural language querying over structured financial data.

4. Eventually expand this pipeline to include multiple companies and documents.

While this may seem ambitious, the current goal is to test it on a single 10-K (Apple Inc.) and evaluate its potential.

# **3. What We Have Covered in This Notebook**
This notebook implements a foundational pipeline involving:

- **Structured Output / JSON Mode / Controlled Generation:**
Extracted risk-related sections, Financial data from Apple’s 10-K and converted them into JSON format using few-shot prompting.

- **Document Understanding:**
Parsed PDF content with PyPDF2 and isolated the "Item 1A. Risk Factors" section.

- **Few-shot Prompting:**
Provided examples to guide the model’s generation style and structure.

- **Embeddings:**
Used sentence-transformers to convert extracted content into embeddings.

- **Vector Search / Vector Database :**
Stored embeddings in ChromaDB to enable efficient similarity-based retrieval.

- **Retrieval Augmented Generation (RAG):**
Retrieved relevant risk factors based on user queries and integrated them into the LLM’s context.

###Installing Dependencies

In [1]:
!pip install pypdf2
!pip install -U langchain-community
!pip install langchain_google_genai
!pip install chromadb sentence-transformers google-generativeai

Collecting pypdf2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pypdf2
Successfully installed pypdf2-3.0.1
Collecting langchain-community
  Downloading langchain_community-0.3.21-py3-none-any.whl.metadata (2.4 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_settings-2.9.1-py3-none-any.whl.metadata (3.8 kB)
Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain-community)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading marshmallow-3.26.1-py3-none-any.whl.metadata 

Collecting chromadb
  Downloading chromadb-1.0.5-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.9 kB)
Collecting build>=1.0.3 (from chromadb)
  Downloading build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB)
Collecting chroma-hnswlib==0.7.6 (from chromadb)
  Downloading chroma_hnswlib-0.7.6-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (252 bytes)
Collecting fastapi==0.115.9 (from chromadb)
  Downloading fastapi-0.115.9-py3-none-any.whl.metadata (27 kB)
Collecting uvicorn>=0.18.3 (from uvicorn[standard]>=0.18.3->chromadb)
  Downloading uvicorn-0.34.2-py3-none-any.whl.metadata (6.5 kB)
Collecting posthog>=2.4.0 (from chromadb)
  Downloading posthog-3.25.0-py2.py3-none-any.whl.metadata (3.0 kB)
Collecting onnxruntime>=1.14.1 (from chromadb)
  Downloading onnxruntime-1.21.1-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (4.5 kB)
Collecting opentelemetry-exporter-otlp-proto-grpc>=1.2.0 (from chromadb)
  Downloading opentele

Importing Necessary Libraries and Setting up the Google Generative AI client

In [2]:
from PyPDF2 import PdfReader
import re
from google import genai
from langchain_google_genai import ChatGoogleGenerativeAI

from google.colab import userdata
google_api = userdata.get('Google_Api')
client = genai.Client(api_key=google_api)


import os
import re
from PyPDF2 import PdfReader



#Extracting Risk Factors

- THe class "RiskFactorsExtractor"extract the "Risk Factors" section from a PDF document (specifically, the "Item 1A. Risk Factors" section, which is a standard part of many financial reports).  It then saves this extracted text to a separate text file.

In [3]:
class RiskFactorsExtractor:
    def __init__(self, pdf_path: str):
        self.pdf_path = pdf_path
        self.reader = PdfReader(pdf_path)
        self.risk_factors_text = ""

    def extract_text(self) -> str:
        """Extract all text from the PDF."""
        text = ""
        for page in self.reader.pages:
            text += page.extract_text()
        return text

    def find_risk_factors_section(self, text: str) -> str:
        """Find and extract the Item 1A Risk Factors section."""
        # Pattern to find the start of Item 1A
        start_pattern = r"Item\s*1A\.?\s*Risk\s*Factors"
        # Pattern to find the end (usually Item 1B)
        end_pattern = r"Item\s*1B\.?\s*Unresolved\s*Staff\s*Comments"

        # Find the start of the section
        start_match = re.search(start_pattern, text, re.IGNORECASE)
        if not start_match:
            return "Risk Factors section not found"

        # Find the end of the section
        end_match = re.search(end_pattern, text[start_match.end():], re.IGNORECASE)
        if not end_match:
            return "Could not determine end of Risk Factors section"

        # Extract the section
        start_idx = start_match.end()
        end_idx = start_idx + end_match.start()
        risk_factors = text[start_idx:end_idx].strip()

        # Clean up the text
        risk_factors = re.sub(r'\n\s*\n', '\n\n', risk_factors)  # Remove excessive newlines
        risk_factors = re.sub(r'\s+', ' ', risk_factors)  # Normalize whitespace

        return risk_factors

    def extract(self) -> str:
        """Main method to extract risk factors."""
        text = self.extract_text()
        self.risk_factors_text = self.find_risk_factors_section(text)
        return self.risk_factors_text

    def save_to_file(self, output_path: str):
        """Save the extracted risk factors to a file."""
        with open(output_path, 'w', encoding='utf-8') as f:
            f.write(self.risk_factors_text)

def main():

    pdf_path = "/content/NASDAQ_AAPL_2023.pdf"
    extractor = RiskFactorsExtractor(pdf_path)
    risk_factors = extractor.extract()

    # Save to file
    extractor.save_to_file("apple_risk_factors_2023.txt")


    print("Extracted Risk Factors (first 100 characters):")
    print(risk_factors[:100] + "...")

if __name__ == "__main__":
    main()


Extracted Risk Factors (first 100 characters):
5 Item 1B. Unresolved Staf f Comments 16 Item 1C. Cybersecurity 16 Item 2. Properties 17 Item 3. Leg...


#Extracting Financial Health of the company

- The class "#FinancialDataExtractor" extract the "Financial Data"  from a PDF document (specifically, the "Item 8" section, which is a standard part of many financial reports).  It then saves this extracted text to a separate text file.

In [4]:
class FinancialDataExtractor:
    def __init__(self, pdf_path: str):
        self.pdf_path = pdf_path
        self.reader = PdfReader(pdf_path)
        self.financial_data_text = ""

    def extract_text(self) -> str:
        """Extract all text from the PDF."""
        text = ""
        for page in self.reader.pages:
            text += page.extract_text()
        return text

    def find_financial_data_section(self, text: str) -> str:
        """Find and extract the financial data section."""
        # Pattern to find the start of the section
        start_pattern = r"Item\s*8\.?\s*Financial\s*Statements\s*and\s*Supplementary\s*Data\s*Index\s*to\s*Consolidated\s*Financial\s*Statements"
        # Pattern to find the end (usually Item 9)
        end_pattern = r"Item\s*9\.?\s*Changes\s*in\s*and\s*Disagreements\s*with\s*Accountants\s*on\s*Accounting\s*and\s*Financial\s*Disclosure\s*None\."

        # Find the start of the section
        start_match = re.search(start_pattern, text, re.IGNORECASE)
        if not start_match:
            return "Financial Data section not found"

        # Find the end of the section
        end_match = re.search(end_pattern, text[start_match.end():], re.IGNORECASE)
        if not end_match:
            return "Could not determine end of Financial Data section"

        # Extract the section
        start_idx = start_match.end()
        end_idx = start_idx + end_match.start()
        financial_data = text[start_idx:end_idx].strip()

        # Clean up the text
        financial_data = re.sub(r'\n\s*\n', '\n\n', financial_data)  # Remove excessive newlines
        financial_data = re.sub(r'\s+', ' ', financial_data)  # Normalize whitespace

        return financial_data

    def extract(self) -> str:
        """Main method to extract financial data."""
        text = self.extract_text()
        self.financial_data_text = self.find_financial_data_section(text)
        return self.financial_data_text

    def save_to_file(self, output_path: str):
        """Save the extracted financial data to a file."""
        with open(output_path, 'w', encoding='utf-8') as f:
            f.write(self.financial_data_text)

def main():

    pdf_path = "/content/NASDAQ_AAPL_2023.pdf"
    extractor = FinancialDataExtractor(pdf_path)
    financial_data = extractor.extract()

    # Save to file
    extractor.save_to_file("apple_financial_data_2023.txt")

    print("Extracted Financial Data (first 100 characters):")
    print(financial_data[:100] + "...")

if __name__ == "__main__":
    main()


Extracted Financial Data (first 100 characters):
Page Consolidated Statements of Operations for the years ended September 30, 2023, September 24, 202...


- The text files containing Apple's 2023 financial filings are now given to the Gemini 1.5 Flash latest model with a few examples such that it will provide JSON-based data from the text.
- The prompts include sample JSON objects demonstrating the expected structure, which helps the model understand the format needed for extraction.
- For the risk factors analysis, the model extracts information into JSON objects containing fields like company name, year, risk title, category, severity, summary and source text.
- For the financial metrics, it extracts data into JSON objects with fields for company, year, financial statement source, metric name, performance summary, numeric value, and unit.
- After processing, the structured JSON responses are cleaned of any code block markers and saved to dedicated JSON files for further analysis or visualization.

In [5]:
with open('/content/apple_risk_factors_2023.txt','r') as f:
  file_content1 = f.read()
  prompt1 = f"""
You are an expert in analyzing financial filings. Your task is to extract information about risk factors from the following text and structure it as a list of JSON objects, where each object represents a specific risk factor.

The JSON object should have the following fields:
- "company": The name of the company.
- "year": The year of the filing.
- "risk_title": A concise title for the risk factor.
- "risk_category": The broad category this risk belongs to.
- "severity": An assessment of the risk's severity (e.g., "High", "Medium", "Low"). If the text doesn't explicitly state the severity, make an informed judgment based on the description.
- "summary": A brief summary of the risk factor.
- "source_text": A relevant excerpt from the text that supports the identified risk factor.

Here is the text from Apple Inc.'s 2023 Form 10-K filing:

{file_content1}

Please provide the output as a list of JSON objects. (based on the provided output, these are some examples and you should extract all multiple risk factor.):

{
  {
    "company": "Apple",
    "year": 2023,
    "risk_title": "Disruptions in Global Supply Chain",
    "risk_category": "Macroeconomic and Industry",
    "severity": "High",
    "summary": "**Disruptions in Apple's global and complex supply chain**, where a majority of supplier facilities are outside the U.S., can **materially adversely affect** the business, results of operations, and financial condition [2]. Restrictions on international trade can further exacerbate these issues [3].",
    "source_text": "In addition, the Company’s global supply chain is large and complex and a majority of the Company’s supplier facilities, including manufacturing and assembly sites, are located outside the U.S. As a result, the Company’s operations and performance depend significantly on global and regional economic conditions [2]. Restrictions on international trade, such as tariffs and other controls on imports or exports of goods, technology or data, can materially adversely affect the Company’s operations and supply chain and limit the Company’s ability to offer and distribute its products and services to customers [3]."
  },
  {
    "company": "Apple",
    "year": 2023,
    "risk_title": "Intense Competition in Global Markets",
    "risk_category": "Macroeconomic and Industry",
    "severity": "High",
    "summary": "The global markets for Apple's products and services are **highly competitive**, characterized by price competition, frequent new product introductions, and rapid technological change, which could prevent Apple from competing effectively [4].",
    "source_text": "*Global markets for the Company’s products and services are **highly competitive** and subject to rapid technological change, and the Company may be unable to compete effectively in these markets* [4]."
  },
  {
    "company": "Apple",
    "year": 2023,
    "risk_title": "Dependence on Outsourcing Partners",
    "risk_category": "Business Risks",
    "severity": "High",
    "summary": "Apple relies significantly on **outsourcing partners**, primarily located outside the U.S., for **manufacturing components and products**. This reduces Apple's direct control over production and distribution, potentially affecting product quality, quantity, and responsiveness to changing conditions [5].",
    "source_text": "**Substantially all of the Company’s manufacturing is performed in whole or in part by outsourcing partners located primarily in China mainland, India, Japan, South Korea, Taiwan and Vietnam**, and a significant concentration of this manufacturing is currently performed by a small number of outsourcing partners, often in single locations. ... While these arrangements can lower operating costs, they also reduce the Company’s direct control over production and distribution. Such diminished control has from time to time and may in the future have an adverse effect on the quality or quantity of products manufactured or services provided, or adversely affect the Company’s flexibility to respond to changing conditions [5]."
  },
  {
    "company": "Apple",
    "year": 2023,
    "risk_title": "Single or Limited Sources for Certain Components",
    "risk_category": "Business Risks",
    "severity": "High",
    "summary": "Apple obtains **certain essential components from single or limited sources**, exposing the company to **significant supply and pricing risks**, including industry-wide shortages and commodity price fluctuations, which can **materially adversely affect** its business [6].",
    "source_text": "**Because the Company currently obtains certain components from single or limited sources, the Company is subject to significant supply and pricing risks.** Many components, including those that are available from multiple sources, are at times subject to industry-wide shortages and significant commodity pricing fluctuations that can **materially adversely affect the Company’s business, results of operations and financial condition** [6]."
  },
  {
    "company": "Apple",
    "year": 2023,
    "risk_title": "Design and Manufacturing Defects",
    "risk_category": "Business Risks",
    "severity": "High",
    "summary": "Apple's complex hardware and software products and services are susceptible to **design and manufacturing defects**. Failure to detect and fix these issues can lead to technical problems, performance issues, product liability claims, recalls, and **harm to the Company's reputation** [7].",
    "source_text": "**The Company’s products and services may be affected from time to time by design and manufacturing defects that could materially adversely affect the Company’s business and result in harm to the Company’s reputation.** Sophisticated operating system software and applications, such as those offered by the Company, often have issues that can unexpectedly interfere with the intended operation of hardware or software products and services [7]."
  }
}

Now, go through the text and extract multiple risk factors in the specified JSON format.
"""
response1 = client.models.generate_content(
    model='gemini-1.5-flash-latest',

    contents=prompt1
)

json_text = response1.text.strip()
json_text = re.sub(r"```(?:json)?\n([\s\S]*?)```", r"\1", json_text)
with open("apple_risk_factors_2023.json", "w") as outfile:
    outfile.write(json_text)

In [6]:
with open('/content/apple_financial_data_2023.txt','r') as f:
  file_content2 = f.read()
prompt2 = f"""
You are an expert in analyzing financial filings. Your task is to extract structured financial data from the following text, specifically focusing on **key financial metrics**. Each metric should be extracted **only once**, using the most recent value available (e.g., from 2023 if available).

Provide the output as a list of JSON objects, with the following fields:
- "company": The name of the company.
- "year": The year of the financial data.
- "financial_statement": The name of the financial statement the metric appears in (e.g., "Consolidated Statements of Operations", "Consolidated Balance Sheets", or "Consolidated Statements of Cash Flows").
- "metric": The name of the financial metric (e.g., "Total Net Sales", "Net Income").
- "summary": A brief assessment of the performance indicated by this metric (e.g., "strong growth", "concerning decline", "stable performance").
- "value": The numeric value of the metric (as a number, not a string).
- "unit": The unit of the value (usually "USD").



Here is the text from Apple Inc.'s financial filing:

{file_content2}

Please provide the output as a list of JSON objects. For example (based on the provided output, these are few examples and you should extract all  Financial Metrics):
{
  {
    "company": "Apple",
    "year": 2023,
    "financial_statement": "Consolidated Statements of Operations",
    "metric": "Total Net Sales",
    "summary": "Strong performance with 4.8% growth from previous year",
    "value": 383285000000,
    "unit": "USD"
  },
  {
    "company": "Apple",
    "year": 2023,
    "financial_statement": "Consolidated Statements of Operations",
    "metric": "Research and Development Expenses",
    "summary": "Continued significant investment in innovation and future products",
    "value": 29915000000,
    "unit": "USD"
  },
  {
    "company": "Apple",
    "year": 2023,
    "financial_statement": "Consolidated Statements of Operations",
    "metric": "Selling, General and Administrative Expenses",
    "summary": "Well-managed operational expenses relative to revenue",
    "value": 24932000000,
    "unit": "USD"
  },
  {
    "company": "Apple",
    "year": 2023,
    "financial_statement": "Consolidated Statements of Operations",
    "metric": "Earnings per Share, Basic",
    "summary": "Solid earnings performance indicating profitability",
    "value": 6.16,
    "unit": "USD"
  },
  {
    "company": "Apple",
    "year": 2023,
    "financial_statement": "Consolidated Statements of Operations",
    "metric": "Earnings per Share, Diluted",
    "summary": "Strong profitability with minimal dilution effect",
    "value": 6.13,
    "unit": "USD"
  },
  {
    "company": "Apple",
    "year": 2023,
    "financial_statement": "Consolidated Balance Sheets",
    "metric": "Marketable Securities",
    "summary": "Substantial liquid assets providing financial flexibility",
    "value": 31590000000,
    "unit": "USD"
  },
  {
    "company": "Apple",
    "year": 2023,
    "financial_statement": "Consolidated Balance Sheets",
    "metric": "Total Assets",
    "summary": "Impressive asset base supporting business operations",
    "value": 352755000000,
    "unit": "USD"
  },
  {
    "company": "Apple",
    "year": 2022,
    "financial_statement": "Consolidated Balance Sheets",
    "metric": "Total Shareholders' Equity",
    "summary": "Declining equity position compared to previous year",
    "value": 50672000000,
    "unit": "USD"
  },
  {
    "company": "Apple",
    "year": 2021,
    "financial_statement": "Consolidated Statements of Operations",
    "metric": "Total Net Sales",
    "summary": "Strong historical sales performance",
    "value": 365817000000,
    "unit": "USD"
  },
  {
    "company": "Apple",
    "year": 2021,
    "financial_statement": "Consolidated Balance Sheets",
    "metric": "Total Shareholders' Equity",
    "summary": "Healthy equity position providing financial stability",
    "value": 63090000000,
    "unit": "USD"
  },
  {
    "company": "Apple",
    "year": 2023,
    "financial_statement": "Consolidated Statements of Cash Flows",
    "metric": "Net Cash Used in Financing Activities",
    "summary": "Significant cash outflow indicating substantial shareholder returns or debt repayment",
    "value": -108488000000,
    "unit": "USD"
  }
}
Now, go through the text and extract multiple Financial Metrics in the specified JSON format.
"""

response2 = client.models.generate_content(
    model='gemini-1.5-flash-latest',
    contents=prompt2
)

print(response2.text)


json_text = response2.text.strip()
json_text = re.sub(r"```(?:json)?\n([\s\S]*?)```", r"\1", json_text)
with open("apple_financial_data_2023.json", "w") as outfile:
    outfile.write(json_text)

```json
[
  {
    "company": "Apple",
    "year": 2023,
    "financial_statement": "Consolidated Statements of Operations",
    "metric": "Total Net Sales",
    "summary": "Slight decrease compared to 2022, but still strong overall.",
    "value": 383285000000,
    "unit": "USD"
  },
  {
    "company": "Apple",
    "year": 2023,
    "financial_statement": "Consolidated Statements of Operations",
    "metric": "Net Income",
    "summary": "Slight decrease compared to 2022, but still strong overall.",
    "value": 96995000000,
    "unit": "USD"
  },
  {
    "company": "Apple",
    "year": 2023,
    "financial_statement": "Consolidated Statements of Operations",
    "metric": "Cost of Sales, Products",
    "summary": "Increased cost of sales for products compared to 2022.",
    "value": 189282000000,
    "unit": "USD"
  },
  {
    "company": "Apple",
    "year": 2023,
    "financial_statement": "Consolidated Statements of Operations",
    "metric": "Cost of Sales, Services",
    "summary"

Installing libraries for embeddings and storing

In [7]:
import google.generativeai as genai
import uuid
import chromadb
from chromadb.config import Settings
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

- prepares the data for storage in the ChromaDB vector database.
- It iterates through the risk and financial data, formats the text, and stores it in lists (documents, metadatas, ids).
- The metadatas list stores additional information about each document, while ids assigns unique identifiers to each document.

In [9]:
import json
def load_json_data(filepath):
    with open(filepath, 'r',encoding='utf-8') as f:
        return json.load(f)

risk_data = load_json_data("apple_risk_factors_2023.json")
financial_data = load_json_data("apple_financial_data_2023.json")

documents = []
metadatas = []
ids=[]

for item in risk_data:
    text = f"Risk: {item['risk_title']}. Summary: {item['summary']} Source: {item['source_text']}"
    documents.append(text)
    item["type"] = "risk"
    metadatas.append(item)
    ids.append(str(uuid.uuid4()))

for item in financial_data:
    text = f"Financial Metric: {item['metric']} ({item['financial_statement']}) for {item['year']}. Value: {item['value']} {item['unit']}. Summary: {item['summary']}"
    documents.append(text)
    item["type"] = "financial"
    metadatas.append(item)
    ids.append(str(uuid.uuid4()))

# Embedding and Storing Data in ChromaDB


In [10]:
client1 = chromadb.Client(Settings(
    persist_directory="./chroma_db"
))

collection = client1.get_or_create_collection("apple_data")
embeddings_list = embedding_model.encode(documents, convert_to_numpy=True)

collection.add(
    embeddings=embeddings_list.tolist(),
    documents=documents,
    metadatas=metadatas,
    ids=ids
)

print(f"Successfully embedded and stored {collection.count()} documents in ChromaDB.")

Successfully embedded and stored 71 documents in ChromaDB.


#Defining the RAG Prompt Builder
- constructs the prompt for the Retrieval Augmented Generation (RAG) pipeline.
- It takes the user's query and retrieved documents as input and formats them into a prompt for the Google Generative AI model.

In [11]:
EXAMPLES = """
  [
  {
    "question": "What was Apple’s total assets as of 2023?",
    "output": {
      "type": "financial",
      "metric": "Total Assets",
      "financial_statement": "Consolidated Balance Sheets",
      "value": 352583000000,
      "unit": "USD",
      "summary": "Slight decrease in total assets compared to 2022."
    }
  },
  {
    "question": "What are the major business risks Apple faces?",
    "output": {
      "type": "risk",
      "risk_title": "Dependence on Outsourcing Partners for Manufacturing",
      "risk_category": "Business Risks",
      "severity": "High",
      "summary": "Apple's reliance on outsourcing partners for manufacturing reduces its direct control over production and distribution, creating risks related to product quality, quantity, and response to changing conditions.",
      "source_text": "Substantially all of the Company’s manufacturing is performed in whole or in part by outsourcing partners located primarily in China mainland, India, Japan, South Korea, Taiwan and Vietnam..."
    }
  },
  {
    "question": "How much did Apple make in net sales from Europe in 2023?",
    "output": {
      "type": "financial",
      "metric": "Net Sales, Europe",
      "financial_statement": "Note 13 - Segment Information and Geographic Data",
      "value": 94294000000,
      "unit": "USD",
      "summary": "Strong sales performance in Europe."
    }
  },
  {
    "question": "What was Apple's net income in 2023?",
    "output": {
      "type": "financial",
      "metric": "Net Income",
      "financial_statement": "Consolidated Statements of Operations",
      "value": 96995000000,
      "unit": "USD",
      "summary": "Slight decrease compared to 2022, but still very high."
    }
  },
  {
    "question": "Does Apple mention anything about cybersecurity threats?",
    "output": {
      "type": "risk",
      "risk_title": "Malicious Attacks and Cybersecurity Risks",
      "risk_category": "Business Risks",
      "severity": "High",
      "summary": "Apple faces regular malicious attacks and cybersecurity threats aiming to compromise its systems and data. These attacks could disrupt operations, harm reputation, and lead to legal and financial repercussions.",
      "source_text": "The Company experiences malicious attacks and other attempts to gain unauthorized access to its systems on a regular basis..."
    }
  },
  {
    "question": "How much did Apple spend on research and development in 2023?",
    "output": {
      "type": "financial",
      "metric": "Research and Development Expenses",
      "financial_statement": "Consolidated Statements of Operations",
      "value": 29915000000,
      "unit": "USD",
      "summary": "Significant increase in R&D spending compared to 2022."
    }
  },
  {
    "question": "What legal or regulatory risks did Apple face in 2023?",
    "output": {
      "type": "risk",
      "risk_title": "Legal and Regulatory Proceedings",
      "risk_category": "Legal and Regulatory Compliance Risks",
      "severity": "High",
      "summary": "Apple faces various legal claims, proceedings, and government investigations, which can be expensive, time-consuming, and disruptive.",
      "source_text": "The Company is subject to various claims, legal proceedings and government investigations that have arisen in the ordinary course of business..."
    }
  }
]
"""
def build_rag_prompt(user_query, retrieved_docs):
    return f"""You are a financial analyst assistant. Extract structured data from the given context based on the format shown in the examples.

{EXAMPLES}

Context:
{retrieved_docs[0]}

Question:
{user_query}

Output:"""



### RAG Query Pipeline
- It takes the user's query as input, generates an embedding for it, queries the ChromaDB collection for relevant documents, builds the RAG prompt, sends the prompt to the Google Generative AI model, and returns the response.

In [13]:
def query_rag_pipeline(user_query, model='gemini-1.5-flash-latest'):

    query_embedding = embedding_model.encode([user_query], convert_to_numpy=True)
    results = collection.query(query_embeddings=query_embedding.tolist(), n_results=5)
    retrieved_docs = results['documents'][0]
    retrieved_context = "\n\n".join(retrieved_docs)
    prompt = build_rag_prompt(user_query, [retrieved_context])
    print("Top retrieved docs:\n", retrieved_context)
    prompt = build_rag_prompt(user_query, retrieved_docs)
    response = client.models.generate_content(
        model=model,
        contents=[{
            "role": "user",
            "parts": [{"text": prompt}]
        }]
    )
    return response.text if hasattr(response, 'text') else response.candidates[0].content.parts[0].text


#### Querying the RAG Pipeline

In [14]:
response = query_rag_pipeline("What was total Net Sales?")
print(response)


Top retrieved docs:
 Financial Metric: Total Net Sales (Consolidated Statements of Operations) for 2023. Value: 383285000000 USD. Summary: Slight decrease compared to 2022, but still strong overall.

Financial Metric: Property, Plant and Equipment, Net (Consolidated Balance Sheets) for 2023. Value: 43715000000 USD. Summary: Slight increase in net property, plant and equipment compared to 2022.

Financial Metric: Net Income (Consolidated Statements of Operations) for 2023. Value: 96995000000 USD. Summary: Slight decrease compared to 2022, but still strong overall.

Financial Metric: Net Cash Used in Financing Activities (Consolidated Statements of Cash Flows) for 2023. Value: -108488000000 USD. Summary: Significant net cash outflow from financing activities, likely due to share repurchases and dividends.

Financial Metric: Net Cash from Operating Activities (Consolidated Statements of Cash Flows) for 2023. Value: 110543000000 USD. Summary: Strong cash flow from operations.
```json
{
  "

In [17]:
response1 = query_rag_pipeline("What was Apple’s Major Risks?")
print(response1)

Top retrieved docs:
 Risk: Information Technology System Failures and Network Disruptions. Summary: Apple's operations are heavily reliant on information technology systems, making it vulnerable to disruptions caused by natural disasters, accidents, cyberattacks, or other events.  Such failures can severely impact its business. Source: The Company and its global supply chain are dependent on complex information technology systems and are exposed to information technology system failures or network disruptions caused by natural disasters, accidents, power disruptions, telecommunications failures, acts of terrorism or war, computer viruses, physical or electronic break-ins, ransomware or other cybersecurity incidents, or other events or disruptions.

Risk: Third-Party Intellectual Property Risks. Summary: Apple's reliance on third-party intellectual property for its products and services creates a risk of infringement claims or inability to obtain necessary licenses on commercially reaso

In [21]:
response3 = query_rag_pipeline("What was Apple's Gross Margin")
print(response3)

Top retrieved docs:
 Financial Metric: Gross Margin (Consolidated Statements of Operations) for 2023. Value: 169148000000 USD. Summary: Slightly decreased gross margin compared to 2022.

Risk: Intense Competition. Summary: Apple faces intense competition in the smartphone, personal computer, and tablet markets, with competitors employing aggressive pricing strategies, broader product lines, and larger installed bases.  This leads to downward pressure on gross margins. Source: Global markets for the Company’s products and services are highly competitive and subject to rapid technological change, and the Company may be unable to compete effectively in these markets. The Company’s products and services are offered in highly competitive global markets characterized by aggressive price competition and resulting downward pressure on gross margins, frequent introduction of new products and services, short product life cycles, evolving industry standards, continual improvement in product price

 - Our RAG model consistently producing accurate and well-structured responses for multiple queries, demonstrating its ability to retrieve relevant information from the 10-K filing, ground responses in source context, and generate outputs in a controlled JSON format.

# **5. Future Work**

The broader vision for this project includes:

-  Evaluating LLM responses against gold-standard queries and ground truth answers to ensure factuality and reliability.(Immediate)

- Extracting a wide range of financial, strategic, and operational data from filings.

- Supporting multiple companies and parsing both 10-K and 10-Q reports.

- Building pipelines using agents to autonomously extract, embed, and update data.

- Developing a web app with an interactive dashboard for querying and exploring filings.

- Benchmarking various LLMs (e.g., GPT-4, Gemini, Claude) on financial document understanding tasks