# Multi-Agent Document Extraction Pipeline

This notebook demonstrates a **two-agent pipeline** for extracting and analyzing financial documents:

1. **Extractor Agent** (Claude Sonnet 4.5) - Reads and extracts key information from Apple's 10-K SEC filing
2. **Report Agent** (GPT-4o) - Generates a brief analytical report based on the extracted data

## Architecture
```
PDF Document → Extractor Agent (Claude) → Structured Data → Report Agent (GPT-4o) → Final Report
```

## 1. Setup & Installation

In [None]:
# Install required packages
!pip install -q langchain langchain-openai langchain-anthropic langchain-community
!pip install -q langchain-text-splitters
!pip install -q pypdf tiktoken
!pip install -q pydantic

In [None]:
import os
from getpass import getpass

# Set API keys
if 'OPENAI_API_KEY' not in os.environ:
    os.environ['OPENAI_API_KEY'] = getpass('Enter your OpenAI API key: ')

if 'ANTHROPIC_API_KEY' not in os.environ:
    os.environ['ANTHROPIC_API_KEY'] = getpass('Enter your Anthropic API key: ')

## 2. Download Apple 10-K Filing

We'll download Apple's latest 10-K filing from SEC EDGAR.

In [None]:
import urllib.request
import os

# Apple's 10-K filing URL (FY2024)
# You can replace this with any 10-K URL from SEC EDGAR
SEC_10K_URL = "https://www.sec.gov/Archives/edgar/data/320193/000032019324000123/aapl-20240928.htm"

# For this demo, we'll use a sample PDF approach
# In production, you'd download the actual filing
print("Note: For this demo, please upload Apple's 10-K PDF manually or provide the file path.")
print("You can download it from: https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=0000320193&type=10-K")

In [None]:
# Upload PDF file in Colab
from google.colab import files

print("Please upload Apple's 10-K PDF file:")
uploaded = files.upload()

# Get the filename
PDF_PATH = list(uploaded.keys())[0]
print(f"\nUploaded: {PDF_PATH}")

## 3. Load and Process the PDF

In [None]:
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Load the PDF
loader = PyPDFLoader(PDF_PATH)
documents = loader.load()

print(f"Loaded {len(documents)} pages from the PDF")

# Split into chunks for processing
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=4000,
    chunk_overlap=200,
    separators=["\n\n", "\n", " ", ""]
)

chunks = text_splitter.split_documents(documents)
print(f"Split into {len(chunks)} chunks")

## 4. Define Data Models

In [None]:
from pydantic import BaseModel, Field
from typing import List, Optional

class FinancialMetrics(BaseModel):
    """Key financial metrics extracted from 10-K"""
    total_revenue: Optional[str] = Field(description="Total net revenue/sales")
    net_income: Optional[str] = Field(description="Net income")
    gross_margin: Optional[str] = Field(description="Gross margin percentage")
    operating_income: Optional[str] = Field(description="Operating income")
    eps: Optional[str] = Field(description="Earnings per share (diluted)")
    total_assets: Optional[str] = Field(description="Total assets")
    total_debt: Optional[str] = Field(description="Total debt/liabilities")
    cash_and_equivalents: Optional[str] = Field(description="Cash and cash equivalents")

class BusinessSegments(BaseModel):
    """Revenue breakdown by segment"""
    iphone_revenue: Optional[str] = Field(description="iPhone revenue")
    mac_revenue: Optional[str] = Field(description="Mac revenue")
    ipad_revenue: Optional[str] = Field(description="iPad revenue")
    wearables_revenue: Optional[str] = Field(description="Wearables, Home and Accessories revenue")
    services_revenue: Optional[str] = Field(description="Services revenue")

class RiskFactors(BaseModel):
    """Key risk factors identified"""
    risks: List[str] = Field(description="List of key risk factors")

class ExtractedData(BaseModel):
    """Complete extracted data from 10-K filing"""
    company_name: str = Field(description="Company name")
    fiscal_year: str = Field(description="Fiscal year of the filing")
    financial_metrics: FinancialMetrics
    business_segments: BusinessSegments
    risk_factors: RiskFactors
    business_overview: str = Field(description="Brief business description")
    key_developments: List[str] = Field(description="Key developments or highlights")

## 5. Initialize LLM Models

In [None]:
from langchain_anthropic import ChatAnthropic
from langchain_openai import ChatOpenAI

# Initialize Claude Sonnet 4.5 for extraction
extractor_llm = ChatAnthropic(
    model="claude-sonnet-4-5-20250929",
    temperature=0,
    max_tokens=4096
)

# Initialize GPT-4o for report generation
report_llm = ChatOpenAI(
    model="gpt-4o",
    temperature=0.3,
    max_tokens=2048
)

print("Models initialized:")
print("  - Extractor: Claude Sonnet 4.5")
print("  - Report: GPT-4o")

## 6. Agent 1: Document Extractor (Claude Sonnet 4.5)

In [None]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import JsonOutputParser

# Create the extraction prompt
extraction_prompt = ChatPromptTemplate.from_messages([
    ("system", """You are an expert financial analyst specializing in SEC filings analysis.
Your task is to extract key information from Apple Inc.'s 10-K filing.

Extract the following information and return it as valid JSON:
- Company name and fiscal year
- Key financial metrics (revenue, net income, margins, EPS, assets, debt, cash)
- Revenue breakdown by business segment (iPhone, Mac, iPad, Wearables, Services)
- Top 5 risk factors
- Brief business overview (2-3 sentences)
- Key developments or highlights (3-5 bullet points)

If a specific value is not found in the provided text, use null.
Always include the unit (e.g., "$394.3 billion" or "45.2%").

Return ONLY valid JSON matching this schema:
{schema}"""),
    ("human", """Analyze the following sections from Apple's 10-K filing and extract the required information:

---
{document_text}
---

Return the extracted data as JSON:""")
])

# Create JSON parser
json_parser = JsonOutputParser(pydantic_object=ExtractedData)

In [None]:
def extract_from_document(chunks, llm, prompt):
    """
    Extract information from document chunks using Claude.
    Processes chunks in batches to stay within context limits.
    """
    # Combine relevant chunks (focus on financial sections)
    # In a production system, you'd use semantic search to find relevant sections
    combined_text = "\n\n---\n\n".join([chunk.page_content for chunk in chunks[:30]])
    
    # Truncate if too long (Claude has 200K context but we want to be efficient)
    if len(combined_text) > 100000:
        combined_text = combined_text[:100000]
    
    print(f"Processing {len(combined_text):,} characters of text...")
    
    # Create the chain
    chain = prompt | llm
    
    # Run extraction
    response = chain.invoke({
        "schema": ExtractedData.model_json_schema(),
        "document_text": combined_text
    })
    
    return response.content

In [None]:
import json

print("="*60)
print("AGENT 1: Document Extractor (Claude Sonnet 4.5)")
print("="*60)
print("\nExtracting information from 10-K filing...")

# Run extraction
extraction_result = extract_from_document(chunks, extractor_llm, extraction_prompt)

# Parse the JSON response
try:
    # Find JSON in the response
    json_start = extraction_result.find('{')
    json_end = extraction_result.rfind('}') + 1
    json_str = extraction_result[json_start:json_end]
    extracted_data = json.loads(json_str)
    print("\n✓ Extraction complete!")
except json.JSONDecodeError as e:
    print(f"\nWarning: Could not parse JSON response. Using raw response.")
    extracted_data = {"raw_response": extraction_result}

# Display extracted data
print("\n" + "-"*60)
print("EXTRACTED DATA:")
print("-"*60)
print(json.dumps(extracted_data, indent=2))

## 7. Agent 2: Report Generator (GPT-4o)

In [None]:
# Create the report generation prompt
report_prompt = ChatPromptTemplate.from_messages([
    ("system", """You are a senior financial analyst writing an executive briefing report.
Your task is to create a concise, professional report based on extracted 10-K data.

The report should:
1. Be clear and actionable for executive leadership
2. Highlight key financial performance metrics
3. Identify trends and notable changes
4. Summarize material risks
5. Provide a brief outlook or recommendation

Use professional financial language but keep it accessible.
Format with clear sections and bullet points where appropriate."""),
    ("human", """Based on the following extracted data from Apple Inc.'s 10-K filing, 
generate a concise executive briefing report:

EXTRACTED DATA:
{extracted_data}

Generate a professional executive briefing report (approximately 500-700 words):""")
])

In [None]:
print("="*60)
print("AGENT 2: Report Generator (GPT-4o)")
print("="*60)
print("\nGenerating executive briefing report...")

# Create the report chain
report_chain = report_prompt | report_llm

# Generate the report
report_response = report_chain.invoke({
    "extracted_data": json.dumps(extracted_data, indent=2)
})

final_report = report_response.content

print("\n✓ Report generation complete!")

In [None]:
print("\n" + "="*60)
print("FINAL EXECUTIVE BRIEFING REPORT")
print("="*60)
print(final_report)

## 8. Save Results

In [None]:
from datetime import datetime

# Create output dictionary
output = {
    "metadata": {
        "source_document": PDF_PATH,
        "extraction_model": "claude-sonnet-4-5-20250929",
        "report_model": "gpt-4o",
        "timestamp": datetime.now().isoformat()
    },
    "extracted_data": extracted_data,
    "executive_report": final_report
}

# Save to JSON
output_filename = f"apple_10k_analysis_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json"
with open(output_filename, 'w') as f:
    json.dump(output, f, indent=2)

print(f"Results saved to: {output_filename}")

# Download the file
files.download(output_filename)

## 9. Pipeline Summary

In [None]:
print("\n" + "="*60)
print("PIPELINE EXECUTION SUMMARY")
print("="*60)
print(f"""
Document: {PDF_PATH}
Pages processed: {len(documents)}
Chunks created: {len(chunks)}

Agent 1 (Extractor):
  - Model: Claude Sonnet 4.5
  - Task: Extract structured data from 10-K filing
  - Output: JSON with financial metrics, segments, risks

Agent 2 (Report Generator):
  - Model: GPT-4o
  - Task: Generate executive briefing report
  - Output: Professional narrative report

Pipeline: PDF → Claude (Extract) → GPT-4o (Report) → Output
""")

---

## Notes

### Model Selection Rationale
- **Claude Sonnet 4.5** for extraction: Excellent at structured data extraction, long context handling, and following complex schemas
- **GPT-4o** for report generation: Strong narrative generation and professional writing capabilities

### Production Considerations
1. Add retry logic with exponential backoff for API calls
2. Implement semantic chunking for better section identification
3. Add validation layer to verify extracted data accuracy
4. Use vector store for efficient retrieval of relevant sections
5. Add cost tracking and token usage monitoring

### Extending the Pipeline
- Add a third agent for competitive analysis
- Implement RAG for historical comparison
- Add visualization agent for charts and graphs