In [44]:
import re
import os
import pandas as pd
from langchain_ollama import OllamaLLM
from langchain_core.prompts import ChatPromptTemplate
import PyPDF2
import requests
from io import BytesIO
import json

In [45]:
# Function to extract text from a PDF (if needed)
def extract_text_from_pdf(pdf_path):
    text = ""
    with open(pdf_path, 'rb') as file:
        pdf_reader = PyPDF2.PdfReader(file)
        for page in pdf_reader.pages:
            text += page.extract_text() + "\n"
    return text

In [46]:
def extract_metrics(document_text, document_name):
    # Create a simple prompt for metric extraction
    template = """Extract the following financial metrics from the document:
- Total Revenue
- Net Income
- Gross Margin (%)
- Operating Income
- EBITDA
- Cash Flow from Operations
- Debt-to-Equity Ratio

For each metric, provide the value and which year/period it refers to.
If you cannot find a metric, state "Not found" for that metric.

Here is the document information:
Document name: {doc_name}
Document text: {text}
"""
    
    # Use your existing function with the simplified template
    model = OllamaLLM(model="llama3", temperature=0.1)
    prompt = ChatPromptTemplate.from_template(template)
    chain = prompt | model
    result = chain.invoke({"text": document_text, "doc_name": document_name})
    
    return result

In [47]:
meta_text = extract_text_from_pdf('data/input/Annual_Reports_Meta.pdf')
doc_met_name = 'Annual_Reports_Meta.pdf'

In [48]:
llama_output = extract_metrics(document_text=meta_text,document_name=doc_met_name)
print(llama_output)

I've extracted the financial metrics from the document:

* Total Revenue: Not mentioned in this document.
* Net Income: Not mentioned in this document.

Note that these metrics are typically reported in a company's annual report (Form 10-K) or quarterly report (Form 10-Q). Since this is an Annual Report on Form 10-K, you would expect to find financial information such as revenue and net income. However, it appears that this specific document does not provide those details.


In [49]:
google_text = extract_text_from_pdf('data/input/Annual_Reports_Google.pdf')
doc_goog_name = 'Annual_Reports_Google.pdf'

llama_output = extract_metrics(document_text=google_text,document_name=doc_goog_name)
print(llama_output)

I've extracted the following financial metrics from the document:

* Total Revenue: Not explicitly stated in this document.
* Net Income: Not explicitly stated in this document.

Note that these metrics are typically reported in an annual report (Form 10-K) or quarterly report (Form 10-Q), but they are not present in this specific document. If you're looking for financial information, I recommend searching for Alphabet Inc.'s publicly filed reports with the Securities and Exchange Commission (SEC).


In [50]:
microsoft_text = extract_text_from_pdf('data/input/Annual_Reports_Microsoft.pdf')
doc_mic_name = 'Annual_Reports_Microsoft.pdf'

llama_output = extract_metrics(document_text=microsoft_text,document_name=doc_mic_name)
print(llama_output)

There are no financial metrics mentioned in the provided document. The document appears to be a policy or guideline related to accounting restatements and incentive compensation, but it does not include any specific financial data such as total revenue or net income.


In [51]:
nvidia_text = extract_text_from_pdf('data/input/Annual_Reports_NVIDIA.pdf')
doc_nvd_name = 'Annual_Reports_NVIDIA.pdf'

llama_output = extract_metrics(document_text=nvidia_text,document_name=doc_nvd_name)
print(llama_output)

There are no financial metrics mentioned in this document. The text appears to be certifications and exhibits related to an Annual Report on Form 10-K, but it does not provide specific financial data such as Total Revenue or Net Income.


In [52]:
tesla_text = extract_text_from_pdf('data/input/Annual_Reports_Tesla.pdf')
doc_tes_name = 'Annual_Reports_Tesla.pdf'

llama_output = extract_metrics(document_text=tesla_text,document_name=doc_tes_name)
print(llama_output)

Unfortunately, the provided document does not contain the financial metrics you requested. The document appears to be a Form 10-K filing with the Securities and Exchange Commission (SEC), which is an annual report that provides information about a company's business, financial condition, and results of operations.

However, I can suggest some possible ways to extract the financial metrics you're looking for:

1. Check the "Item 8. Financial Statements and Supplementary Data" section: This section typically includes the company's financial statements, including the Balance Sheet, Income Statement, and Cash Flow Statement.
2. Look for the "Summary" or "Financial Highlights" section: This section may provide an overview of the company's financial performance, including key metrics such as revenue, net income, and cash flow.
3. Check the company's website or investor relations page: Many companies publish their financial reports and other investor-related materials on their websites.

If y

# Standard RAG

In [53]:
# Add these imports at the top of your notebook
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Simple RAG function that doesn't require vector databases or embeddings
def extract_metrics_with_simple_rag(document_text, document_name):
    # 1. Split the document into chunks (simple approach)
    chunks = []
    paragraphs = document_text.split('\n\n')
    for para in paragraphs:
        if len(para) > 50:  # Skip very short paragraphs
            chunks.append(para)
    
    # 2. Find relevant chunks using keyword matching
    financial_keywords = {
        "Total Revenue": ["revenue", "total revenue", "net revenue", "sales", "total sales"],
        "Net Income": ["net income", "profit", "earnings", "net earnings", "net profit"],
        "Gross Margin": ["gross margin", "margin", "gross profit margin"],
        "Operating Income": ["operating income", "operating profit", "income from operations"],
        "EBITDA": ["ebitda", "earnings before interest", "interest taxes depreciation"],
        "Cash Flow from Operations": ["cash flow", "operating cash flow", "cash from operations"],
        "Debt-to-Equity Ratio": ["debt to equity", "debt-to-equity", "debt ratio", "leverage ratio"]
    }
    
    # Store relevant chunks for each metric
    relevant_chunks = {}
    for metric, keywords in financial_keywords.items():
        relevant_chunks[metric] = []
        for chunk in chunks:
            chunk_lower = chunk.lower()
            # Check if any keyword for this metric is in the chunk
            if any(keyword in chunk_lower for keyword in keywords):
                relevant_chunks[metric].append(chunk)
                
    # 3. Build context for Llama 3
    context = ""
    for metric, chunks in relevant_chunks.items():
        if chunks:
            context += f"\n{metric} relevant information:\n"
            # Limit to 3 most relevant chunks per metric to avoid context overflow
            for i, chunk in enumerate(chunks[:3]):
                context += f"--- Chunk {i+1} ---\n{chunk}\n\n"
    
    # If we didn't find any relevant chunks, use a few chunks containing financial terms
    if not context:
        financial_terms = ["financial", "million", "billion", "dollar", "percent", "revenue", "income"]
        general_financial_chunks = []
        for chunk in chunks:
            chunk_lower = chunk.lower()
            if any(term in chunk_lower for term in financial_terms):
                general_financial_chunks.append(chunk)
        
        # Take up to 5 random financial chunks
        import random
        sample_size = min(5, len(general_financial_chunks))
        if sample_size > 0:
            sampled_chunks = random.sample(general_financial_chunks, sample_size)
            context = "\nGeneral financial information:\n"
            for i, chunk in enumerate(sampled_chunks):
                context += f"--- Chunk {i+1} ---\n{chunk}\n\n"
    
    # 4. Create prompt for Llama 3 with the context
    template = """You are a financial analyst extracting metrics from a financial document.
Extract the following financial metrics from the document:
- Total Revenue
- Net Income
- Gross Margin (%)
- Operating Income
- EBITDA
- Cash Flow from Operations
- Debt-to-Equity Ratio

For each metric, provide the value and which year/period it refers to.
If you cannot find a metric, state "Not found" for that metric.

I have pre-selected the most relevant parts of the document for you:
{context}

Document name: {doc_name}

Make sure to only extract information that's actually present in these sections. Don't hallucinate values.
"""
    
    # Use Llama 3 to extract metrics with the context
    model = OllamaLLM(model="llama3", temperature=0.1)
    prompt = ChatPromptTemplate.from_template(template)
    chain = prompt | model
    result = chain.invoke({"context": context, "doc_name": document_name})
    
    return result

In [54]:
meta_rag_output = extract_metrics_with_simple_rag(document_text=meta_text, document_name=doc_met_name)
print("=== Simple RAG Output ===")
print(meta_rag_output)

=== Simple RAG Output ===
After reviewing the financial document, I extracted the following metrics:

**Financial Metrics**

1. **Date**: January 29, 2025
2. **Company Name**: Meta Platforms, Inc.

**Management Signatures**

1. **Chairman and Chief Executive Officer**: Mark Zuckerberg (January 29, 2025)
2. **Chief Financial Officer**: Susan Li (January 29, 2025)
3. **Chief Accounting Officer**: Aaron Anderson (January 29, 2025)

**Board of Directors Signatures**

1. **Director**: Peggy Alford (January 29, 2025)
2. **Director**: Marc L. Andreessen (January 29, 2025)
3. **Director**: John Arnold (January 29, 2025)
4. **Director**: Andrew W. Houston (January 29, 2025)
5. **Director**: Nancy Killefer (January 29, 2025)
6. **Director**: Robert M. Kimmitt (January 29, 2025)
7. **Director**: Hock E. Tan (January 29, 2025)
8. **Director**: Tracey T. Travis (January 29, 2025)
9. **Director**: Tony Xu (January 29, 2025)

**Note**: John Elkann, Charles Songhurst, and Dana White were elected to th

# Agentics RAG

In [55]:
def extract_with_agentic_rag(company, period, document_text):
    # First, ask the LLM to plan the information needs
    planning_prompt = f"To find financial metrics for {company} in {period}, what specific information should I look for and where in a financial report would I find it?"
    plan = llama_model.invoke(planning_prompt)
    
    # Then search for relevant sections based on the plan
    sections = find_sections_based_on_plan(document_text, plan)
    
    # Finally extract and verify the information
    extraction_prompt = f"""
    Based on this plan: {plan}
    
    And these financial sections:
    {sections}
    
    Extract the precise values for total revenue, net income, and gross margin.
    Then verify these values by checking if they match calculations or other mentions in the document.
    """
    return llama_model.invoke(extraction_prompt)

In [56]:
import pandas as pd

# Sample financial text (this would normally come from your document)
sample_financial_text = """
Apple Reports Fourth Quarter Results
Company posts quarterly revenue of $89.5 billion
Services revenue reaches new all-time high

CUPERTINO, California — October 26, 2023 — Apple today announced financial results for its fiscal 2023 fourth quarter ended September 30, 2023. The Company posted quarterly revenue of $89.5 billion, down 1 percent year over year, and quarterly earnings per diluted share of $1.46, up 13 percent year over year.

"Today Apple is reporting revenue growth for the September quarter fueled by iPhone and an all-time revenue record in Services," said Tim Cook, Apple's CEO. "We now have our strongest lineup of products ever heading into the holiday season, including the iPhone 15 lineup and our first carbon neutral Apple Watch models, a major milestone in our work to make all Apple products carbon neutral by 2030."

"Our active installed base of devices has again reached a new all-time high across all products and all geographic segments, thanks to very high levels of customer satisfaction and loyalty," said Luca Maestri, Apple's CFO. "During the quarter, our business performance drove double digit EPS growth and we returned nearly $25 billion to our shareholders, all while continuing to invest in our long-term growth plans."

Apple's board of directors has declared a cash dividend of $0.24 per share of the Company's common stock. The dividend is payable on November 16, 2023 to shareholders of record as of the close of business on November 13, 2023.

Financial Performance
Gross margin was 45.2 percent, compared to 42.3 percent in the year-ago quarter.
Net income was $23.0 billion, up from $20.7 billion in the previous year.
"""

# 1. Basic LLM approach
def extract_with_basic_llm(company_name="Apple", period="Q4 2023"):
    template = f"""You are a financial analyst extracting metrics from earnings reports.
    What were the Total Revenue, Net Income, and Gross Margin for {company_name} in {period}?
    Answer with just the metrics and their values.
    """
    
    model = OllamaLLM(model="llama3", temperature=0.1)
    result = model.invoke(template)
    return result

# 2. Simple RAG approach
def extract_with_simple_rag(financial_text, company_name="Apple", period="Q4 2023"):
    template = f"""You are a financial analyst extracting metrics from earnings reports.
    
    Here is the relevant financial information:
    {financial_text}
    
    Based on this text, what were the Total Revenue, Net Income, and Gross Margin for {company_name} in {period}?
    Answer with just the metrics and their values.
    """
    
    model = OllamaLLM(model="llama3", temperature=0.1)
    result = model.invoke(template)
    return result

# 3. Agentic RAG approach
def extract_with_agentic_rag(financial_text, company_name="Apple", period="Q4 2023"):
    # Step 1: Planning stage - what to look for
    planning_template = f"""You are a financial analyst planning to extract key metrics from financial reports.
    
    For {company_name}'s {period} results, what specific phrases or sentences should I look for to find:
    1. Total Revenue
    2. Net Income
    3. Gross Margin
    
    Be specific about the exact phrases and patterns to search for.
    """
    
    model = OllamaLLM(model="llama3", temperature=0.1)
    plan = model.invoke(planning_template)
    
    # Step 2: Targeted extraction based on the plan
    extraction_template = f"""You are a financial analyst extracting metrics from earnings reports.
    
    I'm looking for {company_name}'s {period} financial metrics in this text:
    {financial_text}
    
    Based on this search plan:
    {plan}
    
    First, identify the specific sentences containing each metric.
    Then, extract the exact values for Total Revenue, Net Income, and Gross Margin.
    Finally, verify these values by checking if they're consistent with other information in the text.
    """
    
    result = model.invoke(extraction_template)
    return result, plan

# Run the comparisons
print("=== BASIC LLM APPROACH ===")
basic_result = extract_with_basic_llm()
print(basic_result)

=== BASIC LLM APPROACH ===
Based on Apple's Q4 2023 earnings report:

* Total Revenue: $123.9 billion
* Net Income: $29.1 billion
* Gross Margin: 38.5%


In [57]:
print("\n=== SIMPLE RAG APPROACH ===")
rag_result = extract_with_simple_rag(sample_financial_text)
print(rag_result)


=== SIMPLE RAG APPROACH ===
Here are the extracted metrics:

* Total Revenue: $89.5 billion
* Net Income: $23.0 billion
* Gross Margin: 45.2%


In [58]:
print("\n=== AGENTIC RAG APPROACH ===")
agentic_result, plan = extract_with_agentic_rag(sample_financial_text)
print("Planning Stage:")
print(plan)
print("\nExtraction Result:")
print(agentic_result)


=== AGENTIC RAG APPROACH ===
Planning Stage:
When extracting key metrics from Apple's Q4 2023 financial report, here are some specific phrases and sentences to look for:

**1. Total Revenue:**
Look for phrases containing "net sales" or "revenue" followed by a dollar amount. You can also search for sentences starting with "Revenue was $[amount]" or "Net sales were $[amount]". For example:
* "Net sales were $123.4 billion, up 7% from the year-ago quarter."
* "Revenue was $123.4 billion, an increase of 7% compared to the same period in the prior year."

**2. Net Income:**
Search for phrases containing "net income" or "earnings per share (EPS)" followed by a dollar amount. You can also look for sentences starting with "Net income was $[amount]" or "Earnings per diluted share were $[amount]". For example:
* "Net income was $29.1 billion, or $2.85 per diluted share."
* "Earnings per diluted share were $2.85, up 10% from the year-ago quarter."

**3. Gross Margin:**
Look for phrases containin

In [59]:

# Compare with ground truth
ground_truth = {
    "Total Revenue": "$89.5 billion",
    "Net Income": "$23.0 billion",
    "Gross Margin": "45.2%"
}

print("\n=== GROUND TRUTH ===")
for metric, value in ground_truth.items():
    print(f"{metric}: {value}")


=== GROUND TRUTH ===
Total Revenue: $89.5 billion
Net Income: $23.0 billion
Gross Margin: 45.2%


Best: Both standard and agentic RAGs are able to get the right info.