### Post-Success, Multiple Serves, Displays:
- Quickly query over 5-10 different queries and show the output cleanly.

In [2]:
# ============================================================================
# CELL 1: Setup - Path Resolution & Imports
# ============================================================================

from pathlib import Path
import sys
import logging

# Suppress noisy logs for clean notebook output
logging.getLogger().setLevel(logging.WARNING)
logging.getLogger("finrag_ml_tg1").setLevel(logging.INFO)

# Find ModelPipeline root and add to sys.path
current = Path.cwd()
for parent in [current] + list(current.parents):
    if parent.name == "ModelPipeline":
        model_root = parent
        break
else:
    raise RuntimeError("Cannot find 'ModelPipeline' root in path tree")

if str(model_root) not in sys.path:
    sys.path.insert(0, str(model_root))

print(f"ModelPipeline root: {model_root}")
print(f"Notebook location: {Path.cwd()}")

ModelPipeline root: d:\JoelDesktop folds_24\NEU FALL2025\MLops IE7374 Project\FinSights\ModelPipeline
Notebook location: d:\JoelDesktop folds_24\NEU FALL2025\MLops IE7374 Project\FinSights\ModelPipeline\finrag_ml_tg1\rag_modules_src\01_Isolation_Test_NBS


## ============================================================================
## Question P3 Analysis:

#### Key Criteria:
1. Question-Answer Alignment - Does the answer properly address what the question asks?
2. Evidence Precision - Is the answer clearly grounded in the evidence?
3. Testability - Can this be reliably evaluated in a RAG system?
4. Narrative Coherence - Does the answer do justice to the question's scope?

#### FACTOID BUT NON / BAD for test ( Single-sentence local factuals )
- P3.v2 Questions P3V2-Q001, P3V2-Q002, P3V2-Q004, P3V2-Q010, Q011 -- Bad questions for testability but serve diff purpose.
  - "cash flow from operations" but answer is auditor's opinion boilerplate.
  - "net income" but answer discusses valuation allowance sensitivity
  - "total revenue" but answer is a cross-reference statement ("Reference is made to...") 
  - Were created much earlier before metric_pipe or entity extraction, rag, with the hopes of "what greatest/best heuristic bundling based-match exists for these simple factoids questions?" but they don't align well. Refer to V1/V2/V3-5 bundles.
  
#### MEDIUM
- P3V2-Q003, Q005, Q007-Q009, Q013-Q016, Q019-Q021: Medium. Too broad for single-sentence answers OR massive bullet-list responses.

#### PERFECT or GOOD
- P3V2-Q006 (Microsoft Intelligent Cloud 2017): PERFECT
    - Question: "describe the change...including direction and magnitude"
    - Answer: "increased $2.4 billion or 10%, primarily due to higher revenue from server products and cloud services"

- P3.v3 Questions (Q001-Q010): Multi-hop synthesis:
- Walmart debt 2018-2020: 
    - Cross-year debt strategy synthesis
    - 4 evidence sentences spanning debt extinguishment → Flipkart acquisition → funding strategy
- Meta regulatory 2019-2024
  - Temporal regulatory evolution narrative, 4 evidence sentences tracing FCPA → GDPR/privacy → EU AI Act
  - And so on..
- 
#### To choose any 5:
- P3V3-Q001 (Walmart Debt Strategy 2018-2020), P3V2-Q006 (Microsoft Intelligent Cloud Revenue 2017), P3V3-Q004 (Cross-Company Cyber Risks 2009), P3V3-Q007 (Tesla Adjusted EBITDA Definition 2022), P3V3-Q002 (Meta Regulatory Evolution 2019-2024)

- Complexity Distribution:
  - Easy: 1 (Q007 - definition), Medium: 3 (Q006, Q004, Q001), Hard: 1 (Q002 - 5-year regulatory evolution)
- Scope Distribution:
    - Local: 2 (Q006, Q007), Cross-year: 2 (Q001, Q002), Cross-company: 1 (Q004)
    - Span: 3 (Q001, Q002, Q007), List: 1 (Q004), Causal single-sentence: 1 (Q006)




In [5]:
# ============================================================================
# CELL 1: Setup - Path Resolution & Load Gold Test Suite
# ============================================================================

from pathlib import Path
import sys
import logging
import json

# Suppress noisy logs for clean notebook output
logging.getLogger().setLevel(logging.WARNING)
logging.getLogger("finrag_ml_tg1").setLevel(logging.INFO)

# Find ModelPipeline root and add to sys.path
current = Path.cwd()
for parent in [current] + list(current.parents):
    if parent.name == "ModelPipeline":
        model_root = parent
        break
else:
    raise RuntimeError("Cannot find 'ModelPipeline' root in path tree")

if str(model_root) not in sys.path:
    sys.path.insert(0, str(model_root))

print(f"✓ ModelPipeline root: {model_root}")
print(f"✓ Notebook location: {Path.cwd()}\n")

# Construct absolute path to gold test suite
gold_path = model_root / "finrag_ml_tg1" / "data_cache" / "qa_manual_exports" / "goldp3_analysis" / "p3_gold_test_suite_31q.json"

if not gold_path.exists():
    raise FileNotFoundError(f"Gold test suite not found at: {gold_path}")

print(f"✓ Gold test suite: {gold_path}\n")

# Load all questions
with gold_path.open("r", encoding="utf-8") as f:
    all_questions = json.load(f)

# Selected question IDs for testing
SELECTED_IDS = [
    "P3V3-Q001",  # Walmart Debt Strategy 2018-2020 (cross-year, medium, 4 evidence)
    "P3V2-Q006",  # Microsoft Intelligent Cloud 2017 (local, medium, 1 evidence)
    "P3V3-Q004",  # Cross-Company Cyber 2009 (cross-company, medium, 3 evidence)
    "P3V3-Q007",  # Tesla Adjusted EBITDA 2022 (local, easy, 1 evidence)
    "P3V3-Q002",  # Meta Regulatory Evolution 2019-2024 (cross-year, hard, 4 evidence)


    "P3V2-Q015",  # Walmart Market/Competitive Risks 2021
                  # local, medium, 1 evidence # COVID-related risks - comprehensive but single long paragraph
    "P3V2-Q007",  # Genworth Regulatory Risks 2019
                  # local, medium, 1 evidence # Massive bullet list (14+ risk cues) - overwhelming detail
    "P3V2-Q013",  # Walmart Operational/Supply Chain Risks 2011
                  # local, hard, 1 evidence # Long narrative about natural disasters and disruptions
    
    "P3V2-Q001",  # Exxon Mobil Total Revenue 2008
                  # local, easy, 1 evidence
                  # BAD: Asks for revenue, answer is cross-reference boilerplate
    "P3V2-Q002",  # Eli Lilly Net Income 2006
                  # local, easy, 1 evidence
                  # BAD: Asks for net income, answer discusses valuation allowance
    "P3V2-Q004",  # Johnson & Johnson Cash Flow 2016
                  # local, easy, 1 evidence
                  # BAD: Asks for cash flow, answer is auditor's opinion statement
]

# Extract selected questions into structured dictionary
test_suite = {}
for q in all_questions:
    qid = q["question_id"]
    if qid in SELECTED_IDS:
        test_suite[qid] = {
            "question_text": q["question_text"],
            "gold_answer": q["answer_text"],
            "answer_type": q["answer_type"],
            "companies": q["company_name"],
            "years": q["years"],
            "retrieval_scope": q["retrieval_scope"],
            "difficulty": q["difficulty"],
            "evidence_count": len(q["evidence_sentence_ids"]),
            "evidence_ids": q["evidence_sentence_ids"],
        }

# Display summary
print("="*80)
print(f"LOADED {len(test_suite)} TEST QUESTIONS")
print("="*80)
for qid in SELECTED_IDS:
    q = test_suite[qid]
    print(f"\n{qid}:")
    print(f"  Companies: {', '.join(q['companies'])}")
    print(f"  Years: {q['years']}")
    print(f"  Scope: {q['retrieval_scope']} | Difficulty: {q['difficulty']} | Evidence: {q['evidence_count']} sentences")

✓ ModelPipeline root: d:\JoelDesktop folds_24\NEU FALL2025\MLops IE7374 Project\FinSights\ModelPipeline
✓ Notebook location: d:\JoelDesktop folds_24\NEU FALL2025\MLops IE7374 Project\FinSights\ModelPipeline\finrag_ml_tg1\rag_modules_src\01_Isolation_Test_NBS

✓ Gold test suite: d:\JoelDesktop folds_24\NEU FALL2025\MLops IE7374 Project\FinSights\ModelPipeline\finrag_ml_tg1\data_cache\qa_manual_exports\goldp3_analysis\p3_gold_test_suite_31q.json

LOADED 11 TEST QUESTIONS

P3V3-Q001:
  Companies: Walmart Inc.
  Years: [2018, 2019, 2020]
  Scope: cross_year | Difficulty: medium | Evidence: 4 sentences

P3V2-Q006:
  Companies: MICROSOFT CORP
  Years: [2017]
  Scope: local | Difficulty: medium | Evidence: 1 sentences

P3V3-Q004:
  Companies: RADIAN GROUP INC, NETFLIX INC, Mastercard Inc
  Years: [2009]
  Scope: cross_company | Difficulty: medium | Evidence: 3 sentences

P3V3-Q007:
  Companies: Tesla, Inc.
  Years: [2022]
  Scope: local | Difficulty: easy | Evidence: 1 sentences

P3V3-Q002:
 

In [3]:
# ============================================================================
# CELL 2: Test Question 1 - Walmart Debt Strategy 2018-2020
# ============================================================================
from finrag_ml_tg1.rag_modules_src.synthesis_pipeline.orchestrator import answer_query
from finrag_ml_tg1.rag_modules_src.utilities.notebook_display import display_qa_comparison

# Get question details
qid = "P3V3-Q001"
question_data = test_suite[qid]
query = question_data["question_text"]

# Prepare metadata for display
metadata = {
    "Companies": ", ".join(question_data["companies"]),
    "Years": str(question_data["years"]),
    "Scope": question_data["retrieval_scope"],
    "Difficulty": question_data["difficulty"],
    "Evidence": f"{question_data['evidence_count']} sentences",
}

print("="*80)
print(f"TEST: {qid}")
print("="*80)
print(f"Processing query: {query[:80]}...")
print()

# Call orchestrator directly (no main.py wrapper)
result = answer_query(
    query=query,
    model_root=model_root,
    include_kpi=True,
    include_rag=True,
    model_key=None,  # Use default from config
    export_context=True,
    export_response=True
)

# Extract clean data
if result.get('error'):
    print(f"ERROR: {result['error']}")
    llm_answer = f"Error: {result['error']}"
    stdout_logs = f"Error Type: {result.get('error_type', 'Unknown')}\nStage: {result.get('stage', 'Unknown')}"
else:
    llm_answer = result['answer']
    
    # Build stdout_logs from metadata (what main.py would have printed)
    llm_meta = result['metadata']['llm']
    ctx_meta = result['metadata']['context']
    exports = result.get('exports', {})
    
    stdout_logs = f"""Model: {llm_meta['model_id']}
Tokens: {llm_meta['input_tokens']:,} in / {llm_meta['output_tokens']:,} out
Cost: ${llm_meta['cost']:.4f}
Context: {ctx_meta['context_length']:,} chars

Exports:
  Context: {exports.get('context_file', 'N/A')}
  Response: {exports.get('response_file', 'N/A')}
  Logs: {exports.get('log_file', 'N/A')}"""

# Display comparison with Option B layout (Gold vs LLM top, Logs bottom)
display_qa_comparison(
    question_id=qid,
    question_text=query,
    gold_answer=question_data["gold_answer"],
    synthesis_output=llm_answer,
    metadata=metadata,
    stdout_logs=stdout_logs,  # Pass logs separately
    max_height="600px",
)

print("\n" + "="*80)
print(f"TEST COMPLETE: {qid}")
print("="*80)


TEST: P3V3-Q001
Processing query: Across its fiscal 2018-2020 10-K filings, how does Walmart Inc. explain the main...

[DEBUG] ✓ AWS credentials loaded from aws_credentials.env
[DEBUG] ✓ AWS credentials loaded from aws_credentials.env
✓ FilterExtractor initialized with 21 companies
  Using: finrag_dim_companies_21.parquet
✓ FilterExtractor initialized with 21 companies
  Using: finrag_dim_companies_21.parquet
✓ KPI-JSON: Loaded 527 metric records
✓ KPI-JSON: Unique tickers: 2
✓ KPI-JSON: Year range: 2010-2025



TEST COMPLETE: P3V3-Q001


In [7]:
# ============================================================================
# CELL 3: Test Question 1 - Walmart Debt Strategy (SONNET 4.5 Comparison)
# ============================================================================
from finrag_ml_tg1.rag_modules_src.synthesis_pipeline.orchestrator import answer_query
from finrag_ml_tg1.rag_modules_src.utilities.notebook_display import display_qa_comparison

# Get question details (same as Cell 2)
qid = "P3V3-Q001"
question_data = test_suite[qid]
query = question_data["question_text"]

# Prepare metadata for display
metadata = {
    "Companies": ", ".join(question_data["companies"]),
    "Years": str(question_data["years"]),
    "Scope": question_data["retrieval_scope"],
    "Difficulty": question_data["difficulty"],
    "Evidence": f"{question_data['evidence_count']} sentences",
}

print("="*80)
print(f"TEST: {qid} (CLAUDE SONNET 4.5)")
print("="*80)
print(f"Processing query: {query[:80]}...")
print()

# Call orchestrator with SONNET model
result = answer_query(
    query=query,
    model_root=model_root,
    include_kpi=True,
    include_rag=True,
    model_key="development_CL_SONN_4_5",  # !!
    export_context=True,
    export_response=True
)

# Extract clean data (same logic as Cell 2)
if result.get('error'):
    print(f"ERROR: {result['error']}")
    llm_answer = f"Error: {result['error']}"
    stdout_logs = f"Error Type: {result.get('error_type', 'Unknown')}\nStage: {result.get('stage', 'Unknown')}"
else:
    llm_answer = result['answer']
    
    # Build stdout_logs from metadata
    llm_meta = result['metadata']['llm']
    ctx_meta = result['metadata']['context']
    exports = result.get('exports', {})
    
    stdout_logs = f"""Model: {llm_meta['model_id']}
Tokens: {llm_meta['input_tokens']:,} in / {llm_meta['output_tokens']:,} out
Cost: ${llm_meta['cost']:.4f}
Context: {ctx_meta['context_length']:,} chars

Exports:
  Context: {exports.get('context_file', 'N/A')}
  Response: {exports.get('response_file', 'N/A')}
  Logs: {exports.get('log_file', 'N/A')}"""

# Display comparison
display_qa_comparison(
    question_id=f"{qid} (Sonnet 4.5)",
    question_text=query,
    gold_answer=question_data["gold_answer"],
    synthesis_output=llm_answer,
    metadata=metadata,
    stdout_logs=stdout_logs,
    max_height="600px",
)

print("\n" + "="*80)
print(f"TEST COMPLETE: {qid} (Sonnet 4.5)")
print("="*80)

TEST: P3V3-Q001 (CLAUDE SONNET 4.5)
Processing query: Across its fiscal 2018-2020 10-K filings, how does Walmart Inc. explain the main...

[DEBUG] ✓ AWS credentials loaded from aws_credentials.env
[DEBUG] ✓ AWS credentials loaded from aws_credentials.env
✓ FilterExtractor initialized with 21 companies
  Using: finrag_dim_companies_21.parquet
✓ FilterExtractor initialized with 21 companies
  Using: finrag_dim_companies_21.parquet
✓ KPI-JSON: Loaded 527 metric records
✓ KPI-JSON: Unique tickers: 2
✓ KPI-JSON: Year range: 2010-2025



TEST COMPLETE: P3V3-Q001 (Sonnet 4.5)


### Analysis of Sonnet vs Haiku: -- Manual, GPT 5.1 DEEP, Opus DEEP reviews. 

1. Chronological clarity: Sonnet explains the 2018 extinguishment purpose (reducing future interest expense) more explicitly
2. Dividend tangent: Haiku dedicates 20% of response to dividends - comprehensive but not requested
3. Writing style: Sonnet more technical/precise; Haiku more narrative with emphasis formatting
    

#### Core Differences
- Structural Philosophy: Sonnet uses importance hierarchy (Flipkart→2020→2018), Haiku uses thematic grouping with capital allocation lens
- Question Discipline: Sonnet stays laser-focused on debt/financing flows only, Haiku adds 20% dividend discussion not in scope
- Technical Precision: Sonnet provides $3.1B loss detail on 2018 extinguishment, Haiku glosses over the interest-reduction rationale
- Gold Alignment: Sonnet hits all three gold reference points explicitly, Haiku partially aligned with tangential additions
- Citation Depth: Sonnet includes liquidity buffers ($15B US facilities), Haiku tracks dividend progression ($2.08→$2.16)
- Writing Style: Sonnet technical/precise, Haiku narrative with italicized emphasis for readability
- Token Efficiency: Sonnet 777 tokens with tighter focus, Haiku 719 tokens but diluted with off-topic content
- Interpretive Lens: Sonnet read this as targeted debt analysis, Haiku as broader financing activities review

#### Final Verdict
- Winner: Sonnet 4.5 (~9/10 vs ~7.5-7.8/10)
- Sonnet provides superior response through better question alignment, technical precision, and disciplined scope management.

----

In [8]:
# ============================================================================
# CELL 3: Test Question 2 - Microsoft Intelligent Cloud Revenue 2017
# ============================================================================
from finrag_ml_tg1.rag_modules_src.synthesis_pipeline.orchestrator import answer_query
from finrag_ml_tg1.rag_modules_src.utilities.notebook_display import display_qa_comparison

# Get question details
qid = "P3V2-Q006"
question_data = test_suite[qid]
query = question_data["question_text"]

# Prepare metadata for display
metadata = {
    "Companies": ", ".join(question_data["companies"]),
    "Years": str(question_data["years"]),
    "Scope": question_data["retrieval_scope"],
    "Difficulty": question_data["difficulty"],
    "Evidence": f"{question_data['evidence_count']} sentences",
}

print("="*80)
print(f"TEST: {qid}")
print("="*80)
print(f"Processing query: {query[:80]}...")
print()

# Call orchestrator directly
result = answer_query(
    query=query,
    model_root=model_root,
    include_kpi=True,
    include_rag=True,
    model_key="development_CH45",
    export_context=True,
    export_response=True
)

# Extract clean data
if result.get('error'):
    print(f"ERROR: {result['error']}")
    llm_answer = f"Error: {result['error']}"
    stdout_logs = f"Error Type: {result.get('error_type', 'Unknown')}\nStage: {result.get('stage', 'Unknown')}"
else:
    llm_answer = result['answer']
    
    # Build stdout_logs from metadata
    llm_meta = result['metadata']['llm']
    ctx_meta = result['metadata']['context']
    exports = result.get('exports', {})
    
    stdout_logs = f"""Model: {llm_meta['model_id']}
Tokens: {llm_meta['input_tokens']:,} in / {llm_meta['output_tokens']:,} out
Cost: ${llm_meta['cost']:.4f}
Context: {ctx_meta['context_length']:,} chars

Exports:
  Context: {exports.get('context_file', 'N/A')}
  Response: {exports.get('response_file', 'N/A')}
  Logs: {exports.get('log_file', 'N/A')}"""

# Display comparison
display_qa_comparison(
    question_id=qid,
    question_text=query,
    gold_answer=question_data["gold_answer"],
    synthesis_output=llm_answer,
    metadata=metadata,
    stdout_logs=stdout_logs,
    max_height="600px",
)

print("\n" + "="*80)
print(f"TEST COMPLETE: {qid}")
print("="*80)

TEST: P3V2-Q006
Processing query: How does MICROSOFT CORP describe the change in its Intelligent Cloud revenue in ...

[DEBUG] ✓ AWS credentials loaded from aws_credentials.env
[DEBUG] ✓ AWS credentials loaded from aws_credentials.env
✓ FilterExtractor initialized with 21 companies
  Using: finrag_dim_companies_21.parquet
✓ FilterExtractor initialized with 21 companies
  Using: finrag_dim_companies_21.parquet
✓ KPI-JSON: Loaded 527 metric records
✓ KPI-JSON: Unique tickers: 2
✓ KPI-JSON: Year range: 2010-2025



TEST COMPLETE: P3V2-Q006


In [10]:
# ============================================================================
# CELL 4: Test Question 3 - Cross-Company Cyber Risks 2009
# ============================================================================
from finrag_ml_tg1.rag_modules_src.synthesis_pipeline.orchestrator import answer_query
from finrag_ml_tg1.rag_modules_src.utilities.notebook_display import display_qa_comparison

# Get question details
qid = "P3V3-Q004"
question_data = test_suite[qid]
query = question_data["question_text"]

# Prepare metadata for display
metadata = {
    "Companies": ", ".join(question_data["companies"]),
    "Years": str(question_data["years"]),
    "Scope": question_data["retrieval_scope"],
    "Difficulty": question_data["difficulty"],
    "Evidence": f"{question_data['evidence_count']} sentences",
}

print("="*80)
print(f"TEST: {qid}")
print("="*80)
print(f"Processing query: {query[:80]}...")
print()

# Call orchestrator directly
result = answer_query(
    query=query,
    model_root=model_root,
    include_kpi=True,
    include_rag=True,
    model_key="development_CH45",
    export_context=True,
    export_response=True
)

# Extract clean data
if result.get('error'):
    print(f"ERROR: {result['error']}")
    llm_answer = f"Error: {result['error']}"
    stdout_logs = f"Error Type: {result.get('error_type', 'Unknown')}\nStage: {result.get('stage', 'Unknown')}"
else:
    llm_answer = result['answer']
    
    # Build stdout_logs from metadata
    llm_meta = result['metadata']['llm']
    ctx_meta = result['metadata']['context']
    exports = result.get('exports', {})
    
    stdout_logs = f"""Model: {llm_meta['model_id']}
Tokens: {llm_meta['input_tokens']:,} in / {llm_meta['output_tokens']:,} out
Cost: ${llm_meta['cost']:.4f}
Context: {ctx_meta['context_length']:,} chars

Exports:
  Context: {exports.get('context_file', 'N/A')}
  Response: {exports.get('response_file', 'N/A')}
  Logs: {exports.get('log_file', 'N/A')}"""

# Display comparison
display_qa_comparison(
    question_id=qid,
    question_text=query,
    gold_answer=question_data["gold_answer"],
    synthesis_output=llm_answer,
    metadata=metadata,
    stdout_logs=stdout_logs,
    max_height="600px",
)

print("\n" + "="*80)
print(f"TEST COMPLETE: {qid}")
print("="*80)

TEST: P3V3-Q004
Processing query: In their 2009 Form 10-K risk-factor disclosures, how do Radian Group, Netflix an...

[DEBUG] ✓ AWS credentials loaded from aws_credentials.env
[DEBUG] ✓ AWS credentials loaded from aws_credentials.env
✓ FilterExtractor initialized with 21 companies
  Using: finrag_dim_companies_21.parquet
✓ FilterExtractor initialized with 21 companies
  Using: finrag_dim_companies_21.parquet
✓ KPI-JSON: Loaded 527 metric records
✓ KPI-JSON: Unique tickers: 2
✓ KPI-JSON: Year range: 2010-2025



TEST COMPLETE: P3V3-Q004


In [12]:
# ============================================================================
# CELL 5: Test Question 4 - Tesla Adjusted EBITDA Definition 2022
# ============================================================================
from finrag_ml_tg1.rag_modules_src.synthesis_pipeline.orchestrator import answer_query
from finrag_ml_tg1.rag_modules_src.utilities.notebook_display import display_qa_comparison

# Get question details
qid = "P3V3-Q007"
question_data = test_suite[qid]
query = question_data["question_text"]

# Prepare metadata for display
metadata = {
    "Companies": ", ".join(question_data["companies"]),
    "Years": str(question_data["years"]),
    "Scope": question_data["retrieval_scope"],
    "Difficulty": question_data["difficulty"],
    "Evidence": f"{question_data['evidence_count']} sentences",
}

print("="*80)
print(f"TEST: {qid}")
print("="*80)
print(f"Processing query: {query[:80]}...")
print()

# Call orchestrator directly
result = answer_query(
    query=query,
    model_root=model_root,
    include_kpi=True,
    include_rag=True,
    model_key="development_CH45",
    export_context=True,
    export_response=True
)

# Extract clean data
if result.get('error'):
    print(f"ERROR: {result['error']}")
    llm_answer = f"Error: {result['error']}"
    stdout_logs = f"Error Type: {result.get('error_type', 'Unknown')}\nStage: {result.get('stage', 'Unknown')}"
else:
    llm_answer = result['answer']
    
    # Build stdout_logs from metadata
    llm_meta = result['metadata']['llm']
    ctx_meta = result['metadata']['context']
    exports = result.get('exports', {})
    
    stdout_logs = f"""Model: {llm_meta['model_id']}
Tokens: {llm_meta['input_tokens']:,} in / {llm_meta['output_tokens']:,} out
Cost: ${llm_meta['cost']:.4f}
Context: {ctx_meta['context_length']:,} chars

Exports:
  Context: {exports.get('context_file', 'N/A')}
  Response: {exports.get('response_file', 'N/A')}
  Logs: {exports.get('log_file', 'N/A')}"""

# Display comparison
display_qa_comparison(
    question_id=qid,
    question_text=query,
    gold_answer=question_data["gold_answer"],
    synthesis_output=llm_answer,
    metadata=metadata,
    stdout_logs=stdout_logs,
    max_height="600px",
)

print("\n" + "="*80)
print(f"TEST COMPLETE: {qid}")
print("="*80)

TEST: P3V3-Q007
Processing query: Where does Tesla define Adjusted EBITDA in its 2022 Form 10-K, and how does the ...

[DEBUG] ✓ AWS credentials loaded from aws_credentials.env
[DEBUG] ✓ AWS credentials loaded from aws_credentials.env
✓ FilterExtractor initialized with 21 companies
  Using: finrag_dim_companies_21.parquet
✓ FilterExtractor initialized with 21 companies
  Using: finrag_dim_companies_21.parquet
✓ KPI-JSON: Loaded 527 metric records
✓ KPI-JSON: Unique tickers: 2
✓ KPI-JSON: Year range: 2010-2025



TEST COMPLETE: P3V3-Q007


In [14]:
# ============================================================================
# CELL 6: Test Question 5 - Meta Regulatory Evolution 2019-2024
# ============================================================================
from finrag_ml_tg1.rag_modules_src.synthesis_pipeline.orchestrator import answer_query
from finrag_ml_tg1.rag_modules_src.utilities.notebook_display import display_qa_comparison

# Get question details
qid = "P3V3-Q002"
question_data = test_suite[qid]
query = question_data["question_text"]

# Prepare metadata for display
metadata = {
    "Companies": ", ".join(question_data["companies"]),
    "Years": str(question_data["years"]),
    "Scope": question_data["retrieval_scope"],
    "Difficulty": question_data["difficulty"],
    "Evidence": f"{question_data['evidence_count']} sentences",
}

print("="*80)
print(f"TEST: {qid}")
print("="*80)
print(f"Processing query: {query[:80]}...")
print()

# Call orchestrator directly
result = answer_query(
    query=query,
    model_root=model_root,
    include_kpi=True,
    include_rag=True,
    model_key="development_CH45",
    export_context=True,
    export_response=True
)

# Extract clean data
if result.get('error'):
    print(f"ERROR: {result['error']}")
    llm_answer = f"Error: {result['error']}"
    stdout_logs = f"Error Type: {result.get('error_type', 'Unknown')}\nStage: {result.get('stage', 'Unknown')}"
else:
    llm_answer = result['answer']
    
    # Build stdout_logs from metadata
    llm_meta = result['metadata']['llm']
    ctx_meta = result['metadata']['context']
    exports = result.get('exports', {})
    
    stdout_logs = f"""Model: {llm_meta['model_id']}
Tokens: {llm_meta['input_tokens']:,} in / {llm_meta['output_tokens']:,} out
Cost: ${llm_meta['cost']:.4f}
Context: {ctx_meta['context_length']:,} chars

Exports:
  Context: {exports.get('context_file', 'N/A')}
  Response: {exports.get('response_file', 'N/A')}
  Logs: {exports.get('log_file', 'N/A')}"""

# Display comparison
display_qa_comparison(
    question_id=qid,
    question_text=query,
    gold_answer=question_data["gold_answer"],
    synthesis_output=llm_answer,
    metadata=metadata,
    stdout_logs=stdout_logs,
    max_height="600px",
)

print("\n" + "="*80)
print(f"TEST COMPLETE: {qid}")
print("="*80)

TEST: P3V3-Q002
Processing query: Over time, how does Meta Platforms describe the regulatory and policy risks that...

[DEBUG] ✓ AWS credentials loaded from aws_credentials.env
[DEBUG] ✓ AWS credentials loaded from aws_credentials.env
✓ FilterExtractor initialized with 21 companies
  Using: finrag_dim_companies_21.parquet
✓ FilterExtractor initialized with 21 companies
  Using: finrag_dim_companies_21.parquet
✓ KPI-JSON: Loaded 527 metric records
✓ KPI-JSON: Unique tickers: 2
✓ KPI-JSON: Year range: 2010-2025



TEST COMPLETE: P3V3-Q002


# ============================================================================
- 5 queries × 1 request each = 5 requests in 60 seconds = 5 RPM ✓
- 5 queries × 3 requests each = 15 requests in 60 seconds = 15 RPM
- + Retries: 15 × 1.5 (avg retries) = ~22 requests
- = 22 RPM burst → THROTTLED if limit is 10-20 RPM

# ============================================================================

In [6]:
# ============================================================================
# CELL: BAD QUESTION TEST - P3V2-Q001 with Sonnet 4.5
# ============================================================================
from finrag_ml_tg1.rag_modules_src.synthesis_pipeline.orchestrator import answer_query
from finrag_ml_tg1.rag_modules_src.utilities.notebook_display import display_qa_comparison

# Get the BAD question
qid = "P3V2-Q001"
question_data = test_suite[qid]
query = question_data["question_text"]

# Prepare metadata for display
metadata = {
    "Companies": ", ".join(question_data["companies"]),
    "Years": str(question_data["years"]),
    "Scope": question_data["retrieval_scope"],
    "Difficulty": question_data["difficulty"],
    "Evidence": f"{question_data['evidence_count']} sentences",
    "NOTE": "BAD GOLD ANSWER - Testing if Sonnet finds actual revenue despite poor curation"
}

print("="*80)
print(f"BAD QUESTION TEST: {qid} (SONNET 4.5)")
print("="*80)
print(f"Processing query: {query}")
print("\nNOTE: This question has a poor gold answer (cross-reference instead of actual revenue).")
print("Testing if Sonnet 4.5 can still find the correct information.\n")

# Call orchestrator with SONNET
result = answer_query(
    query=query,
    model_root=model_root,
    include_kpi=True,
    include_rag=True,
    model_key="development_CL_SONN_4_5",  # Use Sonnet 4.5
    export_context=True,
    export_response=True
)

# Extract clean data
if result.get('error'):
    print(f"ERROR: {result['error']}")
    llm_answer = f"Error: {result['error']}"
    stdout_logs = f"Error Type: {result.get('error_type', 'Unknown')}\nStage: {result.get('stage', 'Unknown')}"
else:
    llm_answer = result['answer']
    
    # Build stdout_logs from metadata
    llm_meta = result['metadata']['llm']
    ctx_meta = result['metadata']['context']
    exports = result.get('exports', {})
    
    stdout_logs = f"""Model: {llm_meta['model_id']}
Tokens: {llm_meta['input_tokens']:,} in / {llm_meta['output_tokens']:,} out
Cost: ${llm_meta['cost']:.4f}
Context: {ctx_meta['context_length']:,} chars

Exports:
  Context: {exports.get('context_file', 'N/A')}
  Response: {exports.get('response_file', 'N/A')}
  Logs: {exports.get('log_file', 'N/A')}"""

# Display comparison
display_qa_comparison(
    question_id=f"{qid} (BAD QUESTION - SONNET 4.5)",
    question_text=query,
    gold_answer=question_data["gold_answer"],
    synthesis_output=llm_answer,
    metadata=metadata,
    stdout_logs=stdout_logs,
    max_height="600px",
)

print("\n" + "="*80)
print(f"TEST COMPLETE: {qid} (Sonnet 4.5)")
print("="*80)
print("\nAnalysis: Does Sonnet find actual 2008 revenue despite the gold answer being")
print("a cross-reference statement instead of the actual metric?")

BAD QUESTION TEST: P3V2-Q001 (SONNET 4.5)
Processing query: What does EXXON MOBIL CORP report as its total revenue in 2008, and how is this figure described in the filing?

NOTE: This question has a poor gold answer (cross-reference instead of actual revenue).
Testing if Sonnet 4.5 can still find the correct information.

[DEBUG] ✓ AWS credentials loaded from aws_credentials.env
[DEBUG] ✓ AWS credentials loaded from aws_credentials.env
✓ FilterExtractor initialized with 21 companies
  Using: finrag_dim_companies_21.parquet
✓ FilterExtractor initialized with 21 companies
  Using: finrag_dim_companies_21.parquet
✓ KPI-JSON: Loaded 527 metric records
✓ KPI-JSON: Unique tickers: 2
✓ KPI-JSON: Year range: 2010-2025



TEST COMPLETE: P3V2-Q001 (Sonnet 4.5)

Analysis: Does Sonnet find actual 2008 revenue despite the gold answer being
a cross-reference statement instead of the actual metric?


In [7]:
# ============================================================================
# CELL: MED QUESTION TEST -  with Sonnet 4.5
# ============================================================================
from finrag_ml_tg1.rag_modules_src.synthesis_pipeline.orchestrator import answer_query
from finrag_ml_tg1.rag_modules_src.utilities.notebook_display import display_qa_comparison

# Get the MED question
qid = "P3V2-Q013" 
question_data = test_suite[qid]
query = question_data["question_text"]

# Prepare metadata for display
metadata = {
    "Companies": ", ".join(question_data["companies"]),
    "Years": str(question_data["years"]),
    "Scope": question_data["retrieval_scope"],
    "Difficulty": question_data["difficulty"],
    "Evidence": f"{question_data['evidence_count']} sentences",
    "NOTE": "MEDIUM DIFFICULTY - Testing if Sonnet finds actual revenue despite complex evidence"
}

print("="*80)
print(f"MEDIUM QUESTION TEST: {qid} (SONNET 4.5)")
print("="*80)
print(f"Processing query: {query}")
print("\nNOTE: This question has a medium difficulty level and involves complex evidence.")
print("Testing if Sonnet 4.5 can handle this complexity and find the correct information.\n")

# Call orchestrator with SONNET
result = answer_query(
    query=query,
    model_root=model_root,
    include_kpi=True,
    include_rag=True,
    model_key="development_CL_SONN_4_5",  # Use Sonnet 4.5
    export_context=True,
    export_response=True
)

# Extract clean data
if result.get('error'):
    print(f"ERROR: {result['error']}")
    llm_answer = f"Error: {result['error']}"
    stdout_logs = f"Error Type: {result.get('error_type', 'Unknown')}\nStage: {result.get('stage', 'Unknown')}"
else:
    llm_answer = result['answer']
    
    # Build stdout_logs from metadata
    llm_meta = result['metadata']['llm']
    ctx_meta = result['metadata']['context']
    exports = result.get('exports', {})
    
    stdout_logs = f"""Model: {llm_meta['model_id']}
Tokens: {llm_meta['input_tokens']:,} in / {llm_meta['output_tokens']:,} out
Cost: ${llm_meta['cost']:.4f}
Context: {ctx_meta['context_length']:,} chars

Exports:
  Context: {exports.get('context_file', 'N/A')}
  Response: {exports.get('response_file', 'N/A')}
  Logs: {exports.get('log_file', 'N/A')}"""

# Display comparison
display_qa_comparison(
    question_id=f"{qid} (MEDIUM Q - SONNET 4.5)",
    question_text=query,
    gold_answer=question_data["gold_answer"],
    synthesis_output=llm_answer,
    metadata=metadata,
    stdout_logs=stdout_logs,
    max_height="600px",
)

print("\n" + "="*80)
print(f"TEST COMPLETE: {qid} (Sonnet 4.5)")
print("="*80)


MEDIUM QUESTION TEST: P3V2-Q013 (SONNET 4.5)
Processing query: What operational or supply chain risks does Walmart Inc. highlight in its 2011 Risk Factors section?

NOTE: This question has a medium difficulty level and involves complex evidence.
Testing if Sonnet 4.5 can handle this complexity and find the correct information.

[DEBUG] ✓ AWS credentials loaded from aws_credentials.env
[DEBUG] ✓ AWS credentials loaded from aws_credentials.env
✓ FilterExtractor initialized with 21 companies
  Using: finrag_dim_companies_21.parquet
✓ FilterExtractor initialized with 21 companies
  Using: finrag_dim_companies_21.parquet
✓ KPI-JSON: Loaded 527 metric records
✓ KPI-JSON: Unique tickers: 2
✓ KPI-JSON: Year range: 2010-2025



TEST COMPLETE: P3V2-Q013 (Sonnet 4.5)
