### Gold Test P3: Business Realism Design Ideas

### Ideas:
- Pick 5 issuers with different industries (e.g., auto, fintech, semiconductor, retail, pharma) and one target year (say 2020).
  - (a) export the right slice of data for 5 companies in 1 year and 
  - (b) use that slice to curate a 30-question global gold (P3).
  - **Problem**: diff year comparisons need multi-year data.

### Test questions would mirror actual analyst workflows:
- Temporal queries: "How did Microsoft's cloud revenue grow 2018→2020?"
- Comparative queries: "Which had higher capex in 2019: Apple or Google?"
- Section-aware queries: "What risks did Tesla highlight in 2020?" (expects ITEM_1A hits)
- KPI extraction: "Amazon's operating cash flow in fiscal 2019?"


### 1. Question Taxonomy: Extensive Ideas 

- A. Local factual (exact statement; single sentence evidence).
    - Examples: “What was Company A’s revenue in 2020?” Answer: numeric + unit; Evidence: one MD&A line.
- B. Local causal (“why/what drove”).
    - Example: “Why did operating cash flow increase in 2020 for Company B?” Evidence: MD&A causal sentence.
- C. Cross-section within year (global in structure, same year).
    - Example: “What major supply chain risk did Company C highlight in 2020?” Evidence may live in ITEM_1A, not MD&A.
- D. Cross-year within company (2019 vs 2020).
    - Examples: “How did capex change from 2019 to 2020?” Answer requires two sentences; allow small numeric tolerance.
- E. Cross-company compare (same year).
    - Example: “Which of Company D or E reported higher gross margin in 2020?” Evidence: one sentence per company.
- F. Aggregation/summarization light.
    - Example: “List Company A’s top two stated 2020 growth drivers.” Evidence: two sentences; answer type multispan.
- G. Verification/consistency.
    - Example: “Does the MD&A explicitly state that FX drove revenue decline?” Yes/No with cite.    
- H. Definition/label grounding.
    - Example: “Where does Company A define ‘Adjusted EBITDA’?” Evidence: policy/definitions section; answer: quote span.
- I. Temporal reference disambiguation.
    - Example: “What period does ‘the year’ refer to in this filing?” Evidence: header/date sentence; answer: YYYY.
- J. Negative control (no-answer).
    - Example: “Does Company E disclose headcount in 2020?” Expected: “Not found” with empty evidence; validates abstention.


### Process - Potential, 1:
-  For each of the 5 companies, scan the KPI view and pick 2–3 canonical numeric statements (revenue, gross margin, cash from ops). That’s 10–15 easy local factuals.
- From MD&A causal atlas, select 1–2 “why” statements per company (5–10 items).
- From Risk atlas, pick 1 specific risk per company; phrase the question to target it (5 items).
- Add 3–4 cross-company compares (gross margin, revenue, capex) and 2–3 cross-year trend questions for one or two issuers.
- Add 2 negative controls and 2 verification/definition items.


### Gold Set Expected Cols:
One row per question; multi-evidence handled as arrays. This schema is friendly to MLflow logging and ES/Kibana indexing.

| Column Name | Type | Description |
|-------------|------|-------------|
| **question_id** | `str` | Stable UUID |
| **cik_int[]** | `list[int]` | Usually one, multiple for cross-company |
| **company_name[]** | `list[str]` | Company names |
| **years[]** | `list[int]` | One or more fiscal years |
| **question_text** | `str` | Natural language query |
| **answer_type** | `enum` | numeric, span, list, boolean, no_answer |
| **answer_text** | `str \| list[str]` | Canonical answer or list |
| **answer_numeric** | `float \| null` | Numeric value if applicable |
| **answer_unit** | `str \| null` | Units (e.g., "USD millions") |
| **tolerance** | `float \| null` | Acceptable variance for numeric matches |
| **evidence_sentence_ids[]** | `list[str]` | Authoritative sentence IDs from corpus |
| **evidence_spans[]** | `list[str]` | Optional quoted substrings |
| **retrieval_scope** | `enum` | local, global, cross_company, cross_year |
| **difficulty** | `enum` | easy, medium, hard |
| **section_hints[]** | `list[str]` | Optional (MD&A, Item 1A, Notes) |
| **notes** | `str` | Curator comment |
| **gold_version** | `str` | e.g., "P3.v1" |
| **created_by** | `str` | Curator username |
| **created_at** | `timestamp` | Creation timestamp |
| **curation_confidence** | `float` | Range: 0.0 to 1.0 |

### Locked ANALYST Map queries? which views we want for INFO.
- save a JSON/NDJSON file under data_cache/qa_manual_exports/goldp3_analysis/… 
- P3 taxonomy (numeric KPIs, risks, causal MD&A, cross-year trends, cross-company compares, negatives)...

1. KPI numeric scan (per company/year/section)
    - Sentences with numbers + KPI-ish words (revenue, net income, cash from ops, etc).
    - Feeds: easy numeric Qs, basic trends, cross-company compares.
2. Risk atlas (ITEM 1A topics)
   - Sentences from ITEM_1A tagged into coarse risk topics (regulatory, liquidity, competition, cyber, other).
   - Feeds: risk Qs, “what are the main risks X faces?” style questions.
3. MD&A causal lines (ITEM 7/7A)
   - Sentences from MD&A with causal cues (“because”, “due to”, “as a result”, “driven by”).
   - Feeds: “why did metric X change?” / causal explanation Qs.
4. Template / boilerplate detector (near-duplicate narrative)
   - Normalized sentences (numbers stripped, case-folded) grouped across companies/years to surface boilerplate language.
   - Feeds: cross-year/cross-company global questions (e.g., “What standard going-concern language does the company use?”).
5. KPI trend outliers (revenue z-scores per company)
   - Extract revenue values per company/year, normalize to a numeric series, compute z-scores across time per company.
   - Feeds: cross-year trend Qs (“In which year did revenue spike/drop unusually?”).

In [None]:
import sys
from pathlib import Path
sys.path.append(str(Path.cwd().parent / 'loaders'))

from ml_config_loader import MLConfig
import polars as pl

# Polars display tuning for notebook
pl.Config.set_tbl_rows(50)
pl.Config.set_tbl_cols(20)
pl.Config.set_fmt_str_lengths(200)
pl.Config.set_tbl_width_chars(1000)

config = MLConfig()

base_dir = Path.cwd().parent

# stage 1 facts 
stage1_path = base_dir / "data_cache" / "stage1_facts" / "finrag_fact_sentences.parquet"

if not stage1_path.exists():
    raise FileNotFoundError(stage1_path)

df = pl.read_parquet(stage1_path)

TEXT_COL = "sentence"  # 

qa_export_dir = base_dir / "data_cache" / "qa_manual_exports" / "goldp3_analysis"
qa_export_dir.mkdir(parents=True, exist_ok=True)

df.head()
df.collect_schema()


[DEBUG] ✓ AWS credentials loaded from aws_credentials.env


Schema([('cik', String),
        ('cik_int', Int32),
        ('name', String),
        ('tickers', List(String)),
        ('docID', String),
        ('sentenceID', String),
        ('section_ID', Int64),
        ('section_name', String),
        ('form', String),
        ('sic', String),
        ('sentence', String),
        ('filingDate', String),
        ('report_year', Int64),
        ('reportDate', String),
        ('temporal_bin', String),
        ('likely_kpi', Boolean),
        ('has_numbers', Boolean),
        ('has_comparison', Boolean),
        ('sample_created_at', Datetime(time_unit='us', time_zone='UTC')),
        ('last_modified_date', Datetime(time_unit='us', time_zone='UTC')),
        ('sample_version', String),
        ('source_file_path', String),
        ('load_method', String),
        ('row_hash', String)])

In [12]:
# =====================================================================
# VIEW 1: KPI SCAN (NUMERIC SENTENCES WITH KPI KEYWORDS)
# =====================================================================

import sys
from pathlib import Path
sys.path.append(str(Path.cwd().parent / "loaders"))

from ml_config_loader import MLConfig
import polars as pl

# Polars display for notebook
pl.Config.set_tbl_rows(50)
pl.Config.set_tbl_cols(20)
pl.Config.set_fmt_str_lengths(200)
pl.Config.set_tbl_width_chars(1000)

cfg = MLConfig()

# ---------------------------------------------------------------------
# 1) Load Stage 1 facts (sentence-level fact table)
# ---------------------------------------------------------------------
stage1_local = Path.cwd().parent / "data_cache" / "stage1_facts" / "finrag_fact_sentences.parquet"

if stage1_local.exists():
    df = pl.read_parquet(stage1_local)
else:
    meta_uri = f"s3://{cfg.bucket}/{cfg.meta_embeds_path}"
    df = pl.read_parquet(meta_uri, storage_options=cfg.get_storage_options())

# Optional: restrict to a small subset of companies/years while designing views
TARGET_CIKS  = None   # e.g. [1318605, 1276520, 789019, 320193, 1326801]
TARGET_YEARS = None   # e.g. [2018, 2019, 2020]

if TARGET_CIKS is not None:
    df = df.filter(pl.col("cik_int").is_in(TARGET_CIKS))
if TARGET_YEARS is not None and "report_year" in df.columns:
    df = df.filter(pl.col("report_year").is_in(TARGET_YEARS))

# ---------------------------------------------------------------------
# 2) KPI keyword tagging + numeric heuristics
# ---------------------------------------------------------------------
TEXT_COL = "sentence"  # <-- actual text column in your table

# Basic numeric presence (you already have has_number, but it's fine to overwrite)
df_kpi = df.with_columns([
    pl.col(TEXT_COL).str.contains(r"\d").alias("has_number"),
])

# KPI label via heuristic keyword patterns (case-insensitive)
kpi_expr = (
    pl.when(pl.col(TEXT_COL).str.contains(r"(?i)\b(net\s+sales|total\s+revenue|revenues?|sales)\b"))
      .then(pl.lit("revenue"))
    .when(pl.col(TEXT_COL).str.contains(r"(?i)\bgross\s+(profit|margin)\b"))
      .then(pl.lit("gross_margin"))
    .when(pl.col(TEXT_COL).str.contains(r"(?i)\boperating\s+(income|loss|margin)\b"))
      .then(pl.lit("operating_income"))
    .when(pl.col(TEXT_COL).str.contains(r"(?i)\bnet\s+(income|loss)\b"))
      .then(pl.lit("net_income"))
    .when(pl.col(TEXT_COL).str.contains(r"(?i)\bearnings\s+per\s+share|\bEPS\b"))
      .then(pl.lit("eps"))
    .when(pl.col(TEXT_COL).str.contains(r"(?i)\b(cash\s+flows?|cash\s+provided\s+by\s+operating\s+activities)\b"))
      .then(pl.lit("cash_from_ops"))
    .when(pl.col(TEXT_COL).str.contains(r"(?i)\bcapital\s+expenditures?\b|\bcapex\b"))
      .then(pl.lit("capex"))
    .when(pl.col(TEXT_COL).str.contains(r"(?i)\b(debt|borrowings|indebtedness)\b"))
      .then(pl.lit("debt"))
    .when(pl.col(TEXT_COL).str.contains(r"(?i)\binterest\s+expense\b"))
      .then(pl.lit("interest_expense"))
    .otherwise(pl.lit(None))
)

df_kpi = df_kpi.with_columns([
    kpi_expr.alias("kpi_label"),
    pl.col(TEXT_COL).str.contains(r"(?i)%|percent").alias("is_percent"),
    pl.col(TEXT_COL).str.contains(r"(?i)\$|usd|million|billion").alias("has_amount_cue"),
])

# Crude first numeric span (for analyst eyeballing, not rigorous parsing)
df_kpi = df_kpi.with_columns([
    pl.when(pl.col("has_number"))
      .then(pl.col(TEXT_COL).str.extract(r"(\$?\d[\d,]*(\.\d+)?)", group_index=1))
      .otherwise(None)
      .alias("first_number_raw")
])

# ---------------------------------------------------------------------
# 3) Filter to KPI-ish sentences
# ---------------------------------------------------------------------
df_view1 = (
    df_kpi
    .filter(
        pl.col("has_number") &
        pl.col("kpi_label").is_not_null()
    )
    .select([
        "cik_int",
        "name",
        *(["report_year"] if "report_year" in df_kpi.columns else []),
        "section_name",
        "docID",
        "sentenceID",
        "kpi_label",
        "is_percent",
        "has_amount_cue",
        "first_number_raw",
        pl.col(TEXT_COL).alias("sentence_text"),
    ])
    .sort([
        "cik_int",
        "report_year" if "report_year" in df_kpi.columns else "cik_int",
        "section_name",
        "docID",
        "sentenceID",
    ])
)

print("=== VIEW 1: KPI SCAN (sample) ===")
print(df_view1.head(50))

# ---------------------------------------------------------------------
# 4) Export to JSON for manual/LLM-based curation
# ---------------------------------------------------------------------
export_root = Path.cwd().parent / "data_cache" / "analysis_exports" / "goldp3_views"
export_root.mkdir(parents=True, exist_ok=True)

view1_path = export_root / "view1_kpi_scan.json"
df_view1.write_json(view1_path)

print(f"\n[Saved KPI scan view] -> {view1_path}")
print(f"Rows: {df_view1.height}")


[DEBUG] ✓ AWS credentials loaded from aws_credentials.env
=== VIEW 1: KPI SCAN (sample) ===
shape: (50, 11)
┌─────────┬──────────────────┬─────────────┬──────────────┬──────────────────────┬─────────────────────────────────────┬───────────────┬────────────┬────────────────┬──────────────────┬───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ cik_int ┆ name             ┆ report_year ┆ section_name ┆ docID                ┆ sentenceID                          ┆ kpi_label     ┆ is_percent ┆ has_amount_cue ┆ first_number_raw ┆ sentence_text                                                                                                                                                                                             │
│ ---     ┆ ---              ┆ ---         ┆ ---          ┆ ---                  ┆ ---                          

In [14]:
# =====================================================================
# VIEW 2: RISK SENTENCE ATLAS (ITEM 1A BY TOPIC)
# =====================================================================

import sys
from pathlib import Path
sys.path.append(str(Path.cwd().parent / "loaders"))

from ml_config_loader import MLConfig
import polars as pl

# Polars display (if not already set)
pl.Config.set_tbl_rows(50)
pl.Config.set_tbl_cols(20)
pl.Config.set_fmt_str_lengths(200)
pl.Config.set_tbl_width_chars(1000)

cfg = MLConfig()

stage1_local = Path.cwd().parent / "data_cache" / "stage1_facts" / "finrag_fact_sentences.parquet"

if stage1_local.exists():
    df = pl.read_parquet(stage1_local)
else:
    meta_uri = f"s3://{cfg.bucket}/{cfg.meta_embeds_path}"
    df = pl.read_parquet(meta_uri, storage_options=cfg.get_storage_options())

TEXT_COL = "sentence"  # <-- actual text column

# Optional restriction (same as View 1 if you want consistency)
TARGET_CIKS  = None   # e.g. [1318605, 1276520, 789019, 320193, 1326801]
TARGET_YEARS = None   # e.g. [2018, 2019, 2020]

if TARGET_CIKS is not None:
    df = df.filter(pl.col("cik_int").is_in(TARGET_CIKS))
if TARGET_YEARS is not None and "report_year" in df.columns:
    df = df.filter(pl.col("report_year").is_in(TARGET_YEARS))

# ---------------------------------------------------------------------
# 1) Focus on Risk Factors sections (Item 1A variants)
# ---------------------------------------------------------------------
df_risk = (
    df
    .with_columns([
        pl.col("section_name").cast(pl.Utf8).alias("section_name_str")
    ])
    .filter(
        pl.col("section_name_str")
          .str.to_uppercase()
          .str.contains("ITEM_1A")  # catches ITEM_1A, Item 1A, etc.
    )
)

# ---------------------------------------------------------------------
# 2) Risk cue and topic tagging
# ---------------------------------------------------------------------
risk_cue_pattern = (
    r"(?i)\b(risk|uncertain|volatility|adverse|material adverse|"
    r"disruption|default|covenant|liquidity|leverage|cyber|security|"
    r"breach|litigation|regulatory|regulation|fine|sanction|"
    r"pandemic|recession|downturn)\b"
)

regulatory_pat  = r"(?i)\b(regulatory|regulation|compliance|sec|doj|fines?|sanctions?)\b"
liquidity_pat   = r"(?i)\b(liquidity|cash\s+flows?|refinanc(e|ing)|covenants?|default|going\s+concern)\b"
market_pat      = r"(?i)\b(competition|competitors?|market\s+share|pricing\s+pressure|volatility|macroeconomic)\b"
operational_pat = r"(?i)\b(supply\s+chain|operations?|manufactur(e|ing)|facilities|disruptions?)\b"
cyber_pat       = r"(?i)\b(cyber|information\s+security|data\s+breach|unauthorized\s+access|ransomware)\b"
legal_pat       = r"(?i)\b(litigation|lawsuits?|legal\s+proceedings?|claims?|patent|intellectual\s+property)\b"

topic_expr = (
    pl.when(pl.col(TEXT_COL).str.contains(regulatory_pat))
      .then(pl.lit("regulatory"))
    .when(pl.col(TEXT_COL).str.contains(liquidity_pat))
      .then(pl.lit("liquidity_credit"))
    .when(pl.col(TEXT_COL).str.contains(market_pat))
      .then(pl.lit("market_competitive"))
    .when(pl.col(TEXT_COL).str.contains(operational_pat))
      .then(pl.lit("operational_supply_chain"))
    .when(pl.col(TEXT_COL).str.contains(cyber_pat))
      .then(pl.lit("cybersecurity_tech"))
    .when(pl.col(TEXT_COL).str.contains(legal_pat))
      .then(pl.lit("legal_ip_litigation"))
    .otherwise(pl.lit("general_risk"))
)

df_risk_view = (
    df_risk
    .with_columns([
        # FIX: use count_matches, not count_match
        pl.col(TEXT_COL).str.count_matches(risk_cue_pattern).alias("risk_cue_count"),
        topic_expr.alias("risk_topic"),
    ])
    .filter(pl.col("risk_cue_count") > 0)
    .select([
        "cik_int",
        "name",
        *(["report_year"] if "report_year" in df_risk.columns else []),
        "section_name_str",
        "docID",
        "sentenceID",
        "risk_topic",
        "risk_cue_count",
        pl.col(TEXT_COL).alias("sentence_text"),
    ])
    .sort([
        "cik_int",
        "report_year" if "report_year" in df_risk.columns else "cik_int",
        "risk_topic",
        "docID",
        "sentenceID",
    ])
)

print("=== VIEW 2: RISK SENTENCE ATLAS (sample) ===")
print(df_risk_view.head(50))

# Optional: quick aggregate for your own eyeballing
agg = (
    df_risk_view
    .group_by(["cik_int", "name", "risk_topic"])
    .agg([
        pl.len().alias("num_sentences"),
        pl.col("risk_cue_count").mean().alias("avg_risk_cues"),
    ])
    .sort(["cik_int", "risk_topic"])
)
print("\n=== VIEW 2: TOPIC COUNTS PER COMPANY ===")
print(agg)

# ---------------------------------------------------------------------
# 3) Export to JSON for curation
# ---------------------------------------------------------------------
export_root = Path.cwd().parent / "data_cache" / "analysis_exports" / "goldp3_views"
export_root.mkdir(parents=True, exist_ok=True)

view2_path = export_root / "view2_risk_atlas.json"
df_risk_view.write_json(view2_path)

print(f"\n[Saved Risk atlas view] -> {view2_path}")
print(f"Rows: {df_risk_view.height}")


[DEBUG] ✓ AWS credentials loaded from aws_credentials.env
=== VIEW 2: RISK SENTENCE ATLAS (sample) ===
shape: (50, 9)
┌─────────┬──────────────────┬─────────────┬──────────────────┬──────────────────────┬────────────────────────────────────┬──────────────────────────┬────────────────┬───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ cik_int ┆ name             ┆ report_year ┆ section_name_str ┆ docID                ┆ sentenceID                         ┆ risk_topic               ┆ risk_cue_count ┆ sentence_text                                                                                                                                                                                             │
│ ---     ┆ ---              ┆ ---         ┆ ---              ┆ ---                  ┆ ---                                ┆ ---           

### Table From V2:

```

=== VIEW 2: TOPIC COUNTS PER COMPANY ===
shape: (153, 5)
┌─────────┬──────────────────────┬──────────────────────────┬───────────────┬───────────────┐
│ cik_int ┆ name                 ┆ risk_topic               ┆ num_sentences ┆ avg_risk_cues │
│ ---     ┆ ---                  ┆ ---                      ┆ ---           ┆ ---           │
│ i32     ┆ str                  ┆ str                      ┆ u32           ┆ f64           │
╞═════════╪══════════════════════╪══════════════════════════╪═══════════════╪═══════════════╡
│ 34088   ┆ EXXON MOBIL CORP     ┆ general_risk             ┆ 129           ┆ 1.062016      │
│ 34088   ┆ EXXON MOBIL CORP     ┆ legal_ip_litigation      ┆ 1             ┆ 1.0           │
│ 34088   ┆ EXXON MOBIL CORP     ┆ liquidity_credit         ┆ 21            ┆ 1.047619      │
│ 34088   ┆ EXXON MOBIL CORP     ┆ market_competitive       ┆ 8             ┆ 1.5           │
│ 34088   ┆ EXXON MOBIL CORP     ┆ operational_supply_chain ┆ 54            ┆ 1.296296      │
│ 34088   ┆ EXXON MOBIL CORP     ┆ regulatory               ┆ 88            ┆ 1.545455      │
│ 59478   ┆ ELI LILLY & Co       ┆ cybersecurity_tech       ┆ 54            ┆ 1.425926      │
│ 59478   ┆ ELI LILLY & Co       ┆ general_risk             ┆ 98            ┆ 1.142857      │
│ 59478   ┆ ELI LILLY & Co       ┆ legal_ip_litigation      ┆ 74            ┆ 1.162162      │
│ 59478   ┆ ELI LILLY & Co       ┆ liquidity_credit         ┆ 44            ┆ 1.113636      │
│ 59478   ┆ ELI LILLY & Co       ┆ market_competitive       ┆ 33            ┆ 1.454545      │
│ 59478   ┆ ELI LILLY & Co       ┆ operational_supply_chain ┆ 66            ┆ 1.80303       │
│ 59478   ┆ ELI LILLY & Co       ┆ regulatory               ┆ 264           ┆ 1.458333      │
│ 104169  ┆ Walmart Inc.         ┆ cybersecurity_tech       ┆ 24            ┆ 1.708333      │
│ 104169  ┆ Walmart Inc.         ┆ general_risk             ┆ 86            ┆ 1.174419      │
│ 104169  ┆ Walmart Inc.         ┆ legal_ip_litigation      ┆ 33            ┆ 1.424242      │
│ 104169  ┆ Walmart Inc.         ┆ liquidity_credit         ┆ 34            ┆ 1.529412      │
│ 104169  ┆ Walmart Inc.         ┆ market_competitive       ┆ 11            ┆ 2.181818      │
│ 104169  ┆ Walmart Inc.         ┆ operational_supply_chain ┆ 102           ┆ 1.764706      │
│ 104169  ┆ Walmart Inc.         ┆ regulatory               ┆ 173           ┆ 1.653179      │
│ 200406  ┆ JOHNSON & JOHNSON    ┆ cybersecurity_tech       ┆ 22            ┆ 2.136364      │
│ 200406  ┆ JOHNSON & JOHNSON    ┆ general_risk             ┆ 44            ┆ 1.045455      │
│ 200406  ┆ JOHNSON & JOHNSON    ┆ legal_ip_litigation      ┆ 32            ┆ 1.3125        │
│ 200406  ┆ JOHNSON & JOHNSON    ┆ liquidity_credit         ┆ 15            ┆ 1.0           │
│ 200406  ┆ JOHNSON & JOHNSON    ┆ market_competitive       ┆ 10            ┆ 1.0           │
│ …       ┆ …                    ┆ …                        ┆ …             ┆ …             │
│ 1326801 ┆ Facebook Inc         ┆ operational_supply_chain ┆ 17            ┆ 1.176471      │
│ 1326801 ┆ Meta Platforms, Inc. ┆ operational_supply_chain ┆ 118           ┆ 1.161017      │
│ 1326801 ┆ Meta Platforms, Inc. ┆ regulatory               ┆ 393           ┆ 1.664122      │
│ 1326801 ┆ Facebook Inc         ┆ regulatory               ┆ 56            ┆ 1.553571      │
│ 1341439 ┆ ORACLE CORP          ┆ cybersecurity_tech       ┆ 27            ┆ 1.962963      │
│ 1341439 ┆ ORACLE CORP          ┆ general_risk             ┆ 205           ┆ 1.141463      │
│ 1341439 ┆ ORACLE CORP          ┆ legal_ip_litigation      ┆ 82            ┆ 1.158537      │
│ 1341439 ┆ ORACLE CORP          ┆ liquidity_credit         ┆ 51            ┆ 1.176471      │
│ 1341439 ┆ ORACLE CORP          ┆ market_competitive       ┆ 39            ┆ 1.205128      │
│ 1341439 ┆ ORACLE CORP          ┆ operational_supply_chain ┆ 108           ┆ 1.203704      │
│ 1341439 ┆ ORACLE CORP          ┆ regulatory               ┆ 181           ┆ 1.563536      │
│ 1403161 ┆ VISA INC.            ┆ cybersecurity_tech       ┆ 61            ┆ 1.540984      │
│ 1403161 ┆ VISA INC.            ┆ general_risk             ┆ 169           ┆ 1.088757      │
│ 1403161 ┆ VISA INC.            ┆ legal_ip_litigation      ┆ 119           ┆ 1.470588      │
│ 1403161 ┆ VISA INC.            ┆ liquidity_credit         ┆ 101           ┆ 1.326733      │
│ 1403161 ┆ VISA INC.            ┆ market_competitive       ┆ 27            ┆ 1.037037      │
│ 1403161 ┆ VISA INC.            ┆ operational_supply_chain ┆ 65            ┆ 1.461538      │
│ 1403161 ┆ VISA INC.            ┆ regulatory               ┆ 442           ┆ 1.371041      │
│ 1652044 ┆ Alphabet Inc.        ┆ cybersecurity_tech       ┆ 38            ┆ 1.710526      │
│ 1652044 ┆ Alphabet Inc.        ┆ general_risk             ┆ 255           ┆ 1.215686      │
│ 1652044 ┆ Alphabet Inc.        ┆ legal_ip_litigation      ┆ 105           ┆ 1.171429      │
│ 1652044 ┆ Alphabet Inc.        ┆ liquidity_credit         ┆ 30            ┆ 1.0           │
│ 1652044 ┆ Alphabet Inc.        ┆ market_competitive       ┆ 62            ┆ 1.048387      │
│ 1652044 ┆ Alphabet Inc.        ┆ operational_supply_chain ┆ 40            ┆ 1.075         │
│ 1652044 ┆ Alphabet Inc.        ┆ regulatory               ┆ 192           ┆ 1.598958      │
└─────────┴──────────────────────┴──────────────────────────┴───────────────┴───────────────┘
````

- ## V1 V2 Views: are the two strongest, highest-signal views., 
- View 1 → Numeric KPI QA, - This view is exactly the canonical source for: revenue, net income, operating income, cash from operations, margins, EPS, capex.
- Because View 1 already isolates sentences with: numeric cues, KPI keywords, extracted numbers, section context (Item 7, Item 8, etc.)
- View 2 → Risk QA
- This view isolates: regulatory risks, liquidity risks, market/competition risks, operational/supply-chain risks, cybersecurity risks, legal/IP/litigation risks.
-  Risk QA is the other major half of P3. Risk QA = ~40% of P3.\
-  ` GOLD P3 = View1_numeric + View2_risk + (small add-on for cause/trend/compare) `
-  Numeric QA (View 1) → **~12 questions****
-  Risk QA (View 2) → **~10 questions**
-  Trend / causal / compare QA → **~6–8 questions**

## Results of V1 V2:

- Both files loaded cleanly:
    - view1_kpi_scan.json: 25,310 KPI-ish sentences
    - view2_risk_atlas.json: 20,753 risk-factor sentences

1. Overall KPI label distribution
- Across all 25,310 rows, the kpi_label distribution is:
    ```
    revenue – 5,727 rows (~22.6%)
    net_income – 3,771 (~14.9%)
    operating_income – 3,061 (~12.1%)
    cash_from_ops – 2,770 (~10.9%)
    gross_margin – 1,759 (~7.0%)
    eps – 1,742 (~6.9%)
    debt – 1,545 (~6.1%)
    interest_expense – 1,309 (~5.2%)
    capex – 1,225 (~4.8%)
    the remaining few hundred are in smaller tails / overlaps.
    ```
2. strongest candidates for questions are: revenue, net income, operating income, cash from ops, gross margin, EPS, and debt.
3. Section patterns (where these live) - group by section_name + kpi_label, the most common patterns:
   1. Revenue, net income, EPS, cash from ops, and gross margin concentrate in ITEM_7 (MD&A) and ITEM_8 (Financial Statements).
   2. Debt and interest expense show up heavily in ITEM_7, Notes to the financial statements (if present in your schema under ITEM_8 sublabels), and sometimes in specific “liquidity / capital resources” subsections within MD&A.
4. Per-company KPI coverage (for 5 biggest issuers):
   1. 5 highest-volume companies in V1 (by row count) - few thousand KPI sentences
   2. count of rows per (cik_int, name, kpi_label)
   3. All big names have strong coverage in revenue, net_income, operating_income, cash_from_ops, and gross_margin.
   4. Some issuers are particularly “chatty” about debt or interest_expense, which makes them ideal candidates for P3 questions on leverage, interest burden, etc.
5. Canonical KPI sentence candidates (for curation)
    ```
    For each (cik, year, kpi_label) in {revenue, net_income, cash_from_ops, gross_margin, eps}:
      filter rows with that kpi_label for the issuer/year;
      prefer rows where has_amount_cue == True (mentions $, “million”, “billion”, etc.);
      if multiple, take the earliest by (section_name, sentenceID). 
    ``` 

6. Yields a compact set of candidate lines like:
    ```
    A “revenues increased X% to $Y million/billion” sentence in ITEM_7.
    A “net income (loss) was $X” sentence in ITEM_8.
    A “cash provided by operating activities was $X” sentence in the cash-flow discussion.
    EPS lines like “diluted earnings per share were $X” or “basic and diluted net loss per share was $(Y)”.
    ```

7. View 2 – Risk sentence atlas: what the data looks like:
```   
    1. general_risk – big background class (sentences that mention risk terms but don’t match a more specific lexicon).
    2. market_competitive – competition, pricing pressure, macroeconomic volatility, etc.
    3. regulatory – regulation, compliance, fines, SEC/DOJ language.
    4. liquidity_credit – liquidity, covenants, refinancing, default risk.
    5. operational_supply_chain – operations, manufacturing, facilities, disruptions.
    6. cybersecurity_tech – cyber, security, unauthorized access, data breach.
    7. legal_ip_litigation – litigation, lawsuits, IP disputes.
```

8. Shape P3 risk questions as:
```
    “According to the company’s Risk Factors, what are the main regulatory risks it highlights?”
    “What liquidity and covenant risks does the company describe?”
    “How does the company describe cybersecurity or data breach risks?”
```

9. Looked at the risk_cue_count (how many risk cue terms appear in the sentence) and ranked sentences by that.
   1. Multi-cue phrases: “material adverse”, “significant volatility”, “liquidity”, “covenant”, “default”, “regulatory actions” all in one sentence.
   2. Dense descriptions ( “A failure to comply with regulatory requirements or a downgrade in our credit rating could materially and adversely affect our liquidity, increase our cost of capital....” ) 
   3. ideal for P3 questions that want an analyst to summarize a key risk in their own words
   4. QA that expects retrieval of that exact sentence as evidence.
    


In [16]:
import polars as pl
from pathlib import Path

# -------------------------------------------------------------------
# Load artifacts
# -------------------------------------------------------------------
export_root = Path.cwd().parent / "data_cache" / "analysis_exports" / "goldp3_views"

view1_path = export_root / "view1_kpi_scan.json"
view2_path = export_root / "view2_risk_atlas.json"

df_kpi  = pl.read_json(view1_path)
df_risk = pl.read_json(view2_path)

print("Loaded:")
print(f" - KPI view rows:  {df_kpi.height}")
print(f" - Risk view rows: {df_risk.height}")
print("\n")

# Make sure expected columns exist
print("KPI columns:", df_kpi.columns)
print("Risk columns:", df_risk.columns)
print("\n")


# ===================================================================
# Q1: Global KPI label distribution (top 20)
# ===================================================================
print("=== Q1: Global KPI label distribution (top 20) ===")
q1 = (
    df_kpi
    .group_by("kpi_label")
    .agg([
        pl.len().alias("num_sentences"),
    ])
    .sort("num_sentences", descending=True)
)

print(q1.head(20))
print("\n")


# ===================================================================
# Q2: KPI labels per company – top 15 (company, kpi_label combos)
# ===================================================================
print("=== Q2: KPI label counts per company (top 15 combos) ===")
q2 = (
    df_kpi
    .group_by(["cik_int", "name", "kpi_label"])
    .agg([
        pl.len().alias("num_sentences"),
    ])
    .sort("num_sentences", descending=True)
    .head(15)
)

print(q2)
print("\n")


# ===================================================================
# Q3: For a few core KPIs, which companies talk about them the most?
#     (top 10 company–KPI pairs for priority KPIs)
# ===================================================================
priority_kpis = ["revenue", "net_income", "operating_income", "cash_from_ops", "gross_margin", "eps"]

print("=== Q3: Top companies per core KPI (top 10 rows) ===")
q3 = (
    df_kpi
    .filter(pl.col("kpi_label").is_in(priority_kpis))
    .group_by(["kpi_label", "cik_int", "name"])
    .agg([
        pl.len().alias("num_sentences"),
    ])
    .sort(["kpi_label", "num_sentences"], descending=[False, True])
    .head(10)
)

print(q3)
print("\n")


# ===================================================================
# Q4: Global risk-topic distribution
# ===================================================================
print("=== Q4: Global risk-topic distribution ===")
q4 = (
    df_risk
    .group_by("risk_topic")
    .agg([
        pl.len().alias("num_sentences"),
        pl.col("risk_cue_count").mean().alias("avg_risk_cue_count"),
    ])
    .sort("num_sentences", descending=True)
)

print(q4)
print("\n")


# ===================================================================
# Q5: Risk topics per company – top 15 company–topic combos
# ===================================================================
print("=== Q5: Risk topics per company (top 15 combos) ===")
q5 = (
    df_risk
    .group_by(["cik_int", "name", "risk_topic"])
    .agg([
        pl.len().alias("num_sentences"),
        pl.col("risk_cue_count").mean().alias("avg_risk_cue_count"),
    ])
    .sort("num_sentences", descending=True)
    .head(15)
)

print(q5)
print("\n")


# ===================================================================
# Q6: Spiciest risk sentences – highest risk_cue_count
# ===================================================================
print("=== Q6: Spiciest risk sentences (top 20 by risk_cue_count) ===")
q6 = (
    df_risk
    .sort(["risk_cue_count"], descending=True)
    .select([
        "cik_int",
        "name",
        *(["report_year"] if "report_year" in df_risk.columns else []),
        "section_name_str",
        "sentenceID",
        "risk_topic",
        "risk_cue_count",
        "sentence_text",
    ])
    .head(20)
)

print(q6)
print("\nDone.")


Loaded:
 - KPI view rows:  25310
 - Risk view rows: 20753


KPI columns: ['cik_int', 'name', 'report_year', 'section_name', 'docID', 'sentenceID', 'kpi_label', 'is_percent', 'has_amount_cue', 'first_number_raw', 'sentence_text']
Risk columns: ['cik_int', 'name', 'report_year', 'section_name_str', 'docID', 'sentenceID', 'risk_topic', 'risk_cue_count', 'sentence_text']


=== Q1: Global KPI label distribution (top 20) ===
shape: (9, 2)
┌──────────────────┬───────────────┐
│ kpi_label        ┆ num_sentences │
│ ---              ┆ ---           │
│ str              ┆ u32           │
╞══════════════════╪═══════════════╡
│ revenue          ┆ 12551         │
│ debt             ┆ 5537          │
│ cash_from_ops    ┆ 2782          │
│ net_income       ┆ 1526          │
│ operating_income ┆ 1179          │
│ interest_expense ┆ 511           │
│ gross_margin     ┆ 434           │
│ capex            ┆ 408           │
│ eps              ┆ 382           │
└──────────────────┴───────────────┘


=== Q2

In [20]:
# =====================================================================
# EXPORTS FOR P3 GOLD CREATION
# Query A: KPI canonical candidates (from View1)
# Query B: Risk canonical candidates (from View2)
# =====================================================================

import polars as pl
from pathlib import Path

# Export directory
export_root = Path.cwd().parent / "data_cache" / "analysis_exports" / "goldp3_views"
export_root.mkdir(parents=True, exist_ok=True)

# ---------------------------------------------------------------------
# Query A — KPI canonical candidates (View1)
# ---------------------------------------------------------------------
priority = ["revenue", "net_income", "operating_income", "cash_from_ops", "eps"]
sections = ["ITEM_7", "ITEM_8"]

qA = (
    df_kpi
    .filter(pl.col("kpi_label").is_in(priority))
    .filter(
        pl.col("section_name")
          .str.to_uppercase()
          .is_in(sections)
    )
    .sort(["cik_int", "report_year", "kpi_label"])
    .select([
        "cik_int",
        "name",
        "report_year",
        "kpi_label",
        "first_number_raw",
        "sentenceID",
        "sentence_text"
    ])
)

qa_path = export_root / "p3_candidates_kpi.json"
qA.write_json(qa_path)

print(f"[Saved Query A KPI candidates] -> {qa_path}")
print(f"Rows: {qA.height}")


# ---------------------------------------------------------------------
# Query B — Strong risk evidence sentences (View2)
# ---------------------------------------------------------------------
qB = (
    df_risk
    .sort(["risk_cue_count"], descending=True)
    .select([
        "cik_int",
        "name",
        "report_year",
        "risk_topic",
        "risk_cue_count",
        "sentenceID",
        "sentence_text"
    ])
)

# Export full list (no .head(200) truncation)
qb_path = export_root / "p3_candidates_risk.json"
qB.write_json(qb_path)

print(f"[Saved Query B Risk candidates] -> {qb_path}")
print(f"Rows: {qB.height}")


[Saved Query A KPI candidates] -> d:\JoelDesktop folds_24\NEU FALL2025\MLops IE7374 Project\FinSights\ModelPipeline\finrag_ml_tg1\data_cache\analysis_exports\goldp3_views\p3_candidates_kpi.json
Rows: 14990
[Saved Query B Risk candidates] -> d:\JoelDesktop folds_24\NEU FALL2025\MLops IE7374 Project\FinSights\ModelPipeline\finrag_ml_tg1\data_cache\analysis_exports\goldp3_views\p3_candidates_risk.json
Rows: 20753


In [26]:
import polars as pl
from pathlib import Path
import json
from datetime import datetime, date

def json_serializer(obj):
    """Handle datetime serialization"""
    if isinstance(obj, (datetime, date)):
        return obj.isoformat()
    raise TypeError(f"Type {type(obj)} not serializable")

# Define base directory
base_dir = Path.cwd().parent
dim_dir = base_dir / "data_cache" / "dimensions"

# File paths
sections_parquet = dim_dir / "finrag_dim_sec_sections.parquet"
companies_parquet = dim_dir / "finrag_dim_companies_21.parquet"

# Read and convert sections
df_sections = pl.read_parquet(sections_parquet)
sections_json = dim_dir / "finrag_dim_sec_sections.json"

with open(sections_json, 'w') as f:
    json.dump(df_sections.to_dicts(), f, indent=2, default=json_serializer)
print(f"✓ Sections JSON created: {df_sections.shape[0]} rows")

# Read and convert companies
df_companies = pl.read_parquet(companies_parquet)
companies_json = dim_dir / "finrag_dim_companies_21.json"

with open(companies_json, 'w') as f:
    json.dump(df_companies.to_dicts(), f, indent=2, default=json_serializer)
print(f"✓ Companies JSON created: {df_companies.shape[0]} rows")

# Preview
print("\n--- Sections Preview ---")
print(df_sections.head(3))

print("\n--- Companies Preview ---")
print(df_companies.head(3))

✓ Sections JSON created: 21 rows
✓ Companies JSON created: 21 rows

--- Sections Preview ---
shape: (3, 10)
┌────────────────────┬─────────────────┬──────────────────┬────────────────┬────────────────────────────────────┬──────────────────────────────────────────────────────────────────────────────────┬──────────────────┬─────────────┬──────────┬───────────────┐
│ sec_item_canonical ┆ hf_section_code ┆ api_section_code ┆ section_code   ┆ section_name                       ┆ section_description                                                              ┆ section_category ┆ part_number ┆ priority ┆ has_sub_items │
│ ---                ┆ ---             ┆ ---              ┆ ---            ┆ ---                                ┆ ---                                                                              ┆ ---              ┆ ---         ┆ ---      ┆ ---           │
│ str                ┆ i32             ┆ str              ┆ str            ┆ str                                ┆ str    

## Better shape for P3.v2:

1. 6–9 numeric KPI questions (as above). -- “local factual + light causal”: category A + B from taxonomy.
    - different KPI label: revenue, net_income, operating_income, gross_margin, eps, debt
    - Is phrased to ask for “number + short explanation” as one span from an MD&A sentence.
    - Answer is treated as span (the whole sentence), not just a number.

2. 15 Risk-factor narrative questions across topics like regulatory, liquidity/credit, market, operational/supply-chain, cyber, legal/IP, and general risk.



| Field | Value/Rule | Notes |
|-------|------------|-------|
| **question_id** | `"P3V2-Q001"` ... `"P3V2-Q021"` | Sequential IDs for this batch |
| **cik_int** | `list[int]` (single CIK) | Taken from candidate row |
| **company_name** | `list[str]` (single name) | Taken from candidate row |
| **years** | `[report_year]` | From candidate row |
| **question_text** | Template varies by type: | |
| &nbsp;&nbsp;&nbsp;&nbsp;↳ KPI Items | "What does {COMPANY} report as its {metric} in {YEAR}, and how is this figure described in the filing?" | Example: "What does Microsoft report as its Intelligent Cloud revenue in 2021..." |
| &nbsp;&nbsp;&nbsp;&nbsp;↳ Cross-Year KPI | "How does {COMPANY} describe the change in its {metric} in {YEAR}...?" | Example: "How does Microsoft describe the change in its Intelligent Cloud revenue in 2021?" |
| &nbsp;&nbsp;&nbsp;&nbsp;↳ Risk Items | "What {risk_category} risks does {COMPANY} highlight in its {YEAR} Risk Factors section?" | Example: "What regulatory risks does Walmart highlight in its 2021 Risk Factors section?" |
| **answer_type** | `"span"` (all entries) | Testing retrieval + narrative grounding, not numeric parsing |
| **answer_text** | Exact `sentence_text` from candidates | No rewriting - preserve original filing language |
| **answer_numeric** | `null` | Avoiding mis-parsing; can fill later if needed |
| **answer_unit** | `null` | Not parsing units in this version |
| **tolerance** | `null` | N/A for span-type answers |
| **evidence_sentence_ids** | `list[str]` (1 sentenceID per question) | Taken from `p3_candidates_kpi.json` / `p3_candidates_risk.json` |
| **evidence_spans** | `[]` (empty) | Using full sentences, not sub-spans |
| **retrieval_scope** | Varies by question type: | |
| &nbsp;&nbsp;&nbsp;&nbsp;↳ KPI Questions | `"local"` | `section_hints: ["ITEM_7", "ITEM_8"]` |
| &nbsp;&nbsp;&nbsp;&nbsp;↳ Risk Questions | `"local"` | `section_hints: ["ITEM_1A"]` |
| **difficulty** | Varies by question type: | |
| &nbsp;&nbsp;&nbsp;&nbsp;↳ KPI | Mostly `"easy"` | Cross-year segment revenue = `"medium"` |
| &nbsp;&nbsp;&nbsp;&nbsp;↳ Risk | Mix of `"medium"` and `"hard"` | Depends on topic complexity |
| **section_hints** | See retrieval_scope above | Section filtering for retrieval |
| **notes** | (optional curator comments) | Free text field |
| **gold_version** | `"P3.v2"` | Version identifier for this gold set |
| **created_by** | `"joel_curator"` | Curator attribution |
| **created_at** | Single ISO timestamp | Applied to all entries in this generation batch |
| **curation_confidence** | Varies by difficulty: | |
| &nbsp;&nbsp;&nbsp;&nbsp;↳ KPI | `0.85` | High confidence for structured metrics |
| &nbsp;&nbsp;&nbsp;&nbsp;↳ Risk (medium) | `0.80` | Moderate confidence |
| &nbsp;&nbsp;&nbsp;&nbsp;↳ Risk (hard) | `0.75` | Lower confidence for complex topics |

In [None]:
## Inspecting JSON file:
## File: p3_gold_P3_v1_auto.json


import sys
from pathlib import Path
sys.path.append(str(Path.cwd().parent / 'loaders'))

from ml_config_loader import MLConfig
import polars as pl

# Polars display tuning for notebook
pl.Config.set_tbl_rows(50)
pl.Config.set_tbl_cols(20)
pl.Config.set_fmt_str_lengths(200)
pl.Config.set_tbl_width_chars(1000)

config = MLConfig()

base_dir = Path.cwd().parent

# stage 1 facts 
stage1_path = base_dir / "data_cache" / "stage1_facts" / "finrag_fact_sentences.parquet"

if not stage1_path.exists():
    raise FileNotFoundError(stage1_path)

df = pl.read_parquet(stage1_path)

TEXT_COL = "sentence"  # 

qa_export_dir = base_dir / "data_cache" / "qa_manual_exports" / "goldp3_analysis"
qa_export_dir.mkdir(parents=True, exist_ok=True)

goldset_path = base_dir / "data_cache" / "qa_manual_exports" / "goldp3_analysis" / "p3_gold_p3v2_group_all.json"

df_gold = pl.read_json(goldset_path)

#  valid columns: 
    # ["question_id", "cik_int", "company_name", "years", "question_text", "answer_type", "answer_text", "answer_numeric",
    #  "answer_unit", "tolerance", "evidence_sentence_ids", "evidence_spans", "retrieval_scope", "difficulty", "section_hints", 
    # "notes", "gold_version", "created_by", "created_at", "curation_confidence"]

## collect only question IDs, questions, and evidence IDs. 
df_gold_subset = df_gold.select([
    "question_id", "cik_int", "question_text", "evidence_sentence_ids", "answer_text", "answer_numeric", "section_hints"
])

# df_gold

[DEBUG] ✓ AWS credentials loaded from aws_credentials.env


### MANUAL INSPECTION OF GOLD SET 21 ENTRIES: 

In [30]:

"""
    "question_text": "What operational or supply chain risks does Walmart Inc. highlight in its 2021 Risk Factors section?",
    "answer_text": "Such events could result in physical damage to, or the complete loss of, one or more of our properties, the closure of one or more stores, clubs and distribution or fulfillment centers, limitations on store or club operating hours, the lack of an adequate work force in a market, the inability of customers and associates to reach or have transportation to our stores and clubs affected by such events, the evacuation of the populace from areas in which our stores, clubs and distribution and fulfillment centers are located, the unavailability of our digital platforms to our customers, changes in the purchasing patterns of consumers (including the frequency of visits by consumers to physical retail locations, whether as a result of limitations on large gatherings, travel and movement limitations or otherwise) and in consumers' disposable income, the temporary or long-term disruption in the supply of products from some suppliers, the disruption in the transport of goods from overseas, the disruption or delay in the delivery of goods to our distribution and fulfillment centers or stores within a country in which we are operating, the reduction in the availability of products in our stores, the disruption of utility services to our stores and our facilities, and the disruption in our communications with our stores.",
    "evidence_sentence_ids": [
      "0000104169_10-K_2021_section_1A_47"
    ],


    "question_text": "What does Apple Inc. report as its earnings per share in 2006, and how is this figure described in the filing?",
    "answer_type": "span",
    "answer_text": "The following table sets forth the computation of basic and diluted earnings per share: Potentially dilutive securities representing approximately 3.9 million, 12.7 million (as restated(1)), and 8.9 million (as restated(1)) shares of common stock for the years ended September 30, 2006, September 24, 2005, and September 25, 2004, respectively, were excluded from the computation of diluted earnings per share for these periods because their effect would have been antidilutive.",
    "evidence_sentence_ids": [
      "0000320193_10-K_2006_section_8_174"
    ],


    "question_text": "How does MICROSOFT CORP describe the change in its Intelligent Cloud revenue in 2017, including both the direction and magnitude of the change?",
    "answer_type": "span",
    "answer_text": "Intelligent Cloud Revenue increased $2.4 billion or 10%, primarily due to higher revenue from server products and cloud services.",
    "evidence_sentence_ids": [
      "0000789019_10-K_2017_section_7_104"
    ],


    "question_text": "What does ELI LILLY & Co report as its net income in 2006, and how is this figure described in the filing?",
    "answer_type": "span",
    "answer_text": "A 5 percent change in the valuation allowance would result in a change in net income of approximately $25 million.",
    "evidence_sentence_ids": [
      "0000059478_10-K_2006_section_7_326"
    ],

"""


## Simple quick analysis query. Display cik, year, filingID or docID, sentenceID and sentence from main fact table


import sys
from pathlib import Path
sys.path.append(str(Path.cwd().parent / 'loaders'))

from ml_config_loader import MLConfig
import polars as pl

pl.Config.set_tbl_rows(50)
pl.Config.set_tbl_cols(20)
pl.Config.set_fmt_str_lengths(1500)
pl.Config.set_tbl_width_chars(1000)

config = MLConfig()

base_dir = Path.cwd().parent

facts_path = base_dir / "data_cache" / "meta_embeds" / "finrag_fact_sentences_meta_embeds.parquet"

if not facts_path.exists():
    raise FileNotFoundError(facts_path)

df = pl.read_parquet(facts_path)

TEXT_COL = "sentence"  

df_filtered = (
    df
    .filter(pl.col("sentenceID") == "0000059478_10-K_2006_section_7_326")
    .select([
        "cik_int",
        "report_year", 
        "reportDate",
        "docID",
        "sentenceID",
        "section_name",
        "sentence",
        "form",
        "sic"
    ])
)

df_filtered

[DEBUG] ✓ AWS credentials loaded from aws_credentials.env


cik_int,report_year,reportDate,docID,sentenceID,section_name,sentence,form,sic
i32,i64,str,str,str,str,str,str,str
59478,2006,"""2006-12-31""","""0000059478_10-K_2006""","""0000059478_10-K_2006_section_7_326""","""ITEM_7""","""A 5 percent change in the valuation allowance would result in a change in net income of approximately $25 million.""","""10-K""","""2834"""


## V3/V4/V5 would add on top of the 21: 

1. P3.v2 set: current 21 as “local, single-sentence, single-answer” QA.
2. Scope is usually one company, one year, one section. Question asks: “What does X say about Y?” rather than trends, comparisons.
3. View 3 – MD&A causal / “why” questions (taxonomy: B, F)
4. mines: MD&A sentences with causal connectors: “increased due to…”, “declined primarily because…”, “driven by…”.
5. B-type (local causal), F-type (light summarization), Pull explanations, not just numbers. Retrieve the right cause sentences.
6. View 4 – Cross-year / trend questions (taxonomy: D, I) 
7. mines: pairs of sentences across years for the same company, often from similar MD&A/Note templates.
8. D-type (cross-year change), I-type (temporal reference clarification)
9. Tests global retrieval within a company across multiple years.
10. View 5 – Cross-company, definitions, and verification (taxonomy: E, G, H)
11. mines: “templated” or repeated sentences across different CIKs and definitions sections.
12. E-type (cross-company compare), G-type (verification/consistency), H-type (definition / label grounding)
13. Sends the retriever to different issuers, not just different years of one issuer.
14. Brings in semantic nuance (“FX drivers”, “Adjusted EBITDA definition”), not just numbers.
15. **More multi-evidence (2–3 sentence IDs per question).**


### What is needed?:
- Stage-3 sentence table joined with View1 and/or Risk Atlas.
- bundles like: (cik, company_name, kpi_label, years=[2019,2020], [sent2019, sent2020, maybe a causal MD&A sentence])
- materialize warehouse-ish data as JSON (or Parquet) with each row being a trend bundle for one company+KPI.

Builds three mini-warehouses:
- goldp3_v3_trend_bundles.json (cross-year / trend, KPI + risk)
- goldp3_v4_cross_company_bundles.json (cross-company KPI + risk)
- goldp3_v5_def_verify_candidates.json (definitions + FX/supply-chain/COVID/regulatory verification)
- Each row is a bundle with metadata + a list of candidate sentences (structs). 
- Subselect from these bundles to form actual P3 V3/V4/V5 gold entries.

Some Schemas:
- KPI: cik_int, name, kpi_label, years, sentences, bundle_type, v_group, topic_label (8 cols)
- Risk: cik_int, name, topic_label, years, sentences, bundle_type, v_group (7 cols)




In [8]:
# =====================================================================
# GOLD P3 V3/V4/V5 CANDIDATE WAREHOUSES
#  - V3: cross-year trend bundles (KPI + Risk)
#  - V4: cross-company comparison bundles (KPI + Risk)
#  - V5: definition + verification candidates (Stage-1 text)
# =====================================================================

import sys
from pathlib import Path

import polars as pl

# ---------------------------------------------------------------------
# 0) Setup and config
# ---------------------------------------------------------------------
# Add loaders path for MLConfig (if needed)
sys.path.append(str(Path.cwd().parent / "loaders"))
from ml_config_loader import MLConfig  # noqa: E402

# Polars display (for notebook work)
pl.Config.set_tbl_rows(50)
pl.Config.set_tbl_cols(20)
pl.Config.set_fmt_str_lengths(200)
pl.Config.set_tbl_width_chars(1000)

cfg = MLConfig()

# Base paths
project_root = Path.cwd().parent
export_root = project_root / "data_cache" / "analysis_exports" / "goldp3_views"
export_root.mkdir(parents=True, exist_ok=True)

# Existing views
view1_path = export_root / "view1_kpi_scan.json"
view2_path = export_root / "view2_risk_atlas.json"

if not view1_path.exists():
    raise FileNotFoundError(f"Missing KPI view at {view1_path} – run View 1 generation first.")
if not view2_path.exists():
    raise FileNotFoundError(f"Missing Risk view at {view2_path} – run View 2 generation first.")

df_kpi = pl.read_json(view1_path)
df_risk = pl.read_json(view2_path)

print(f"[Loaded] View1 KPI scan -> {view1_path} (rows={df_kpi.height})")
print(f"[Loaded] View2 Risk atlas -> {view2_path} (rows={df_risk.height})")

# ---------------------------------------------------------------------
# Optional restriction knobs for all downstream bundles
# (leave as None for full universe)
# ---------------------------------------------------------------------
TARGET_CIKS  = None  # e.g. [1318605, 1276520, 789019, 320193, 1326801]
TARGET_YEARS = None  # e.g. [2018, 2019, 2020]

if TARGET_CIKS is not None:
    df_kpi = df_kpi.filter(pl.col("cik_int").is_in(TARGET_CIKS))
    df_risk = df_risk.filter(pl.col("cik_int").is_in(TARGET_CIKS))

if TARGET_YEARS is not None and "report_year" in df_kpi.columns:
    df_kpi = df_kpi.filter(pl.col("report_year").is_in(TARGET_YEARS))
if TARGET_YEARS is not None and "report_year" in df_risk.columns:
    df_risk = df_risk.filter(pl.col("report_year").is_in(TARGET_YEARS))





# ---------------------------------------------------------------------
# 1) V3 – Cross-year / trend bundles (KPI + Risk)
# ---------------------------------------------------------------------
causal_pattern = (
    r"(?i)\b(due to|because|as a result of|resulted from|"
    r"driven by|attributed to|primarily due to|"
    r"primarily as a result of)\b"
)

if "report_year" not in df_kpi.columns:
    raise ValueError("df_kpi is missing 'report_year'; ensure View 1 was generated with report_year.")
if "report_year" not in df_risk.columns:
    raise ValueError("df_risk is missing 'report_year'; ensure View 2 was generated with report_year.")

# --- V3 KPI trend bundles ------------------------------------------------
df_kpi_causal = (
    df_kpi
    .filter(pl.col("sentence_text").str.contains(causal_pattern))
    .with_columns(
        pl.col("section_name").alias("section_name_generic")
    )
)

df_v3_kpi = (
    df_kpi_causal
    .group_by(["cik_int", "name", "kpi_label"])
    .agg([
        pl.col("report_year").unique().sort().alias("years"),
        pl.struct(
            [
                "report_year",
                "section_name_generic",  # unified field name
                "docID",
                "sentenceID",
                "sentence_text",
            ]
        ).alias("sentences"),
    ])
    .filter(pl.col("years").list.len() >= 2)
    .with_columns([
        pl.col("kpi_label").alias("topic_label"),
        pl.lit("kpi_trend").alias("bundle_type"),
        pl.lit("v3_trend").alias("v_group"),
    ])
    .drop("kpi_label")
)

print(f"[V3 KPI] trend bundles: {df_v3_kpi.height}")

# --- V3 Risk trend bundles ----------------------------------------------
df_risk_causal = (
    df_risk
    .filter(pl.col("sentence_text").str.contains(causal_pattern))
    .with_columns(
        pl.col("section_name_str").alias("section_name_generic")
    )
)

df_v3_risk = (
    df_risk_causal
    .group_by(["cik_int", "name", "risk_topic"])
    .agg([
        pl.col("report_year").unique().sort().alias("years"),
        pl.struct(
            [
                "report_year",
                "section_name_generic",  # same field name as KPI side
                "docID",
                "sentenceID",
                "sentence_text",
            ]
        ).alias("sentences"),
    ])
    .filter(pl.col("years").list.len() >= 2)
    .with_columns([
        pl.col("risk_topic").alias("topic_label"),
        pl.lit("risk_trend").alias("bundle_type"),
        pl.lit("v3_trend").alias("v_group"),
    ])
    .drop("risk_topic")
)

print(f"[V3 Risk] trend bundles: {df_v3_risk.height}")

# --- Align schemas explicitly before concat -----------------------------
# Ensure identical column names, order, and dtypes
common_cols_v3 = ["cik_int", "name", "years", "sentences", "topic_label", "bundle_type", "v_group"]

df_v3_kpi = (
    df_v3_kpi
    .select(common_cols_v3)
    .with_columns([
        pl.col("cik_int").cast(pl.Int64),
        pl.col("years").cast(pl.List(pl.Int64)),
        pl.col("topic_label").cast(pl.Utf8),
        pl.col("bundle_type").cast(pl.Utf8),
        pl.col("v_group").cast(pl.Utf8),
    ])
)

df_v3_risk = (
    df_v3_risk
    .select(common_cols_v3)
    .with_columns([
        pl.col("cik_int").cast(pl.Int64),
        pl.col("years").cast(pl.List(pl.Int64)),
        pl.col("topic_label").cast(pl.Utf8),
        pl.col("bundle_type").cast(pl.Utf8),
        pl.col("v_group").cast(pl.Utf8),
    ])
)

df_v3_all = pl.concat(
    [df_v3_kpi, df_v3_risk],
    how="vertical",
    rechunk=True,
)

v3_path = export_root / "goldp3_v3_trend_bundles.json"
df_v3_all.write_json(v3_path)
print(f"[Saved V3 trend bundles] -> {v3_path} (rows={df_v3_all.height})")






# ---------------------------------------------------------------------
# 2) V4 – Cross-company comparison bundles (KPI + Risk)
# ---------------------------------------------------------------------
MAX_COMPANIES_PER_BUNDLE = 5  # safety cap so bundles don’t explode

# --- V4 KPI cross-company bundles ---------------------------------------
df_kpi_cc = df_kpi.with_columns(
    pl.col("section_name").alias("section_name_generic")
)

df_v4_kpi = (
    df_kpi_cc
    .select(
        [
            "report_year",
            "kpi_label",
            "cik_int",
            "name",
            "docID",
            "sentenceID",
            "section_name_generic",
            "sentence_text",
        ]
    )
    .group_by(["report_year", "kpi_label"])
    .agg([
        pl.struct(
            [
                "cik_int",
                "name",
                "docID",
                "sentenceID",
                "section_name_generic",  # unified field
                "sentence_text",
            ]
        ).alias("company_sentences")
    ])
    .with_columns([
        pl.col("kpi_label").alias("topic_label"),
        pl.lit("kpi_cross_company").alias("bundle_type"),
        pl.lit("v4_cross_company").alias("v_group"),
    ])
    .drop("kpi_label")
    .with_columns(
        pl.col("company_sentences")
        .list.slice(0, MAX_COMPANIES_PER_BUNDLE)
        .alias("company_sentences")
    )
    .filter(pl.col("company_sentences").list.len() >= 2)
)

print(f"[V4 KPI] cross-company bundles: {df_v4_kpi.height}")

# --- V4 Risk cross-company bundles --------------------------------------
df_risk_cc = df_risk.with_columns(
    pl.col("section_name_str").alias("section_name_generic")
)

df_v4_risk = (
    df_risk_cc
    .select(
        [
            "report_year",
            "risk_topic",
            "cik_int",
            "name",
            "docID",
            "sentenceID",
            "section_name_generic",
            "sentence_text",
        ]
    )
    .group_by(["report_year", "risk_topic"])
    .agg([
        pl.struct(
            [
                "cik_int",
                "name",
                "docID",
                "sentenceID",
                "section_name_generic",  # same unified field
                "sentence_text",
            ]
        ).alias("company_sentences")
    ])
    .with_columns([
        pl.col("risk_topic").alias("topic_label"),
        pl.lit("risk_cross_company").alias("bundle_type"),
        pl.lit("v4_cross_company").alias("v_group"),
    ])
    .drop("risk_topic")
    .with_columns(
        pl.col("company_sentences")
        .list.slice(0, MAX_COMPANIES_PER_BUNDLE)
        .alias("company_sentences")
    )
    .filter(pl.col("company_sentences").list.len() >= 2)
)

print(f"[V4 Risk] cross-company bundles: {df_v4_risk.height}")

# --- Align schemas explicitly before concat ------------------------------
common_cols_v4 = ["report_year", "topic_label", "company_sentences", "bundle_type", "v_group"]

df_v4_kpi = (
    df_v4_kpi
    .select(common_cols_v4)
    .with_columns([
        pl.col("report_year").cast(pl.Int64),
        pl.col("topic_label").cast(pl.Utf8),
        pl.col("bundle_type").cast(pl.Utf8),
        pl.col("v_group").cast(pl.Utf8),
    ])
)

df_v4_risk = (
    df_v4_risk
    .select(common_cols_v4)
    .with_columns([
        pl.col("report_year").cast(pl.Int64),
        pl.col("topic_label").cast(pl.Utf8),
        pl.col("bundle_type").cast(pl.Utf8),
        pl.col("v_group").cast(pl.Utf8),
    ])
)

df_v4_all = pl.concat(
    [df_v4_kpi, df_v4_risk],
    how="vertical",
    rechunk=True,
)

v4_path = export_root / "goldp3_v4_cross_company_bundles.json"
df_v4_all.write_json(v4_path)
print(f"[Saved V4 cross-company bundles] -> {v4_path} (rows={df_v4_all.height})")












# ---------------------------------------------------------------------
# 3) V5 – Definitions + verification / attribution candidates
# ---------------------------------------------------------------------
# For this we go back to Stage-1 sentences (full universe, not only KPI/Risk views).

stage1_local = project_root / "data_cache" / "stage1_facts" / "finrag_fact_sentences.parquet"

if stage1_local.exists():
    df_stage1 = pl.read_parquet(stage1_local)
else:
    meta_uri = f"s3://{cfg.bucket}/{cfg.meta_embeds_path}"
    df_stage1 = pl.read_parquet(meta_uri, storage_options=cfg.get_storage_options())

TEXT_COL = "sentence"  # consistent with your earlier code

if TARGET_CIKS is not None:
    df_stage1 = df_stage1.filter(pl.col("cik_int").is_in(TARGET_CIKS))
if TARGET_YEARS is not None and "report_year" in df_stage1.columns:
    df_stage1 = df_stage1.filter(pl.col("report_year").is_in(TARGET_YEARS))

# --- Definition candidates ----------------------------------------------
def_pattern = (
    r"(?i)\b(we define|is defined as|are defined as|means|refers to|"
    r"we refer to|herein referred to as|non-GAAP|adjusted EBITDA|"
    r"adjusted earnings|adjusted income)\b"
)

df_defs = df_stage1.filter(
    pl.col(TEXT_COL).str.contains(def_pattern)
)

# --- Verification / attribution candidates ------------------------------
verify_pattern = (
    r"(?i)\b(foreign currency|FX|exchange rate|exchange-rate|"
    r"supply chain|supply-chain|logistics|pandemic|COVID-19|COVID 19|"
    r"coronavirus|regulatory|regulation|compliance|cybersecurity|"
    r"information security|data breach|ransomware)\b"
)

df_verify = df_stage1.filter(
    pl.col(TEXT_COL).str.contains(verify_pattern)
)

# Restrict columns to keep JSON reasonable
base_cols = [
    "cik_int",
    "name",
    *(["report_year"] if "report_year" in df_stage1.columns else []),
    "section_name",
    "docID",
    "sentenceID",
]

df_v5_defs = (
    df_defs
    .select(
        base_cols
        + [
            pl.col(TEXT_COL).alias("sentence_text"),
        ]
    )
    .with_columns(
        [
            pl.lit("definition").alias("candidate_type"),
            pl.lit("v5_def_verify").alias("v_group"),
        ]
    )
)

df_v5_verify = (
    df_verify
    .select(
        base_cols
        + [
            pl.col(TEXT_COL).alias("sentence_text"),
        ]
    )
    .with_columns(
        [
            pl.lit("verification").alias("candidate_type"),
            pl.lit("v5_def_verify").alias("v_group"),
        ]
    )
)

df_v5_all = (
    pl.concat([df_v5_defs, df_v5_verify], how="vertical", rechunk=True)
    .unique(subset=["cik_int", "docID", "sentenceID", "candidate_type"])
)

v5_path = export_root / "goldp3_v5_def_verify_candidates.json"
df_v5_all.write_json(v5_path)
print(f"[Saved V5 def/verify candidates] -> {v5_path} (rows={df_v5_all.height})")

# ---------------------------------------------------------------------
# Summary
# ---------------------------------------------------------------------
print("\n=== SUMMARY ===")
print(f"V3 trend bundles      : {df_v3_all.height} rows -> {v3_path.name}")
print(f"V4 cross-company      : {df_v4_all.height} rows -> {v4_path.name}")
print(f"V5 def/verify cand.   : {df_v5_all.height} rows -> {v5_path.name}")


[DEBUG] ✓ AWS credentials loaded from aws_credentials.env
[Loaded] View1 KPI scan -> d:\JoelDesktop folds_24\NEU FALL2025\MLops IE7374 Project\FinSights\ModelPipeline\finrag_ml_tg1\data_cache\analysis_exports\goldp3_views\view1_kpi_scan.json (rows=25310)
[Loaded] View2 Risk atlas -> d:\JoelDesktop folds_24\NEU FALL2025\MLops IE7374 Project\FinSights\ModelPipeline\finrag_ml_tg1\data_cache\analysis_exports\goldp3_views\view2_risk_atlas.json (rows=20753)
[V3 KPI] trend bundles: 127
[V3 Risk] trend bundles: 97
[Saved V3 trend bundles] -> d:\JoelDesktop folds_24\NEU FALL2025\MLops IE7374 Project\FinSights\ModelPipeline\finrag_ml_tg1\data_cache\analysis_exports\goldp3_views\goldp3_v3_trend_bundles.json (rows=224)
[V4 KPI] cross-company bundles: 178
[V4 Risk] cross-company bundles: 138
[Saved V4 cross-company bundles] -> d:\JoelDesktop folds_24\NEU FALL2025\MLops IE7374 Project\FinSights\ModelPipeline\finrag_ml_tg1\data_cache\analysis_exports\goldp3_views\goldp3_v4_cross_company_bundles.json 

### Next New File: p3_gold_p3v3_new10


- **V3 – Cross-year trend / narrative (3 questions)**
- P3V3-Q001 – Walmart long-term debt trend (2018–2020)
- P3V3-Q002 – Meta regulatory & policy risk over time
  - Spans 2019, 2022, 2024 and asks how Meta describes regulatory risks to ads/data:
- P3V3-Q003 – J&J COVID-related sales trend (2022–2024)
  - narrates COVID-19 as: recovery tailwind (2022), headwind in some consumer categories (2023), and then specific decline in infectious disease sales primarily driven by lower COVID-19 vaccine revenue (2024).
  - Uses 3 ITEM_7 sentences (2022, 2023, 2024).
- **V4 – Cross-company comparisons (3 questions)**
- P3V3-Q004 – 2009 cyber/data-protection risk (Radian, Netflix, Mastercard)
  - Compares how each issuer frames data protection / security / privacy risk in ITEM_1A.
  - Evidence: one sentence each from RADIAN GROUP INC, NETFLIX INC, Mastercard Inc.
- Cross-company liquidity/credit framing:
  - legal/contingency risk to Walmart’s liquidity, Apple’s investment credit/valuation risk, Microsoft’s tax/regulatory effects on cash flow, Icahn’s leveraged investments / counterparty default risk. 
  - Evidence: 4 ITEM_1A sentences (one per issuer).
- **V5 – Definitions & verification (4 questions)**
- P3V3-Q007 – Tesla Adjusted EBITDA definition (2022)
  - Uses Tesla’s ITEM_8 line (section_8_592) explicitly defining Adjusted EBITDA as net income/loss adjusted for interest, taxes, D&A, stock-based comp.
- P3V3-Q008 – Icahn Enterprises Adjusted EBITDA definition (2011)
  - Uses section_6_9 in ITEM_6, where Icahn defines Adjusted EBITDA and notes exclusions like discontinued operations and gains/losses on debt extinguishment.
- P3V3-Q009 – J&J COVID-19 vaccine revenue verification (2024)
  - Yes/no question: does J&J explicitly say the infectious disease sales decline is primarily driven by lower COVID-19 vaccine revenue?
  - Evidence: 0000200406_10-K_2024_section_7_61.
- P3V3-Q010 – Meta FX remeasurement verification (2015 & 2016)
  - Yes/no question across two years: does Meta say that movements in “Other income/(expense), net” are primarily due to foreign currency remeasurement?
  - Evidence: 2015_section_7_249 and 2016_section_7_216.

### MANUAL INSPECTION OF GOLD SET NEXT 15 ENTRIES: 

In [None]:

"""
EXTRACT FROM JSON:
    "question_text": "Across its fiscal 2018\u20132020 10-K filings, how does Walmart Inc. explain the main drivers behind changes in its long-term debt and related cash flows from financing activities?",
    "answer_type": "span",
    "answer_text": "Walmart links changes in long-term debt over this period to deliberate capital structure actions. In 2018 it highlights losses and cash outflows from the early extinguishment of higher\u2011rate debt, which were intended to reduce future interest expense. In 2019 and 2020 it explains that large new issuances and subsequent movements in long\u2011term debt were primarily driven by financing the Flipkart acquisition and funding broader corporate and operating needs, with short\u2011term borrowings and repayments shifting as those long\u2011term debt transactions settled.",
    "answer_numeric": null,
    "answer_unit": null,
    "tolerance": null,
    "evidence_sentence_ids": [
      "0000104169_10-K_2018_section_7_142",
      "0000104169_10-K_2018_section_7_224",
      "0000104169_10-K_2019_section_7_209",
      "0000104169_10-K_2020_section_7_219"
    ],

TRUE DATA:    
cik_int	sentenceID	section_name	sentence
i32	str	str	str
104169	"0000104169_10-K_2018_section_7_224"	"ITEM_7"	"Net cash used in financing activities for fiscal 2018 increased due to premiums paid for early extinguishment of debt."
104169	"0000104169_10-K_2019_section_7_209"	"ITEM_7"	"Long-term Debt The following table provides the changes in our long-term debt for fiscal 2019: Our total long-term debt increased $11.6 billion for fiscal 2019, primarily due to the net proceeds from issuance of long-term debt to fund a portion of the purchase price for Flipkart and for general corporate purposes."
104169	"0000104169_10-K_2020_section_7_219"	"ITEM_7"	"The increase was primarily due to the $15.9 billion of net proceeds received in the prior year from the issuance of long-term debt to fund a portion of the purchase price for Flipkart partially offset by $5.5 billion of additional long-term debt in the current year to fund general business operations."
104169	"0000104169_10-K_2018_section_7_142"	"ITEM_7"	"For fiscal 2018, loss on extinguishment of debt was $3.1 billion, due to the early extinguishment of long-term debt which allowed us to retire higher rate debt to reduce interest expense in future periods."


EXTRACT FROM JSON:
        "question_text": "For 2010, how do Walmart, Apple, Microsoft and Icahn Enterprises describe their exposure to liquidity and credit\u2011related risks in their Item 1A risk\u2011factor discussions?",
    "answer_type": "list",
    "answer_text": [
      "Walmart notes that certain legal proceedings and contingencies could adversely affect its results of operations, financial condition and liquidity.",
      "Apple explains that the credit quality, ratings and pricing of its investment portfolio expose it to potential deterioration or losses that could impact financial results.",
      "Microsoft highlights that tax and regulatory developments can have a material adverse impact on its tax expense and cash flows.",
      "Icahn Enterprises warns that difficult market conditions for highly leveraged investments and defaults by counterparties or institutions could significantly affect funds it manages and, in turn, its own liquidity and results."
    ],
    "answer_numeric": null,
    "answer_unit": null,
    "tolerance": null,
    "evidence_sentence_ids": [
      "0000104169_10-K_2010_section_1A_78",
      "0000320193_10-K_2010_section_1A_204",
      "0000789019_10-K_2010_section_1A_145",
      "0000813762_10-K_2010_section_1A_155"
    ],

TRUE DATA:    
cik_int	sentenceID	section_name	sentence
i32	str	str	str
320193	"0000320193_10-K_2010_section_1A_204"	"ITEM_1A"	"Credit ratings and pricing of these investments can be negatively impacted by liquidity, credit deterioration or losses, financial results, or other factors."
789019	"0000789019_10-K_2010_section_1A_145"	"ITEM_1A"	"Although we cannot predict whether or in what form this proposed legislation will pass, if enacted it could have a material adverse impact on our tax expense and cash flow."
104169	"0000104169_10-K_2010_section_1A_78"	"ITEM_1A"	"We are subject to certain legal proceedings that may adversely affect our results of operations, financial condition and liquidity."
813762	"0000813762_10-K_2010_section_1A_155"	"ITEM_1A"	"Furthermore, difficult market conditions may also increase the risk of default with respect to investments held by the Private Funds that have significant debt investments.


EXTRACT FROM JSON:
    "question_text": "In its 2015 and 2016 Form 10\u2011K filings, does Meta explicitly attribute movements in 'Other income/(expense), net' to foreign currency remeasurement, and what does it say?",
    "answer_type": "boolean",
    "answer_text": "Yes. Meta explains that changes in 'Other income/(expense), net' in both 2015 and 2016 were driven primarily by the periodic remeasurement of its foreign currency balances. One year it notes an increase and in the other a decrease, but in each case it points to foreign currency remeasurement as the main cause.",
    "answer_numeric": null,
    "answer_unit": null,
    "tolerance": null,
    "evidence_sentence_ids": [
      "0001326801_10-K_2016_section_7_216",
      "0001326801_10-K_2015_section_7_249"
    ],

TRUE DATA:
cik_int	sentenceID	section_name	sentence
i32	str	str	str
1326801	"0001326801_10-K_2015_section_7_249"	"ITEM_7"	"Other income/(expense), net decreased primarily due to $87 million in foreign exchange losses resulting from the periodic re-measurement of our foreign currency balances."
1326801	"0001326801_10-K_2016_section_7_216"	"ITEM_7"	"Other income/(expense), net increased primarily due to a decrease in foreign exchange losses resulting from the periodic re-measurement of our foreign currency balances."



EXTRACT FROM JSON:
    "question_text": "Between 2022 and 2024, how does Johnson & Johnson describe the impact of COVID-19 on selected consumer and infectious disease product sales in its 10-K MD&A discussions?",
    "answer_type": "span",
    "answer_text": "Johnson & Johnson initially portrays COVID-19 as a tailwind and then a headwind for different parts of the portfolio. In 2022 it notes that certain consumer franchises benefited from innovation, e-commerce strength and COVID-19 recovery. By 2023 it reports operational declines in some personal care categories, citing negative COVID-19 impacts in China and broader pressure on consumption. In its 2024 MD&A the company explains that infectious disease product sales fell versus the prior year primarily because COVID-19 vaccine revenue declined, making pandemic-related demand a key driver of the trend.",
    "evidence_sentence_ids": [
      "0000200406_10-K_2022_section_7_54",
      "0000200406_10-K_2023_section_7_54",
      "0000200406_10-K_2024_section_7_61"
    ],
TRUE DATA:
cik_int	sentenceID	section_name	sentence
i32	str	str	str
200406	"0000200406_10-K_2023_section_7_54"	"ITEM_7"	"The operational decline was due to portfolio simplification in the U.S., competitive pressures in EMEA and China, category decline and pricing pressures in EMEA, as well as suspension of personal care sales in Russia and negative COVID-19 impacts in China."
200406	"0000200406_10-K_2024_section_7_61"	"ITEM_7"	"Infectious disease products sales were $3.4 billion in 2024, a decline of 23.1% as compared to the prior year primarily driven by a decline in COVID-19 vaccine revenue."
200406	"0000200406_10-K_2022_section_7_54"	"ITEM_7"	"The Baby Care franchise sales of $1.6 billion increased 3.2% compared to the prior year. Growth was driven by AVEENO® Asia Pacific eCommerce strength, innovation and COVID-19 recovery."
    

EXTRACT FROM JSON:
    "question_text": "In their 2009 Form 10-K risk-factor disclosures, how do Radian Group, Netflix and Mastercard each describe their exposure to data protection, information security and customer privacy risks?",
    "answer_type": "list",
    "answer_text": [
      "Radian Group warns that, despite believing it has appropriate information security, failures or breaches by employees or third parties could still occur and adversely affect the business.",
      "Netflix emphasizes that privacy concerns and potential compromises of subscriber data could limit its ability to grow, damage its reputation and negatively affect its business.",
      "Mastercard explicitly flags data protection and information security as a distinct risk area, underscoring the importance of safeguarding transaction and cardholder data in its network."
    ],
    "evidence_sentence_ids": [
      "0000890926_10-K_2009_section_1A_410",
      "0001065280_10-K_2009_section_1A_235",
      "0001141391_10-K_2009_section_1A_119"
    ],
TRUE DATA:
cik_int	sentenceID	section_name	sentence
i32	str	str	str
1141391	"0001141391_10-K_2009_section_1A_119"	"ITEM_1A"	"Data Protection and Information Security."
1065280	"0001065280_10-K_2009_section_1A_235"	"ITEM_1A"	"Privacy concerns could limit our ability to leverage our subscriber data and our disclosure of or unauthorized access to subscriber data could adversely impact our business and reputation."
890926	"0000890926_10-K_2009_section_1A_410"	"ITEM_1A"	"While we believe we have appropriate information security policies and systems to prevent unauthorized disclosure, there can be no assurance that unauthorized disclosure, either through the actions of third parties or our employees, will not occur."
    
"""


## Simple quick analysis query. Display cik, year, filingID or docID, sentenceID and sentence from main fact table


import sys
from pathlib import Path
sys.path.append(str(Path.cwd().parent / 'loaders'))

from ml_config_loader import MLConfig
import polars as pl

pl.Config.set_tbl_rows(50)
pl.Config.set_tbl_cols(20)
pl.Config.set_fmt_str_lengths(1500)
pl.Config.set_tbl_width_chars(1000)

config = MLConfig()

base_dir = Path.cwd().parent

facts_path = base_dir / "data_cache" / "meta_embeds" / "finrag_fact_sentences_meta_embeds.parquet"

if not facts_path.exists():
    raise FileNotFoundError(facts_path)

df = pl.read_parquet(facts_path)

TEXT_COL = "sentence"  

df_filtered = ( df
    .filter(pl.col("sentenceID").is_in
            ([ 
      "0000890926_10-K_2009_section_1A_410",
      "0001065280_10-K_2009_section_1A_235",
      "0001141391_10-K_2009_section_1A_119"
            ])
              )
    .select([
        "cik_int",
        # "reportDate",
        # "docID",
        "sentenceID",
        "section_name",
        "sentence"
    ])
)

df_filtered



[DEBUG] ✓ AWS credentials loaded from aws_credentials.env


cik_int,sentenceID,section_name,sentence
i32,str,str,str
1141391,"""0001141391_10-K_2009_section_1A_119""","""ITEM_1A""","""Data Protection and Information Security."""
1065280,"""0001065280_10-K_2009_section_1A_235""","""ITEM_1A""","""Privacy concerns could limit our ability to leverage our subscriber data and our disclosure of or unauthorized access to subscriber data could adversely impact our business and reputation."""
890926,"""0000890926_10-K_2009_section_1A_410""","""ITEM_1A""","""While we believe we have appropriate information security policies and systems to prevent unauthorized disclosure, there can be no assurance that unauthorized disclosure, either through the actions of third parties or our employees, will not occur."""


### Gold Test P3v3 10 Queries - Quality Assessment is Excellent.
### Gold Test - Fidelity to Evidence analysis. 
- Question: How does Walmart explain drivers behind changes in long-term debt across 2018-2020?

**Answer Text Claims:**
- 2018: Early extinguishment losses/outflows to reduce future interest expense
- 2019-2020: Issuances for Flipkart acquisition and corporate needs
- Short-term borrowings shifted as long-term debt settled
- True Data Evidence:
    - 2018_142: "loss on extinguishment of debt was $3.1 billion, due to early extinguishment... 
            - to retire higher rate debt to reduce interest expense"
    - 2018_224: "increased due to premiums paid for early extinguishment of debt"
    - 2019_209: "increased $11.6 billion... due to net proceeds... to fund Flipkart 
            - and for general corporate purposes"
    - 2020_219: "$15.9 billion net proceeds... for Flipkart partially offset by 
            - $5.5 billion... to fund general business operations"

**Quality Assessment:**
- ✅ Accurately captures 2018 debt extinguishment strategy and intent
- ✅ Correctly links 2019-2020 issuances to Flipkart
- ✅ Synthesizes the temporal progression well
- ✅ Preserves causality (early extinguishment → lower future interest)


- Example 4: J&J COVID-19 Product Impact (P3V3-Q003)
- Question: How does J&J describe COVID-19 impact on consumer/infectious disease sales 2022-2024?

**Answer Text Claims:**
- 2022: Consumer franchises benefited (innovation, e-commerce, COVID recovery)
- 2023: Operational declines (China COVID impacts, consumption pressure)
- 2024: Infectious disease sales fell due to COVID vaccine revenue decline
- **True Data Evidence:**
- 2022_54: "Baby Care... increased 3.2%... driven by AVEENO Asia Pacific eCommerce 
         - strength, innovation and COVID-19 recovery"
- 2023_54: "operational decline... suspension of personal care in Russia and 
         - negative COVID-19 impacts in China"
- 2024_61: "Infectious disease products sales... decline of 23.1%... primarily 
         - driven by decline in COVID-19 vaccine revenue"
- Quality Assessment: !!!! guuuuuud.


## Prepare final json:
- 1. ModelPipeline\finrag_ml_tg1\data_cache\qa_manual_exports\goldp3_analysis\p3_gold_p3v2_group_all.json
- 2. ModelPipeline\finrag_ml_tg1\data_cache\qa_manual_exports\goldp3_analysis\p3_gold_p3v3_new10.json

## Check for Keyset. 
```
Schema. Key set:
['answer_numeric', 'answer_text', 'answer_type', 'answer_unit', 'cik_int', 'company_name', 'created_at', 'created_by', 'curation_confidence', 'difficulty', 'evidence_sentence_ids', 'evidence_spans', 'gold_version', 'notes', 'question_id', 'question_text', 'retrieval_scope', 'section_hints', 'tolerance', 'years']
```

In [1]:
import sys
from pathlib import Path
import json

import polars as pl

# --------------------------------------------------------------------
# Setup
# --------------------------------------------------------------------
sys.path.append(str(Path.cwd().parent / "loaders"))
from ml_config_loader import MLConfig  

pl.Config.set_tbl_rows(50)
pl.Config.set_tbl_cols(20)
pl.Config.set_fmt_str_lengths(1500)
pl.Config.set_tbl_width_chars(1000)

config = MLConfig()

base_dir = Path.cwd().parent

g1path = base_dir / "data_cache" / "qa_manual_exports" / "goldp3_analysis" / "p3_gold_p3v2_group_all.json"
g2path = base_dir / "data_cache" / "qa_manual_exports" / "goldp3_analysis" / "p3_gold_p3v3_new10.json"

if not g1path.exists():
    raise FileNotFoundError(g1path)
if not g2path.exists():
    raise FileNotFoundError(g2path)

# --------------------------------------------------------------------
# Load the two JSON files as Python lists
# --------------------------------------------------------------------
with g1path.open("r", encoding="utf-8") as f:
    g1_data = json.load(f)

with g2path.open("r", encoding="utf-8") as f:
    g2_data = json.load(f)

if not isinstance(g1_data, list):
    raise ValueError(f"Expected list in {g1path.name}, got {type(g1_data)}")
if not isinstance(g2_data, list):
    raise ValueError(f"Expected list in {g2path.name}, got {type(g2_data)}")

# --------------------------------------------------------------------
# Schema check: ensure all records in each file share the same key set,
# and that both files have the same schema.
# --------------------------------------------------------------------
def collect_key_sets(records, label):
    """Return set of frozenset(keys) over all dict records."""
    key_sets = set()
    for i, rec in enumerate(records):
        if not isinstance(rec, dict):
            raise ValueError(f"{label}: element {i} is not a dict (got {type(rec)})")
        key_sets.add(frozenset(rec.keys()))
    return key_sets

g1_key_sets = collect_key_sets(g1_data, "p3v2")
g2_key_sets = collect_key_sets(g2_data, "p3v3")

print("Unique key sets in p3v2:", len(g1_key_sets))
print("Unique key sets in p3v3:", len(g2_key_sets))

if len(g1_key_sets) != 1:
    print("WARNING: p3v2 has multiple key patterns:", g1_key_sets)

if len(g2_key_sets) != 1:
    print("WARNING: p3v3 has multiple key patterns:", g2_key_sets)

# For strict comparison, just compare the *first* schema from each
schema_v2 = next(iter(g1_key_sets))
schema_v3 = next(iter(g2_key_sets))

if schema_v2 != schema_v3:
    missing_in_v3 = schema_v2 - schema_v3
    missing_in_v2 = schema_v3 - schema_v2
    raise ValueError(
        f"Schemas differ.\n"
        f"  Keys in v2 not in v3: {missing_in_v3}\n"
        f"  Keys in v3 not in v2: {missing_in_v2}"
    )

print("Schemas match. Key set:")
print(sorted(schema_v2))

# --------------------------------------------------------------------
# Concatenate and save
# --------------------------------------------------------------------
merged_data = g1_data + g2_data

merged_path = base_dir / "data_cache" / "qa_manual_exports" / "goldp3_analysis" / "p3_gold_test_suite_31q.json"

with merged_path.open("w", encoding="utf-8") as f:
    json.dump(merged_data, f, ensure_ascii=False, indent=2)

print(f"Saved merged JSON with {len(merged_data)} entries to: {merged_path}")


[DEBUG] ✓ AWS credentials loaded from aws_credentials.env
Unique key sets in p3v2: 1
Unique key sets in p3v3: 1
Schemas match. Key set:
['answer_numeric', 'answer_text', 'answer_type', 'answer_unit', 'cik_int', 'company_name', 'created_at', 'created_by', 'curation_confidence', 'difficulty', 'evidence_sentence_ids', 'evidence_spans', 'gold_version', 'notes', 'question_id', 'question_text', 'retrieval_scope', 'section_hints', 'tolerance', 'years']
Saved merged JSON with 31 entries to: d:\JoelDesktop folds_24\NEU FALL2025\MLops IE7374 Project\FinSights\ModelPipeline\finrag_ml_tg1\data_cache\qa_manual_exports\goldp3_analysis\p3_gold_test_suite_31q.json


In [4]:
import json
from pathlib import Path
import polars as pl

base_dir = Path.cwd().parent
merged_path = base_dir / "data_cache" / "qa_manual_exports" / "goldp3_analysis" / "p3_gold_test_suite_31q.json"

if not merged_path.exists():
    raise FileNotFoundError(merged_path)

with merged_path.open("r", encoding="utf-8") as f:
    merged_data = json.load(f)

print("Total questions:", len(merged_data))

# ---------------------------------------------------------------------
# Normalize for Polars:
# - Keep the on-disk JSON untouched.
# - For analytics, convert answer_text to a *string* in all rows.
#   If it's already a string -> keep it.
#   If it's a list[str]      -> join with a separator.
# ---------------------------------------------------------------------
normalized_rows = []
for rec in merged_data:
    row = rec.copy()
    at = row.get("answer_text")
    if isinstance(at, list):
        # Join list answers into a single readable string for analytics only
        row["answer_text"] = " || ".join(at)
    normalized_rows.append(row)

# Now Polars sees a consistent schema and will not choke
df = pl.DataFrame(normalized_rows)

print("\nColumns (schema):")
print(df.columns)

# A few small, useful breakdowns
print("\nCount by gold_version:")
print(df.select(pl.col("gold_version").value_counts()))

print("\nCount by answer_type:")
print(df.select(pl.col("answer_type").value_counts()))

print("\nDistinct companies:", df["company_name"].explode().n_unique())
print("Distinct years:", df["years"].explode().n_unique())

# Optional: quick peek at question_ids and types
print("\nSample questions (id, type, company, years):")
print(
    df.select(
        "question_id",
        "answer_type",
        pl.col("company_name").list.get(0).alias("primary_company"),
        pl.col("years").list.min().alias("min_year"),
        pl.col("years").list.max().alias("max_year"),
    ).head(10)
)


Total questions: 31

Columns (schema):
['question_id', 'cik_int', 'company_name', 'years', 'question_text', 'answer_type', 'answer_text', 'answer_numeric', 'answer_unit', 'tolerance', 'evidence_sentence_ids', 'evidence_spans', 'retrieval_scope', 'difficulty', 'section_hints', 'notes', 'gold_version', 'created_by', 'created_at', 'curation_confidence']

Count by gold_version:
shape: (2, 1)
┌──────────────┐
│ gold_version │
│ ---          │
│ struct[2]    │
╞══════════════╡
│ {"P3.v2",21} │
│ {"P3.v3",10} │
└──────────────┘

Count by answer_type:
shape: (3, 1)
┌───────────────┐
│ answer_type   │
│ ---           │
│ struct[2]     │
╞═══════════════╡
│ {"span",26}   │
│ {"list",3}    │
│ {"boolean",2} │
└───────────────┘

Distinct companies: 15
Distinct years: 18

Sample questions (id, type, company, years):
shape: (10, 5)
┌─────────────┬─────────────┬────────────────────────┬──────────┬──────────┐
│ question_id ┆ answer_type ┆ primary_company        ┆ min_year ┆ max_year │
│ ---         ┆ 

In [None]:
"""    
    "question_text": "What does EXXON MOBIL CORP report as its total revenue in 2008, and how is this figure described in the filing?",
    "answer_type": "span",
    "answer_text": "Reference is made to the following in the Financial Section of this report: • Consolidated financial statements, together with the report thereon of PricewaterhouseCoopers LLP dated February 27, 2009, beginning with the section entitled “Report of Independent Registered Public Accounting Firm” and continuing through “Note 18: Income, Sales-Based and Other Taxes”; • “Quarterly Information” (unaudited); • “Supplemental Information on Oil and Gas Exploration and Production Activities” (unaudited); and • “Frequently Used Terms” (unaudited).",
    "answer_numeric": null,
    "answer_unit": null,
    "tolerance": null,
    "evidence_sentence_ids": [
      "0000034088_10-K_2008_section_8_2"
    ],
"""


## Simple quick analysis query. Display cik, year, filingID or docID, sentenceID and sentence from main fact table


import sys
from pathlib import Path
sys.path.append(str(Path.cwd().parent / 'loaders'))

# from ml_config_loader import MLConfig
import polars as pl

pl.Config.set_tbl_rows(50)
pl.Config.set_tbl_cols(20)
pl.Config.set_fmt_str_lengths(1500)
pl.Config.set_tbl_width_chars(1000)

# config = MLConfig()

base_dir = Path.cwd().parent

facts_path = base_dir / "data_cache" / "meta_embeds" / "finrag_fact_sentences_meta_embeds.parquet"

if not facts_path.exists():
    raise FileNotFoundError(facts_path)

df = pl.read_parquet(facts_path)

TEXT_COL = "sentence"  


# valid columns: ["cik", "cik_int", "name", "tickers", "docID", "sentenceID", "section_ID", "section_name", "form", "sic", "sentence", "filingDate", "report_year", "reportDate", "temporal_bin", "likely_kpi", "has_numbers", "has_comparison", "sample_created_at", "last_modified_date", "sample_version", "source_file_path", "load_method", "row_hash", "prev_sentenceID", "next_sentenceID", "sentence_char_length", "sentence_token_count", "section_sentence_count", "embedding_id", "embedding_model", "embedding_dims", "embedding_date", "embedding_ref"]

df_filtered = ( df
    .filter(pl.col("sentenceID").is_in
            ([ 
      "0000034088_10-K_2008_section_8_2",
            ])
              )
    .select([
        "cik_int",
        # "reportDate",
        # "docID",
        "name", "tickers", "docID", 
        "section_name", "form", "sic",
        "sentenceID",
        "sentence"
    ])
)
df_filtered

cik_int,name,tickers,docID,section_name,form,sic,sentenceID,sentence
i32,str,list[str],str,str,str,str,str,str
34088,"""EXXON MOBIL CORP""","[""XOM""]","""0000034088_10-K_2008""","""ITEM_8""","""10-K""","""2911""","""0000034088_10-K_2008_section_8_2""","""Reference is made to the following in the Financial Section of this report: • Consolidated financial statements, together with the report thereon of PricewaterhouseCoopers LLP dated February 27, 2009, beginning with the section entitled “Report of Independent Registered Public Accounting Firm” and continuing through “Note 18: Income, Sales-Based and Other Taxes”; • “Quarterly Information” (unaudited); • “Supplemental Information on Oil and Gas Exploration and Production Activities” (unaudited); and • “Frequently Used Terms” (unaudited)."""


In [3]:
## Get all question_text strings from JSON file
## Simple script to extract just the question texts

import json
from pathlib import Path

base_dir = Path.cwd().parent
gold_p3_path = base_dir / "data_cache" / "qa_manual_exports" / "goldp3_analysis" / "p3_gold_test_suite_31q.json"

with gold_p3_path.open("r", encoding="utf-8") as f:
    gold_p3_data = json.load(f)

question_texts = [rec["question_text"] for rec in gold_p3_data if "question_text" in rec]

for q in question_texts:
    print(f'"{q}",')


"What does EXXON MOBIL CORP report as its total revenue in 2008, and how is this figure described in the filing?",
"What does ELI LILLY & Co report as its net income in 2006, and how is this figure described in the filing?",
"What does Walmart Inc. report as its operating income in 2018, and how is this figure described in the filing?",
"What does JOHNSON & JOHNSON report as its cash flow from operations in 2016, and how is this figure described in the filing?",
"What does Apple Inc. report as its earnings per share in 2006, and how is this figure described in the filing?",
"How does MICROSOFT CORP describe the change in its Intelligent Cloud revenue in 2017, including both the direction and magnitude of the change?",
"What regulatory and compliance risks does GENWORTH FINANCIAL INC highlight in its 2019 Risk Factors section?",
"What regulatory and compliance risks does GENWORTH FINANCIAL INC highlight in its 2018 Risk Factors section?",
"What regulatory and compliance risks does JOHNS