## RAG Pipeline Modules Notes
- Config + wiring, Query → embedding + Supply1 / Supply 2.
- Embedding → S3 Vectors search wrapper: search_s3_vectors(query_emb, top_k) -> pl.DataFrame
- Reranker stub: rerank(query, candidates_df) -> candidates_df_sorted.
- Evidence → answer synthesis
- Eval harness integration
- Gold set → RAG pipeline → Measurable metrics.

**Potential Ideas**:
1. Flow below or a better version.
    ```
    User question → embed → ANN search → 100 hits
            ↓
    Hybrid/BM25 filter (optional)
            ↓
    Cross-encoder reranker → 10 best passages
            ↓
    Context-window expansion ±N ?? ( Rethink. Contexts might vary. )
            ↓
    LLM prompt assembly
    ```
2. config-driven crews, prompt versioning, auto-refinement, and central logging. possibly.

**Notes**:
- LLMs are extremely sensitive to the quality and diversity of retrieved context.
- Top-10 good chunks → coherent answer.
- Top-10 semantically-close-but-irrelevant → hallucination.

**Ideas for Sub Directory/Package designs**:
- RouterPipeline/ – the 1/0 decision maker and entity detection.
- AnalyticsPipeline/ – the “true” KPI / JSON layer you highlighted in yellow.
- RAGPipeline/ – embedding, S3 Vectors, similarity + re-rank.
- LLMPipeline/ – synthesis / explanation crews.


A) “Range queries (2020–2023) should be basic — can S3 Vectors do this?”
- Yes. Use numeric report_year and filter with $gte/$lte. Example: {"$and": [{"cik_int": 320193}, {"report_year": {"$gte": 2020, "$lte": 2023}}]}. 

B) “Bring the most related sentences around this query, top-k around this query.”
- S3 Vectors gives the top-K hits. “Around this query” in the semantic sense is what ANN already returns; 
- “around each hit” in the document context sense (±N sentences, same paragraph/section) is your post-processing. That is not performed inside S3 Vectors. 

**About SOTA CE Bi-Encoders and Hybrid stacks**: 

C) State-of-the-art and practical models
- Cross-encoders: ms-marco-MiniLM-L-6-v2, bge-reranker-large, monoT5-3B, E5-Reranker.
- Bi-encoders: bge-base, e5-large-v2.
- Hybrid stacks: hybrid_fusion(bm25, dense, crossencoder) for robustness.

## Recap on N4 operations and index.

- single, sentence-level vector index where each vector record corresponds to exactly one row from lean vector table.
- index granularity is determined entirely by what you PUT - there is no automatic aggregation or hierarchy creation by S3 Vectors.
- each vector has: a unique key (sentenceID), 1024-d embedding, and metadata (cik_int, report_year, section_code, etc.)

**Does S3 Vectors Create Internal Shards or Sub-Indexes?**
- system handles internal partitioning to scale, but you never manage or address these shards separately.
- AWS handles internal distribution across their infrastructure
- interact with one logical index via the API
- no user-visible "shards" or "secondary indexes" on metadata columns like report_year or section_code 

#### Prod level- How many logical indexes to create? and necessity.
- Global index: one index for all companies/years
- Company-specific: one index per CIK
- Hybrid: one index per (company, year) combination

### About indexing recommendations and strategies:
1. metadata filters to simulate local recall: narrow by cik_int, report_year, section_code.
2. current index as the “global” FinRAG index: narrow by cik_int, report_year, section_code, etc.
3. multiple indexes when you have a real need: sentence-level vs section-level embeddings.
4. S3 Vectors doesn’t solve rerank/context logic. "Use it as the ANN + metadata filter engine.
- one S3 Vectors index is enough to support “global vs local recall + rerank” design
- not for the reason my intuition is currently worried about (i.e., “my open-regime hit rates are low, so I must need another index”).
- low score is more about evaluation regime and chunk granularity than about “not enough indexes”.

### same index, different “lens” via filters and rerank
5. K and reranking strategy
   1. Local: k=50, strong focus; rerank within small candidate set.
   2. Global: k=200+, over-fetch; use reranker + windowing to find cross-company narrative.

### low hits on open-regime queries explained:
- low open-regime hit@k does not mean “we need more indexes.
- it means: “global search over a large, heterogeneous space is hard; we need rerankers, filters, and careful eval definitions.”

### multi-level / (Sentence vs Section/Paragraph):
- Document → Section → Paragraph → Sentence.
- we store and embed only sentence-level units, we do context reconstruction (±2 windows) after retrieval possibly.
- RAG questions are analytical, narrative, risk-based, trend-based, cross-company
  
### when we need multi level and dataset regrain, reset:
- narrative spans 10–20+ sentences
- topic is clear but any single sentence is weak: embeddings of individual sentences may scatter in vector space
- need “semantic neighborhoods” at the section level: entire MD&A section, entire Risk Factor section
- cross-year narrative shifts across LARGE-SCALE chunks, docs, sections. 
- fine grained atomic things: cant do discourse-level meaning, topic continuity, or section-level intention.




---

### Step plan maps like this:

1. Intent classification – dev coding
2. Entity extraction – dev coding
3. Query embedding – Bedrock/cohere does this. v4.
    - Result: Send Supply1 (Analytical Metric Info from Metric tables), Supply2 (User Query as Cohere Embedding) & Extracted parameters

4. Metadata filter construction – dev coding, using S3 Vectors filter syntax
5. S3 Vectors ANN search – this is exactly the QueryVectors call on index, with filters
6. Context window expansion – dev coding, using the meta parquet / sentenceID neighbors
7. Fetch full text – dev coding, join on sentenceID
8. Hybrid reranking – dev coding, external reranker

9. Deduplication – dev coding - check on this, study.

10. Context assembly – dev coding
11. Prepare Config Prompt, LLM synthesis – dev coding + Bedrock Claude calls.
12. Evaluation harness integration – dev coding - on P3 Gold. or P2 Gold.
13. MLFlow or ELK integration - logging - dev coding.

---

### Study on BM25, Needs, Complexity - Planning it for Phase 2, Not now:

- local BM25 index is very feasible on a laptop with 16–32 GB RAM:
  - Index build time: on the order of seconds to a minute once, if you use a simple Python BM25 library and store just tokenized sentences.
  - RAM footprint: think a few hundred MB, not tens of GB, if you keep representation lean (IDs + postings + term frequencies).
  - Query latency: typically tens of milliseconds per query for 50–100 hits.

- Easiest route is a small Python library (rank_bm25, whoosh, or even elasticlunr-style).
    - Load sentenceID + sentence from local parquet cache.
    - Tokenize once.
    - Build a BM25 index object.
    - Serialize it to disk (pickle) so it doesn’t rebuild each run.
- Query side is trivial: bm25.get_top_n(query_tokens, sentences, n=70) plus mapping back to sentenceID.

### Minimal Enhancement Plans:

- Window expansion / Multi-hop: for each hit sentenceID, pull neighbors ±N and group into paragraphs/sections. That’s just Polars + sentenceID prefix logic. It massively improves context without new infra.
- Semantic query variants into S3 Vectors: cheap to implement:
  - Call LLM once to generate 2–4 paraphrases of the user question.
  - Embed each and query S3 Vectors with k=30 for each variant.
  - Union by sentenceID, then window-expand + rerank.

- “phase 2” bump: BM25 hybrid layer, Facet filters “generator” ?

### About routing:
- When should we trigger global tricks? (Routing).
   - not wise to run all “global recall” machinery for every query. You want a small routing layer that decides:
   - “This is a narrow, local question → just use filters + one S3 call.”
   - “This is a broad analytic question → fire semantic variants, possibly BM25.”

- Decision tree ideas:
  - Detect explicit CIK/year in the query or in UI parameters.
  - If query has company but no year → “broad within company”.
  - If query has neither company nor year → global analytic.
  - small “mode selector” in your query module: mode = local | company_broad | global. (just basic ideas for now.)

- Window expansion (multi-hop): not optional if we care about narrative quality.
- Without windowing, the LLM sees disjoint, contextless fragments. even if we use semantic variants, we still need window expansion to build coherent context.

### First concrete plans:

- Implement window expansion cleanly and make it a reusable helper: given a set of sentenceIDs, return expanded, deduped, grouped contexts.
  
- Implement semantic query variants → S3 Vectors:
    - N = 2–3 variants per query.
    - Each variant uses the same filter logic you already have.
    - Merge candidates, window-expand, rerank.
- 
- Add a tiny mode switch:
    - mode="local" → no variants, strong filters.
    - mode="global" → variants, broader filters.

### Big pic analysis on Vishak's Code:
- metric_pipeline/ → query → [ticker, year, metrics] → look up numeric values in a precomputed JSON table → return structured results and a formatted string.
- synthesis_pipeline/ → orchestrator coordinates:
  - metric pipeline (if available), Cohere-via-Bedrock query embedding, vector search (currently mocked), Claude-via-Bedrock answer generation.

### Metric_pipeline.py:
- Decides if a query is “numeric metric-style”.
- Extracts ticker, year, metric names (with fuzzy typo handling).
- Looks them up in a metrics JSON table.
- Returns structured results and/or a formatted English string.

**COMPANY_TO_TICKER:**
- Lowercase company names → ticker codes, e.g. "nvidia": "NVDA", "apple": "AAPL".
**METRIC_MAPPINGS:**
- Natural-language phrases → canonical metric codes 
  - (e.g. "income_stmt_Revenue", "balance_sheet_Total Assets", "Return on Assets (ROA) %").
- "revenue", "sales", "total revenue", "total sales" → income_stmt_Revenue.
- "net income", "profit", "earnings", "bottom line" → income_stmt_Net Income.
- Asset, liability, equity, cash flow, gross profit, operating expenses, COGS, interest expense, tax, ROA, margins.
**METRIC_KEYWORDS:**
- A smaller set of generic terms: revenue, income, profit, loss, assets, liabilities, equity, debt, cash flow, ....
**QUANTITATIVE_INDICATORS:**
- Phrases like "how much", "what is", "total", "amount", used to detect “how much is X” style queries.

### FilterExtractor – turning text into filters:
- src/filter_extractor.py. uses COMPANY_TO_TICKER and METRIC_MAPPINGS, plus a built-in fuzzy matcher.
- simple_fuzzy_match(word, choices, threshold=0.8): no external fuzzywuzzy dependency; just internal edit distance.
- FilterExtractor.extract(query: str) -> Dict:  
  - ` { "ticker": ..., "year": ..., "metrics": [...], "query": original_query, "confidence": 0.0–1.0 }`
- _extract_ticker(query): tries to find any 2–5 letter token and treat it as a ticker (uppercasing it), filtering out obvious English words (IN, IT, ARE, THE, etc.).
- _extract_year(query): Regex for 4-digit years in 1900–2030, returns int or None.
- _extract_metrics(query): Lowercases query, Sorts METRIC_MAPPINGS by key length to match longer phrases first.
- Confidence: simply counts how many of [ticker, year, metrics] were found, divides by 3.


### Analysis Takeaways:
- externalize domain knowledge into metric_mappings.py instead of hard-coding strings all over.
- implement fuzzy matching with a clean, local Levenshtein distance so you don’t drag in extra dependencies.
- have a clear, narrow interface: FilterExtractor.extract(query) and MetricPipeline.process(query); the latter returns a dict with a data field that you can pass directly into the LLM side.
- 'needs_metric_layer' is, in practice, a very usable first-pass intent classifier for “should I even bother hitting numeric tables”.

## Extracting full constants or better keys:

#### p3_candidates_kpi.json → numeric / KPI / MD&A metric language
#### p3_candidates_risk.json → risk / uncertainty / regulatory / liquidity language

aiming for two upgraded vocabularies:
Metric vocabulary / mappings
- richer set of natural phrases to map into METRIC_MAPPINGS and METRIC_KEYWORDS
Section / scope vocabulary
- specific sections: MD&A, Risk Factors, Liquidity & Capital Resources, Notes, etc.,
- risk vs MD&A vs Business vs Footnotes cues in the sentence text, not just in your hard section_name field.
- for FX/supply chain/COVID/defs: goldp3_v5_def_verify_candidates.json (to extend into more exotic cues).

## If you see:
    - a quantitative phrase, plus at least one metric keyword, plus a valid ticker / company from dim_companies, a year or range,
    - high confidence that: you must invoke the numeric KPI layer (Supply1) and not just RAG.


In [3]:
import polars as pl
from pathlib import Path

# ---------------------------------------------------------
# Setup: paths
# ---------------------------------------------------------
project_root = Path.cwd().parent  # assuming notebook is in .../finrag_ml_tg1/notebooks/*
export_root   = project_root / "data_cache" / "analysis_exports" / "goldp3_views"

kpi_path      = export_root / "p3_candidates_kpi.json"
risk_path     = export_root / "p3_candidates_risk.json"

# ModelPipeline\finrag_ml_tg1\data_cache\dimensions\finrag_dim_sec_sections.parquet
sections_path = project_root / "data_cache" / "dimensions" / "finrag_dim_sec_sections.parquet"


# ---------------------------------------------------------
# Helper: tokenize sentences into lowercase tokens
# - strips non-alphanumerics to space
# - splits on space
# - filters out very short tokens (len <= 2)
# ---------------------------------------------------------
def tokenize_column(df: pl.DataFrame, text_col: str, out_col: str = "token") -> pl.DataFrame:
    return (
        df
        .with_columns(
            pl.col(text_col)
              .str.to_lowercase()
              .str.replace_all(r"[^a-z0-9]+", " ")
              .str.strip_chars()
              .alias("_norm_text")
        )
        .with_columns(
            pl.col("_norm_text").str.split(" ").alias("_tokens")
        )
        # FIX: avoid selecting _tokens twice
        .select(
            pl.all().exclude(["_norm_text", "_tokens"]),
            pl.col("_tokens"),
        )
        .explode("_tokens")
        .rename({"_tokens": out_col})
        .filter(pl.col(out_col).str.len_chars() > 2)
        .filter(pl.col(out_col) != "")
    )


# =========================================================
# 1) KPI candidates: distinct tokens + counts
# =========================================================
df_kpi = pl.read_json(kpi_path)

# tokenize based on sentence_text
df_kpi_tokens = tokenize_column(df_kpi, text_col="sentence_text", out_col="token")

# overall distinct KPI tokens with counts
df_kpi_token_counts = (
    df_kpi_tokens
    .group_by("token")
    .agg([
        pl.len().alias("freq"),
        pl.n_unique("sentenceID").alias("num_sentences"),
        pl.n_unique("kpi_label").alias("num_kpi_labels"),
    ])
    .sort("freq", descending=True)
)

# optional: save to JSON for inspection
kpi_tokens_out = export_root / "analysis_keywords_kpi_tokens.json"
df_kpi_token_counts.write_json(kpi_tokens_out)
print(f"[KPI tokens] rows={df_kpi_token_counts.height} -> {kpi_tokens_out}")

# per-KPI label token counts (words used around each KPI label)
df_kpi_by_label = (
    df_kpi_tokens
    .group_by(["kpi_label", "token"])
    .agg([
        pl.len().alias("freq"),
        pl.n_unique("sentenceID").alias("num_sentences"),
    ])
    .sort(["kpi_label", "freq"], descending=[False, True])
)

kpi_tokens_by_label_out = export_root / "analysis_keywords_kpi_by_label.json"
df_kpi_by_label.write_json(kpi_tokens_by_label_out)
print(f"[KPI tokens by label] rows={df_kpi_by_label.height} -> {kpi_tokens_by_label_out}")


# =========================================================
# 2) Risk candidates: distinct tokens + counts
# =========================================================
df_risk = pl.read_json(risk_path)

df_risk_tokens = tokenize_column(df_risk, text_col="sentence_text", out_col="token")

df_risk_token_counts = (
    df_risk_tokens
    .group_by("token")
    .agg([
        pl.len().alias("freq"),
        pl.n_unique("sentenceID").alias("num_sentences"),
        pl.n_unique("risk_topic").alias("num_risk_topics"),
    ])
    .sort("freq", descending=True)
)

risk_tokens_out = export_root / "analysis_keywords_risk_tokens.json"
df_risk_token_counts.write_json(risk_tokens_out)
print(f"[Risk tokens] rows={df_risk_token_counts.height} -> {risk_tokens_out}")

# per-risk-topic tokens
df_risk_by_topic = (
    df_risk_tokens
    .group_by(["risk_topic", "token"])
    .agg([
        pl.len().alias("freq"),
        pl.n_unique("sentenceID").alias("num_sentences"),
    ])
    .sort(["risk_topic", "freq"], descending=[False, True])
)

risk_tokens_by_topic_out = export_root / "analysis_keywords_risk_by_topic.json"
df_risk_by_topic.write_json(risk_tokens_by_topic_out)
print(f"[Risk tokens by topic] rows={df_risk_by_topic.height} -> {risk_tokens_by_topic_out}")


# =========================================================
# 3) Section-dimension keywords:
#    - from section_name + section_description
# =========================================================

# =========================================================
# 3) Section-dimension keywords:
#    - from section_name + section_description
# =========================================================
import json

# If your sections dimension is stored as PARQUET (as per your debug path):
sections_path = project_root / "data_cache" / "dimensions" / "finrag_dim_sec_sections.parquet"

print(f"[DEBUG] Sections path: {sections_path}")

# Read depending on extension
if sections_path.suffix == ".parquet":
    df_sections = pl.read_parquet(sections_path)
elif sections_path.suffix == ".json":
    with open(sections_path, "r", encoding="utf-8") as f:
        sections_raw = json.load(f)  # JSON array -> list[dict]
    df_sections = pl.DataFrame(sections_raw)
else:
    raise ValueError(f"Unsupported sections file type: {sections_path}")

# Basic schema sanity check (optional but helpful once)
print("[DEBUG] Sections schema:", df_sections.schema)

# unify text: section_name + section_description
df_sec_text = df_sections.with_columns(
    (pl.col("section_name") + pl.lit(" ") + pl.col("section_description"))
      .str.to_lowercase()
      .alias("sec_text")
)

# tokenize
df_sec_tokens = tokenize_column(df_sec_text, text_col="sec_text", out_col="token")

# distinct tokens per section_code (for token -> section mapping)
df_sec_token_by_section = (
    df_sec_tokens
    .group_by(["section_code", "sec_item_canonical", "section_category", "token"])
    .agg([
        pl.len().alias("freq"),
    ])
    .sort(["section_code", "freq"], descending=[False, True])
)

sec_tokens_by_section_out = export_root / "analysis_keywords_section_by_section.json"
df_sec_token_by_section.write_json(sec_tokens_by_section_out)
print(f"[Section tokens by section] rows={df_sec_token_by_section.height} -> {sec_tokens_by_section_out}")

# global distinct section tokens
df_sec_token_global = (
    df_sec_tokens
    .group_by("token")
    .agg([
        pl.n_unique("section_code").alias("num_sections"),
        pl.n_unique("section_category").alias("num_categories"),
    ])
    .sort("num_sections", descending=True)
)

sec_tokens_global_out = export_root / "analysis_keywords_section_global.json"
df_sec_token_global.write_json(sec_tokens_global_out)
print(f"[Section tokens global] rows={df_sec_token_global.height} -> {sec_tokens_global_out}")


[KPI tokens] rows=6426 -> d:\JoelDesktop folds_24\NEU FALL2025\MLops IE7374 Project\FinSights\ModelPipeline\finrag_ml_tg1\data_cache\analysis_exports\goldp3_views\analysis_keywords_kpi_tokens.json
[KPI tokens by label] rows=12258 -> d:\JoelDesktop folds_24\NEU FALL2025\MLops IE7374 Project\FinSights\ModelPipeline\finrag_ml_tg1\data_cache\analysis_exports\goldp3_views\analysis_keywords_kpi_by_label.json
[Risk tokens] rows=6639 -> d:\JoelDesktop folds_24\NEU FALL2025\MLops IE7374 Project\FinSights\ModelPipeline\finrag_ml_tg1\data_cache\analysis_exports\goldp3_views\analysis_keywords_risk_tokens.json
[Risk tokens by topic] rows=19162 -> d:\JoelDesktop folds_24\NEU FALL2025\MLops IE7374 Project\FinSights\ModelPipeline\finrag_ml_tg1\data_cache\analysis_exports\goldp3_views\analysis_keywords_risk_by_topic.json
[DEBUG] Sections path: d:\JoelDesktop folds_24\NEU FALL2025\MLops IE7374 Project\FinSights\ModelPipeline\finrag_ml_tg1\data_cache\dimensions\finrag_dim_sec_sections.parquet
[DEBUG] Sec

In [None]:
"""

PART_III_ITEM_10 : ['governance', 'officers', 'directors', 'executive', 'structure', 'board', 'corporate', 'item', 'and']
PART_III_ITEM_11 : ['executive', 'compensation', 'analysis', 'pay', 'grants', 'discussion', 'and', 'equity', 'item']
PART_III_ITEM_12 : ['ownership', 'management', 'security', 'share', 'plans', 'equity', 'major', 'owners', 'directors', 'certain', 'and', 'item', 'beneficial', 'officers', 'shareholders']
PART_III_ITEM_13 : ['related', 'transactions', 'and', 'relationships', 'item', 'certain', 'party', 'independence', 'director']
PART_III_ITEM_14 : ['fees', 'and', 'audit', 'services', 'auditor', 'accountant', 'principal', 'non', 'item']
PART_II_ITEM_5 : ['stock', 'market', 'repurchases', 'item', 'dividends', 'equity', 'for', 'registrant', 'performance', 'share', 'data', 'common']
PART_II_ITEM_6 : ['financial', 'deprecated', 'selected', 'filers', '2020', 'item', 'after', 'reserved', 'data', 'summary', 'multi', 'year', 'for', 'smaller']
PART_II_ITEM_7 : ['analysis', 'trends', 'results', 'critical', 'item', 'outlook', 'revenue', 'operating', 'discussion', 'capital', 'liquidity', 'management', 'resources']
PART_II_ITEM_7A : ['risk', 'disclosures', 'item', 'about', 'interest', 'qualitative', 'foreign', 'commodity', 'currency', 'rate', 'exposures', 'market', 'and', 'quantitative']
PART_II_ITEM_8 : ['statements', 'financial', 'cash', 'notes', 'data', 'income', 'statement', 'balance', 'critical', 'supplementary', 'audited', 'flow', 'and', 'sheet', 'item']
PART_II_ITEM_9 : ['disagreements', 'with', 'auditors', 'changes', 'item', 'accountants', 'and', 'accounting', 'none', 'usually', 'rare']
PART_II_ITEM_9A : ['controls', 'internal', 'item', 'effectiveness', 'and', 'sox', 'icfr', '404', 'disclosures', 'procedures']
PART_II_ITEM_9B : ['information', 'item', 'other', 'disclosed', 'not', 'previously', 'material']
PART_IV_ITEM_15 : ['financial', 'exhibits', 'schedules', 'list', 'exhibit', 'index', 'statement', 'item', 'and']
PART_IV_ITEM_16 : ['summary', 'item', 'form', 'used', 'optional', 'rarely']
PART_I_ITEM_1 : ['business', 'item', 'competition', 'services', 'segments', 'products', 'description', 'strategy', 'market']
PART_I_ITEM_1A : ['factors', 'risk', 'risks', 'statement', 'item', 'looking', 'uncertainties', 'forward']
PART_I_ITEM_1B : ['staff', 'comments', 'none', 'unresolved', 'outstanding', 'sec', 'item', 'typically']
PART_I_ITEM_2 : ['properties', 'holdings', 'item', 'physical', 'facilities', 'real', 'estate']
PART_I_ITEM_3 : ['legal', 'proceedings', 'item', 'litigation', 'material', 'matters']
PART_I_ITEM_4 : ['safety', 'mine', 'companies', 'disclosures', 'only', 'mining', 'statistics', 'item', 'for']


=== topic liquidity_credit ===
['liquidity', 'financial', 'cash', 'results', 'default', 'operations', 'have', 'other', 'capital', 'company', 'credit', 'business', 'which', 'debt', 'condition', 'risk', 'adversely', 'adverse', 'insurance', 'including', 'any', 'flows', 'ability', 'obligations', 'affect', 'result', 'material', 'mbia', 'mortgage', 'effect', 'risks', 'additional', 'market', 'terms', 'investment', 'investments', 'subsidiaries', 'future', 'interest', 'impact']

=== topic regulatory ===
['regulatory', 'other', 'business', 'including', 'which', 'products', 'regulation', 'have', 'requirements', 'financial', 'subject', 'data', 'operations', 'result', 'changes', 'legal', 'insurance', 'laws', 'any', 'regulations', 'compliance', 'company', 'adverse', 'new', 'costs', 'actions', 'results', 'ability', 'risks', 'services', 'impact', 'related', 'litigation', 'affect', 'adversely', 'risk', 'practices', 'additional', 'government', 'capital']

=== topic market_competitive ===
['volatility', 'price', 'market', 'business', 'other', 'which', 'stock', 'have', 'financial', 'including', 'products', 'markets', 'company', 'economic', 'competitors', 'securities', 'impact', 'future', 'affect', 'result', 'factors', 'adverse', 'results', 'credit', 'risk', 'competition', 'litigation', 'customers', 'global', 'changes', 'experience', 'conditions', 'adversely', 'has', 'any', 'operations', 'significant', 'risks', 'insurance', 'ability']

=== topic operational_supply_chain ===
['operations', 'business', 'results', 'have', 'financial', 'adverse', 'condition', 'material', 'effect', 'which', 'other', 'impact', 'any', 'disruption', 'including', 'result', 'adversely', 'risk', 'affect', 'company', 'systems', 'products', 'insurance', 'future', 'risks', 'security', 'significant', 'litigation', 'services', 'loss', 'factors', 'pandemic', 'costs', 'information', 'mortgage', 'failure', 'materially', 'ability', 'events', 'operating']

=== topic cybersecurity_tech ===
['security', 'data', 'information', 'cyber', 'other', 'attacks', 'systems', 'access', 'have', 'third', 'unauthorized', 'breaches', 'including', 'threats', 'measures', 'breach', 'business', 'which', 'parties', 'service', 'use', 'networks', 'party', 'continue', 'customers', 'user', 'subject', 'products', 'incidents', 'technology', 'risks', 'services', 'result', 'financial', 'cybersecurity', 'failures', 'detect', 'significant', 'privacy', 'any']

=== topic legal_ip_litigation ===
['litigation', 'claims', 'other', 'have', 'business', 'which', 'products', 'property', 'result', 'intellectual', 'risk', 'company', 'adverse', 'rights', 'certain', 'subject', 'financial', 'any', 'related', 'significant', 'costs', 'adversely', 'against', 'impact', 'legal', 'including', 'future', 'liability', 'management', 'class', 'insurance', 'investigations', 'stock', 'security', 'proceedings', 'also', 'product', 'material', 'data', 'there']

=== topic general_risk ===
['risk', 'adverse', 'have', 'security', 'other', 'business', 'which', 'company', 'financial', 'products', 'insurance', 'including', 'investment', 'factors', 'data', 'impact', 'credit', 'mortgage', 'result', 'any', 'capital', 'effect', 'information', 'material', 'significant', 'services', 'changes', 'funds', 'risks', 'tax', 'results', 'subject', 'ability', 'systems', 'pandemic', 'certain', 'new', 'also', 'future', 'economic']

"""

### Three axes of filtering / extraction:
- Section axis – “Where in the 10-K?” (ITEM_1, ITEM_1A, ITEM_7, ITEM_8, …)
- Metric axis – “Which KPI / numeric concept?” (revenue, net income, cash from ops, debt, etc.)
- Topic axis – “What semantic theme?” (liquidity risk, regulatory risk, FX risk, supply chain disruption, etc.)
- (The risk-topic keywords live on axis 3. They’re useful even if you don’t use them as hard filters, and they’re not “competing” with section keywords.)
  
### Under the hood:
- Use analysis_keywords_kpi_by_label.json to ground revenue / net income / operating income / cash-from-ops / EPS synonyms in what actually appears in filings.
- Use analysis_keywords_section_by_section.json and the dim-sections table to decide which natural phrases should map to PART_I_ITEM_1, PART_II_ITEM_7, PART_II_ITEM_8, etc.
- Use analysis_keywords_risk_by_topic.json to build a clean RISK_TOPIC_KEYWORDS map keyed by your topic labels (liquidity_credit, regulatory, …).

### Even if the initial v1 filter only uses section_name and company/year, have multiple hooks:
- Risk language is not confined to ITEM 1A
- Companies often talk about “liquidity risk”, “market risk”, “regulatory risk” in MD&A, 7A, or even notes.
- Risk-topic keywords can fire even in ITEM_7 or ITEM_7A, not just ITEM_1A.
- extra semantic axis to be usable: Section alone (ITEM_1A) wasn’t enough: it’s all risk, but what risk? 

### MD&A vs Risk Factors:
- MD&A is a free-form narrative about operations, trends, liquidity, capital resources, strategic commentary, expenses, revenue drivers, restructuring, acquisitions, product updates, macro commentary, guidance, and infinitely more.
- MD&A ≠ a small, enumerable ontology.
- Risk ≈ a taxonomizable, clusterable ontology.
- Risk-topic keywords matter because “risk” is the single section of a 10-K where semantic themes are highly interpretable, highly clustered, and highly stable across companies and across years.
- MD&A and other sections are not like that — they are semantically broad, multi-topic, and do not benefit from manual keyword catalogs.

## entity_adapter/

1. models.py – shared dataclasses / plain structs describing the outputs.
2. *_universe.py – loaders that talk to dim / parquet / JSON tables and build in-memory lookup structures.
3. *_extractor.py – pure logic: “given a query string and a universe, return lists of entities”.
4. entity_adapter.py – one high-level class that orchestrates the four extractors and produces a single, clean “parsed query” object.

#### Conceptually, QueryEntities (in models.py) would hold:

- ciks: list of ints (primary thing S3 will use)
- tickers: list of strings
- company_names: list of canonical company names
- years: list of ints (unique, unsorted or sorted)
- sections: list of standard section codes (e.g. PART_I_ITEM_1A, PART_II_ITEM_7, not “md&a”)
- metrics: list of canonical metric IDs from metric_mapping_v2 / metrics dim
- (later) risk_topics, mdna_topics if ever use them
- maybe a debug / raw field if to stash intermediate info

#### you have to think of this: central abstraction: “parsed query entities” should entity-level information needed for downstream filters.

#### Final Flow:
- Create a CompanyUniverse (load dim file).
- Instantiate CompanyExtractor, YearExtractor, SectionExtractor, MetricExtractorAdapter.
- Expose a method along the lines of:
  - “given a query string, run all extractors, combine outputs, produce a QueryEntities object”.

#### Extractor details: CompanyExtractor:
- dynamic alias index,
- exact alias matching (apple → Apple Inc.),
- conservative fuzzy alias matching (microsft → MICROSOFT CORP),
- handling of possessives like "nvidia's", "apple's".

#### Section Intelligence:

- Embeddings + ANN already solve 70–80% of “section intelligence”
- Embedding vectors already contain: semantics, sentence meaning, section tone, ling features.
- RAG sys rely on vector retrieval to pick the right sections implicitly..

#### User side filter triggers:
- They return the correct canonical filter value → ITEM_7, ITEM_1A, etc.
- They accept many NL variations → “Item 7”, “ITEM-7”, “item_7”, “7A”, “7-A”, etc.
- They allow fuzzy phrasing → “management discussion”, “ops results”, “liquidity resources”.
- They map all SECTION_KEYWORDS → sec_item_canonical, not section_code.

#### Patterns to cover:
- PATTERNS: item 7 item-7 item_7 item7 "7" "7A" "item 7a" "item7A" "ITEM 7A"
```
| Input         | Group | Canonical |
| ------------- | ----- | --------- |
| “item 7”      | 7     | ITEM_7    |
| “item-7”      | 7     | ITEM_7    |
| “item 7A”     | 7A    | ITEM_7A   |
| “item-1a”     | 1A    | ITEM_1A   |
| “item 8”      | 8     | ITEM_8    |
| “7A”          | 7A    | ITEM_7A   |
| “see item 12” | 12    | ITEM_12   |
| “ITEM_13”     | 13    | ITEM_13   |
```


### Plan:
- Step 1: keyword phase : Pull all ITEM_* from SECTION_KEYWORDS_v2
- Step 2: regex phase : Extract any explicit item mention → ITEM_#
- Step 3: semantic signals : If risk topic is present → add ITEM_1A
    - If financial metrics present → add ITEM_7 and ITEM_8
    - If cash flow metrics → add ITEM_7 and ITEM_8
    - If query metric is only revenue → either ITEM_7 OR rely solely on ANN (both OK)
- Step 4: default : If empty → use ITEM_7

---

## STEP 4-12+ design ideas, Retrieval Spine Designs:

- design the whole retrieval spine now, with concrete decisions about:
  - local vs “global” retrieval, how semantic variants fit in, what S3 hits look like for us, 
  - and how that flows into windowing, full text, rerank, dedup, assembly.

#### Ground facts
- Index is 1024-d Cohere v4 (“bedrock_cohere_v4_1024d_…”) and S3 Vectors index finrag-sentence-fact-embed-1024d has dimension = 1024.
- All doc embeddings were created with output_dimension=1024.
- Query side is now also Cohere v4 via Bedrock, explicitly forcing output_dimension = cfg.dimensions = 1024, so we are geometrically aligned.
- S3 Vectors QueryVectors response (from AWS docs):
```
{
  "vectors": [
    {
      "key": "string",
      "data": { ... },          // only if returnMetadata / returnData
      "distance": number,       // only if returnDistance = true
      "metadata": { ... }       // our fields live here
    }
  ]
}
```
- querying, we will call query_vectors with: `returnDistance=True, returnMetadata=True`

#### ! Filtering design:
- Local (filtered) retrieval: S3 query with strong metadata filters - cik_int, report_year, sec_item_canonical
- Company-global retrieval: Relax years but keep company when we have it. replace exact year set with report_year >= recent_years_threshold. 
- Truly global retrieval: only makes sense when: no company was detected, or you explicitly want multi-company analytic behavior.
- If query has companies:
  - Filtered call: strong local {cik_int ∈ C, report_year ∈ Y}.
  - Global call: company-global {cik_int ∈ C, report_year >= threshold}.
- **So we never ignore company in v1; “global” is “global in time”, not “global across issuers”.**
- If query has no companies:
  - Filtered call: whatever filters we can build (years / sections).
  - Global call: truly global with recency only.

#### Semantic variants: filters or open?
- Purpose of variants = phrase coverage, not “turn the retriever into random noise”. 
- Variants should use the same filters as the base query !!
- **different embeddings for the same conceptual question, not a different scope.**
  - Base embedding: filtered call, global call
  - variant embedding: filtered call only
- Base global call already gives cross-year / cross-company support.
- Variants are mainly for “did the embedding miss some synonyms?” within the intended scope; filtered calls capture that well.
- Doing global+variants everywhere multiplies noise and rerank effort for relatively small marginal gain.
- **add toggles for:** enable_global: bool (default True), enable_variants: bool (default False initially).
---

### Files, modules etc:
- map steps 4–10 to concrete modules under rag_modules_src/rag_pipeline/:
  - metadata_filters.py – Step 4: build S3 Vectors filter JSON(s) from EntityExtractionResult.
  - s3_retriever.py – Step 5: call S3 Vectors (filtered + global, base + variants) and produce structured hits.
  - context_window.py – Step 6: map sentence-level hits to windowed spans.
  - text_fetcher.py – Step 7: join spans to text (Stage1/Stage2 parquet).
  - reranker.py – Step 8: hybrid reranking (MVP simple).
  - deduplicator.py – Step 9: drop duplicates/boilerplate.
  - context_assembler.py – Step 10: final prompt-ready context.



---
### Thoughts on DEDUPS:

1. duplication is almost guaranteed once you move past “one ANN call → one sentence”:
2. hit the same logical sentence or paragraph via different paths: filtered vs global S3 Vectors calls or variants.
3. obvious repetition from SEC boilerplate: forward-looking statements, safe-harbor disclaimers
4. Context budget: repeated text eats tokens and shrinks the room for genuinely diverse evidence.
5. Answer quality: Claude gets a distorted signal because one paragraph is over-represented compared with others; it tends to over-anchor on boilerplate and repeat it back.
6. Evaluation noise: your P3 gold harness will see apparent “multiple supporting passages” that are actually the same paragraph duplicated through different retrieval paths; that muddies recall/precision diagnostics.
---

1. Do all the complex stuff first: filtered + global + semantic variants + window expansion + reranking.
2. Before truncating to the top N chunks for the prompt, run a dedup pass that: 
   1. canonical key per chunk, (cik_int, report_year, sec_item_canonical, start_sentence_id, end_sentence_id) if operating at spans or windowing.
   2. normalized_text (lowercased, stripped, maybe trimmed to 1–2k chars) if you operate pure text-wise. (not recommended)
3. single highest-score version of any chunk sharing that key. (??)
4. sits naturally right at the end of your retriever module, just before you hand rag_context to the LLM prompt builder.
---

1. BedrockClient is a focused wrapper around Claude on Bedrock: Owns the bedrock-runtime client and model_id, max_tokens, temperature.
2. pure string prompt from: user_query, a compact analytical string (metric pipeline output), a flat list of RAG chunks (rag_context), each being a dict with keys like text, company, year, section, similarity_score.
3. QueryOrchestrator is the high-level “glue”: 
   1. self.metric_pipeline (from metric_pipeline/src/pipeline.py) to run KPI extraction over your Stage 2/warehouse tables.
   2. self.embedder (currently wired to the old QueryEmbedder that takes (query, input_type) and hits Bedrock for embeddings).
   3. self.rag_search (intended to be a S3 Vectors search client, but currently None with a mock branch in _search_documents).
   4. self.llm_client as a BedrockClient instance.
4. Extract S3 filters from the analytic string in _extract_s3_filters.. (??)
5. final answer via _generate_response.

--- 
#### orchestrator already has slots for:
- “Step 3: query embedding” → _generate_embedding.
- “Step 4–5: metadata filters + S3 Vectors ANN” → _extract_s3_filters + _search_documents.
- “Step 10: context assembly + LLM synthesis” → _generate_response + BedrockClient._build_prompt.


```
Dedup policy (MVP): Primary key: (cik_int, report_year, sec_item_canonical, sentenceID) → keep highest score_final.
````

---
## H level plan - 1117, Quick:

````
EntityExtractionResult
  ↓
[Step 4] MetadataFilterBuilder → {filtered_filters, global_filters}
  ↓
[Step 5] S3VectorsRetriever → RetrievalBundle
         ├─ base_embedding: filtered_call + global_call
         └─ variant_embeddings: filtered_call only (if enabled)
  ↓
         Deduplicate by sentence_id → union_hits (List[S3Hit])
         Minimal state, early reduction, clear early semantics, and- cheaper to deduplicate @sentenceID here. 
         Second deduplication will happen later.
  ↓
[Step 6] ContextWindowExpander → List[ContextSpan]
         (±N sentence window, merge overlaps)
  ↓
[Step 7] TextFetcher → List[ContextBlock]
         (join to Stage1/2 parquet for full text)
  ↓
[Step 8] HybridReranker → List[ContextBlock] (scored)
         (ANN score + keyword overlap + source bonuses)
  ↓
[Step 9] BlockDeduplicator → List[ContextBlock] (cleaned)
         Primary key: (cik_int, report_year, sec_item, sentenceID)
         Keep highest score_final per key
  ↓
[Step 10] ContextAssembler → str (formatted context)
          Format: === NVDA 2021 ITEM_7 ===\n<text>
````

### S3 Retriever Analysis:

- **Evidence:**
- All thresholds (0.0 → 0.5): 45 hits, NO rejections
- Similarity range: [0.674, 0.737]
- Mean: 0.693, Median: 0.687
- Conclusion: Even the "weakest" hit (similarity=0.674) is actually quite strong. S3 Vectors' ANN algorithm already returns high-quality matches.
- 64% of hits in 0.6-0.7 range
- 36% of hits in 0.7-0.8 range
- Zero hits below 0.6 similarity
- Zero hits above 0.8 similarity
- The 45th best result (0.674) is still semantically relevant
- **no "long tail" of weak matches to filter out** AT ALL.


### Rerank Corrections:
- S3 returns 45 hits (filtered + global)
- Window expansion: 45 hits × 7 sentences/window = 315 sentence lookups in Stage 2 meta
- Text fetch: Materialize 315 full text blocks
- Rerank: Score all 315 blocks
- BAD DESIGN.
- expensive I/O (315 parquet row fetches + string concatenations) on data you'll throw away.. 
- S3 returns 45 hits -> Rerank -> Take top 15-20 hits only -> Window expansion: 20 hits × 7 = 140 sentence lookups 
- Stage-2 dedup: Remove overlapping windows from the small set
- Talk to LLM


```
# Step 4-5: Retrieval (DONE)!! YIPPIEIEIEE. 
filtered_hits, global_hits = s3_retriever.retrieve(...)
union_hits = deduplicate_stage1(filtered_hits + global_hits)  # by (sentence_id, embedding_id)

# Step 6: IMMEDIATE RERANKING (sentence-level)
scored_hits = reranker.score_hits(query, union_hits)  
# Input: List[S3Hit], Output: List[S3Hit] with .final_score

# Step 7: TOP-K SELECTION
top_hits = sorted(scored_hits, key=lambda h: h.final_score, reverse=True)[:config.top_k_for_expansion]
# e.g., keep top 20 hits for window expansion

# Step 8: WINDOW EXPANSION (only on survivors)
spans = window_expander.expand(top_hits)
# Creates ContextSpan objects with sentence ranges

# Step 9: TEXT FETCH (only for expanded spans)
blocks = text_fetcher.materialize_blocks(spans)
# Joins to Stage 2 meta, concatenates text

# Step 10: STAGE-2 DEDUPLICATION (remove overlapping windows)
unique_blocks = deduplicator.dedup(blocks)
# Key: (cik, year, section, min_sentence_id, max_sentence_id)

# Step 11: FINAL SELECTION & ASSEMBLY
final_blocks = unique_blocks[:config.max_context_blocks]  # e.g., top 10
context_str = assembler.assemble(final_blocks)
```

### Retrieval variants analysis.. :
1. combining results from different retrieval contexts:
    - Filtered hits: "What's most similar within (NVDA, 2021-2023, ITEM_7)?"
    - Global hits: "What's most similar within (NVDA, any year ≥2015)?"
    - Variant hits: "What's most similar to a rephrased version of the query?"
```
2. Filtered pool: 5,000 sentences (NVDA 2021-2023 ITEM_7)
  → Top hit: distance=0.15 (very close match)
    Global pool: 50,000 sentences (NVDA all years)
  → Top hit: distance=0.25 (less close, but from larger pool)
  
  !! Filtered hit is closer (0.15 < 0.25). But global hit competed against 10x more candidates, hm?
```
3. S3 Vectors doesn't tell: "This was rank 1 out of 5,000" vs "This was rank 1 out of 50,000"
4. Simple reranking would go bad: we need awareness. Provenance-Aware Scoring.. "filtered hits match user intent by definition".
5. injecting domain knowledge ??
6. Dense embeddings can retrieve semantically similar but lexically misaligned results; Lexical overlap catches this: "Does this sentence actually SAY the words I care about?"

---
### WINDOW EXPANSION- but with limits:
- ! Take only top 20 BEFORE window expansion. top_hits = sorted_hits[:20]
- Example: Total after union: 30 + 15 + 45 = 90 hits
- After Stage-1 dedup: ~60 unique hits (assuming some overlap)
- Assume 60, 60 hits × 7 sentences/window = 420 sentence lookups in Stage 2 meta parquet
- most of those 60 hits won't make it to the final context anyway. !! expensive.
- Set it high initially (e.g., 40-50) so you're not too aggressive


### Note quick on who's doing the hit calling:
```python
class S3VectorsRetriever:
    def retrieve(self, embeddings, entities, mode, filters):
        # Call S3 for filtered
        filtered_hits = self._query_s3(embedding, filtered_filters, top_k_filtered)
        
        # Call S3 for global
        global_hits = self._query_s3(embedding, global_filters, top_k_global)
        
        # Call S3 for each variant
        variant_hits = []
```

### Again, Embed things- Drop or not, TopK or not?

```
# Why this is WRONG:
Filtered hit: distance=0.25 (from pool of 5,000 candidates)
Global hit:   distance=0.20 (from pool of 50,000 candidates)
```
- Which is "better"?
- Global has LOWER distance (0.20 < 0.25)
- But filtered competed in a SMALLER pool (harder to rank high)
- Distances are NOT comparable across different search spaces!

1. S3 Vectors returns cosine distance relative to the query embedding, but:
   - Pool size affects distance distribution:
   - Small pool (5K): Top-30 distances might be [0.15, 0.30]
   - Large pool (50K): Top-30 distances might be [0.20, 0.40]
   - Same embedding, different distributions!

2. Don't do minmax normalization or even scale normalization, despite how it might seem OK or standard. I mean, technically, we use standard scalers and such things in dataset handlers and use it on a continuous range of very fine specific float value, in-bit features in datasets. But here, especially for semantic and similarity scores, I'm not really sure what is the correct thing to do. ?? 




### High level End to End Skelly v1:

```python

# 1–3 already done:
entities = adapter.extract(query)
embedding = embedder.embed_query(query, entities)

# 4. Filters
local_filters = filter_builder.build_local_filters(entities)
global_filters = filter_builder.build_global_filters(entities)

# 5. Retrieval (base + variants later)
bundle = retriever.retrieve_many(
    embeddings=[embedding],      # later: [embedding] + variant_embeddings
    entities=entities,
    enable_global=True,
    top_k_filtered=30,
    top_k_global=15,
    local_filters=local_filters,
    global_filters=global_filters,
)

# 6. Context windows
spans = window_expander.expand(bundle.union_hits)

# 7. Full text
blocks = text_fetcher.materialize_blocks(spans)

# 8. Rerank
reranked_blocks = reranker.rerank(query, blocks)

# 9. Dedup
clean_blocks = deduplicator.dedup(reranked_blocks)

# 10. Assemble final context for LLM
context_str = assembler.assemble(clean_blocks)

```

### Window Expansion + Hit Merging Logic -- Tracking IS true hit or not, etc.

```python

Hit A (pos=44, distance=0.18) → window [41, 42, 43, 44, 45, 46, 47]
  Creates 7 rows:
    sentenceID="doc_1A_0041", is_core_hit=False, parent_hit_distance=0.18
    sentenceID="doc_1A_0042", is_core_hit=False, parent_hit_distance=0.18
    sentenceID="doc_1A_0043", is_core_hit=False, parent_hit_distance=0.18
    sentenceID="doc_1A_0044", is_core_hit=True,  parent_hit_distance=0.18  ← Core
    sentenceID="doc_1A_0045", is_core_hit=False, parent_hit_distance=0.18
    sentenceID="doc_1A_0046", is_core_hit=False, parent_hit_distance=0.18
    sentenceID="doc_1A_0047", is_core_hit=False, parent_hit_distance=0.18

Hit B (pos=46, distance=0.22) → window [43, 44, 45, 46, 47, 48, 49]
  Creates 7 rows:
    sentenceID="doc_1A_0043", is_core_hit=False, parent_hit_distance=0.22
    sentenceID="doc_1A_0044", is_core_hit=False, parent_hit_distance=0.22
    sentenceID="doc_1A_0045", is_core_hit=False, parent_hit_distance=0.22
    sentenceID="doc_1A_0046", is_core_hit=True,  parent_hit_distance=0.22  ← Core
    sentenceID="doc_1A_0047", is_core_hit=False, parent_hit_distance=0.22
    sentenceID="doc_1A_0048", is_core_hit=False, parent_hit_distance=0.22
    sentenceID="doc_1A_0049", is_core_hit=False, parent_hit_distance=0.22

Total: 14 rows (with duplicates)

Sentence "doc_1A_0043":
  - From Hit A: distance=0.18
  - From Hit B: distance=0.22
  → Keep Hit A's version (0.18 < 0.22) ✅

Sentence "doc_1A_0044":
  - From Hit A: distance=0.18, is_core_hit=True
  - From Hit B: distance=0.22, is_core_hit=False
  → Keep Hit A's version ✅

Sentence "doc_1A_0046":
  - From Hit A: distance=0.18, is_core_hit=False
  - From Hit B: distance=0.22, is_core_hit=True
  → Keep Hit A's version (better score, even though B's core) ✅

# After dedup, we have individual sentences
# Group them into contiguous runs by (doc, section, sentence_pos)

Sentences: [41, 42, 43, 44, 45, 46, 47, 48, 49]
           ↑_________________________↑  ↑______↑
           Block 1: [41-47]              Block 2: [48-49]

contiguous? Merge into ONE block: [41-49]
```
---
### Algorithm Idea for now:

```
Step 6-7: ContextBuilder.build_blocks()
  ↓
  For each S3Hit:
    1. Calculate window [pos-3, pos+3]
    2. Fetch sentences from Stage 2 meta (by sentence_pos range)
    3. For each sentence in window:
       - Mark is_core_hit (True if sentence_id == hit.sentence_id)
       - Tag with parent_hit_distance
       - Tag with parent_hit_sources, parent_hit_variant_ids
  ↓
  Flatten to: List[SentenceRecord] (one row per sentence, many duplicates)

Step 8: Sentence-Level Deduplication
  ↓
  Group by: (sentenceID, cik_int, report_year, section_name)
  Keep: Best parent_hit_distance
  Preserve: is_core_hit, sources, variant_ids
  ↓
  Result: List[SentenceRecord] (deduplicated, each sentence appears once)

Step 9: Contiguous Grouping
  ↓
  Group by: (cik_int, report_year, section_name)
  Within each group:
    - Sort by sentence_pos
    - Find contiguous runs (pos[i+1] == pos[i] + 1)
    - Each run becomes a ContextBlock
  ↓
  Result: List[ContextBlock] (natural paragraphs)

Step 10: Final Selection & Assembly
  ↓
  Sort blocks by: best sentence distance in block (or first core_hit distance)
  Take top 10 blocks
  Format for LLM
```

---

```
# rag_pipeline/context_builder.py

class ContextBuilder:
    def build_blocks(self, hits):
        # Step 1: Expand all hits to sentence records
        sentence_records = self._expand_to_sentences(hits)
        
        # Step 2: Deduplicate at sentence level
        unique_sentences = self._deduplicate_sentences(sentence_records)
        
        # Step 3: Group into contiguous blocks
        blocks = self._group_into_blocks(unique_sentences)
        
        return blocks
    
    def _expand_to_sentences(self, hits):
        # For each hit, create SentenceRecord for core + neighbors
        
    def _deduplicate_sentences(self, records):
        # Group by sentenceID, keep best parent_hit_distance
        
    def _group_into_blocks(self, sentences):
        # Find contiguous runs, create ContextBlocks
```


### Score Purpose Analysis.
```
Steps 1-5: Score determines WHICH sentences to retrieve
  ↓
Step 6-7: Score determines WHICH version to keep during dedup
  ↓
Step 8-9: Score determines PRIORITY for topK selection
  ↓
Step 10: Score is IRRELEVANT - just format text cleanly
```

---
### Functional order; Sorting; NO custom-forced hop-window context blocks.

```
Multiple hits create overlapping windows:
  Hit A: [s1, s2, s3, s4, s10, s12, s20]
  Hit B: [s20, s22, s28, s36]
  Hit C: [s100, s22, s1, s2, s20]

After expansion: ~210 SentenceRecords (with duplicates)
  s1 appears 2 times (from Hit A and Hit C)
  s2 appears 2 times (from Hit A and Hit C)
  s20 appears 3 times (from A, B, C)
  etc.

After sentence dedup: ~140 unique sentences
  s1 (once, best distance)
  s2 (once, best distance)
  s3 (once)
  s4 (once)
  s10 (once)
  s12 (once)
  ...

# After sentence dedup, just SORT by functional order:
sorted_sentences = sorted(unique_sentences, key=lambda s: (
    s.cik_int,
    s.report_year,      # Descending? (most recent first)
    s.section_name,
    s.doc_id,
    s.sentence_pos      # Within document, natural order
))



- Result: One clean ordered list !!
- ALWAYSS FOCUS NATURAL ORDER

### Ideal ASSEMBLY:
```
    """
    === NVIDIA CORP | 2020 | ITEM_1A ===
    We face supply chain risks... (pos 45)
    Competition in data centers... (pos 89)
    Our manufacturing partners... (pos 102)

    === MICROSOFT CORP | 2020 | ITEM_7 ===
    Cloud revenue grew 50%... (pos 12)
    Azure subscriptions increased... (pos 45)

    === NVIDIA CORP | 2019 | ITEM_1A ===
    GPU demand fluctuations... (pos 34)
    Cryptocurrency mining impact... (pos 67)
    """
```
