## Isolation Tests and Skeletons
#### Post Supply lines. RAG Pipeline Concepts.


---
### Obs:
- variants are semantically diverse but preserve intent:
  - Original: "What were NVIDIA's and Microsoft's total revenue and net income in 2021 and 2022?"
  - Variant 1: (Formal business language)
  - "What were the total revenue and net income figures for NVIDIA and Microsoft during 2021 and 2022?"
  - Variant 2: (Combined perspective)
  - "In 2021 and 2022, what were NVIDIA's and Microsoft's combined revenue and net income?"
  - Variant 3: (Action-oriented)
  - "How much revenue and net income did NVIDIA and Microsoft generate in 2021 and 2022?"

In [1]:
from pathlib import Path
import sys
import logging

logging.getLogger().setLevel(logging.WARNING)

current = Path.cwd()
for parent in [current] + list(current.parents):
    if parent.name == "ModelPipeline":
        model_root = parent
        break
else:
    raise RuntimeError("Cannot find 'ModelPipeline' root in path tree")

if str(model_root) not in sys.path:
    sys.path.insert(0, str(model_root))

print(f"✓ Model root on sys.path: {model_root}")


METRIC_DATA_JSON = model_root / "finrag_ml_tg1/rag_modules_src/metric_pipeline/data/downloaded_data.json"
DIM_COMPANIES = model_root / "finrag_ml_tg1/data_cache/dimensions/finrag_dim_companies_21.parquet"
DIM_SECTIONS = model_root / "finrag_ml_tg1/data_cache/dimensions/finrag_dim_sec_sections.parquet"
print(f"✓ Metric data JSON path: {METRIC_DATA_JSON}")
print(f"✓ Dimension companies path: {DIM_COMPANIES}")
print(f"✓ Dimension sections path: {DIM_SECTIONS}")

✓ Model root on sys.path: d:\JoelDesktop folds_24\NEU FALL2025\MLops IE7374 Project\FinSights\ModelPipeline
✓ Metric data JSON path: d:\JoelDesktop folds_24\NEU FALL2025\MLops IE7374 Project\FinSights\ModelPipeline\finrag_ml_tg1\rag_modules_src\metric_pipeline\data\downloaded_data.json
✓ Dimension companies path: d:\JoelDesktop folds_24\NEU FALL2025\MLops IE7374 Project\FinSights\ModelPipeline\finrag_ml_tg1\data_cache\dimensions\finrag_dim_companies_21.parquet
✓ Dimension sections path: d:\JoelDesktop folds_24\NEU FALL2025\MLops IE7374 Project\FinSights\ModelPipeline\finrag_ml_tg1\data_cache\dimensions\finrag_dim_sec_sections.parquet


In [3]:
"""
Isolation Test: Variant Pipeline
Test variant generation + entity extraction + embedding for semantic variants.
"""

# ════════════════════════════════════════════════════════════════════════════
# SETUP
# ════════════════════════════════════════════════════════════════════════════
from pathlib import Path
import sys
import logging
import numpy as np

logging.basicConfig(
    level=logging.CRITICAL,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)

# Add project root
current = Path.cwd()
for parent in [current] + list(current.parents):
    if parent.name == "ModelPipeline":
        model_root = parent
        break
else:
    raise RuntimeError("Cannot find 'ModelPipeline' root")

if str(model_root) not in sys.path:
    sys.path.insert(0, str(model_root))

print(f"✓ Model root: {model_root}\n")

# ════════════════════════════════════════════════════════════════════════════
# INITIALIZE COMPONENTS
# ════════════════════════════════════════════════════════════════════════════
from finrag_ml_tg1.loaders.ml_config_loader import MLConfig
from finrag_ml_tg1.rag_modules_src.entity_adapter.entity_adapter import EntityAdapter
from finrag_ml_tg1.rag_modules_src.utilities.query_embedder_v2 import (
    EmbeddingRuntimeConfig,
    QueryEmbedderV2
)
from finrag_ml_tg1.rag_modules_src.rag_pipeline.variant_pipeline import VariantPipeline

print("Initializing components...")

# Config
config = MLConfig()

# Bedrock client
bedrock_client = config.get_bedrock_client()

# Entity adapter
DIM_COMPANIES = model_root / "finrag_ml_tg1/data_cache/dimensions/finrag_dim_companies_21.parquet"
DIM_SECTIONS = model_root / "finrag_ml_tg1/data_cache/dimensions/finrag_dim_sec_sections.parquet"
entity_adapter = EntityAdapter(
    company_dim_path=DIM_COMPANIES,
    section_dim_path=DIM_SECTIONS
)

# Query embedder
embedding_cfg = config.cfg["embedding"]
runtime_cfg = EmbeddingRuntimeConfig.from_ml_config(embedding_cfg)
query_embedder = QueryEmbedderV2(runtime_cfg, boto_client=bedrock_client)

# Variant pipeline
variant_pipeline = VariantPipeline(
    config=config,
    entity_adapter=entity_adapter,
    query_embedder=query_embedder,
    bedrock_client=bedrock_client
)

print(f"✓ Components initialized")
print(f"✓ Variants enabled: {variant_pipeline.is_enabled()}")
print(f"✓ Variant count configured: {variant_pipeline.get_variant_count()}\n")

# ════════════════════════════════════════════════════════════════════════════
# TEST 1: Generate variants for financial query
# ════════════════════════════════════════════════════════════════════════════
print("="*80)
print("TEST 1: Multi-company, multi-year financial query")
print("="*80)

test_query = "What were NVIDIA's and Microsoft's total revenue and net income in 2021 and 2022?"

print(f"\nBase query: {test_query}\n")
print("Generating variants + embeddings...\n")

variant_queries, variant_embeddings = variant_pipeline.generate(test_query)

print(f"\n{'='*80}")
print(f"RESULTS:")
print(f"{'='*80}")
print(f"✓ Generated {len(variant_queries)} variant queries")
print(f"✓ Generated {len(variant_embeddings)} embeddings")
print(f"✓ All embeddings are 1024-d: {all(len(e) == 1024 for e in variant_embeddings)}\n")

# Display variants
for i, vq in enumerate(variant_queries, start=1):
    print(f"Variant {i}: {vq}")

print()

# ════════════════════════════════════════════════════════════════════════════
# TEST 2: Validate embedding properties
# ════════════════════════════════════════════════════════════════════════════
print("="*80)
print("TEST 2: Embedding validation")
print("="*80 + "\n")

for i, emb in enumerate(variant_embeddings, start=1):
    arr = np.array(emb, dtype=np.float32)
    print(f"Variant {i} embedding:")
    print(f"  Shape: {arr.shape}")
    print(f"  Dtype: {arr.dtype}")
    print(f"  Range: [{arr.min():.4f}, {arr.max():.4f}]")
    print(f"  Mean: {arr.mean():.4f}, Std: {arr.std():.4f}")
    print(f"  First 5 values: {arr[:5].tolist()}")
    print()

print("✓ All embeddings are valid 1024-d float32 vectors\n")

# ════════════════════════════════════════════════════════════════════════════
# TEST 3: Test with variants disabled
# ════════════════════════════════════════════════════════════════════════════
print("="*80)
print("TEST 3: Graceful degradation (variants disabled)")
print("="*80 + "\n")

# Temporarily disable
original_enabled = variant_pipeline.enabled
variant_pipeline.enabled = False

print("Variants disabled, calling pipeline.generate()...")
vq_disabled, ve_disabled = variant_pipeline.generate(test_query)

print(f"\nVariant queries: {vq_disabled}")
print(f"Variant embeddings: {ve_disabled}")
print("✓ Returns empty lists when disabled (no LLM calls, no cost)\n")

# Restore
variant_pipeline.enabled = original_enabled

# ════════════════════════════════════════════════════════════════════════════
# TEST 4: Test with short query (should reject)
# ════════════════════════════════════════════════════════════════════════════
print("="*80)
print("TEST 4: Short query rejection")
print("="*80 + "\n")

short_query = "revenue"
print(f"Short query: '{short_query}' (len={len(short_query)})")

vq_short, ve_short = variant_pipeline.generate(short_query)

print(f"Variant queries: {vq_short}")
print(f"Variant embeddings: {ve_short}")
print("✓ Rejects queries shorter than 10 characters\n")

# ════════════════════════════════════════════════════════════════════════════
# TEST 5: Calculate similarity between base and variants
# ════════════════════════════════════════════════════════════════════════════
print("="*80)
print("TEST 5: Semantic similarity (base vs variants)")
print("="*80 + "\n")

# Get base embedding for comparison
base_entities = entity_adapter.extract(test_query)
base_embedding = query_embedder.embed_query(test_query, base_entities)

def cosine_similarity(a, b):
    a = np.array(a)
    b = np.array(b)
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

print("Cosine similarity between base query and each variant:\n")
for i, variant_emb in enumerate(variant_embeddings, start=1):
    sim = cosine_similarity(base_embedding, variant_emb)
    print(f"Variant {i}: {sim:.4f}")
    print(f"  Query: {variant_queries[i-1][:80]}...")
    print()

print("✓ All variants should have high similarity (>0.85) with base query\n")

# ════════════════════════════════════════════════════════════════════════════
# SUMMARY
# ════════════════════════════════════════════════════════════════════════════
print("="*80)
print("VARIANT PIPELINE TEST SUMMARY")
print("="*80)
print(f"✓ Pipeline initialization: PASS")
print(f"✓ Variant generation: PASS ({len(variant_queries)} variants)")
print(f"✓ Entity extraction per variant: PASS")
print(f"✓ Embedding generation per variant: PASS ({len(variant_embeddings)} × 1024-d)")
print(f"✓ Graceful degradation (disabled): PASS")
print(f"✓ Input validation (short queries): PASS")
print(f"✓ Semantic similarity validation: PASS")
print("\n ALL TESTS PASSED - Variant pipeline ready for S3 retrieval integration!\n")

✓ Model root: d:\JoelDesktop folds_24\NEU FALL2025\MLops IE7374 Project\FinSights\ModelPipeline

Initializing components...
[DEBUG] ✓ AWS credentials loaded from aws_credentials.env
✓ FilterExtractor initialized with 21 companies
  Using: finrag_dim_companies_21.parquet
✓ Components initialized
✓ Variants enabled: True
✓ Variant count configured: 3

TEST 1: Multi-company, multi-year financial query

Base query: What were NVIDIA's and Microsoft's total revenue and net income in 2021 and 2022?

Generating variants + embeddings...


RESULTS:
✓ Generated 3 variant queries
✓ Generated 3 embeddings
✓ All embeddings are 1024-d: True

Variant 1: What were the total revenue and net income figures for NVIDIA and Microsoft during 2021 and 2022?
Variant 2: In 2021 and 2022, what were NVIDIA's and Microsoft's combined revenue and net income?
Variant 3: How much revenue and net income did NVIDIA and Microsoft generate in 2021 and 2022?

TEST 2: Embedding validation

Variant 1 embedding:
  Shape: (10

### Standard Qry for Temp; Strong: 1.
- query = "In the MD&A and Risk Factors sections, how did NVIDIA and Microsoft discuss their AI strategy, competitive positioning, and supply chain risks between 2020 and 2023?"


In [2]:
query = (
    "In the MD&A and Risk Factors sections, how did NVIDIA and Microsoft "
    "discuss their AI strategy, competitive positioning, and supply chain "
    "risks between 2017 and 2020?"
)


In [3]:
"""
═══════════════════════════════════════════════════════════════════════════════
ISOLATION TEST: Complete Retrieval Flow (Steps 1-5)
═══════════════════════════════════════════════════════════════════════════════

Flow:
1. Query → EntityAdapter.extract()
2. Query + Entities → QueryEmbedderV2.embed_query()
3. Entities → MetadataFilterBuilder.build_filters()
4. VariantPipeline.generate() → automatic (inside S3Retriever)
5. S3VectorsRetriever.retrieve() → RetrievalBundle

Validation:
- Variants generated automatically (3 variants expected)
- S3 queries executed (base: filtered+global, variants: filtered only)
- Hits retrieved and deduplicated
- Provenance tracking correct (sources, variant_ids)

Output:
- DataFrame showing hit distribution by source type
- Distinct sentenceID counts per retrieval source
"""

# ════════════════════════════════════════════════════════════════════════════
# SETUP
# ════════════════════════════════════════════════════════════════════════════
from pathlib import Path
import sys
import logging
import polars as pl

logging.basicConfig(
    level=logging.INFO,
    format='%(levelname)s - %(message)s'
)

# Suppress verbose logs from other modules
logging.getLogger('botocore').setLevel(logging.WARNING)
logging.getLogger('urllib3').setLevel(logging.WARNING)

# Add project root
current = Path.cwd()
for parent in [current] + list(current.parents):
    if parent.name == "ModelPipeline":
        model_root = parent
        break
else:
    raise RuntimeError("Cannot find 'ModelPipeline' root")

if str(model_root) not in sys.path:
    sys.path.insert(0, str(model_root))

print(f"✓ Model root: {model_root}\n")

# ════════════════════════════════════════════════════════════════════════════
# INITIALIZE COMPONENTS
# ════════════════════════════════════════════════════════════════════════════
from finrag_ml_tg1.loaders.ml_config_loader import MLConfig
from finrag_ml_tg1.rag_modules_src.entity_adapter.entity_adapter import EntityAdapter
from finrag_ml_tg1.rag_modules_src.utilities.query_embedder_v2 import (
    EmbeddingRuntimeConfig,
    QueryEmbedderV2
)
from finrag_ml_tg1.rag_modules_src.rag_pipeline.metadata_filters import MetadataFilterBuilder
from finrag_ml_tg1.rag_modules_src.rag_pipeline.variant_pipeline import VariantPipeline
from finrag_ml_tg1.rag_modules_src.rag_pipeline.s3_retriever import S3VectorsRetriever

print("Initializing components...")

# Config
config = MLConfig()

# Bedrock client
bedrock_client = config.get_bedrock_client()

# Entity adapter
DIM_COMPANIES = model_root / "finrag_ml_tg1/data_cache/dimensions/finrag_dim_companies_21.parquet"
DIM_SECTIONS = model_root / "finrag_ml_tg1/data_cache/dimensions/finrag_dim_sec_sections.parquet"
entity_adapter = EntityAdapter(
    company_dim_path=DIM_COMPANIES,
    section_dim_path=DIM_SECTIONS
)

# Query embedder
embedding_cfg = config.cfg["embedding"]
runtime_cfg = EmbeddingRuntimeConfig.from_ml_config(embedding_cfg)
query_embedder = QueryEmbedderV2(runtime_cfg, boto_client=bedrock_client)

# Metadata filter builder
filter_builder = MetadataFilterBuilder(config)

# Variant pipeline
variant_pipeline = VariantPipeline(
    config=config,
    entity_adapter=entity_adapter,
    query_embedder=query_embedder,
    bedrock_client=bedrock_client
)

# S3 Vectors retriever (with variant pipeline)
retrieval_cfg = config.get_retrieval_config()
s3_retriever = S3VectorsRetriever(
    retrieval_config=retrieval_cfg,
    aws_access_key_id=config.aws_access_key,
    aws_secret_access_key=config.aws_secret_key,
    region=config.region,
    variant_pipeline=variant_pipeline  # ← Integrated!
)

print("✓ All components initialized\n")

# ════════════════════════════════════════════════════════════════════════════
# TEST QUERY
# ════════════════════════════════════════════════════════════════════════════
query = (
    "In the MD&A and Risk Factors sections, how did NVIDIA and Microsoft "
    "discuss their AI strategy, competitive positioning, and supply chain "
    "risks between 2017 and 2020?"
)

print("="*80)
print("QUERY:")
print("="*80)
print(query)
print()

# ════════════════════════════════════════════════════════════════════════════
# STEP 1: ENTITY EXTRACTION
# ════════════════════════════════════════════════════════════════════════════
print("="*80)
print("STEP 1: Entity Extraction")
print("="*80)

entities = entity_adapter.extract(query)

print(f"✓ Companies: {entities.companies.tickers}")
print(f"✓ Years: {entities.years.years}")
print(f"✓ Sections: {entities.sections}")
print(f"✓ Primary Section: {entities.primary_section}")
print()

# ════════════════════════════════════════════════════════════════════════════
# STEP 2: QUERY EMBEDDING
# ════════════════════════════════════════════════════════════════════════════
print("="*80)
print("STEP 2: Query Embedding")
print("="*80)

base_embedding = query_embedder.embed_query(query, entities)

print(f"✓ Embedding dimensions: {len(base_embedding)}")
print(f"✓ Embedding type: {type(base_embedding[0])}")
print(f"✓ Preview: {base_embedding[:5]}")
print()

# ════════════════════════════════════════════════════════════════════════════
# STEP 3: METADATA FILTERS
# ════════════════════════════════════════════════════════════════════════════
print("="*80)
print("STEP 3: Metadata Filters")
print("="*80)

filtered_filters = filter_builder.build_filters(entities)
global_filters = filter_builder.build_global_filters(entities)

print("Filtered filters:")
print(filtered_filters)
print()
print("Global filters:")
print(global_filters)
print()

# ════════════════════════════════════════════════════════════════════════════
# STEP 4-5: S3 RETRIEVAL (Variants handled internally)
# ════════════════════════════════════════════════════════════════════════════
print("="*80)
print("STEP 4-5: S3 Vectors Retrieval (with automatic variants)")
print("="*80)
print()

bundle = s3_retriever.retrieve(
    base_embedding=base_embedding,
    base_query=query,  # ← S3Retriever uses this to generate variants
    filtered_filters=filtered_filters,
    global_filters=global_filters
)

print()
print("="*80)
print("RETRIEVAL RESULTS")
print("="*80)
print(f"✓ Filtered hits: {len(bundle.filtered_hits)}")
print(f"✓ Global hits: {len(bundle.global_hits)}")
print(f"✓ Union hits (deduplicated): {len(bundle.union_hits)}")
print(f"✓ Variant queries generated: {len(bundle.variant_queries)}")
print()

if bundle.variant_queries:
    print("Variant queries:")
    for i, vq in enumerate(bundle.variant_queries, start=1):
        print(f"  {i}. {vq}")
    print()

# ════════════════════════════════════════════════════════════════════════════
# ANALYSIS: HIT DISTRIBUTION BY SOURCE
# ════════════════════════════════════════════════════════════════════════════
print("="*80)
print("ANALYSIS: Hit Distribution by Retrieval Source")
print("="*80)
print()

# Convert hits to structured data
rows = []
for hit in bundle.union_hits:
    rows.append({
        "sentence_id": hit.sentence_id,
        "embedding_id": hit.embedding_id,
        "distance": hit.distance,
        "similarity": hit.similarity_score(),
        "cik_int": hit.cik_int,
        "report_year": hit.report_year,
        "section_name": hit.section_name,
        "sentence_pos": hit.sentence_pos,
        "sources": ",".join(sorted(hit.sources)),  # Convert set to string
        "variant_ids": ",".join(map(str, sorted(hit.variant_ids))),  # Convert set to string
        "has_filtered": "filtered" in hit.sources,
        "has_global": "global" in hit.sources,
        "from_base": 0 in hit.variant_ids,
        "from_variant": any(v > 0 for v in hit.variant_ids)
    })

df = pl.DataFrame(rows)

print("Sample of retrieved hits:")
print(df.select([
    "sentence_id", "similarity", "cik_int", "report_year", 
    "section_name", "sources", "variant_ids"
]).head(10))
print()

# ════════════════════════════════════════════════════════════════════════════
# KEY METRIC 1: Distinct Sentences by Source Type
# ════════════════════════════════════════════════════════════════════════════
print("="*80)
print("KEY METRIC 1: Distinct Sentences by Source Type")
print("="*80)
print()

# Define source types
source_type_rows = []

# 1. Filtered Base (variant_id=0, has filtered source)
filtered_base = df.filter(
    (pl.col("from_base") == True) & 
    (pl.col("has_filtered") == True)
)
source_type_rows.append({
    "source_type": "filtered_base",
    "total_hits": len(filtered_base),
    "distinct_sentences": filtered_base.select("sentence_id").unique().height
})

# 2. Global Base (variant_id=0, has global source)
global_base = df.filter(
    (pl.col("from_base") == True) & 
    (pl.col("has_global") == True)
)
source_type_rows.append({
    "source_type": "global_base",
    "total_hits": len(global_base),
    "distinct_sentences": global_base.select("sentence_id").unique().height
})

# 3. Filtered Variant (variant_id>0, has filtered source)
filtered_variant = df.filter(
    (pl.col("from_variant") == True) & 
    (pl.col("has_filtered") == True)
)
source_type_rows.append({
    "source_type": "filtered_variant",
    "total_hits": len(filtered_variant),
    "distinct_sentences": filtered_variant.select("sentence_id").unique().height
})

# Create summary dataframe
summary_df = pl.DataFrame(source_type_rows)
print(summary_df)
print()

# ════════════════════════════════════════════════════════════════════════════
# KEY METRIC 2: Per-Variant Breakdown
# ════════════════════════════════════════════════════════════════════════════
print("="*80)
print("KEY METRIC 2: Per-Variant Breakdown")
print("="*80)
print()

variant_rows = []
for variant_id in range(len(bundle.variant_queries) + 1):  # 0 = base, 1+ = variants
    variant_hits = df.filter(
        pl.col("variant_ids").str.contains(str(variant_id))
    )
    
    variant_name = "Base Query" if variant_id == 0 else f"Variant {variant_id}"
    
    variant_rows.append({
        "variant": variant_name,
        "variant_id": variant_id,
        "total_hits": len(variant_hits),
        "distinct_sentences": variant_hits.select("sentence_id").unique().height,
        "avg_similarity": round(variant_hits.select("similarity").mean().item(), 4) if len(variant_hits) > 0 else 0.0
    })

variant_breakdown = pl.DataFrame(variant_rows)
print(variant_breakdown)
print()

# ════════════════════════════════════════════════════════════════════════════
# KEY METRIC 3: Company & Year Distribution
# ════════════════════════════════════════════════════════════════════════════
print("="*80)
print("KEY METRIC 3: Company & Year Distribution")
print("="*80)
print()

company_year_dist = (
    df.group_by(["cik_int", "report_year"])
    .agg([
        pl.count("sentence_id").alias("hit_count"),
        pl.n_unique("sentence_id").alias("distinct_sentences")
    ])
    .sort(["cik_int", "report_year"])
)
print(company_year_dist)
print()

# ════════════════════════════════════════════════════════════════════════════
# KEY METRIC 4: Section Distribution
# ════════════════════════════════════════════════════════════════════════════
print("="*80)
print("KEY METRIC 4: Section Distribution")
print("="*80)
print()

section_dist = (
    df.group_by("section_name")
    .agg([
        pl.count("sentence_id").alias("hit_count"),
        pl.n_unique("sentence_id").alias("distinct_sentences"),
        pl.mean("similarity").alias("avg_similarity")
    ])
    .sort("hit_count", descending=True)
)
print(section_dist)
print()

# ════════════════════════════════════════════════════════════════════════════
# KEY METRIC 5: Similarity Score Distribution
# ════════════════════════════════════════════════════════════════════════════
print("="*80)
print("KEY METRIC 5: Similarity Score Distribution")
print("="*80)
print()

print(f"Min similarity: {df.select('similarity').min().item():.4f}")
print(f"Max similarity: {df.select('similarity').max().item():.4f}")
print(f"Mean similarity: {df.select('similarity').mean().item():.4f}")
print(f"Median similarity: {df.select('similarity').median().item():.4f}")
print()

# ════════════════════════════════════════════════════════════════════════════
# FINAL SUMMARY
# ════════════════════════════════════════════════════════════════════════════
print("="*80)
print("FINAL SUMMARY")
print("="*80)
print()
print(f"✓ Query processed successfully")
print(f"✓ Entities extracted: {len(entities.companies.ciks_int)} companies, {len(entities.years.years)} years")
print(f"✓ Base embedding generated: 1024-d")
print(f"✓ Variants generated: {len(bundle.variant_queries)}")
print(f"✓ Total S3 queries: {1 + len(bundle.variant_queries)} (base) + variants")
print(f"✓ Raw hits retrieved: ~{len(bundle.filtered_hits) + len(bundle.global_hits)} (before dedup)")
print(f"✓ Deduplicated hits: {len(bundle.union_hits)}")
print(f"✓ Distinct sentences: {df.select('sentence_id').unique().height}")
print()
print("="*80)
print("✓ ALL TESTS PASSED - RETRIEVAL PIPELINE WORKING END-TO-END")
print("="*80)

✓ Model root: d:\JoelDesktop folds_24\NEU FALL2025\MLops IE7374 Project\FinSights\ModelPipeline



INFO - EntityAdapter using company_dim: d:\JoelDesktop folds_24\NEU FALL2025\MLops IE7374 Project\FinSights\ModelPipeline\finrag_ml_tg1\data_cache\dimensions\finrag_dim_companies_21.parquet
INFO - EntityAdapter using section_dim: d:\JoelDesktop folds_24\NEU FALL2025\MLops IE7374 Project\FinSights\ModelPipeline\finrag_ml_tg1\data_cache\dimensions\finrag_dim_sec_sections.parquet
INFO - Loading company dim from: d:\JoelDesktop folds_24\NEU FALL2025\MLops IE7374 Project\FinSights\ModelPipeline\finrag_ml_tg1\data_cache\dimensions\finrag_dim_companies_21.parquet
INFO - Loaded dim with 21 rows and columns: ['company_id', 'cik_int', 'cik', 'company_name', 'ticker', 'source', 'tier', 'quality_score', 'selection_source', 'rank_within_group', 'selected_at', 'version']
INFO - Building indexes from dim with 21 valid rows
INFO - CompanyUniverse initialized: 21 companies, 21 tickers, 21 alias tokens
INFO - SectionUniverse loaded 21 sections from d:\JoelDesktop folds_24\NEU FALL2025\MLops IE7374 Proje

Initializing components...
[DEBUG] ✓ AWS credentials loaded from aws_credentials.env
✓ FilterExtractor initialized with 21 companies
  Using: finrag_dim_companies_21.parquet
✓ All components initialized

QUERY:
In the MD&A and Risk Factors sections, how did NVIDIA and Microsoft discuss their AI strategy, competitive positioning, and supply chain risks between 2017 and 2020?

STEP 1: Entity Extraction


INFO - Extraction result: ciks_int=[789019, 1045810], tickers=['MSFT', 'NVDA'], names=['MICROSOFT CORP', 'NVIDIA CORP']
INFO - EntityAdapter.extract: done. companies=2, years=4, metrics=0, sections=2, risk_topics=3


✓ Companies: ['MSFT', 'NVDA']
✓ Years: [2017, 2018, 2019, 2020]
✓ Sections: ['ITEM_1A', 'ITEM_7']
✓ Primary Section: ITEM_7

STEP 2: Query Embedding


INFO - ═══════════════════════════════════════════════════════════════
Starting retrieval for: 'In the MD&A and Risk Factors sections, how did NVIDIA and Microsoft discuss thei...'
Config: global=True, variants=True
═══════════════════════════════════════════════════════════════
INFO - → Generating variants via VariantPipeline...
INFO - Generating variants for: 'In the MD&A and Risk Factors sections, how did NVIDIA and Microsoft discuss thei...'


✓ Embedding dimensions: 1024
✓ Embedding type: <class 'float'>
✓ Preview: [0.06347656, -0.028320312, 0.036621094, 0.075683594, -0.020141602]

STEP 3: Metadata Filters
Filtered filters:
{'$and': [{'cik_int': {'$in': [789019, 1045810]}}, {'report_year': {'$in': [2017, 2018, 2019, 2020]}}, {'$or': [{'section_name': {'$eq': 'ITEM_1A'}}, {'section_name': {'$eq': 'ITEM_7'}}]}]}

Global filters:
{'$and': [{'cik_int': {'$in': [789019, 1045810]}}, {'report_year': {'$gte': 2015}}]}

STEP 4-5: S3 Vectors Retrieval (with automatic variants)



INFO - Generated 3 variants for query
INFO - ✓ Generated 3 variant queries
INFO - EntityAdapter.extract: starting for query='Between 2017 and 2020, what did NVIDIA and Microsoft say about their AI strategy, competitive positioning, and supply chain risks in their MD&A and Risk Factors sections?'
INFO - Extracting companies from query: 'Between 2017 and 2020, what did NVIDIA and Microsoft say about their AI strategy, competitive positioning, and supply chain risks in their MD&A and Risk Factors sections?'
INFO - Extraction result: ciks_int=[789019, 1045810], tickers=['MSFT', 'NVDA'], names=['MICROSOFT CORP', 'NVIDIA CORP']
INFO - EntityAdapter.extract: done. companies=2, years=4, metrics=0, sections=2, risk_topics=3
INFO - EntityAdapter.extract: starting for query='How did NVIDIA and Microsoft address AI strategy, competitive positioning, and supply chain risks in the MD&A and Risk Factors sections during the 2017-2020 period?'
INFO - Extracting companies from query: 'How did NVIDIA and


RETRIEVAL RESULTS
✓ Filtered hits: 30
✓ Global hits: 15
✓ Union hits (deduplicated): 34
✓ Variant queries generated: 3

Variant queries:
  1. Between 2017 and 2020, what did NVIDIA and Microsoft say about their AI strategy, competitive positioning, and supply chain risks in their MD&A and Risk Factors sections?
  2. How did NVIDIA and Microsoft address AI strategy, competitive positioning, and supply chain risks in the MD&A and Risk Factors sections during the 2017-2020 period?
  3. What discussions of AI strategy, competitive positioning, and supply chain risks appear in NVIDIA and Microsoft's MD&A and Risk Factors sections from 2017 through 2020?

ANALYSIS: Hit Distribution by Retrieval Source

Sample of retrieved hits:
shape: (10, 7)
┌────────────────┬────────────┬─────────┬─────────────┬──────────────┬───────────────┬─────────────┐
│ sentence_id    ┆ similarity ┆ cik_int ┆ report_year ┆ section_name ┆ sources       ┆ variant_ids │
│ ---            ┆ ---        ┆ ---     ┆ ---     

"""
Supply Line 3: Entity → Filters → S3 Retrieval (with variants)
Flow:
  Query → EntityAdapter → Filters → S3Retriever (variants internal) → RetrievalBundle
"""

```python
from pathlib import Path
import sys

# Setup paths
current = Path.cwd()
for parent in [current] + list(current.parents):
    if parent.name == "ModelPipeline":
        model_root = parent
        break
if str(model_root) not in sys.path:
    sys.path.insert(0, str(model_root))

# Imports
from finrag_ml_tg1.loaders.ml_config_loader import MLConfig
from finrag_ml_tg1.rag_modules_src.entity_adapter.entity_adapter import EntityAdapter
from finrag_ml_tg1.rag_modules_src.utilities.query_embedder_v2 import EmbeddingRuntimeConfig, QueryEmbedderV2
from finrag_ml_tg1.rag_modules_src.rag_pipeline.metadata_filters import MetadataFilterBuilder
from finrag_ml_tg1.rag_modules_src.rag_pipeline.variant_pipeline import VariantPipeline
from finrag_ml_tg1.rag_modules_src.rag_pipeline.s3_retriever import S3VectorsRetriever

# Initialize
config = MLConfig()
bedrock_client = config.get_bedrock_client()

DIM_COMPANIES = model_root / "finrag_ml_tg1/data_cache/dimensions/finrag_dim_companies_21.parquet"
DIM_SECTIONS = model_root / "finrag_ml_tg1/data_cache/dimensions/finrag_dim_sec_sections.parquet"

adapter = EntityAdapter(company_dim_path=DIM_COMPANIES, section_dim_path=DIM_SECTIONS)
embedder = QueryEmbedderV2(EmbeddingRuntimeConfig.from_ml_config(config.cfg["embedding"]), bedrock_client)
filter_builder = MetadataFilterBuilder(config)
variant_pipeline = VariantPipeline(config, adapter, embedder, bedrock_client)
retriever = S3VectorsRetriever(config.get_retrieval_config(), config.aws_access_key, config.aws_secret_key, config.region, variant_pipeline)

# Query
query = "What were NVIDIA's and Microsoft's AI risks in 2017-2020?"

# Step 1: Extract entities
entities = adapter.extract(query)

# Step 2: Generate base embedding
base_embedding = embedder.embed_query(query, entities)

# Step 3: Build filters
filtered_filters = filter_builder.build_filters(entities)
global_filters = filter_builder.build_global_filters(entities)

# Step 4-5: Retrieve (variants handled internally)
bundle = retriever.retrieve(base_embedding, query, filtered_filters, global_filters)

# Results
print(f"✓ Filtered: {len(bundle.filtered_hits)}, Global: {len(bundle.global_hits)}, Union: {len(bundle.union_hits)}")
print(f"✓ Variants: {len(bundle.variant_queries)}")
print(f"✓ Companies: {entities.companies.tickers}, Years: {entities.years.years}")
```

---
### NEXT PARTS: SIGH
```
[6] ContextWindowExpander → ~35 spans (±3 windows, merged)
  ↓
[7] TextFetcher → ~35 blocks with full text
  ↓
[8] (SKIPPED - no reranking)
  ↓
[9] BlockDeduplicator → ~25 unique blocks
  ↓
[10] ContextAssembler → Top 10 blocks formatted
  ↓
LLM-ready context string
```
---
### 
# Inside S3Retriever._deduplicate_hits():
for (sentence_id, embedding_id), hits in groups.items():
    best = min(hits, key=lambda h: h.distance)
    best.sources = {h.source for h in hits}      # ← MERGED!
    best.variant_ids = {h.variant_id for h in hits}  # ← MERGED!
```
**This IS your hit merger:**
- Combines filtered + global + variants
- Tracks provenance
- Keeps best distance per sentence

**You don't need a separate `HitMerger` class** - it's already integrated into the deduplication logic.
```

---


## **SIMPLIFIED ARCHITECTURE:**
```
┌──────────────────────────────────────────────────┐
│ S3VectorsRetriever                               │
│  ├─ Generate variants (internal)                 │
│  ├─ Retrieve (filtered+global+variants)          │
│  ├─ Merge & Dedup (internal) ← HIT MERGER HERE   │
│  └─ Return RetrievalBundle                       │
└──────────────────────────────────────────────────┘
         ↓ bundle.union_hits (34 hits)
┌──────────────────────────────────────────────────┐
│ ContextWindowExpander                            │
│  └─ Expand ±3 windows, merge overlaps            │
└──────────────────────────────────────────────────┘
         ↓ spans (~35 spans)
┌──────────────────────────────────────────────────┐
│ TextFetcher                                      │
│  └─ Join to Stage 2 meta, concatenate text       │
└──────────────────────────────────────────────────┘
         ↓ blocks (~35 blocks)
┌──────────────────────────────────────────────────┐
│ BlockDeduplicator (Stage-2)                      │
│  └─ Remove overlapping windows                   │
└──────────────────────────────────────────────────┘
         ↓ unique_blocks (~25 blocks)
┌──────────────────────────────────────────────────┐
│ ContextAssembler                                 │
│  ├─ Sort by base_score                           │
│  ├─ Take top 10 blocks                           │
│  └─ Format with headers                          │
└──────────────────────────────────────────────────┘
         ↓ context_str (LLM prompt)
```


### Looks like:

- It's a list of bundled custom data class objects or custom S3Hit objects.
```
# After deduplication, union_hits looks like:
union_hits = [
    S3Hit(sentence_id="A", distance=0.15, sources={"filtered", "global"}, variant_ids={0,1}),
    S3Hit(sentence_id="B", distance=0.20, sources={"filtered"}, variant_ids={0,1,2}),
    S3Hit(sentence_id="C", distance=0.22, sources={"global"}, variant_ids={0}),
    # ... 31 more
]

# Then we create two filtered views:
filtered_hits = [h for h in union_hits if "filtered" in h.sources]
# Result: [Hit A, Hit B]  ← Hits that came from ANY filtered call

global_hits = [h for h in union_hits if "global" in h.sources]
# Result: [Hit A, Hit C]  ← Hits that came from ANY global call
```

### Final RetrievalBundle structure: It's like having it from set mathematics, I guess, because at any point I might need analysis later on. And it also helped really a lot during testing.
```
RetrievalBundle(
    filtered_hits=[...],  # "What did filtered calls contribute?"
    global_hits=[...],    # "What did global calls contribute?"
    union_hits=[...]      # "What's the complete deduplicated set?"
)
```

## POINTER: DOWNSTREAM:
- downstream processing: Window expansion only needs union_hits: 
- For evaluation & logging: these are very useful.
```python
@dataclass
class RetrievalBundle:
    filtered_hits: List[S3Hit]  # Analysis view: "filtered contributions"
    global_hits: List[S3Hit]    # Analysis view: "global contributions"
    union_hits: List[S3Hit]     # Processing view: "all unique hits"
    
    base_query: str             # Original query (for logging)
    variant_queries: List[str]  # Generated variants (for logging)
```

### Bare bones idea for next:

```python
spans = window_expander.expand(bundle.union_hits)  # ← Only union matters
blocks = text_fetcher.materialize_blocks(spans)
unique_blocks = deduplicator.deduplicate(blocks)
context = assembler.assemble(unique_blocks[:10])
```

- Immediately grab bundle.union_hits and ignore the rest
