## Post Vector Put Validation And Sanity - S3 Vector bucket

1. Test 1 — list_vectors (sample): ingestion & schema sanity ( Confirms the index isn’t empty and  loader’s PutVectors calls succeeded.)
   
2. Test 2 — get_vectors(keys=[…]): id-mapping correctness ( specific key from Stage-3 (sentenceID_numsurrogate) and round-trips it via the vector index, returned key equals the one inserted, and the attached sentenceID matches the source Parquet. )
   
3. Test 3 — query_vectors with a random 1024-d vector: ANN path is live. ( query plane works: you get topK results back for a valid-shape query vector. index accepts. )
   
4. Test 4 — single-column equality filter (cik_int): confirms filtering works as expected and returns the correct subset of vectors.
   - Distances are now present (~0.909–0.917…), confirming  earlier “NULL distance” was just missing

5. Test 5 — multi-column AND filter (cik_int + report_year): confirms compound filtering works as expected and returns the correct subset of vectors.
   1. 5A — AND(cik==1276520, year∈{2016..2020}, section==ITEM_1A)
   2. 5B — range(2016 ≤ year ≤ 2020)
   3. 5C — AND(cik==1276520, OR(section==ITEM_7, ITEM_1A))
   4. 5D — negative control (non-filterable key) -- Clean HTTP 400

6. Strong Gold Test Design - Deterministic Neighbor Test (Gold = local window in same doc section).
   - Builds automatic gold set, Anchors, and evaluates self@1, hits@k, MRR@k for filtered and open regimes. And hardest cases too.
   ``` 
   - anchor :: tuple (cik, year, section, key, pos). 
   - gold set G(anchor) = { sentences from the same (cik, year, section) whose sentence_pos lies in [pos−W, pos+W] \ {pos}, where W is a small window }
   - Self@1 : anchor itself returned as rank 1 when you query with its embedding. Sanity check for ID alignment and distance. 
   - Hit@k : Does any member of the gold set G(anchor) appear within the top-k results excluding the anchor itself? We report Hit@1, Hit@3, Hit@5.
   - MRR@k (Mean Reciprocal Rank): For each anchor, find the first rank r (excluding self) where a gold neighbor appears. 
   - ; contribute 1/r if r ≤ k, else 0. Average over anchors. ( Rewards earlier hits more strongly than Hit@k. )
   - “Hardest cases”: Anchors whose first gold hit rank is ∞ (no gold found in top-k) or very large.
   ```


### Understanding distance, metrics on S3 vector results:
- seeing distance = null because the API does not return distances unless you explicitly ask for them. S3 Vectors, returnDistance defaults to false even when returnMetadata is true; you must set both flags when you want scores plus metadata in the same response.
- On filter grammar: S3 Vectors supports a JSON filter language with $and, $or, $eq, $ne, $gt, $gte, $lt, $lte, $in, $nin, $exists.
- SQL-ish or Dynamo-like failed
- Redo: (a) requests distances correctly, (b) exercises the official filter grammar (including compound AND/OR and ranges), (c) verifies types and monotonic ranking, and (d) catches the permission edge case when you ask for metadata (you need both s3vectors:QueryVectors and s3vectors:GetVectors).
  


In [2]:
## Path loading for the loaders/ and also, a simple list of available methods in the s3vectors client.

import sys
from pathlib import Path

# Add project root to path
project_root = Path.cwd().parent
if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))

# Now import works
from loaders.ml_config_loader import MLConfig
print("✓ Config loader ready")

import boto3

s3vectors = boto3.client('s3vectors', region_name='us-east-1')

# List all available methods
methods = [m for m in dir(s3vectors) if not m.startswith('_')]
print("Available S3 Vectors methods:")
for m in sorted(methods):
    print(f"  - {m}")

✓ Config loader ready
Available S3 Vectors methods:
  - can_paginate
  - close
  - create_index
  - create_vector_bucket
  - delete_index
  - delete_vector_bucket
  - delete_vector_bucket_policy
  - delete_vectors
  - exceptions
  - generate_presigned_url
  - get_index
  - get_paginator
  - get_vector_bucket
  - get_vector_bucket_policy
  - get_vectors
  - get_waiter
  - list_indexes
  - list_vector_buckets
  - list_vectors
  - meta
  - put_vector_bucket_policy
  - put_vectors
  - query_vectors
  - waiter_names


In [3]:
# ============================================================================
# TEST 1: List Vectors - Verify Data Exists
# ============================================================================

import boto3
from loaders.ml_config_loader import MLConfig

config = MLConfig()
s3vectors = boto3.client("s3vectors", region_name=config.region,
                         aws_access_key_id=config.aws_access_key,
                         aws_secret_access_key=config.aws_secret_key)

VECTOR_BUCKET = "finrag-embeddings-s3vectors"
INDEX_NAME = "finrag-sentence-fact-embed-1024d"

print("="*70)
print("TEST 1: List Vectors (Sample)")
print("="*70)

try:
    # List first 10 vectors
    response = s3vectors.list_vectors(
        vectorBucketName=VECTOR_BUCKET,
        indexName=INDEX_NAME,
        maxResults=10,
        returnData=True,       # Include vector data
        returnMetadata=True    # Include metadata
    )
    
    vectors = response.get('vectors', [])
    
    print(f"\n✓ Successfully retrieved {len(vectors)} vectors")
    print(f"  Total in index: ~203,076 (not returned by list API)")
    
    if vectors:
        print(f"\n[Sample Vector 1]")
        v = vectors[0]
        print(f"  Key: {v['key']}")
        print(f"  Data type: {v.get('data', {}).keys()}")
        print(f"  Vector dimensions: {len(v.get('data', {}).get('float32', []))}d")
        
        metadata = v.get('metadata', {})
        print(f"\n  Metadata ({len(metadata)} fields):")
        for key, value in metadata.items():
            value_str = str(value)[:50] + "..." if len(str(value)) > 50 else str(value)
            print(f"    - {key}: {value_str}")
        
        print(f"\n[Sample Vector 2]")
        if len(vectors) > 1:
            v2 = vectors[1]
            print(f"  Key: {v2['key']}")
            print(f"  CIK: {v2.get('metadata', {}).get('cik_int')}")
            print(f"  Year: {v2.get('metadata', {}).get('report_year')}")
            print(f"  Section: {v2.get('metadata', {}).get('section_name')}")
    
    print(f"\n✓ TEST 1 PASSED: Vectors exist and have correct structure")
    
except Exception as e:
    print(f"\n✗ TEST 1 FAILED: {e}")
    import traceback
    traceback.print_exc()

print("="*70)

[DEBUG] ✓ AWS credentials loaded from aws_credentials.env
TEST 1: List Vectors (Sample)

✓ Successfully retrieved 10 vectors
  Total in index: ~203,076 (not returned by list API)

[Sample Vector 1]
  Key: -2660538714400376423
  Data type: dict_keys(['float32'])
  Vector dimensions: 1024d

  Metadata (8 fields):
    - sentence_pos: 70
    - sic: 6351
    - section_sentence_count: 378
    - embedding_id: bedrock_cohere_v4_1024d_20251109_1355
    - section_name: ITEM_1A
    - report_year: 2018
    - cik_int: 890926
    - sentenceID: 0000890926_10-K_2018_section_1A_70

[Sample Vector 2]
  Key: 6984152936648599200
  CIK: 1065280
  Year: 2020
  Section: ITEM_6

✓ TEST 1 PASSED: Vectors exist and have correct structure


In [4]:
# ============================================================================
# TEST 2: Get Vector by Key - Direct Lookup
# ============================================================================

import boto3
from loaders.ml_config_loader import MLConfig
import polars as pl
from pathlib import Path

config = MLConfig()
s3vectors = boto3.client("s3vectors", region_name=config.region,
                         aws_access_key_id=config.aws_access_key,
                         aws_secret_access_key=config.aws_secret_key)

VECTOR_BUCKET = "finrag-embeddings-s3vectors"
INDEX_NAME = "finrag-sentence-fact-embed-1024d"

print("="*70)
print("TEST 2: Get Vector by Key")
print("="*70)

# Load Stage 3 to get a real key
cache_path = config.get_s3vectors_cache_path("cohere_1024d")
df_stage3 = pl.read_parquet(cache_path)
sample_key = str(df_stage3['sentenceID_numsurrogate'][0])
sample_sentenceID = df_stage3['sentenceID'][0]

print(f"\n[Test Parameters]")
print(f"  Key to retrieve: {sample_key}")
print(f"  Expected sentenceID: {sample_sentenceID}")

try:
    response = s3vectors.get_vectors(
        vectorBucketName=VECTOR_BUCKET,
        indexName=INDEX_NAME,
        keys=[sample_key],
        returnData=True,
        returnMetadata=True
    )
    
    vectors = response.get('vectors', [])
    
    if len(vectors) == 0:
        print(f"\n✗ TEST 2 FAILED: Vector not found for key {sample_key}")
    else:
        v = vectors[0]
        retrieved_key = v['key']
        retrieved_sentenceID = v.get('metadata', {}).get('sentenceID')
        
        print(f"\n✓ Vector retrieved successfully")
        print(f"  Retrieved key: {retrieved_key}")
        print(f"  Retrieved sentenceID: {retrieved_sentenceID}")
        print(f"  Match: {retrieved_sentenceID == sample_sentenceID}")
        
        # Check embedding
        embedding = v.get('data', {}).get('float32', [])
        print(f"\n[Embedding Check]")
        print(f"  Dimension: {len(embedding)}d")
        print(f"  First 5 values: {embedding[:5]}")
        print(f"  Data type: {type(embedding[0]) if embedding else 'N/A'}")
        
        # Check all metadata
        metadata = v.get('metadata', {})
        print(f"\n[Metadata Check]")
        print(f"  Fields present: {list(metadata.keys())}")
        print(f"  cik_int: {metadata.get('cik_int')}")
        print(f"  report_year: {metadata.get('report_year')}")
        print(f"  section_name: {metadata.get('section_name')}")
        
        print(f"\n✓ TEST 2 PASSED: Vector retrieval works correctly")
        
except Exception as e:
    print(f"\n✗ TEST 2 FAILED: {e}")
    import traceback
    traceback.print_exc()

print("="*70)

[DEBUG] ✓ AWS credentials loaded from aws_credentials.env
TEST 2: Get Vector by Key

[Test Parameters]
  Key to retrieve: -6100147366912333961
  Expected sentenceID: 0001403161_10-K_2020_section_1A_90

✓ Vector retrieved successfully
  Retrieved key: -6100147366912333961
  Retrieved sentenceID: 0001403161_10-K_2020_section_1A_90
  Match: True

[Embedding Check]
  Dimension: 1024d
  First 5 values: [0.0257568359375, 0.0015716552734375, 0.052490234375, -0.0189208984375, 0.0029754638671875]
  Data type: <class 'float'>

[Metadata Check]
  Fields present: ['section_name', 'sentence_pos', 'embedding_id', 'section_sentence_count', 'sentenceID', 'report_year', 'sic', 'cik_int']
  cik_int: 1403161
  report_year: 2020
  section_name: ITEM_1A

✓ TEST 2 PASSED: Vector retrieval works correctly


In [5]:
# ============================================================================
# TEST 2.1: Index Populated Test: Check Index Vector Count
# ============================================================================

import boto3
from loaders.ml_config_loader import MLConfig

config = MLConfig()
s3vectors = boto3.client("s3vectors", region_name=config.region,
                         aws_access_key_id=config.aws_access_key,
                         aws_secret_access_key=config.aws_secret_key)

VECTOR_BUCKET = "finrag-embeddings-s3vectors"
INDEX_NAME = "finrag-sentence-fact-embed-1024d"

print("="*70)
print("DIAGNOSTIC: Index Status Check")
print("="*70)

# Check index details
response = s3vectors.get_index(
    vectorBucketName=VECTOR_BUCKET,
    indexName=INDEX_NAME
)

index_data = response['index']

print(f"\n[Index Configuration]")
print(f"  Name: {index_data['indexName']}")
print(f"  Status: Check console for vector count")
print(f"  Created: {index_data['creationTime']}")
print(f"  Dimension: {index_data['dimension']}")
print(f"  Distance: {index_data['distanceMetric']}")

# Try list_vectors to see if ANY vectors exist
print(f"\n[Attempting to list vectors...]")
try:
    list_response = s3vectors.list_vectors(
        vectorBucketName=VECTOR_BUCKET,
        indexName=INDEX_NAME,
        maxResults=5
    )
    
    vectors = list_response.get('vectors', [])
    print(f"  Vectors found: {len(vectors)}")
    
    if vectors:
        print(f"  ✓ Index contains vectors")
        print(f"  Sample key: {vectors[0]['key']}")
    else:
        print(f"  ✗ Index appears EMPTY!")
        print(f"  This explains zero query results")
        
except Exception as e:
    print(f"  Error listing vectors: {e}")

print("="*70)

[DEBUG] ✓ AWS credentials loaded from aws_credentials.env
DIAGNOSTIC: Index Status Check

[Index Configuration]
  Name: finrag-sentence-fact-embed-1024d
  Status: Check console for vector count
  Created: 2025-11-10 14:52:15-05:00
  Dimension: 1024
  Distance: cosine

[Attempting to list vectors...]
  Vectors found: 5
  ✓ Index contains vectors
  Sample key: -2660538714400376423


In [6]:
# ============================================================================
# TEST 3: Query Vectors - DataFrame Output
# ============================================================================


import boto3, numpy as np, polars as pl
from loaders.ml_config_loader import MLConfig

cfg = MLConfig()
s3v = boto3.client("s3vectors", region_name=cfg.region,
                   aws_access_key_id=cfg.aws_access_key,
                   aws_secret_access_key=cfg.aws_secret_key)

VECTOR_BUCKET = "finrag-embeddings-s3vectors"
INDEX_NAME    = "finrag-sentence-fact-embed-1024d"
DIM           = 1024

# random 1024-d probe just to test shape/latency; replace with a real query embed to validate semantics
probe = np.random.randn(DIM).astype(np.float32).tolist()

resp = s3v.query_vectors(
    vectorBucketName=VECTOR_BUCKET,
    indexName=INDEX_NAME,
    queryVector={"float32": probe},
    topK=10,
    returnMetadata=True,     # include  cik/year/section metadata
    returnDistance=True      # <-- REQUIRED to see numeric distances
)

rows = []
for i, v in enumerate(resp.get("vectors", []), 1):
    rows.append({
        "rank": i,
        "key": v["key"],
        "distance": v.get("distance"),
        "cik_int": v.get("metadata", {}).get("cik_int"),
        "report_year": v.get("metadata", {}).get("report_year"),
        "section_name": v.get("metadata", {}).get("section_name")
    })

df = pl.DataFrame(rows)

# Assertions: distances present and monotonic nondecreasing by rank
assert df["distance"].is_not_null().all(), "distance missing; set returnDistance=True"
assert (df["distance"].diff().fill_null(0) >= 0).all(), "ranking not monotonic by distance"

print(df)



[DEBUG] ✓ AWS credentials loaded from aws_credentials.env
shape: (10, 6)
┌──────┬──────────────────────┬──────────┬─────────┬─────────────┬──────────────┐
│ rank ┆ key                  ┆ distance ┆ cik_int ┆ report_year ┆ section_name │
│ ---  ┆ ---                  ┆ ---      ┆ ---     ┆ ---         ┆ ---          │
│ i64  ┆ str                  ┆ f64      ┆ i64     ┆ i64         ┆ str          │
╞══════╪══════════════════════╪══════════╪═════════╪═════════════╪══════════════╡
│ 1    ┆ 6322569510623725774  ┆ 0.886695 ┆ 1318605 ┆ 2016        ┆ ITEM_8       │
│ 2    ┆ -8217675960081250299 ┆ 0.888923 ┆ 1318605 ┆ 2018        ┆ ITEM_8       │
│ 3    ┆ -1701519652995191919 ┆ 0.888923 ┆ 1318605 ┆ 2017        ┆ ITEM_8       │
│ 4    ┆ -1663766618521785107 ┆ 0.888923 ┆ 1318605 ┆ 2019        ┆ ITEM_8       │
│ 5    ┆ -1315218437024154496 ┆ 0.892382 ┆ 1318605 ┆ 2016        ┆ ITEM_8       │
│ 6    ┆ 3163729756936668591  ┆ 0.892447 ┆ 1318605 ┆ 2020        ┆ ITEM_8       │
│ 7    ┆ -804885395003797

In [7]:
# ============================================================================
# TEST 4: Query Vectors - Simple Filter (Single Column)
# ============================================================================



import boto3, numpy as np, polars as pl
from loaders.ml_config_loader import MLConfig

cfg = MLConfig()
s3v = boto3.client("s3vectors", region_name=cfg.region,
                   aws_access_key_id=cfg.aws_access_key,
                   aws_secret_access_key=cfg.aws_secret_key)

VECTOR_BUCKET = "finrag-embeddings-s3vectors"
INDEX_NAME    = "finrag-sentence-fact-embed-1024d"
DIM           = 1024

# random 1024-d probe just to test shape/latency; replace with a real query embed to validate semantics
probe = np.random.randn(DIM).astype(np.float32).tolist()


flt = {"cik_int": 1276520}  # equality uses implicit $eq

resp = s3v.query_vectors(
    vectorBucketName=VECTOR_BUCKET,
    indexName=INDEX_NAME,
    queryVector={"float32": probe},
    topK=10,
    filter=flt,              # server-side filtering
    returnMetadata=True,
    returnDistance=True
)

hits = resp.get("vectors", [])
assert hits, "no hits returned"

# Every hit must satisfy cik_int == 1276520
assert all(h.get("metadata", {}).get("cik_int") == 1276520 for h in hits), "filter not applied on cik_int"
# Distances present and monotonic
dists = [h.get("distance") for h in hits]
assert all(d is not None for d in dists), "distance missing; set returnDistance=True"
assert all((dists[i] - dists[i-1]) >= 0 for i in range(1, len(dists))), "non-monotonic ranking"

print(pl.DataFrame([{
    "rank": i+1,
    "distance": round(dists[i], 6),
    "year": hits[i]["metadata"].get("report_year"),
    "section": hits[i]["metadata"].get("section_name"),
    "sic": hits[i]["metadata"].get("sic")
} for i in range(len(hits))]))




[DEBUG] ✓ AWS credentials loaded from aws_credentials.env
shape: (10, 5)
┌──────┬──────────┬──────┬─────────┬──────┐
│ rank ┆ distance ┆ year ┆ section ┆ sic  │
│ ---  ┆ ---      ┆ ---  ┆ ---     ┆ ---  │
│ i64  ┆ f64      ┆ i64  ┆ str     ┆ str  │
╞══════╪══════════╪══════╪═════════╪══════╡
│ 1    ┆ 0.908764 ┆ 2020 ┆ ITEM_7  ┆ 6311 │
│ 2    ┆ 0.927659 ┆ 2020 ┆ ITEM_1  ┆ 6311 │
│ 3    ┆ 0.928635 ┆ 2020 ┆ ITEM_1  ┆ 6311 │
│ 4    ┆ 0.929487 ┆ 2019 ┆ ITEM_1A ┆ 6311 │
│ 5    ┆ 0.929487 ┆ 2020 ┆ ITEM_1A ┆ 6311 │
│ 6    ┆ 0.929487 ┆ 2018 ┆ ITEM_1A ┆ 6311 │
│ 7    ┆ 0.930932 ┆ 2016 ┆ ITEM_8  ┆ 6311 │
│ 8    ┆ 0.931147 ┆ 2018 ┆ ITEM_8  ┆ 6311 │
│ 9    ┆ 0.931147 ┆ 2016 ┆ ITEM_8  ┆ 6311 │
│ 10   ┆ 0.931147 ┆ 2017 ┆ ITEM_8  ┆ 6311 │
└──────┴──────────┴──────┴─────────┴──────┘


In [8]:
# ============================================================================
# TEST 5: Compound filter suite (5A–5D) for S3 Vectors
# - AND with IN set
# - Numeric range (gte/lte)
# - OR combined with AND
# - Negative: filter on a non-filterable key -> expect 400
# ============================================================================

import boto3, numpy as np, polars as pl
from botocore.exceptions import ClientError
from loaders.ml_config_loader import MLConfig

# --- Config & client ---
cfg = MLConfig()
s3v = boto3.client(
    "s3vectors",
    region_name=cfg.region,
    aws_access_key_id=cfg.aws_access_key,
    aws_secret_access_key=cfg.aws_secret_key,
)

VECTOR_BUCKET = "finrag-embeddings-s3vectors"
INDEX_NAME    = "finrag-sentence-fact-embed-1024d"
DIM           = 1024

# Random probe just to exercise the path; replace with a real query embedding to test semantic quality
probe = np.random.randn(DIM).astype(np.float32).tolist()
qvec = {"float32": probe}

# --- Utilities ---

def run_query(name, flt, topk=20, show_cols=("distance","cik_int","report_year","section_name"), max_rows=10):
    """Run a query with distance+metadata, return Polars DataFrame and raw hits."""
    print(f"\n===== {name} =====")
    resp = s3v.query_vectors(
        vectorBucketName=VECTOR_BUCKET,
        indexName=INDEX_NAME,
        queryVector=qvec,
        topK=topk,
        filter=flt,
        returnMetadata=True,
        returnDistance=True,
    )
    hits = resp.get("vectors", [])
    print(f"Hits: {len(hits)}")

    # Build a compact table for inspection
    rows = []
    for i, v in enumerate(hits, 1):
        md = v.get("metadata", {}) or {}
        row = {"rank": i, "distance": v.get("distance")}
        row.update({ "cik_int": md.get("cik_int"),
                     "report_year": md.get("report_year"),
                     "section_name": md.get("section_name"),
                     "sic": md.get("sic") })
        rows.append(row)

    df = pl.DataFrame(rows)
    if len(df) > 0:
        # Keep a small selection of columns for display
        cols = ["rank"] + [c for c in show_cols if c in df.columns]
        print(df.select(cols).head(max_rows))
    else:
        print("No results.")

    # Core invariants
    if len(df) > 0:
        assert df["distance"].is_not_null().all(), "distance missing; set returnDistance=True"
        dist = df["distance"].to_list()
        assert all((dist[i] - dist[i-1]) >= -1e-12 for i in range(1, len(dist))), "ranking not monotonic by distance"

    return df, hits

def expect_400_on_filter(name, flt):
    """Run a query expected to fail due to non-filterable metadata; assert HTTP 400."""
    print(f"\n===== {name} (expect 400) =====")
    try:
        s3v.query_vectors(
            vectorBucketName=VECTOR_BUCKET,
            indexName=INDEX_NAME,
            queryVector=qvec,
            topK=5,
            filter=flt,
            returnMetadata=True,
            returnDistance=True,
        )
        raise AssertionError("Expected 400 when filtering on a non-filterable key, but query succeeded")
    except ClientError as e:
        status = e.response.get("ResponseMetadata", {}).get("HTTPStatusCode", 0)
        if status != 400:
            raise AssertionError(f"Expected HTTP 400, got {status}")
        print("Got expected HTTP 400 for non-filterable filter.")

# --- 5A. AND of three predicates (CIK, year IN set, section exact) ---
flt_5A = {
    "$and": [
        {"cik_int": {"$eq": 1276520}},
        {"report_year": {"$in": [2016, 2017, 2018, 2019, 2020]}},
        {"section_name": {"$eq": "ITEM_1A"}},
    ]
}
df_5A, hits_5A = run_query("5A: AND(cik == 1276520, year ∈ {2016..2020}, section == ITEM_1A)", flt_5A)
if len(df_5A) > 0:
    assert (df_5A["cik_int"] == 1276520).all(), "CIK mismatch in 5A"
    assert df_5A["report_year"].is_in([2016, 2017, 2018, 2019, 2020]).all(), "Year not in set in 5A"
    assert (df_5A["section_name"] == "ITEM_1A").all(), "Section mismatch in 5A"
print("5A: PASS")

# --- 5B. Numeric range only (year between 2016 and 2020 inclusive) ---
flt_5B = {"report_year": {"$gte": 2016, "$lte": 2020}}
df_5B, hits_5B = run_query("5B: range(2016 ≤ year ≤ 2020)", flt_5B)
if len(df_5B) > 0:
    years = df_5B["report_year"].drop_nulls().to_list()
    assert all(2016 <= y <= 2020 for y in years), "Year out of range in 5B"
print("5B: PASS")

# --- 5C. OR between two sections, combined with a CIK ---
flt_5C = {
    "$and": [
        {"cik_int": {"$eq": 1276520}},
        {"$or": [
            {"section_name": {"$eq": "ITEM_7"}},
            {"section_name": {"$eq": "ITEM_1A"}},
        ]}
    ]
}
df_5C, hits_5C = run_query("5C: AND(cik == 1276520, OR(section == ITEM_7, ITEM_1A))", flt_5C)
if len(df_5C) > 0:
    allowed = {"ITEM_7", "ITEM_1A"}
    assert (df_5C["cik_int"] == 1276520).all(), "CIK mismatch in 5C"
    assert set(df_5C["section_name"].drop_nulls().to_list()).issubset(allowed), "Section mismatch in 5C"
print("5C: PASS")

# --- 5D. Negative: filtering on a non-filterable key should 400 ---
# Adjust the key below to one you actually configured as non-filterable at index creation time.
# From  config: sentenceID, embedding_id, section_sentence_count were non-filterable.
flt_5D = {"embedding_id": {"$eq": "some-id"}}
expect_400_on_filter("5D: filter on non-filterable metadata (embedding_id)", flt_5D)
print("5D: PASS")

print("\nAll Test 5 sub-cases completed.")


[DEBUG] ✓ AWS credentials loaded from aws_credentials.env

===== 5A: AND(cik == 1276520, year ∈ {2016..2020}, section == ITEM_1A) =====
Hits: 20
shape: (10, 5)
┌──────┬──────────┬─────────┬─────────────┬──────────────┐
│ rank ┆ distance ┆ cik_int ┆ report_year ┆ section_name │
│ ---  ┆ ---      ┆ ---     ┆ ---         ┆ ---          │
│ i64  ┆ f64      ┆ i64     ┆ i64         ┆ str          │
╞══════╪══════════╪═════════╪═════════════╪══════════════╡
│ 1    ┆ 0.920175 ┆ 1276520 ┆ 2020        ┆ ITEM_1A      │
│ 2    ┆ 0.923693 ┆ 1276520 ┆ 2019        ┆ ITEM_1A      │
│ 3    ┆ 0.924222 ┆ 1276520 ┆ 2018        ┆ ITEM_1A      │
│ 4    ┆ 0.924348 ┆ 1276520 ┆ 2016        ┆ ITEM_1A      │
│ 5    ┆ 0.924348 ┆ 1276520 ┆ 2017        ┆ ITEM_1A      │
│ 6    ┆ 0.927017 ┆ 1276520 ┆ 2016        ┆ ITEM_1A      │
│ 7    ┆ 0.927017 ┆ 1276520 ┆ 2017        ┆ ITEM_1A      │
│ 8    ┆ 0.927017 ┆ 1276520 ┆ 2019        ┆ ITEM_1A      │
│ 9    ┆ 0.927017 ┆ 1276520 ┆ 2018        ┆ ITEM_1A      │
│ 10   ┆ 0.932

### Gold test / “quality” tests on top: 
(1) a deterministic neighbor test using a small gold set of sentences and their known nearest neighbors to compute hit@k, and 
(2) potentially later, a distance calibration script or Platt-style mapping ideas ( to can reject low-confidence hits consistently across CIK/year slices.)

### Design - Attempt Gold Test P1, Gold Test P2, Gold Test P3 (Business Realism). 
### Gold Test P1 Explanation: 
- We build a gold set automatically from our own corpus using structure already have (cik_int, report_year, section_name, sentence_pos). 
- For each anchor sentence, “true neighbors” are defined as sentences from the same document slice—same cik_int, report_year, and section_name—that lie within a small positional window (e.g., ±3) around sentence_pos. 
- This exploits the fact that adjacent sentences in the same section are very likely to be semantically close. It gives we deterministic, explainable positives without hand labels.

**What this measures:**
- Self@1: whether the index returns the exact anchor as the top-1 neighbor (sanity and id/shape check).
- Hit@k vs the gold window: with and without a server-side filter.
- Filtered regime restricts candidates to the same (cik, year, section)—tests pure ranking quality.
- Open regime allows global candidates—tests robustness when relevant text competes with other issuers/years/sections.
- MRR@k and distance stats: median, percentile bands to see calibration drift.
- Hard negatives: we also report the best-ranked false positive so we can spot systematic confusions (e.g., “ITEM_7” bleeding into “ITEM_1A”).

--------------------------------------------------------------------------------------------------------------------
- **Deeper Explanation:** its closest “true” neighbors should be nearby sentences from the same issuer (cik_int), same year, same section, typically within a small positional window around the anchor’s sentence_pos.
- that is, **inexpensive, deterministic ground truth without hand labels.**
- anchor :: tuple (cik, year, section, key, pos). 
- gold set G(anchor) = { sentences from the same (cik, year, section) whose sentence_pos lies in [pos−W, pos+W] \ {pos}, where W is a small window }
- Self@1 : anchor itself returned as rank 1 when we query with its embedding. Sanity check for ID alignment and distance. 
- Hit@k : Does any member of the gold set G(anchor) appear within the top-k results excluding the anchor itself? We report Hit@1, Hit@3, Hit@5.
- MRR@k (Mean Reciprocal Rank): For each anchor, find the first rank r (excluding self) where a gold neighbor appears. 
- ; contribute 1/r if r ≤ k, else 0. Average over anchors. ( Rewards earlier hits more strongly than Hit@k. )
- “Hardest cases”: Anchors whose first gold hit rank is ∞ (no gold found in top-k) or very large.


In [None]:
# ============================================================================
# Deterministic Neighbor Test (Gold = local window in same doc section)

# - Builds automatic gold sets from Stage 3 Parquet
# - Computes Self@1, Hit@k, MRR@k for filtered and open regimes
# - Prints failure cases for inspection
# ============================================================================


import polars as pl
from pathlib import Path

import boto3, numpy as np, polars as pl, math, statistics as stats
from botocore.exceptions import ClientError

from loaders.ml_config_loader import MLConfig


# ---------------------------
# Configuration
# ---------------------------
cfg = MLConfig()
VECTOR_BUCKET = "finrag-embeddings-s3vectors"
INDEX_NAME    = "finrag-sentence-fact-embed-1024d"
DIM           = cfg.s3vectors_dimensions("cohere_1024d")  # 1024

# -------------------------------------------------------------------------------------------
# WINDOW        = 3       # gold neighbor window on sentence_pos (±W)
# TOPK_OPEN     = 20      # topK for open-regime queries
# TOPK_FILT     = 20      # topK for filtered-regime queries
# MAX_ANCHORS   = 40      # number of anchors to evaluate (raise to 200+ later)
# SECTION_MINLEN= 40      # only sample sections with at least this many sentences
# SEED          = 7

WINDOW        = 5        # widen local gold to ±5 sentences
TOPK_FILT     = 30       # give the filtered regime more search budget
TOPK_OPEN     = 30       # mild bump for open regime to surface a few true locals
MAX_ANCHORS   = 60       # slightly larger sample for stability of metrics
SECTION_MINLEN= 20       # avoid tiny/fragmented sections as anchors/gold sources
SEED          = 7        # keep deterministic comparability

# Client
s3v = boto3.client("s3vectors",
                   region_name=cfg.region,
                   aws_access_key_id=cfg.aws_access_key,
                   aws_secret_access_key=cfg.aws_secret_key)

# Load Stage 3 cache (local Parquet)
stage3_path = cfg.get_s3vectors_cache_path("cohere_1024d")
df = pl.read_parquet(stage3_path)

# ---------------------------
# Helper functions
# ---------------------------
def cosine_distance(a, b):
    # For sanity checks only (we rely on service distances for metrics)
    a = np.asarray(a, dtype=np.float32); b = np.asarray(b, dtype=np.float32)
    na = np.linalg.norm(a); nb = np.linalg.norm(b)
    if na == 0 or nb == 0: return 1.0
    return 1.0 - float(np.dot(a, b) / (na * nb))

def pick_anchors(df, max_anchors=MAX_ANCHORS, section_minlen=SECTION_MINLEN, seed=SEED):
    # Choose anchors deterministically but spaced across sections/doc-years
    # 1) group by (cik, year, section), keep only reasonably long sections
    gcols = ["cik_int", "report_year", "section_name"]
    sec_sizes = df.group_by(gcols).agg(pl.count().alias("n")).filter(pl.col("n") >= section_minlen)
    long_secs = sec_sizes.join(df, on=gcols, how="inner")

    # Stable order ensures determinism
    long_secs = long_secs.sort(by=["cik_int","report_year","section_name","sentence_pos"])

    # stride-pick within each section to avoid adjacent anchors
    anchors = []
    for (cik, yr, sec), sub in long_secs.group_by(gcols, maintain_order=True):
        arr = sub.select(["sentenceID_numsurrogate","sentence_pos"]).to_dicts()
        # stride length ~ n // (desired per-section), but keep it simple/consistent:
        stride = max( max(1, len(arr)//5), 4 )
        for idx in range(0, len(arr), stride):
            anchors.append((cik, yr, sec, arr[idx]["sentenceID_numsurrogate"], arr[idx]["sentence_pos"]))
            if len(anchors) >= max_anchors:
                break
        if len(anchors) >= max_anchors:
            break
    return anchors

def gold_neighbors(df, anchor_key, cik, yr, sec, pos, window=WINDOW):
    # All neighbors in the same doc slice with |pos' - pos| in (1..W)
    sub = df.filter(
        (pl.col("cik_int")==cik) &
        (pl.col("report_year")==yr) &
        (pl.col("section_name")==sec) &
        (pl.col("sentence_pos").is_between(pos-window, pos+window))
    ).select(["sentenceID_numsurrogate","sentence_pos"])
    gold = [int(k) for k in sub["sentenceID_numsurrogate"].to_list()
            if int(k) != int(anchor_key)]
    return set(gold)  # set for fast membership

def query_neighbors(query_vec, flt=None, topk=20):
    resp = s3v.query_vectors(
        vectorBucketName=VECTOR_BUCKET,
        indexName=INDEX_NAME,
        queryVector={"float32": query_vec},
        topK=topk,
        filter=flt,
        returnMetadata=True,
        returnDistance=True
    )
    hits = resp.get("vectors", [])
    # return (keys, distances, metadata list)
    keys = [int(h["key"]) for h in hits]
    dists = [h.get("distance") for h in hits]
    meta  = [h.get("metadata", {}) for h in hits]
    return keys, dists, meta

def mrr_at_k(ranks, k):
    # ranks is a list of 1-based rank, None where no hit
    rr = []
    for r in ranks:
        if r is None or r > k:
            rr.append(0.0)
        else:
            rr.append(1.0/float(r))
    return sum(rr)/len(rr) if rr else 0.0

# ---------------------------
# Build anchor set and run evaluation
# ---------------------------
anchors = pick_anchors(df, max_anchors=MAX_ANCHORS, section_minlen=SECTION_MINLEN, seed=SEED)

results = []
for (cik, yr, sec, key, pos) in anchors:
    row = df.filter(pl.col("sentenceID_numsurrogate")==key).select(
        ["embedding","sentenceID_numsurrogate","sentence_pos"]
    ).to_dicts()[0]
    qvec = np.asarray(row["embedding"], dtype=np.float32).tolist()

    # Gold set = window neighbors in same doc slice
    G = gold_neighbors(df, key, cik, yr, sec, pos, window=WINDOW)

    # Filtered regime: restrict to same (cik, year, section)
    flt = {"$and":[
        {"cik_int":{"$eq": int(cik)}},
        {"report_year":{"$eq": int(yr)}},
        {"section_name":{"$eq": str(sec)}}
    ]}
    keys_f, dists_f, meta_f = query_neighbors(qvec, flt=flt, topk=TOPK_FILT)
    # Open regime: no filter
    keys_o, dists_o, meta_o = query_neighbors(qvec, flt=None, topk=TOPK_OPEN)

    # Record metrics
    # Self@1: expect the anchor itself at rank 1 (service returns the same key)
    self1_f = (len(keys_f)>0 and keys_f[0]==key)
    self1_o = (len(keys_o)>0 and keys_o[0]==key)

    # Hit@k against G, ignoring self at rank 1
    def first_hit_rank(keys, gold):
        for i,k_ in enumerate(keys, start=1):
            if k_==key:  # skip self
                continue
            if k_ in gold:
                return i
        return None

    r1_f = first_hit_rank(keys_f, G)
    r1_o = first_hit_rank(keys_o, G)

    hit1_f = (r1_f is not None and r1_f<=1)
    hit3_f = (r1_f is not None and r1_f<=3)
    hit5_f = (r1_f is not None and r1_f<=5)
    hit1_o = (r1_o is not None and r1_o<=1)
    hit3_o = (r1_o is not None and r1_o<=3)
    hit5_o = (r1_o is not None and r1_o<=5)

    # Distances (exclude self when computing stats)
    def dist_wo_self(keys, dists):
        out = []
        for k_, d in zip(keys, dists):
            if k_==key: continue
            if d is not None: out.append(float(d))
        return out

    Df = dist_wo_self(keys_f, dists_f)
    Do = dist_wo_self(keys_o, dists_o)

    results.append({
        "cik": int(cik), "year": int(yr), "sec": sec, "anchor_pos": int(pos),
        "self@1_filt": int(self1_f), "self@1_open": int(self1_o),
        "hit1_filt": int(hit1_f), "hit3_filt": int(hit3_f), "hit5_filt": int(hit5_f),
        "hit1_open": int(hit1_o), "hit3_open": int(hit3_o), "hit5_open": int(hit5_o),
        "rank_first_hit_filt": (r1_f or math.inf),
        "rank_first_hit_open": (r1_o or math.inf),
        "dist_med_filt": (stats.median(Df) if Df else float("nan")),
        "dist_med_open": (stats.median(Do) if Do else float("nan")),
    })

res = pl.DataFrame(results)

# ---------------------------
# Summaries
# ---------------------------
def pct(x): 
    return f"{100.0*float(x):.1f}%"

N = len(res)
print(f"\nAnchors evaluated: {N}")
print("\n=== Filtered regime (same CIK/year/section) ===")
print(f"Self@1:  {pct(res['self@1_filt'].mean())}")
print(f"Hit@1:   {pct(res['hit1_filt'].mean())}")
print(f"Hit@3:   {pct(res['hit3_filt'].mean())}")
print(f"Hit@5:   {pct(res['hit5_filt'].mean())}")
mrr_f = mrr_at_k([int(r) if r!=math.inf else None for r in res['rank_first_hit_filt'].to_list()], k=TOPK_FILT)
print(f"MRR@{TOPK_FILT}: {mrr_f:.3f}")
print(f"Median distance (non-self): {res['dist_med_filt'].drop_nulls().median():.3f}")

print("\n=== Open regime (no filter) ===")
print(f"Self@1:  {pct(res['self@1_open'].mean())}")
print(f"Hit@1:   {pct(res['hit1_open'].mean())}")
print(f"Hit@3:   {pct(res['hit3_open'].mean())}")
print(f"Hit@5:   {pct(res['hit5_open'].mean())}")
mrr_o = mrr_at_k([int(r) if r!=math.inf else None for r in res['rank_first_hit_open'].to_list()], k=TOPK_OPEN)
print(f"MRR@{TOPK_OPEN}: {mrr_o:.3f}")
print(f"Median distance (non-self): {res['dist_med_open'].drop_nulls().median():.3f}")

# Hardest misses (top few where rank_first_hit is large)
hard_f = res.sort(["rank_first_hit_filt"], descending=True).head(5)
hard_o = res.sort(["rank_first_hit_open"], descending=True).head(5)
print("\n=== Hardest cases (filtered) ===")
print(hard_f.select(["cik","year","sec","anchor_pos","rank_first_hit_filt","dist_med_filt"]))
print("\n=== Hardest cases (open) ===")
print(hard_o.select(["cik","year","sec","anchor_pos","rank_first_hit_open","dist_med_open"]))


[DEBUG] ✓ AWS credentials loaded from aws_credentials.env


(Deprecated in version 0.20.5)
  sec_sizes = df.group_by(gcols).agg(pl.count().alias("n")).filter(pl.col("n") >= section_minlen)



Anchors evaluated: 60

=== Filtered regime (same CIK/year/section) ===
Self@1:  96.7%
Hit@1:   0.0%
Hit@3:   58.3%
Hit@5:   66.7%
MRR@30: 0.311
Median distance (non-self): 0.559

=== Open regime (no filter) ===
Self@1:  58.3%
Hit@1:   0.0%
Hit@3:   0.0%
Hit@5:   0.0%
MRR@30: 0.036
Median distance (non-self): 0.376

=== Hardest cases (filtered) ===
shape: (5, 6)
┌───────┬──────┬─────────┬────────────┬─────────────────────┬───────────────┐
│ cik   ┆ year ┆ sec     ┆ anchor_pos ┆ rank_first_hit_filt ┆ dist_med_filt │
│ ---   ┆ ---  ┆ ---     ┆ ---        ┆ ---                 ┆ ---           │
│ i64   ┆ i64  ┆ str     ┆ i64        ┆ f64                 ┆ f64           │
╞═══════╪══════╪═════════╪════════════╪═════════════════════╪═══════════════╡
│ 34088 ┆ 2015 ┆ ITEM_15 ┆ 2          ┆ inf                 ┆ 0.607892      │
│ 34088 ┆ 2015 ┆ ITEM_2  ┆ 1          ┆ inf                 ┆ 0.593437      │
│ 34088 ┆ 2015 ┆ ITEM_2  ┆ 81         ┆ inf                 ┆ 0.590238      │
│ 34088 ┆ 2

### Gold Test P2?: What changes if you include tiny sections

- Metric semantics drift. gold window built on sentence position, sections with 3–20 sentences yield very few non-self “gold” neighbors. pushes Hit@k down (no available positives) or, if W is large relative to section length, it inflates Hit@k (everything becomes gold). 
- Distance distribution skews. In small sections, the nearest non-self neighbors are often structural (headers, enumerations) rather than semantic. higher median distances for first hits, which—if used to calibrate an acceptance threshold—will make runtime RAG more permissive than it should be.
- "Anchor representativeness drops". handful of short sections can dominate anchors if you don’t stratify, over-weighting boilerplate or footers and masking performance where retrieval actually matters (long narrative sections like 1A, 7, 7A).
- doesn’t mirror production ?? RAG - still query across all sections, yes—but success on tiny sections is mostly about global retrieval + reranking, not local nearest-neighbor behavior inside a 10-line section.

#### It’s fair to “trust the embedding,” but evaluation must separate:

- Can the index find tight paraphrases inside a topical neighborhood? (primary suite)
- What happens when the neighborhood is too small to contain paraphrases? (small-sections suite)

### Gold Test 2: Concept Upgrades:
- Adaptive windowing (per-anchor): Instead of a fixed ±W, compute an effective window that expands just enough to yield a minimal number of candidate gold neighbors for that anchor, capped by a max. 
```
    Base window: W_BASE (e.g., 5), Max window cap: W_MAX (e.g., 12)
    Gold target: G_TARGET (e.g., at least 2 non-self neighbors)
    Logic: start at W_BASE; widen ±w until gold size ≥ G_TARGET or w == W_MAX or section boundaries reached.
```
- Cardinality-aware reporting: Small sections can have no available golds even after expansion
```
    Covered anchors (|G| ≥ 1): contribute to Hit@k and MRR.
    Uncovered anchors (|G| = 0): excluded from those aggregates, but reported via Coverage%.
```
- Section-balanced anchor sampling.
- Bucket results by section_len (e.g., <10, 10–19, 20–39, 40+) and print bucket-level Coverage% and Hit@5.

### Coverage Res:
- Coverage% tells you how often adaptive windowing could assemble at least one non-self gold neighbor. If coverage is low in a bucket (e.g., <10), those sections simply don’t contain local paraphrases; you’ll rely on global retrieval + reranking in production.
- Filtered metrics (covered only) remain a measure of local semantic tightness, apples-to-apples with P1.
- Stratified table gives you immediate visibility into small sections without letting them redefine your main aggregates.
```
Keep: TOPK_FILT=30, TOPK_OPEN=30
P2-specific: W_BASE=5, W_MAX=12, G_TARGET=2, MAX_PER_SEC=3, MAX_ANCHORS=60
```


In [3]:
# ============================================================================
# Gold Test P2 — Adaptive windowing, cardinality-aware metrics, small-sections included
# Keeps S3 Vectors usage identical; only the gold and reporting logic change.
# ============================================================================

import boto3, numpy as np, polars as pl, math, statistics as stats
from pathlib import Path
from botocore.exceptions import ClientError
from loaders.ml_config_loader import MLConfig

# ---------------------------
# Configuration (P2-specific)
# ---------------------------
cfg = MLConfig()
VECTOR_BUCKET = "finrag-embeddings-s3vectors"
INDEX_NAME    = "finrag-sentence-fact-embed-1024d"
DIM           = cfg.s3vectors_dimensions("cohere_1024d")

# Retrieval knobs (unchanged service behavior)
TOPK_FILT     = 30
TOPK_OPEN     = 30

# Anchor/gold construction knobs (P2)
W_BASE        = 5       # start here
W_MAX         = 12      # do not exceed this
G_TARGET      = 2       # try to have at least this many non-self gold neighbors
MAX_ANCHORS   = 60      # total anchors to evaluate
MAX_PER_SEC   = 3       # max anchors sampled per (cik,year,section)
SEED          = 7       # determinism

# Client + data
s3v = boto3.client("s3vectors",
                   region_name=cfg.region,
                   aws_access_key_id=cfg.aws_access_key,
                   aws_secret_access_key=cfg.aws_secret_key)

stage3_path = cfg.get_s3vectors_cache_path("cohere_1024d")
df = pl.read_parquet(stage3_path)

# Basic section size table
gcols = ["cik_int","report_year","section_name"]
sec_sizes = (df.group_by(gcols)
               .agg(pl.count().alias("section_len"))
               .sort(gcols))
df = df.join(sec_sizes, on=gcols, how="inner")

# ---------------------------
# Helpers
# ---------------------------
def query_vectors(query_vec, flt=None, topk=20):
    # Clamp per service contract
    topk = max(1, min(30, int(topk)))
    resp = s3v.query_vectors(
        vectorBucketName=VECTOR_BUCKET,
        indexName=INDEX_NAME,
        queryVector={"float32": query_vec},
        topK=topk,
        filter=flt,
        returnMetadata=True,
        returnDistance=True
    )
    hits = resp.get("vectors", [])
    keys  = [int(h["key"]) for h in hits]
    dists = [h.get("distance") for h in hits]
    meta  = [h.get("metadata", {}) for h in hits]
    return keys, dists, meta

def effective_gold(df_slice, anchor_key, anchor_pos, w_base, w_max, g_target):
    """Expand ±window from w_base until we have >= g_target non-self neighbors or hit w_max."""
    sec_len = int(df_slice["section_len"][0])
    # Early exit: single-sentence sections
    if sec_len <= 1:
        return set(), w_base, sec_len
    # Expand window
    for w in range(w_base, w_max+1):
        sub = df_slice.filter(pl.col("sentence_pos").is_between(anchor_pos - w, anchor_pos + w)) \
                      .select(["sentenceID_numsurrogate","sentence_pos"])
        cand = [int(k) for k in sub["sentenceID_numsurrogate"].to_list() if int(k) != int(anchor_key)]
        if len(cand) >= g_target:
            return set(cand), w, sec_len
    # Not enough candidates even at max window
    sub = df_slice.filter(pl.col("sentence_pos").is_between(anchor_pos - w_max, anchor_pos + w_max)) \
                  .select(["sentenceID_numsurrogate","sentence_pos"])
    cand = [int(k) for k in sub["sentenceID_numsurrogate"].to_list() if int(k) != int(anchor_key)]
    return set(cand), w_max, sec_len

def mrr_at_k(ranks, k):
    rr = []
    for r in ranks:
        if r is None or r > k:
            rr.append(0.0)
        else:
            rr.append(1.0/float(r))
    return sum(rr)/len(rr) if rr else 0.0

# ---------------------------
# Anchor sampling (section-balanced, no min-len cut)
# ---------------------------
# Deterministic order
df_sorted = df.sort(by=["cik_int","report_year","section_name","sentence_pos"])

anchors = []
per_sec_count = {}

for (cik, yr, sec), sub in df_sorted.group_by(gcols, maintain_order=True):
    keypos = sub.select(["sentenceID_numsurrogate","sentence_pos","section_len"]).to_dicts()
    # stride by section length to avoid adjacency; at least stride 4
    stride = max(4, len(keypos)//5) if len(keypos) > 0 else 1
    taken = 0
    for idx in range(0, len(keypos), stride):
        if per_sec_count.get((cik,yr,sec), 0) >= MAX_PER_SEC:
            break
        anchors.append((int(cik), int(yr), str(sec),
                        int(keypos[idx]["sentenceID_numsurrogate"]),
                        int(keypos[idx]["sentence_pos"])))
        per_sec_count[(cik,yr,sec)] = per_sec_count.get((cik,yr,sec),0) + 1
        if len(anchors) >= MAX_ANCHORS:
            break
    if len(anchors) >= MAX_ANCHORS:
        break

# ---------------------------
# Evaluation
# ---------------------------
rows = []
for (cik, yr, sec, key, pos) in anchors:
    # Slice for this section once
    sec_slice = df.filter((pl.col("cik_int")==cik) & (pl.col("report_year")==yr) & (pl.col("section_name")==sec)) \
                  .select(["sentenceID_numsurrogate","sentence_pos","embedding","section_len"])

    # Anchor embedding
    arow = sec_slice.filter(pl.col("sentenceID_numsurrogate")==key).to_dicts()[0]
    qvec = np.asarray(arow["embedding"], dtype=np.float32).tolist()

    # Adaptive gold
    G, w_eff, sec_len = effective_gold(sec_slice, key, pos, W_BASE, W_MAX, G_TARGET)

    # Filtered query
    flt = {"$and":[
        {"cik_int":{"$eq": int(cik)}},
        {"report_year":{"$eq": int(yr)}},
        {"section_name":{"$eq": str(sec)}}
    ]}
    keys_f, dists_f, _ = query_vectors(qvec, flt=flt, topk=TOPK_FILT)
    keys_o, dists_o, _ = query_vectors(qvec, flt=None, topk=TOPK_OPEN)

    # Self@1
    self1_f = (len(keys_f)>0 and keys_f[0]==key)
    self1_o = (len(keys_o)>0 and keys_o[0]==key)

    # First non-self hit in gold
    def first_hit_rank(keys, gold):
        for i,k_ in enumerate(keys, start=1):
            if k_==key: continue
            if k_ in gold: return i
        return None

    r1_f = first_hit_rank(keys_f, G)
    r1_o = first_hit_rank(keys_o, G)

    # Distances without self
    def dist_wo_self(keys, dists):
        out = []
        for k_, d in zip(keys, dists):
            if k_==key: continue
            if d is not None: out.append(float(d))
        return out

    Df = dist_wo_self(keys_f, dists_f)
    Do = dist_wo_self(keys_o, dists_o)

    rows.append({
        "cik": cik, "year": yr, "sec": sec, "anchor_pos": pos,
        "section_len": sec_len, "w_eff": w_eff, "gold_size": len(G),
        "covered": int(len(G) > 0),
        "self@1_filt": int(self1_f), "self@1_open": int(self1_o),
        "r1_filt": (r1_f or math.inf), "r1_open": (r1_o or math.inf),
        "hit1_filt": int(r1_f is not None and r1_f<=1),
        "hit3_filt": int(r1_f is not None and r1_f<=3),
        "hit5_filt": int(r1_f is not None and r1_f<=5),
        "hit1_open": int(r1_o is not None and r1_o<=1),
        "hit3_open": int(r1_o is not None and r1_o<=3),
        "hit5_open": int(r1_o is not None and r1_o<=5),
        "dist_med_filt": (stats.median(Df) if Df else float("nan")),
        "dist_med_open": (stats.median(Do) if Do else float("nan")),
    })

res = pl.DataFrame(rows)

# ---------------------------
# Summaries (cardinality-aware)
# ---------------------------
def pct(x): return f"{100.0*float(x):.1f}%"

N_all = len(res)
N_cov = int(res["covered"].sum())
coverage = N_cov / max(1, N_all)

print(f"\nAnchors evaluated: {N_all}")
print(f"Coverage (anchors with ≥1 gold after adaptive window): {pct(coverage)}")

# Filtered (covered only)
cov = res.filter(pl.col("covered")==1)
def mrr(series_ranks, k): 
    return mrr_at_k([int(r) if r!=math.inf else None for r in series_ranks], k)

print("\n=== Filtered regime (same CIK/year/section) — covered anchors only ===")
if len(cov) > 0:
    print(f"Self@1:  {pct(cov['self@1_filt'].mean())}")
    print(f"Hit@1:   {pct(cov['hit1_filt'].mean())}")
    print(f"Hit@3:   {pct(cov['hit3_filt'].mean())}")
    print(f"Hit@5:   {pct(cov['hit5_filt'].mean())}")
    print(f"MRR@{TOPK_FILT}: {mrr(cov['r1_filt'], TOPK_FILT):.3f}")
    print(f"Median distance (non-self): {cov['dist_med_filt'].drop_nulls().median():.3f}")
else:
    print("No covered anchors (unexpected with adaptive window).")

# Open (covered only)
print("\n=== Open regime (no filter) — covered anchors only ===")
if len(cov) > 0:
    print(f"Self@1:  {pct(cov['self@1_open'].mean())}")
    print(f"Hit@1:   {pct(cov['hit1_open'].mean())}")
    print(f"Hit@3:   {pct(cov['hit3_open'].mean())}")
    print(f"Hit@5:   {pct(cov['hit5_open'].mean())}")
    print(f"MRR@{TOPK_OPEN}: {mrr(cov['r1_open'], TOPK_OPEN):.3f}")
    print(f"Median distance (non-self): {cov['dist_med_open'].drop_nulls().median():.3f}")
else:
    print("No covered anchors.")

# Hardest cases (covered only, by largest rank)
hard_f = cov.sort(["r1_filt"], descending=True).head(5)
hard_o = cov.sort(["r1_open"], descending=True).head(5)
print("\n=== Hardest cases (filtered, covered) ===")
print(hard_f.select(["cik","year","sec","anchor_pos","section_len","w_eff","gold_size","r1_filt","dist_med_filt"]))
print("\n=== Hardest cases (open, covered) ===")
print(hard_o.select(["cik","year","sec","anchor_pos","section_len","w_eff","gold_size","r1_open","dist_med_open"]))

# Stratified by section length (light view)
def bucket_len(n):
    if n < 10: return "<10"
    if n < 20: return "10-19"
    if n < 40: return "20-39"
    return "40+"

res_b = res.with_columns(pl.col("section_len").map_elements(bucket_len).alias("len_bucket"))
grp = (res_b.group_by("len_bucket")
           .agg([
               pl.count().alias("anchors"),
               pl.col("covered").mean().alias("coverage"),
               pl.col("hit5_filt").mean().alias("hit5_filt"),
               pl.col("hit5_open").mean().alias("hit5_open"),
           ])
           .sort("len_bucket"))
print("\n=== Stratified summary by section length ===")
print(grp)


[DEBUG] ✓ AWS credentials loaded from aws_credentials.env


(Deprecated in version 0.20.5)
  .agg(pl.count().alias("section_len"))



Anchors evaluated: 60
Coverage (anchors with ≥1 gold after adaptive window): 66.7%

=== Filtered regime (same CIK/year/section) — covered anchors only ===
Self@1:  100.0%
Hit@1:   0.0%
Hit@3:   75.0%
Hit@5:   80.0%
MRR@30: 0.379
Median distance (non-self): 0.540

=== Open regime (no filter) — covered anchors only ===
Self@1:  62.5%
Hit@1:   0.0%
Hit@3:   0.0%
Hit@5:   2.5%
MRR@30: 0.032
Median distance (non-self): 0.333

=== Hardest cases (filtered, covered) ===
shape: (5, 9)
┌───────┬──────┬─────────┬────────────┬───┬───────┬───────────┬─────────┬───────────────┐
│ cik   ┆ year ┆ sec     ┆ anchor_pos ┆ … ┆ w_eff ┆ gold_size ┆ r1_filt ┆ dist_med_filt │
│ ---   ┆ ---  ┆ ---     ┆ ---        ┆   ┆ ---   ┆ ---       ┆ ---     ┆ ---           │
│ i64   ┆ i64  ┆ str     ┆ i64        ┆   ┆ i64   ┆ i64       ┆ f64     ┆ f64           │
╞═══════╪══════╪═════════╪════════════╪═══╪═══════╪═══════════╪═════════╪═══════════════╡
│ 34088 ┆ 2015 ┆ ITEM_1  ┆ 19         ┆ … ┆ 5     ┆ 3         ┆ inf 

(Deprecated in version 0.20.5)
  pl.count().alias("anchors"),
