# DAT Dataset Generation: SQuAD Reduced Dataset

Este notebook gera um dataset reduzido inspirado no paper DAT (Dynamic Alpha Tuning) a partir do SQuAD v1.1 j√° convertido para formato BEIR.

## Objetivo

Criar um dataset reduzido com:
- **Corpus reduzido**: ~585 par√°grafos
- **Queries compat√≠veis**: ~2.976 perguntas cujas respostas est√£o nesses par√°grafos
- **Subset Hybrid-Sensitive**: ~1.111 queries onde BM25 top-1 ‚â† Dense top-1

Tudo salvo em `data/squad_small/processed/beir/` seguindo o formato BEIR padr√£o.


## 1. Setup & Configuration

Configura√ß√£o inicial: imports, constantes e paths.


In [1]:
import sys
from pathlib import Path
import pandas as pd
import hashlib
import json
from datetime import datetime
import random

# Add repo root to path
repo_root = Path.cwd().parent if Path.cwd().name == "experiments" else Path.cwd()
repo_root_str = str(repo_root)
if repo_root_str not in sys.path:
    sys.path.insert(0, repo_root_str)

from src.datasets.loader import load_beir_dataset, as_documents, as_queries
from src.retrievers.bm25_basic import BM25Basic
from src.retrievers.dense_faiss import DenseFaiss

# Configuration constants
SEED = 42
TARGET_DOCS = 585
EXPECTED_QUERIES = 2976
EXPECTED_HYBRID_SENSITIVE = 1111
K = 20  # Retrieval depth for hybrid-sensitive identification

# Paths
input_root = repo_root / "data" / "squad" / "processed" / "beir"
output_root = repo_root / "data" / "squad_small" / "processed" / "beir"
output_root.mkdir(parents=True, exist_ok=True)

# Dense model configuration (can be changed)
DENSE_MODEL = "sentence-transformers/all-MiniLM-L6-v2"  # Local model by default
# Alternative: "text-embedding-3-large" with provider="openai" (requires API key)

# Set seed for reproducibility
random.seed(SEED)
if hasattr(pd, 'numpy'):
    import numpy as np
    np.random.seed(SEED)

print(f"‚úÖ Setup complete")
print(f"   Input:  {input_root}")
print(f"   Output: {output_root}")
print(f"   Seed:   {SEED}")

‚úÖ Setup complete
   Input:  /Users/thiago/Documents/GitHub/hybrid-retrieval/data/squad/processed/beir
   Output: /Users/thiago/Documents/GitHub/hybrid-retrieval/data/squad_small/processed/beir
   Seed:   42


## 2. Stage A: Load & Validate Input Data

Carregar e validar os dados do SQuAD original.


In [2]:
def md5sum(path: Path) -> str:
    """Compute MD5 hash of a file."""
    h = hashlib.md5()
    with open(path, "rb") as f:
        for chunk in iter(lambda: f.read(1 << 20), b""):
            h.update(chunk)
    return h.hexdigest()

# Load SQuAD dataset
print("üì• Loading SQuAD dataset...")
corpus, queries, qrels = load_beir_dataset(input_root)

print(f"\n‚úÖ Loaded:")
print(f"   Corpus:  {len(corpus):,} documents")
print(f"   Queries: {len(queries):,} queries")
print(f"   Qrels:   {len(qrels):,} pairs")

# Validate columns
print("\nüîç Validating structure...")
required_corpus_cols = ["doc_id", "title", "text"]
required_queries_cols = ["query_id", "query"]
required_qrels_cols = ["query_id", "doc_id", "score", "split"]

missing_corpus = set(required_corpus_cols) - set(corpus.columns)
missing_queries = set(required_queries_cols) - set(queries.columns)
missing_qrels = set(required_qrels_cols) - set(qrels.columns)

if missing_corpus or missing_queries or missing_qrels:
    raise ValueError(f"Missing columns: corpus={missing_corpus}, queries={missing_queries}, qrels={missing_qrels}")

# Check uniqueness
print("\nüîç Checking uniqueness...")
duplicate_docs = corpus["doc_id"].duplicated().sum()
duplicate_queries = queries["query_id"].duplicated().sum()

if duplicate_docs > 0:
    print(f"   ‚ö†Ô∏è  Warning: {duplicate_docs} duplicate doc_ids found")
if duplicate_queries > 0:
    print(f"   ‚ö†Ô∏è  Warning: {duplicate_queries} duplicate query_ids found")

# Verify 1 qrel per query (handle duplicates)
print("\nüîç Checking qrels integrity...")
qrels_per_query = qrels.groupby("query_id").size()
multi_qrels = (qrels_per_query > 1).sum()
if multi_qrels > 0:
    print(f"   ‚ö†Ô∏è  Warning: {multi_qrels} queries have multiple qrels")
    print(f"   Keeping first qrel for each query")
    qrels = qrels.drop_duplicates(subset=["query_id"], keep="first")
    print(f"   Qrels after deduplication: {len(qrels):,}")

# Check for nulls/empty strings
print("\nüîç Checking data quality...")
null_docs = corpus[["doc_id", "text"]].isnull().any(axis=1).sum()
null_queries = queries[["query_id", "query"]].isnull().any(axis=1).sum()
empty_texts = (corpus["text"].astype(str).str.strip() == "").sum()
empty_queries = (queries["query"].astype(str).str.strip() == "").sum()

if null_docs > 0:
    print(f"   ‚ö†Ô∏è  Warning: {null_docs} documents with null critical fields")
if null_queries > 0:
    print(f"   ‚ö†Ô∏è  Warning: {null_queries} queries with null critical fields")
if empty_texts > 0:
    print(f"   ‚ö†Ô∏è  Warning: {empty_texts} documents with empty text")
if empty_queries > 0:
    print(f"   ‚ö†Ô∏è  Warning: {empty_queries} queries with empty query")

# Compute input file hashes
print("\nüìä Computing input file hashes...")
input_hashes = {}
for file_name in ["corpus.parquet", "queries.parquet", "qrels.parquet"]:
    file_path = input_root / file_name
    if file_path.exists():
        hash_val = md5sum(file_path)
        input_hashes[file_name.replace(".parquet", "")] = hash_val
        print(f"   {file_name}: {hash_val[:16]}...")

print("\n‚úÖ Stage A complete: Input data validated")


üì• Loading SQuAD dataset...

‚úÖ Loaded:
   Corpus:  20,958 documents
   Queries: 98,169 queries
   Qrels:   98,169 pairs

üîç Validating structure...

üîç Checking uniqueness...

üîç Checking qrels integrity...

üîç Checking data quality...

üìä Computing input file hashes...
   corpus.parquet: e4066c82e40d3c30...
   queries.parquet: f4b77397b6e47091...
   qrels.parquet: add4d02020230b05...

‚úÖ Stage A complete: Input data validated


## 3. Stage B: Reduce Corpus (~585 paragraphs)

Reduzir o corpus agrupando por t√≠tulo e selecionando t√≠tulos at√© alcan√ßar ~585 documentos.


In [3]:
print("üìâ Reducing corpus to ~585 documents...")

# Group by title
title_groups = corpus.groupby("title", dropna=False)
title_counts = title_groups.size().sort_values(ascending=False)

print(f"\nüìä Title distribution:")
print(f"   Total titles: {len(title_counts)}")
print(f"   Total paragraphs: {len(corpus)}")
print(f"   Avg paragraphs per title: {title_counts.mean():.1f}")
print(f"   Max paragraphs per title: {title_counts.max()}")
print(f"   Min paragraphs per title: {title_counts.min()}")

# Strategy: Select titles until we reach ~TARGET_DOCS
selected_titles = []
cumulative_count = 0

for title, count in title_counts.items():
    if cumulative_count + count <= TARGET_DOCS * 1.1:  # Allow 10% over
        selected_titles.append(title)
        cumulative_count += count
        if cumulative_count >= TARGET_DOCS:
            break
    else:
        # If adding this title would exceed too much, skip it
        continue

print(f"\nüìã Selected {len(selected_titles)} titles")
print(f"   Cumulative paragraphs: {cumulative_count}")

# Filter corpus to selected titles
corpus_selected = corpus[corpus["title"].isin(selected_titles)].copy()

# If we're over target, trim deterministically
if len(corpus_selected) > TARGET_DOCS:
    print(f"\n‚úÇÔ∏è  Trimming from {len(corpus_selected)} to {TARGET_DOCS} documents...")
    corpus_selected = corpus_selected.sort_values("doc_id").head(TARGET_DOCS)
    sampling_method = "by_title_trimmed"
elif len(corpus_selected) < TARGET_DOCS * 0.9:  # If less than 90% of target
    print(f"\n‚ö†Ô∏è  Only {len(corpus_selected)} docs selected, falling back to random sampling...")
    corpus_selected = corpus.sample(n=TARGET_DOCS, random_state=SEED)
    sampling_method = "random"
else:
    sampling_method = "by_title"

DOCS_STAR = set(corpus_selected["doc_id"].tolist())

print(f"\n‚úÖ Stage B complete:")
print(f"   Selected documents: {len(DOCS_STAR)}")
print(f"   Sampling method: {sampling_method}")


üìâ Reducing corpus to ~585 documents...

üìä Title distribution:
   Total titles: 490
   Total paragraphs: 20958
   Avg paragraphs per title: 42.8
   Max paragraphs per title: 149
   Min paragraphs per title: 5

üìã Selected 5 titles
   Cumulative paragraphs: 621

‚úÇÔ∏è  Trimming from 621 to 585 documents...

‚úÖ Stage B complete:
   Selected documents: 585
   Sampling method: by_title_trimmed


## 4. Stage C: Select Consistent Queries (~2,976)

Filtrar queries para manter apenas aquelas cujo par√°grafo relevante est√° no corpus reduzido.


In [4]:
print("üîç Filtering queries to match reduced corpus...")

# Filter qrels to only include doc_id in DOCS_STAR
qrels_filtered = qrels[qrels["doc_id"].isin(DOCS_STAR)].copy()

# Get unique query_ids from filtered qrels
valid_query_ids = set(qrels_filtered["query_id"].unique())

print(f"\nüìä Filter results:")
print(f"   Qrels matching reduced corpus: {len(qrels_filtered):,}")
print(f"   Unique queries: {len(valid_query_ids):,}")

# Verify 1 qrel per query in filtered set
qrels_per_query_filtered = qrels_filtered.groupby("query_id").size()
multi_qrels_filtered = (qrels_per_query_filtered > 1).sum()
if multi_qrels_filtered > 0:
    print(f"\n‚ö†Ô∏è  Warning: {multi_qrels_filtered} queries have multiple qrels in filtered set")
    print(f"   Keeping first qrel for each query")
    qrels_filtered = qrels_filtered.drop_duplicates(subset=["query_id"], keep="first")
    valid_query_ids = set(qrels_filtered["query_id"].unique())

# Filter queries DataFrame
queries_small = queries[queries["query_id"].isin(valid_query_ids)].copy()

# Filter corpus to DOCS_STAR
corpus_small = corpus[corpus["doc_id"].isin(DOCS_STAR)].copy()

# Final qrels (already filtered)
qrels_small = qrels_filtered.copy()

print(f"\n‚úÖ Stage C complete:")
print(f"   Corpus (small):  {len(corpus_small):,} documents")
print(f"   Queries (small): {len(queries_small):,} queries")
print(f"   Qrels (small):   {len(qrels_small):,} pairs")
print(f"   Expected queries: ~{EXPECTED_QUERIES}")


üîç Filtering queries to match reduced corpus...

üìä Filter results:
   Qrels matching reduced corpus: 2,823
   Unique queries: 2,823

‚úÖ Stage C complete:
   Corpus (small):  585 documents
   Queries (small): 2,823 queries
   Qrels (small):   2,823 pairs
   Expected queries: ~2976


## 5. Stage D: Sanity Checks

Revalidar invariantes e calcular estat√≠sticas do dataset reduzido.


In [5]:
print("üîç Running sanity checks on reduced dataset...")

# Re-validate invariants
print("\n1. Integrity checks:")
# All doc_ids in qrels exist in corpus
qrels_doc_ids = set(qrels_small["doc_id"].unique())
corpus_doc_ids = set(corpus_small["doc_id"].unique())
missing_docs = qrels_doc_ids - corpus_doc_ids
if missing_docs:
    print(f"   ‚ùå ERROR: {len(missing_docs)} doc_ids in qrels not found in corpus")
else:
    print(f"   ‚úÖ All qrels doc_ids exist in corpus")

# All query_ids in qrels exist in queries
qrels_query_ids = set(qrels_small["query_id"].unique())
queries_query_ids = set(queries_small["query_id"].unique())
missing_queries = qrels_query_ids - queries_query_ids
if missing_queries:
    print(f"   ‚ùå ERROR: {len(missing_queries)} query_ids in qrels not found in queries")
else:
    print(f"   ‚úÖ All qrels query_ids exist in queries")

# 1 qrel per query
qrels_per_query = qrels_small.groupby("query_id").size()
if (qrels_per_query > 1).any():
    print(f"   ‚ùå ERROR: Some queries have multiple qrels")
else:
    print(f"   ‚úÖ Exactly 1 qrel per query")

# Uniqueness
print("\n2. Uniqueness checks:")
duplicate_docs = corpus_small["doc_id"].duplicated().sum()
duplicate_queries = queries_small["query_id"].duplicated().sum()
if duplicate_docs == 0 and duplicate_queries == 0:
    print(f"   ‚úÖ No duplicate doc_ids or query_ids")
else:
    print(f"   ‚ö†Ô∏è  Duplicates: docs={duplicate_docs}, queries={duplicate_queries}")

# Statistics
print("\n3. Statistics:")
print(f"   Final counts:")
print(f"      Documents: {len(corpus_small):,}")
print(f"      Queries:   {len(queries_small):,}")
print(f"      Qrels:     {len(qrels_small):,}")

# Split distribution
if "split" in qrels_small.columns:
    split_counts = qrels_small["split"].value_counts()
    print(f"\n   Split distribution:")
    for split, count in split_counts.items():
        print(f"      {split}: {count:,} ({count/len(qrels_small)*100:.1f}%)")

# Average text lengths
avg_doc_chars = corpus_small["text"].astype(str).str.len().mean()
avg_query_chars = queries_small["query"].astype(str).str.len().mean()
print(f"\n   Average lengths:")
print(f"      Document text: {avg_doc_chars:.0f} characters")
print(f"      Query text:    {avg_query_chars:.0f} characters")

print("\n‚úÖ Stage D complete: All sanity checks passed")


üîç Running sanity checks on reduced dataset...

1. Integrity checks:
   ‚úÖ All qrels doc_ids exist in corpus
   ‚úÖ All qrels query_ids exist in queries
   ‚úÖ Exactly 1 qrel per query

2. Uniqueness checks:
   ‚úÖ No duplicate doc_ids or query_ids

3. Statistics:
   Final counts:
      Documents: 585
      Queries:   2,823
      Qrels:     2,823

   Split distribution:
      train: 2,344 (83.0%)
      test: 479 (17.0%)

   Average lengths:
      Document text: 629 characters
      Query text:    57 characters

‚úÖ Stage D complete: All sanity checks passed


## 6. Stage E: Generate Hybrid-Sensitive Subset

Executar retrieval BM25 e Dense para identificar queries onde top-1 BM25 ‚â† top-1 Dense.


In [6]:
print("üîç Generating Hybrid-Sensitive subset...")
print(f"   This will run BM25 and Dense retrieval on {len(queries_small):,} queries...")

# Convert to Document and Query objects
documents = as_documents(corpus_small)
query_objects = as_queries(queries_small)

print(f"\nüì¶ Converted to objects:")
print(f"   Documents: {len(documents)}")
print(f"   Queries:   {len(query_objects)}")

# Initialize retrievers
print("\nüîß Initializing retrievers...")
bm25_retriever = BM25Basic(k1=0.9, b=0.4)
dense_retriever = DenseFaiss(
    model_name=DENSE_MODEL,
    use_faiss=True,
    artifact_dir=str(output_root.parent.parent / "artifacts" / "squad_small_dense"),
    index_name="dense.index"
)

# Build indexes
print("\nüèóÔ∏è  Building indexes...")
print("   Building BM25 index...")
bm25_retriever.build_index(documents)

print("   Building Dense index...")
dense_retriever.build_index(documents)

print("‚úÖ Indexes built")


üîç Generating Hybrid-Sensitive subset...
   This will run BM25 and Dense retrieval on 2,823 queries...

üì¶ Converted to objects:
   Documents: 585
   Queries:   2823

üîß Initializing retrievers...


  from .autonotebook import tqdm as notebook_tqdm



üèóÔ∏è  Building indexes...
   Building BM25 index...
   Building Dense index...
2025-11-05 21:42:26 | INFO     | retriever.dense | [dense_faiss.py:79] | üöÄ Building Dense Index (585 documentos)
2025-11-05 21:42:26 | INFO     | retriever.dense | [logging.py:199] | ‚è±Ô∏è  Encoding documents - iniciando...
2025-11-05 21:42:33 | INFO     | retriever.dense | [logging.py:220] | ‚úì Encoding documents - conclu√≠do em [32m6.97s[0m
2025-11-05 21:42:33 | INFO     | retriever.dense | [logging.py:199] | ‚è±Ô∏è  Construindo FAISS IndexFlatIP - iniciando...
2025-11-05 21:42:33 | INFO     | retriever.dense | [logging.py:220] | ‚úì Construindo FAISS IndexFlatIP - conclu√≠do em [32m0.6ms[0m
2025-11-05 21:42:33 | INFO     | retriever.dense | [dense_faiss.py:108] |   ‚úì FAISS IndexFlatIP: 585 vetores, dim=384
‚úÖ Indexes built


In [7]:
# Run retrieval for all queries
print(f"\nüîç Running retrieval (K={K}) for all queries...")

# Process in batches to show progress
batch_size = 100
all_bm25_results = {}
all_dense_results = {}

for i in range(0, len(query_objects), batch_size):
    batch_queries = query_objects[i:i+batch_size]
    
    # BM25 retrieval
    bm25_results = bm25_retriever.retrieve(batch_queries, k=K)
    all_bm25_results.update(bm25_results)
    
    # Dense retrieval
    dense_results = dense_retriever.retrieve(batch_queries, k=K)
    all_dense_results.update(dense_results)
    
    if (i + batch_size) % 500 == 0 or (i + batch_size) >= len(query_objects):
        print(f"   Processed {min(i + batch_size, len(query_objects)):,} / {len(query_objects):,} queries")

print("‚úÖ Retrieval complete")



üîç Running retrieval (K=20) for all queries...
   Processed 500 / 2,823 queries
   Processed 1,000 / 2,823 queries
   Processed 1,500 / 2,823 queries
   Processed 2,000 / 2,823 queries
   Processed 2,500 / 2,823 queries
   Processed 2,823 / 2,823 queries
‚úÖ Retrieval complete


In [8]:
# Extract top-1 from each method
print("\nüìä Extracting top-1 results...")

# Get all query IDs from query_objects to ensure we check all queries
all_query_ids = {q.query_id for q in query_objects}

top1_bm25 = {}
top1_dense = {}

for query_id in all_query_ids:
    bm25_results = all_bm25_results.get(query_id, [])
    dense_results = all_dense_results.get(query_id, [])
    
    top1_bm25[query_id] = bm25_results[0][0] if bm25_results else None
    top1_dense[query_id] = dense_results[0][0] if dense_results else None

# Identify Hybrid-Sensitive queries: top1_bm25 ‚â† top1_dense
hybrid_sensitive_query_ids = []
for query_id in top1_bm25.keys():
    if top1_bm25[query_id] != top1_dense[query_id]:
        hybrid_sensitive_query_ids.append(query_id)

hybrid_sensitive_query_ids = set(hybrid_sensitive_query_ids)

print(f"\n‚úÖ Hybrid-Sensitive identification complete:")
print(f"   Total queries analyzed: {len(top1_bm25):,}")
print(f"   Hybrid-Sensitive queries: {len(hybrid_sensitive_query_ids):,}")
print(f"   Proportion: {len(hybrid_sensitive_query_ids)/len(top1_bm25)*100:.1f}%")
print(f"   Expected: ~{EXPECTED_HYBRID_SENSITIVE}")

# Check retrievability (ground truth in top-K)
print("\nüìä Retrievability analysis:")
ground_truth_map = dict(zip(qrels_small["query_id"], qrels_small["doc_id"]))

unretrievable_bm25 = 0
unretrievable_dense = 0

for query_id in hybrid_sensitive_query_ids:
    gt_doc = ground_truth_map.get(query_id)
    if gt_doc:
        bm25_topk = [r[0] for r in all_bm25_results.get(query_id, [])]
        dense_topk = [r[0] for r in all_dense_results.get(query_id, [])]
        
        if gt_doc not in bm25_topk:
            unretrievable_bm25 += 1
        if gt_doc not in dense_topk:
            unretrievable_dense += 1

print(f"   Ground truth not in BM25@20: {unretrievable_bm25:,} queries")
print(f"   Ground truth not in Dense@20: {unretrievable_dense:,} queries")

# Filter to Hybrid-Sensitive subset
queries_hybrid = queries_small[queries_small["query_id"].isin(hybrid_sensitive_query_ids)].copy()
qrels_hybrid = qrels_small[qrels_small["query_id"].isin(hybrid_sensitive_query_ids)].copy()

print(f"\n‚úÖ Stage E complete:")
print(f"   Hybrid-Sensitive queries: {len(queries_hybrid):,}")
print(f"   Hybrid-Sensitive qrels:   {len(qrels_hybrid):,}")



üìä Extracting top-1 results...

‚úÖ Hybrid-Sensitive identification complete:
   Total queries analyzed: 2,823
   Hybrid-Sensitive queries: 1,273
   Proportion: 45.1%
   Expected: ~1111

üìä Retrievability analysis:
   Ground truth not in BM25@20: 148 queries
   Ground truth not in Dense@20: 109 queries

‚úÖ Stage E complete:
   Hybrid-Sensitive queries: 1,273
   Hybrid-Sensitive qrels:   1,273


## 7. Stage F: Write Artifacts & Metadata

Salvar todos os arquivos parquet e gerar METADATA.json com informa√ß√µes de rastreabilidade.


In [9]:
print("üíæ Writing artifacts...")

# Write parquet files
output_files = {
    "corpus.parquet": corpus_small,
    "queries.parquet": queries_small,
    "qrels.parquet": qrels_small,
    "queries_hybrid.parquet": queries_hybrid,
    "qrels_hybrid.parquet": qrels_hybrid,
}

for filename, df in output_files.items():
    filepath = output_root / filename
    df.to_parquet(filepath, index=False, engine="pyarrow")
    print(f"   ‚úÖ Saved {filename} ({len(df):,} rows)")

print("\n‚úÖ All parquet files saved")


üíæ Writing artifacts...
   ‚úÖ Saved corpus.parquet (585 rows)
   ‚úÖ Saved queries.parquet (2,823 rows)
   ‚úÖ Saved qrels.parquet (2,823 rows)
   ‚úÖ Saved queries_hybrid.parquet (1,273 rows)
   ‚úÖ Saved qrels_hybrid.parquet (1,273 rows)

‚úÖ All parquet files saved


In [10]:
# Generate METADATA.json
metadata = {
    "seed": SEED,
    "sampling_method": sampling_method,
    "target_docs": TARGET_DOCS,
    "final_counts": {
        "docs": len(corpus_small),
        "queries": len(queries_small),
        "queries_hybrid": len(queries_hybrid)
    },
    "k": K,
    "retrieval_models": {
        "bm25": {
            "k1": 0.9,
            "b": 0.4
        },
        "dense": {
            "model": DENSE_MODEL,
            "provider": "huggingface"  # Change if using OpenAI
        }
    },
    "created_at": datetime.now().isoformat(),
    "input_hashes": input_hashes,
    "hybrid_sensitive_stats": {
        "count": len(hybrid_sensitive_query_ids),
        "proportion": len(hybrid_sensitive_query_ids) / len(top1_bm25) if top1_bm25 else 0,
        "unretrievable_bm25": unretrievable_bm25,
        "unretrievable_dense": unretrievable_dense
    }
}

# Write METADATA.json
metadata_path = output_root / "METADATA.json"
with open(metadata_path, "w") as f:
    json.dump(metadata, f, indent=2)

print("‚úÖ METADATA.json saved")
print(f"\nüìã Metadata summary:")
print(f"   Seed: {metadata['seed']}")
print(f"   Sampling method: {metadata['sampling_method']}")
print(f"   Final counts: {metadata['final_counts']}")
print(f"   Hybrid-Sensitive: {metadata['hybrid_sensitive_stats']['count']} queries")
print(f"   Created at: {metadata['created_at']}")

print(f"\n‚úÖ Stage F complete: All artifacts saved to {output_root}")


‚úÖ METADATA.json saved

üìã Metadata summary:
   Seed: 42
   Sampling method: by_title_trimmed
   Final counts: {'docs': 585, 'queries': 2823, 'queries_hybrid': 1273}
   Hybrid-Sensitive: 1273 queries
   Created at: 2025-11-05T21:47:05.453831

‚úÖ Stage F complete: All artifacts saved to /Users/thiago/Documents/GitHub/hybrid-retrieval/data/squad_small/processed/beir
