# üöÄ Improved Chunking Pipeline (v·ªõi Dataset-Specific Chunking)

## üìã Overview

Pipeline n√†y s·ª≠ d·ª•ng **dataset-specific chunking strategies** ƒë√£ ƒë∆∞·ª£c t·ªëi ∆∞u t·ª´ **Notebook 2**:

### üéØ Key Features:

1. **Dataset-Specific Chunking** üîß
   - M·ªói dataset s·ª≠ d·ª•ng chunking method **t·ªëi ∆∞u ri√™ng**
   - D·ª±a tr√™n evaluation trong notebook 2
   - T·ª± ƒë·ªông load pre-chunked corpus (nhanh h∆°n, consistent h∆°n)

2. **Hybrid Retrieval** üîç
   - Dense retrieval (BAAI/bge-large-en-v1.5)
   - BM25 (lexical matching)
   - Weighted combination

3. **Advanced Reranking** üìä
   - BAAI/bge-reranker-v2-m3
   - Re-score top candidates

4. **Smart Aggregation** üîó
   - Chunk-level retrieval
   - Document-level aggregation (max score)

### üìÇ Pre-Chunked Data Source:

```
../data/chunked_corpus/
‚îú‚îÄ‚îÄ convfinqa_corpus_chunked_optimal.jsonl      (recursive 1536/200)
‚îú‚îÄ‚îÄ financebench_corpus_chunked_optimal.jsonl   (recursive 768/75)  
‚îú‚îÄ‚îÄ finder_corpus_chunked_optimal.jsonl         (recursive 512/50)
‚îú‚îÄ‚îÄ finqa_corpus_chunked_optimal.jsonl          (preserve_tables 2048/200)
‚îú‚îÄ‚îÄ finqabench_corpus_chunked_optimal.jsonl     (recursive 512/50)
‚îú‚îÄ‚îÄ multiheirtt_corpus_chunked_optimal.jsonl    (preserve_tables 3000/300)
‚îú‚îÄ‚îÄ tatqa_corpus_chunked_optimal.jsonl          (no_chunking)
‚îî‚îÄ‚îÄ best_chunking_config_per_dataset.json       (configs)
```

### ‚ö° Performance:

- ‚úÖ **Faster**: No need to chunk on-the-fly
- ‚úÖ **Consistent**: Same chunks for all runs
- ‚úÖ **Optimized**: Each dataset uses its best strategy

---

## üì¶ Installation

In [None]:
# Install required packages
!pip install -q sentence-transformers faiss-cpu FlagEmbedding rank-bm25

## üìö Imports

In [4]:
# Core Libraries
import os
import pandas as pd
import numpy as np
from tqdm.auto import tqdm
from typing import List, Dict, Tuple
import re

# Embedding & Retrieval
from sentence_transformers import SentenceTransformer
import faiss

# Reranking
from FlagEmbedding import FlagReranker

# BM25
from rank_bm25 import BM25Okapi

# Utils
import warnings
warnings.filterwarnings('ignore')
import logging
logging.disable(logging.CRITICAL)

In [5]:
# Check GPU
import torch
print("CUDA available:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("GPU:", torch.cuda.get_device_name(0))
    device = 'cuda'
else:
    print("Running on CPU")
    device = 'cpu'

CUDA available: True
GPU: NVIDIA GeForce RTX 3050 Laptop GPU


## ‚öôÔ∏è Configuration

In [7]:
CONFIG = {
    'data_dir': '../data',
    'output_file': 'submission_optimal_chunking.csv',
    
    'datasets': [
        'convfinqa', 'financebench', 'finder',
        'finqa', 'finqabench', 'multiheirtt', 'tatqa'
    ],
    
    # Models - SOTA
    'embedding_model': 'BAAI/bge-large-en-v1.5',
    'reranker_model': 'BAAI/bge-reranker-v2-m3',
    
    # Chunking - OPTIMAL (from notebook 2 evaluation)
    'use_prechunked': True,  # Load pre-chunked data from notebook 2
    'chunked_corpus_dir': '../data/chunked_corpus',  # Pre-chunked files location
    'chunking_config_file': '../data/chunked_corpus/best_chunking_config_per_dataset.json',  # Dataset-specific configs
    'use_chunking': True,
    'chunking_method': 'fixed',  # Fallback if pre-chunked not available
    'chunk_size': 512,  # characters
    'chunk_overlap': 50,  # characters
    'chunk_aggregation': 'max',  # max score aggregation
    'preserve_tables': True,  # Keep tables intact
    
    # Hybrid Search
    'use_hybrid': True,
    'hybrid_alpha': 0.6,  # 60% dense, 40% BM25
    
    # Retrieval Parameters
    'top_k_retrieval': 100,  # Initial retrieval
    'top_k_rerank': 50,  # Send to reranker
    'top_k_final': 10,  # Final results per query
    
    # Batch Sizes
    'embed_batch_size': 16,
    'rerank_batch_size': 16,
    'max_length': 512,
    
    # Evaluation
    'eval_on_qrels': True,
}

print("‚úÖ Configuration loaded")
print("\nüéØ CHUNKING STRATEGY:")
if CONFIG['use_prechunked']:
    print(f"   ‚úÖ Using PRE-CHUNKED corpus from notebook 2")
    print(f"   üìÇ Source: {CONFIG['chunked_corpus_dir']}")
    print(f"   üìã Config: {CONFIG['chunking_config_file']}")
    print(f"   ‚ÑπÔ∏è Each dataset uses its OPTIMAL chunking method")
else:
    print(f"   Method: {CONFIG['chunking_method']}")
    print(f"   Chunk Size: {CONFIG['chunk_size']} characters")
    print(f"   Overlap: {CONFIG['chunk_overlap']} characters")
    print(f"   Table Preservation: {CONFIG['preserve_tables']}")
print(f"\nüîó Aggregation: {CONFIG['chunk_aggregation']} (chunks ‚Üí docs)")

‚úÖ Configuration loaded

üéØ CHUNKING STRATEGY:
   ‚úÖ Using PRE-CHUNKED corpus from notebook 2
   üìÇ Source: ../data/chunked_corpus
   üìã Config: ../data/chunked_corpus/best_chunking_config_per_dataset.json
   ‚ÑπÔ∏è Each dataset uses its OPTIMAL chunking method

üîó Aggregation: max (chunks ‚Üí docs)


## üîß Helper Functions

In [8]:
def load_jsonl_data(dataset_name: str, data_dir: str):
    """Load corpus, queries, and qrels"""
    corpus_path = os.path.join(data_dir, f"{dataset_name}_corpus.jsonl", "corpus.jsonl")
    queries_path = os.path.join(data_dir, f"{dataset_name}_queries.jsonl", "queries.jsonl")
    
    # Map dataset names to qrels files
    qrels_mapping = {
        'convfinqa': 'ConvFinQA_qrels.tsv',
        'financebench': 'FinanceBench_qrels.tsv',
        'finder': 'FinDER_qrels.tsv',
        'finqa': 'FinQA_qrels.tsv',
        'finqabench': 'FinQABench_qrels.tsv',
        'multiheirtt': 'MultiHeirtt_qrels.tsv',
        'tatqa': 'TATQA_qrels.tsv'
    }
    
    qrels_path = os.path.join(data_dir, qrels_mapping.get(dataset_name, f"{dataset_name}_qrels.tsv"))
    
    corpus_df = pd.read_json(corpus_path, lines=True)
    queries_df = pd.read_json(queries_path, lines=True)
    
    qrels_df = None
    if os.path.exists(qrels_path):
        qrels_df = pd.read_csv(qrels_path, sep='\t')
    
    print(f"  Loaded {len(corpus_df)} docs, {len(queries_df)} queries")
    return corpus_df, queries_df, qrels_df


def load_prechunked_corpus(dataset_name: str, chunked_dir: str, config_file: str = None):
    """Load pre-chunked corpus from notebook 2 evaluation"""
    import json
    
    chunked_path = os.path.join(chunked_dir, f"{dataset_name}_corpus_chunked_optimal.jsonl")
    
    if not os.path.exists(chunked_path):
        print(f"  ‚ö†Ô∏è Pre-chunked file not found: {chunked_path}")
        return None, None
    
    # Load chunking config to show which method was used
    chunking_method = "unknown"
    if config_file and os.path.exists(config_file):
        with open(config_file, 'r') as f:
            configs = json.load(f)
            if dataset_name in configs:
                cfg = configs[dataset_name]
                method = cfg.get('method', 'unknown')
                if method != 'no_chunking' and cfg.get('chunk_size'):
                    chunking_method = f"{method} ({cfg['chunk_size']}/{cfg['chunk_overlap']})"
                else:
                    chunking_method = method
    
    # Load pre-chunked corpus
    chunks = []
    with open(chunked_path, 'r', encoding='utf-8') as f:
        for line in f:
            chunks.append(json.loads(line))
    
    print(f"  ‚úÖ Loaded {len(chunks)} pre-chunked chunks")
    print(f"  üìã Method used: {chunking_method}")
    return chunks, chunking_method


print("‚úÖ Data loading functions defined")

‚úÖ Data loading functions defined


## ‚úÖ Verify Pre-Chunked Data

Before running the pipeline, let's verify that pre-chunked data from notebook 2 is available:

In [9]:
import json
import os

# Check pre-chunked files
chunked_dir = CONFIG['chunked_corpus_dir']
config_file = CONFIG['chunking_config_file']

print("="*80)
print("üîç VERIFICATION: Pre-Chunked Data from Notebook 2")
print("="*80)

# Check if config file exists
if os.path.exists(config_file):
    print(f"\n‚úÖ Config file found: {config_file}")
    
    with open(config_file, 'r') as f:
        chunk_configs = json.load(f)
    
    print(f"\nüìã Dataset-Specific Chunking Configurations:")
    print("-"*80)
    print(f"{'Dataset':<15} {'Method':<25} {'NDCG@10':<12} {'Status':<10}")
    print("-"*80)
    
    for dataset in CONFIG['datasets']:
        if dataset in chunk_configs:
            cfg = chunk_configs[dataset]
            method = cfg['method']
            
            # Format method string
            if method != 'no_chunking' and cfg.get('chunk_size'):
                method_str = f"{method} ({cfg['chunk_size']}/{cfg['chunk_overlap']})"
            else:
                method_str = method
            
            ndcg = cfg.get('ndcg_10', 0.0)
            
            # Check if chunked file exists
            chunked_file = os.path.join(chunked_dir, f"{dataset}_corpus_chunked_optimal.jsonl")
            status = "‚úÖ Ready" if os.path.exists(chunked_file) else "‚ùå Missing"
            
            print(f"{dataset:<15} {method_str:<25} {ndcg:<12.4f} {status:<10}")
        else:
            print(f"{dataset:<15} {'N/A':<25} {'N/A':<12} {'‚ùå Missing':<10}")
    
    print("-"*80)
    
    # Count available files
    available = sum(1 for d in CONFIG['datasets'] 
                   if os.path.exists(os.path.join(chunked_dir, f"{d}_corpus_chunked_optimal.jsonl")))
    print(f"\nüìä Summary: {available}/{len(CONFIG['datasets'])} datasets have pre-chunked corpus")
    
    if available == len(CONFIG['datasets']):
        print("‚úÖ All datasets ready! Pipeline will use pre-chunked data.")
    else:
        print("‚ö†Ô∏è Some datasets missing. Pipeline will chunk on-the-fly for missing datasets.")
else:
    print(f"\n‚ùå Config file not found: {config_file}")
    print("‚ö†Ô∏è Pipeline will use fallback chunking settings.")

print("\n" + "="*80)

üîç VERIFICATION: Pre-Chunked Data from Notebook 2

‚úÖ Config file found: ../data/chunked_corpus/best_chunking_config_per_dataset.json

üìã Dataset-Specific Chunking Configurations:
--------------------------------------------------------------------------------
Dataset         Method                    NDCG@10      Status    
--------------------------------------------------------------------------------
convfinqa       recursive (1536/200)      0.6081       ‚úÖ Ready   
financebench    recursive (768/75)        1.0330       ‚úÖ Ready   
finder          recursive (512/50)        0.5777       ‚úÖ Ready   
finqa           preserve_tables (2048/200) 0.5591       ‚úÖ Ready   
finqabench      recursive (512/50)        1.3488       ‚úÖ Ready   
multiheirtt     preserve_tables (3000/300) 0.1948       ‚úÖ Ready   
tatqa           no_chunking               0.3408       ‚úÖ Ready   
--------------------------------------------------------------------------------

üìä Summary: 7/7 datasets 

## ‚úÇÔ∏è Optimal Chunking Functions

In [10]:
def detect_tables(text: str) -> List[Tuple[int, int]]:
    """Detect table regions using heuristics"""
    lines = text.split('\n')
    table_regions = []
    in_table = False
    table_start = 0
    
    for i, line in enumerate(lines):
        # Detect table markers: pipes, tabs, or multiple spaces
        is_table = (
            line.count('|') >= 2 or
            line.count('\t') >= 2 or
            len(re.findall(r'\s{3,}', line)) >= 2
        )
        
        if is_table and not in_table:
            in_table = True
            table_start = max(0, i - 1)  # Include line before
        elif not is_table and in_table:
            in_table = False
            table_end = min(len(lines), i + 1)  # Include line after
            if table_end - table_start >= 3:  # At least 3 lines
                table_regions.append((table_start, table_end))
    
    if in_table:
        table_regions.append((table_start, len(lines)))
    
    return table_regions


def chunk_text_fixed(text: str, chunk_size: int, overlap: int) -> List[str]:
    """Fixed-size character chunking with overlap"""
    if len(text) <= chunk_size:
        return [text]
    
    chunks = []
    step = chunk_size - overlap
    
    for i in range(0, len(text), step):
        chunk = text[i:i + chunk_size]
        if len(chunk.strip()) >= 100:  # Minimum chunk size
            chunks.append(chunk.strip())
        
        # Stop if we've reached the end
        if i + chunk_size >= len(text):
            break
    
    return chunks


def chunk_document_optimal(doc_id: str, title: str, text: str, 
                          chunk_size: int, overlap: int, preserve_tables: bool):
    """Optimal chunking: fixed-size with table preservation"""
    chunks = []
    full_text = f"{title}\n{text}" if title else text
    
    # For short documents, no chunking needed
    if len(full_text) < chunk_size * 1.5:
        chunks.append({
            'chunk_id': f"{doc_id}_c0",
            'text': full_text,
            'doc_id': doc_id,
            'is_table': False
        })
        return chunks
    
    # Detect tables
    if preserve_tables:
        lines = full_text.split('\n')
        table_regions = detect_tables(full_text)
        
        if table_regions:
            chunk_idx = 0
            prev_end = 0
            
            for table_start, table_end in table_regions:
                # Chunk text before table
                if table_start > prev_end:
                    before_text = '\n'.join(lines[prev_end:table_start])
                    if before_text.strip():
                        for chunk_text in chunk_text_fixed(before_text, chunk_size, overlap):
                            chunks.append({
                                'chunk_id': f"{doc_id}_c{chunk_idx}",
                                'text': chunk_text,
                                'doc_id': doc_id,
                                'is_table': False
                            })
                            chunk_idx += 1
                
                # Keep table as single chunk
                table_text = '\n'.join(lines[table_start:table_end])
                chunks.append({
                    'chunk_id': f"{doc_id}_t{chunk_idx}",
                    'text': f"[TABLE]\n{table_text}",
                    'doc_id': doc_id,
                    'is_table': True
                })
                chunk_idx += 1
                prev_end = table_end
            
            # Chunk text after last table
            if prev_end < len(lines):
                after_text = '\n'.join(lines[prev_end:])
                if after_text.strip():
                    for chunk_text in chunk_text_fixed(after_text, chunk_size, overlap):
                        chunks.append({
                            'chunk_id': f"{doc_id}_c{chunk_idx}",
                            'text': chunk_text,
                            'doc_id': doc_id,
                            'is_table': False
                        })
                        chunk_idx += 1
            
            return chunks
    
    # No tables detected or preservation disabled - simple fixed chunking
    for i, chunk_text in enumerate(chunk_text_fixed(full_text, chunk_size, overlap)):
        chunks.append({
            'chunk_id': f"{doc_id}_c{i}",
            'text': chunk_text,
            'doc_id': doc_id,
            'is_table': False
        })
    
    return chunks


print("‚úÖ Optimal chunking functions defined")

‚úÖ Optimal chunking functions defined


## üîç Retrieval Functions

In [11]:
def normalize_scores(scores: np.ndarray) -> np.ndarray:
    """Normalize scores to [0, 1]"""
    if scores.max() == scores.min():
        return np.ones_like(scores)
    return (scores - scores.min()) / (scores.max() - scores.min())


def hybrid_search(query_emb, query_text, faiss_index, bm25, corpus_texts, top_k, alpha=0.6):
    """Hybrid search: Dense + BM25"""
    # Dense retrieval
    dense_scores, indices = faiss_index.search(
        query_emb.reshape(1, -1).astype('float32'), top_k * 2
    )
    dense_scores = dense_scores[0]
    indices = indices[0]
    
    # BM25 retrieval
    query_tokens = query_text.lower().split()
    bm25_scores = bm25.get_scores(query_tokens)
    bm25_subset = bm25_scores[indices]
    
    # Normalize both
    dense_norm = normalize_scores(dense_scores)
    bm25_norm = normalize_scores(bm25_subset)
    
    # Combine with alpha weighting
    hybrid = alpha * dense_norm + (1 - alpha) * bm25_norm
    
    # Re-sort by hybrid scores
    sorted_idx = np.argsort(hybrid)[::-1][:top_k]
    return hybrid[sorted_idx], indices[sorted_idx]


def aggregate_chunk_scores(chunk_scores: Dict, method='max') -> Dict:
    """Aggregate chunk scores to document scores"""
    aggregated = {}
    for doc_id, scores in chunk_scores.items():
        if method == 'max':
            aggregated[doc_id] = max(scores)
        elif method == 'mean':
            aggregated[doc_id] = np.mean(scores)
        elif method == 'weighted':
            # Weighted by position (higher weight for top chunks)
            weights = np.array([1.0 / (i + 1) for i in range(len(scores))])
            weights = weights / weights.sum()
            aggregated[doc_id] = np.dot(scores, weights)
        else:
            aggregated[doc_id] = max(scores)
    return aggregated


print("‚úÖ Retrieval functions defined")

‚úÖ Retrieval functions defined


## üìä Evaluation Functions

In [12]:
def compute_ndcg(qrels: Dict, results: Dict, k=10) -> float:
    """Compute NDCG@k"""
    ndcg_scores = []
    for query_id, retrieved in results.items():
        if query_id not in qrels:
            continue
        
        relevant = qrels[query_id]
        retrieved_k = retrieved[:k]
        
        # DCG
        dcg = sum(relevant.get(doc_id, 0) / np.log2(i + 2) 
                  for i, doc_id in enumerate(retrieved_k))
        
        # IDCG
        ideal = sorted(relevant.values(), reverse=True)[:k]
        idcg = sum(rel / np.log2(i + 2) for i, rel in enumerate(ideal))
        
        if idcg > 0:
            ndcg_scores.append(dcg / idcg)
    
    return np.mean(ndcg_scores) if ndcg_scores else 0.0


def evaluate_results(results_df, qrels_df, k=10):
    """Evaluate results with qrels"""
    # Handle both column naming conventions
    query_col = 'query_id' if 'query_id' in qrels_df.columns else 'query-id'
    corpus_col = 'corpus_id' if 'corpus_id' in qrels_df.columns else 'corpus-id'
    
    qrels = qrels_df.groupby(query_col).apply(
        lambda x: dict(zip(x[corpus_col], x['score']))
    ).to_dict()
    
    results = results_df.groupby('query_id')['corpus_id'].apply(list).to_dict()
    
    ndcg = compute_ndcg(qrels, results, k)
    return {'NDCG@10': ndcg, 'num_queries': len(results), 'num_qrels': len(qrels)}


print("‚úÖ Evaluation functions defined")

‚úÖ Evaluation functions defined


## ü§ñ Load Models

In [13]:
print("Loading models...")

# Embedding model
print(f"\n1. Loading embedding model: {CONFIG['embedding_model']}")
embed_model = SentenceTransformer(CONFIG['embedding_model'], device=device)
print("   ‚úÖ Done")

# Reranker model
print(f"\n2. Loading reranker: {CONFIG['reranker_model']}")
reranker = FlagReranker(CONFIG['reranker_model'], use_fp16=(device=='cuda'))
print("   ‚úÖ Done")

print("\n‚úÖ All models loaded!")

Loading models...

1. Loading embedding model: BAAI/bge-large-en-v1.5
   ‚úÖ Done

2. Loading reranker: BAAI/bge-reranker-v2-m3
   ‚úÖ Done

‚úÖ All models loaded!


## üîÑ Main Pipeline

In [14]:
def process_dataset_with_optimal_chunking(dataset_name: str, config: Dict):
    """Process dataset with optimal chunking strategy"""
    print(f"\n{'='*70}")
    print(f"üìä Processing: {dataset_name.upper()}")
    print(f"{'='*70}")
    
    # Load data
    corpus_df, queries_df, qrels_df = load_jsonl_data(dataset_name, config['data_dir'])
    
    # Step 1: Load or Create Chunks
    all_chunks = []
    chunk_to_doc = {}
    
    # Try to load pre-chunked data first (from notebook 2)
    chunking_method_used = None
    if config.get('use_prechunked', False):
        print(f"\nüìÇ Loading pre-chunked corpus from notebook 2...")
        print(f"   Source: {config['chunked_corpus_dir']}")
        
        all_chunks, chunking_method_used = load_prechunked_corpus(
            dataset_name, 
            config['chunked_corpus_dir'],
            config.get('chunking_config_file')
        )
        
        if all_chunks:
            # Build chunk-to-doc mapping (pre-chunked format uses 'original_id')
            for c in all_chunks:
                chunk_id = c.get('_id', c.get('chunk_id', ''))
                doc_id = c.get('original_id', c.get('doc_id', chunk_id))
                chunk_to_doc[chunk_id] = doc_id
            
            expansion_ratio = len(all_chunks) / len(corpus_df)
            print(f"   üìà Expansion: {len(corpus_df)} docs ‚Üí {len(all_chunks)} chunks ({expansion_ratio:.2f}x)")
        else:
            print(f"   ‚ö†Ô∏è Pre-chunked data not found, will chunk on-the-fly")
            config['use_prechunked'] = False
    
    # Fallback: chunk on-the-fly if pre-chunked not available
    if not config.get('use_prechunked', False) or not all_chunks:
        print(f"\n‚úÇÔ∏è Chunking corpus on-the-fly...")
        print(f"   Method: {config['chunking_method']}")
        print(f"   Size: {config['chunk_size']} chars, Overlap: {config['chunk_overlap']} chars")
        
        if config['use_chunking']:
            for _, row in tqdm(corpus_df.iterrows(), total=len(corpus_df), desc="Chunking"):
                chunks = chunk_document_optimal(
                    row['_id'], 
                    str(row.get('title', '')), 
                    str(row.get('text', '')),
                    config['chunk_size'], 
                    config['chunk_overlap'], 
                    config['preserve_tables']
                )
                for c in chunks:
                    all_chunks.append(c)
                    chunk_to_doc[c['chunk_id']] = c['doc_id']
            
            num_tables = sum(1 for c in all_chunks if c.get('is_table', False))
            expansion_ratio = len(all_chunks) / len(corpus_df)
            print(f"   ‚úÖ Created {len(all_chunks)} chunks from {len(corpus_df)} docs")
            print(f"   üìà Expansion: {expansion_ratio:.2f}x")
            print(f"   üìä Tables preserved: {num_tables}")
        else:
            for _, row in corpus_df.iterrows():
                doc_id = row['_id']
                text = f"[{row.get('title', '')}] {row.get('text', '')}"
                all_chunks.append({'chunk_id': doc_id, 'text': text, 'doc_id': doc_id})
                chunk_to_doc[doc_id] = doc_id
    
    # Extract texts and IDs (handle both formats)
    chunk_texts = [c.get('text', '') for c in all_chunks]
    chunk_ids = [c.get('_id', c.get('chunk_id', '')) for c in all_chunks]
    
    # Ensure chunk_to_doc mapping is complete
    if not chunk_to_doc:
        for c in all_chunks:
            cid = c.get('_id', c.get('chunk_id', ''))
            did = c.get('original_id', c.get('doc_id', cid))
            chunk_to_doc[cid] = did
    
    # Step 2: Embedding
    print(f"\nüî¢ Encoding chunks...")
    chunk_embeddings = embed_model.encode(
        chunk_texts, 
        batch_size=config['embed_batch_size'],
        show_progress_bar=True, 
        convert_to_numpy=True,
        normalize_embeddings=True
    )
    print(f"   ‚úÖ Encoded {len(chunk_embeddings)} chunks")
    
    # Step 3: Build FAISS Index
    print(f"\nüîç Building FAISS index...")
    index = faiss.IndexFlatIP(chunk_embeddings.shape[1])
    index.add(chunk_embeddings.astype('float32'))
    print(f"   ‚úÖ Index built with {index.ntotal} vectors")
    
    # Step 4: Build BM25 Index (for hybrid)
    bm25 = None
    if config['use_hybrid']:
        print(f"\nüî§ Building BM25 index...")
        tokenized = [t.lower().split() for t in chunk_texts]
        bm25 = BM25Okapi(tokenized)
        print(f"   ‚úÖ BM25 index built")
    
    # Free memory
    del chunk_embeddings
    if device == 'cuda':
        torch.cuda.empty_cache()
    
    # Step 5: Process Queries
    print(f"\nüéØ Processing queries...")
    query_texts = [str(r.get('text', '')) for _, r in queries_df.iterrows()]
    query_ids = queries_df['_id'].tolist()
    
    query_embeddings = embed_model.encode(
        query_texts, 
        batch_size=config['embed_batch_size'],
        show_progress_bar=True, 
        convert_to_numpy=True,
        normalize_embeddings=True
    )
    print(f"   ‚úÖ Encoded {len(query_embeddings)} queries")
    
    # Step 6: Retrieve & Rerank
    print(f"\nüîé Retrieving and reranking...")
    results = []
    
    for i, query_id in enumerate(tqdm(query_ids, desc="Retrieve+Rerank")):
        query_emb = query_embeddings[i]
        query_text = query_texts[i]
        
        # Hybrid or dense retrieval
        if config['use_hybrid'] and bm25:
            scores, chunk_indices = hybrid_search(
                query_emb, query_text, index, bm25, chunk_texts,
                config['top_k_retrieval'], config['hybrid_alpha']
            )
        else:
            scores, chunk_indices = index.search(
                query_emb.reshape(1, -1).astype('float32'),
                config['top_k_retrieval']
            )
            scores, chunk_indices = scores[0], chunk_indices[0]
        
        # Aggregate chunks to documents
        doc_scores = {}
        for idx, score in zip(chunk_indices, scores):
            doc_id = chunk_to_doc[chunk_ids[idx]]
            if doc_id not in doc_scores:
                doc_scores[doc_id] = []
            doc_scores[doc_id].append(float(score))
        
        doc_agg = aggregate_chunk_scores(doc_scores, config['chunk_aggregation'])
        sorted_docs = sorted(doc_agg.items(), key=lambda x: x[1], reverse=True)[:config['top_k_rerank']]
        
        # Rerank top documents
        candidate_ids = [d[0] for d in sorted_docs]
        candidate_texts = [
            str(corpus_df[corpus_df['_id']==d]['text'].values[0])[:2048]
            for d in candidate_ids
        ]
        
        pairs = [[query_text, t] for t in candidate_texts]
        rerank_scores = reranker.compute_score(pairs)
        
        if not isinstance(rerank_scores, list):
            rerank_scores = [rerank_scores]
        
        scored = list(zip(candidate_ids, rerank_scores))
        scored.sort(key=lambda x: x[1], reverse=True)
        
        # Store top-k final results
        for doc_id, score in scored[:config['top_k_final']]:
            results.append({
                'query_id': query_id,
                'corpus_id': doc_id,
                'score': float(score)
            })
    
    results_df = pd.DataFrame(results)
    print(f"   ‚úÖ Generated {len(results_df)} results")
    
    # Step 7: Evaluate
    eval_metrics = {}
    if config['eval_on_qrels'] and qrels_df is not None:
        print(f"\nüìä Evaluating...")
        eval_metrics = evaluate_results(results_df, qrels_df)
        print(f"   NDCG@10: {eval_metrics['NDCG@10']:.4f}")
        print(f"   Queries evaluated: {eval_metrics['num_queries']}")
    
    # Clean up
    del query_embeddings, index
    if device == 'cuda':
        torch.cuda.empty_cache()
    
    return results_df, eval_metrics


print("‚úÖ Pipeline function defined")

‚úÖ Pipeline function defined


## üöÄ Run Complete Pipeline

In [None]:
all_results = []
all_eval = {}
failed = []

print("\n" + "="*70)
print("üöÄ STARTING OPTIMAL CHUNKING PIPELINE")
print("="*70)

for dataset in CONFIG['datasets']:
    try:
        df_res, metrics = process_dataset_with_optimal_chunking(dataset, CONFIG)
        all_results.append(df_res)
        if metrics:
            all_eval[dataset] = metrics
    except Exception as e:
        print(f"\n‚ùå Error processing {dataset}: {e}")
        import traceback
        traceback.print_exc()
        failed.append(dataset)

print(f"\n{'='*70}")
print(f"‚úÖ Pipeline completed: {len(all_results)}/{len(CONFIG['datasets'])} datasets")
if failed:
    print(f"‚ùå Failed datasets: {failed}")
print(f"{'='*70}")


üöÄ STARTING OPTIMAL CHUNKING PIPELINE

üìä Processing: CONVFINQA


  Loaded 2066 docs, 421 queries

üìÇ Loading pre-chunked corpus from notebook 2...
   Source: ../data/chunked_corpus
  ‚úÖ Loaded 8667 pre-chunked chunks
  üìã Method used: recursive (1536/200)
   üìà Expansion: 2066 docs ‚Üí 8667 chunks (4.20x)

üî¢ Encoding chunks...


Batches:   0%|          | 0/542 [00:00<?, ?it/s]

   ‚úÖ Encoded 8667 chunks

üîç Building FAISS index...
   ‚úÖ Index built with 8667 vectors

üî§ Building BM25 index...
   ‚úÖ BM25 index built

üéØ Processing queries...


Batches:   0%|          | 0/27 [00:00<?, ?it/s]

   ‚úÖ Encoded 421 queries

üîé Retrieving and reranking...


Retrieve+Rerank:   0%|          | 0/421 [00:00<?, ?it/s]

   ‚úÖ Generated 4210 results

üìä Evaluating...
   NDCG@10: 0.5068
   Queries evaluated: 421

üìä Processing: FINANCEBENCH
  Loaded 180 docs, 150 queries

üìÇ Loading pre-chunked corpus from notebook 2...
   Source: ../data/chunked_corpus
  ‚úÖ Loaded 539 pre-chunked chunks
  üìã Method used: recursive (768/75)
   üìà Expansion: 180 docs ‚Üí 539 chunks (2.99x)

üî¢ Encoding chunks...


Batches:   0%|          | 0/34 [00:00<?, ?it/s]

   ‚úÖ Encoded 539 chunks

üîç Building FAISS index...
   ‚úÖ Index built with 539 vectors

üî§ Building BM25 index...
   ‚úÖ BM25 index built

üéØ Processing queries...


Batches:   0%|          | 0/10 [00:00<?, ?it/s]

   ‚úÖ Encoded 150 queries

üîé Retrieving and reranking...


Retrieve+Rerank:   0%|          | 0/150 [00:00<?, ?it/s]

   ‚úÖ Generated 1500 results

üìä Evaluating...
   NDCG@10: 0.7429
   Queries evaluated: 150

üìä Processing: FINDER
  Loaded 13867 docs, 216 queries

üìÇ Loading pre-chunked corpus from notebook 2...
   Source: ../data/chunked_corpus
  ‚úÖ Loaded 30511 pre-chunked chunks
  üìã Method used: recursive (512/50)
   üìà Expansion: 13867 docs ‚Üí 30511 chunks (2.20x)

üî¢ Encoding chunks...


Batches:   0%|          | 0/1907 [00:00<?, ?it/s]

   ‚úÖ Encoded 30511 chunks

üîç Building FAISS index...
   ‚úÖ Index built with 30511 vectors

üî§ Building BM25 index...
   ‚úÖ BM25 index built

üéØ Processing queries...


Batches:   0%|          | 0/14 [00:00<?, ?it/s]

   ‚úÖ Encoded 216 queries

üîé Retrieving and reranking...


Retrieve+Rerank:   0%|          | 0/216 [00:00<?, ?it/s]

   ‚úÖ Generated 2160 results

üìä Evaluating...
   NDCG@10: 0.3903
   Queries evaluated: 216

üìä Processing: FINQA
  Loaded 2789 docs, 1147 queries

üìÇ Loading pre-chunked corpus from notebook 2...
   Source: ../data/chunked_corpus
  ‚úÖ Loaded 6036 pre-chunked chunks
  üìã Method used: preserve_tables (2048/200)
   üìà Expansion: 2789 docs ‚Üí 6036 chunks (2.16x)

üî¢ Encoding chunks...


Batches:   0%|          | 0/378 [00:00<?, ?it/s]

   ‚úÖ Encoded 6036 chunks

üîç Building FAISS index...
   ‚úÖ Index built with 6036 vectors

üî§ Building BM25 index...
   ‚úÖ BM25 index built

üéØ Processing queries...


Batches:   0%|          | 0/72 [00:00<?, ?it/s]

   ‚úÖ Encoded 1147 queries

üîé Retrieving and reranking...


Retrieve+Rerank:   0%|          | 0/1147 [00:00<?, ?it/s]

   ‚úÖ Generated 11470 results

üìä Evaluating...
   NDCG@10: 0.4190
   Queries evaluated: 1147

üìä Processing: FINQABENCH
  Loaded 92 docs, 100 queries

üìÇ Loading pre-chunked corpus from notebook 2...
   Source: ../data/chunked_corpus
  ‚úÖ Loaded 419 pre-chunked chunks
  üìã Method used: recursive (512/50)
   üìà Expansion: 92 docs ‚Üí 419 chunks (4.55x)

üî¢ Encoding chunks...


Batches:   0%|          | 0/27 [00:00<?, ?it/s]

   ‚úÖ Encoded 419 chunks

üîç Building FAISS index...
   ‚úÖ Index built with 419 vectors

üî§ Building BM25 index...
   ‚úÖ BM25 index built

üéØ Processing queries...


Batches:   0%|          | 0/7 [00:00<?, ?it/s]

   ‚úÖ Encoded 100 queries

üîé Retrieving and reranking...


Retrieve+Rerank:   0%|          | 0/100 [00:00<?, ?it/s]

   ‚úÖ Generated 1000 results

üìä Evaluating...
   NDCG@10: 0.8759
   Queries evaluated: 100

üìä Processing: MULTIHEIRTT
  Loaded 10475 docs, 974 queries

üìÇ Loading pre-chunked corpus from notebook 2...
   Source: ../data/chunked_corpus
  ‚úÖ Loaded 11241 pre-chunked chunks
  üìã Method used: preserve_tables (3000/300)
   üìà Expansion: 10475 docs ‚Üí 11241 chunks (1.07x)

üî¢ Encoding chunks...


Batches:   0%|          | 0/703 [00:00<?, ?it/s]

   ‚úÖ Encoded 11241 chunks

üîç Building FAISS index...
   ‚úÖ Index built with 11241 vectors

üî§ Building BM25 index...
   ‚úÖ BM25 index built

üéØ Processing queries...


Batches:   0%|          | 0/61 [00:00<?, ?it/s]

   ‚úÖ Encoded 974 queries

üîé Retrieving and reranking...


Retrieve+Rerank:   0%|          | 0/974 [00:00<?, ?it/s]

   ‚úÖ Generated 9740 results

üìä Evaluating...
   NDCG@10: 0.1435
   Queries evaluated: 974

üìä Processing: TATQA
  Loaded 2756 docs, 1663 queries

üìÇ Loading pre-chunked corpus from notebook 2...
   Source: ../data/chunked_corpus
  ‚úÖ Loaded 2756 pre-chunked chunks
  üìã Method used: no_chunking
   üìà Expansion: 2756 docs ‚Üí 2756 chunks (1.00x)

üî¢ Encoding chunks...


Batches:   0%|          | 0/173 [00:00<?, ?it/s]

   ‚úÖ Encoded 2756 chunks

üîç Building FAISS index...
   ‚úÖ Index built with 2756 vectors

üî§ Building BM25 index...
   ‚úÖ BM25 index built

üéØ Processing queries...


Batches:   0%|          | 0/104 [00:00<?, ?it/s]

   ‚úÖ Encoded 1663 queries

üîé Retrieving and reranking...


Retrieve+Rerank:   0%|          | 0/1663 [00:00<?, ?it/s]

## üìä Evaluation Summary

In [None]:
if all_eval:
    print("\n" + "="*70)
    print("üìä EVALUATION SUMMARY (NDCG@10)")
    print("="*70)
    
    total_ndcg = 0
    total_queries = 0
    
    for ds, m in sorted(all_eval.items()):
        print(f"\n{ds.upper():15s}: {m['NDCG@10']:.4f} ({m['num_queries']} queries)")
        total_ndcg += m['NDCG@10'] * m['num_qrels']
        total_queries += m['num_qrels']
    
    if total_queries > 0:
        avg_ndcg = total_ndcg / total_queries
        print(f"\n{'='*70}")
        print(f"üìà WEIGHTED AVERAGE NDCG@10: {avg_ndcg:.4f}")
        print(f"{'='*70}")
        
        # Compare with baseline
        baseline = 0.328  # Original baseline
        gain = avg_ndcg - baseline
        gain_pct = (gain / baseline) * 100
        
        print(f"\nüéØ Performance vs Baseline:")
        print(f"   Baseline (no chunking): {baseline:.4f}")
        print(f"   Optimal chunking: {avg_ndcg:.4f}")
        print(f"   Improvement: +{gain:.4f} ({gain_pct:+.1f}%)")
        
        # Performance tier
        if avg_ndcg >= 0.58:
            print(f"\nüèÜ EXCELLENT! Likely TOP 3 performance!")
        elif avg_ndcg >= 0.50:
            print(f"\n‚úÖ VERY GOOD! Strong improvement!")
        elif avg_ndcg >= 0.40:
            print(f"\n‚úÖ GOOD! Solid improvement!")
        else:
            print(f"\n‚ö†Ô∏è More tuning needed")
        
        print(f"\n{'='*70}")
else:
    print("\n‚ö†Ô∏è No evaluation metrics available")


üìä EVALUATION SUMMARY (NDCG@10)

CONVFINQA      : 0.4858 (421 queries)

FINANCEBENCH   : 0.7362 (150 queries)

FINDER         : 0.3953 (216 queries)

FINQA          : 0.4570 (1147 queries)

FINQABENCH     : 0.8662 (100 queries)

MULTIHEIRTT    : 0.1467 (974 queries)

TATQA          : 0.4768 (1663 queries)

üìà WEIGHTED AVERAGE NDCG@10: 0.4168

üéØ Performance vs Baseline:
   Baseline (no chunking): 0.3280
   Optimal chunking: 0.4168
   Improvement: +0.0888 (+27.1%)

‚úÖ GOOD! Solid improvement!



## üíæ Generate Submission

In [None]:
if all_results:
    final_df = pd.concat(all_results, ignore_index=True)
    submission_df = final_df[['query_id', 'corpus_id']]
    
    submission_df.to_csv(CONFIG['output_file'], index=False)
    
    print(f"\n‚úÖ Submission saved: {CONFIG['output_file']}")
    print(f"   Total entries: {len(submission_df):,}")
    print(f"   Unique queries: {submission_df['query_id'].nunique():,}")
    
    print(f"\nüìã Sample results:")
    print(submission_df.head(15))
    
    # Validation
    counts = submission_df.groupby('query_id').size()
    print(f"\nüîç Validation:")
    print(f"   Results per query: {dict(counts.value_counts().sort_index())}")
    
    if (counts == 10).all():
        print(f"   ‚úÖ All queries have exactly 10 results")
    else:
        print(f"   ‚ö†Ô∏è Warning: Some queries don't have 10 results")
        print(f"   Queries with != 10 results: {sum(counts != 10)}")
else:
    print("\n‚ùå No results to save")


‚úÖ Submission saved: submission_optimal_chunking.csv
   Total entries: 46,710
   Unique queries: 4,671

üìã Sample results:
     query_id  corpus_id
0   qd4982518  dd4c4f7aa
1   qd4982518  dd4bb016e
2   qd4982518  dd4b9f7f6
3   qd4982518  dd4bb5506
4   qd4982518  dd4b87d18
5   qd4982518  dd4be45d6
6   qd4982518  dd4bd3790
7   qd4982518  dd4c0119a
8   qd4982518  dd4b89cbc
9   qd4982518  dd4971510
10  qd49795a8  dd4befb5c
11  qd49795a8  dd4c05bc8
12  qd49795a8  dd4bd7b9c
13  qd49795a8  dd4979602
14  qd49795a8  dd4c4cf78

üîç Validation:
   Results per query: {10: 4671}
   ‚úÖ All queries have exactly 10 results


## üéØ Final Summary

In [None]:
print("\n" + "="*70)
print("üéâ OPTIMAL CHUNKING PIPELINE COMPLETED!")
print("="*70)

print("\n‚úÖ Key Features:")
print("   1. Optimal chunking (512 chars, 50 overlap)")
print("   2. Table-aware chunking")
print("   3. Fixed-size chunking method")
print("   4. BGE-large embeddings")
print("   5. BGE-reranker-v2-m3 (SOTA)")
print("   6. Hybrid search (Dense + BM25)")
print("   7. Max-score aggregation")

if all_eval:
    total_ndcg = sum(m['NDCG@10'] * m['num_qrels'] for m in all_eval.values())
    total_queries = sum(m['num_qrels'] for m in all_eval.values())
    if total_queries > 0:
        avg = total_ndcg / total_queries
        print(f"\nüìä Final NDCG@10: {avg:.4f}")

print(f"\nüíæ Output: {CONFIG['output_file']}")
print(f"\nüöÄ Next: Upload to Kaggle!")
print("="*70)


üéâ OPTIMAL CHUNKING PIPELINE COMPLETED!

‚úÖ Key Features:
   1. Optimal chunking (512 chars, 50 overlap)
   2. Table-aware chunking
   3. Fixed-size chunking method
   4. BGE-large embeddings
   5. BGE-reranker-v2-m3 (SOTA)
   6. Hybrid search (Dense + BM25)
   7. Max-score aggregation

üìä Final NDCG@10: 0.4168

üíæ Output: submission_optimal_chunking.csv

üöÄ Next: Upload to Kaggle!


: 