# 04 - Retrieval: Hierarchical Multi-Modal Search

This notebook implements the two-stage hierarchical retrieval system with table-aware routing.

**Objectives:**
- Stage A: Section-level retrieval
- Stage B: Intra-section dense + BM25 hybrid search
- Query routing for table-centric questions
- Retrieval fusion and reranking
- Evidence collection for QA

In [11]:
import os
os.environ['KMP_DUPLICATE_LIB_OK'] = 'TRUE'  # Fix OpenMP conflict

import sys
import json
import pickle
import numpy as np
import pandas as pd
from pathlib import Path
from tqdm import tqdm
import faiss
from sentence_transformers import SentenceTransformer, CrossEncoder
from rank_bm25 import BM25Okapi

sys.path.append(str(Path.cwd().parent / 'src'))

from utils.config import PARSED_DATA_DIR, INDICES_DIR, MODEL_DIR
from retrieval.text_chunker import TextChunker
from retrieval.embedding_generator import EmbeddingGenerator
from retrieval.index_builder import IndexBuilder
from retrieval.query_router import QueryRouter
from retrieval.hierarchical_retriever import HierarchicalRetriever
from retrieval.hybrid_search import HybridSearcher

## 1. Load Indices and Data

In [12]:
print("Loading indices and data...")

# Load FAISS indices
section_index = faiss.read_index(str(INDICES_DIR / "section_index.faiss"))
text_index = faiss.read_index(str(INDICES_DIR / "text_index.faiss"))
table_index = faiss.read_index(str(INDICES_DIR / "table_index.faiss"))

# Load content and metadata
with open(INDICES_DIR / "section_data.pkl", 'rb') as f:
    section_data = pickle.load(f)

with open(INDICES_DIR / "text_data.pkl", 'rb') as f:
    text_data = pickle.load(f)

with open(INDICES_DIR / "table_data.pkl", 'rb') as f:
    table_data = pickle.load(f)

# Load index config
with open(INDICES_DIR / "index_config.json", 'r') as f:
    index_config = json.load(f)

print(f"Loaded {section_index.ntotal} section vectors")
print(f"Loaded {text_index.ntotal} text chunk vectors")
print(f"Loaded {table_index.ntotal} table sentence vectors")

Loading indices and data...
Loaded 14810 section vectors
Loaded 17706 text chunk vectors
Loaded 15838 table sentence vectors


## 2. Initialize Retrieval Components

In [13]:
# Load embedding model (same as indexing)
embedding_model = SentenceTransformer(index_config['embedding_model'])

# Initialize query router
query_router = QueryRouter()

# Initialize hierarchical retriever
hierarchical_retriever = HierarchicalRetriever(
    section_index=section_index,
    text_index=text_index,
    table_index=table_index,
    section_data=section_data,
    text_data=text_data,
    table_data=table_data,
    embedding_model=embedding_model
)

# Initialize hybrid searcher (dense + BM25 fusion)
hybrid_searcher = HybridSearcher(
    dense_weight=0.7,
    bm25_weight=0.3
)

# Optional: Load cross-encoder for reranking
USE_RERANKING = True
if USE_RERANKING:
    cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
    print("Cross-encoder loaded for reranking")

Cross-encoder loaded for reranking


## 3. Query Routing: Classify Query Type

In [14]:
# Test query routing
test_queries = [
    "Report the YoY change in R&D expense for 2022 to 2024",
    "What is the ratio of long-term debt to equity in 2023?",
    "Which operating segment contributed most to 2024 revenue growth?",
    "Explain the company's business strategy",
    "What are the main risk factors?"
]

print("=== Query Routing ===")
for query in test_queries:
    route_info = query_router.route(query)
    print(f"\nQuery: {query}")
    print(f"  Type: {route_info['query_type']}")
    print(f"  Table-centric: {route_info['is_table_centric']}")
    print(f"  Requires math: {route_info['requires_math']}")
    print(f"  Confidence: {route_info['confidence']:.2f}")

=== Query Routing ===

Query: Report the YoY change in R&D expense for 2022 to 2024
  Type: numeric_table
  Table-centric: True
  Requires math: True
  Confidence: 0.80

Query: What is the ratio of long-term debt to equity in 2023?
  Type: numeric_table
  Table-centric: True
  Requires math: True
  Confidence: 0.80

Query: Which operating segment contributed most to 2024 revenue growth?
  Type: numeric_table
  Table-centric: True
  Requires math: True
  Confidence: 0.80

Query: Explain the company's business strategy
  Type: narrative
  Table-centric: False
  Requires math: False
  Confidence: 0.60

Query: What are the main risk factors?
  Type: narrative
  Table-centric: False
  Requires math: False
  Confidence: 0.60


## 4. Hierarchical Retrieval: Stage A (Section Selection)

In [15]:
# Test Stage A retrieval
query = "What is the ratio of long-term debt to equity in 2023?"

print(f"Query: {query}\n")

# Stage A: Retrieve relevant sections
top_sections = hierarchical_retriever.retrieve_sections(
    query=query,
    k=5
)

print("=== Stage A: Top Sections ===")
for i, section_result in enumerate(top_sections):
    meta = section_result['metadata']
    score = section_result['score']
    print(f"{i+1}. [{meta['ticker']} {meta['fiscal_year']}] {meta['section_title']}")
    print(f"   Score: {score:.4f}\n")

Query: What is the ratio of long-term debt to equity in 2023?

=== Stage A: Top Sections ===
1. [APA 2024] Section 1
   Score: 0.5384

2. [APA 2022] Section 1
   Score: 0.5326

3. [APA 2023] Section 1
   Score: 0.5309

4. [CTVA 2024] Section 1
   Score: 0.5283

5. [TSN 2024] Section 1
   Score: 0.5269



## 5. Hierarchical Retrieval: Stage B (Intra-Section Search)

In [16]:
# Stage B: Hybrid search within selected sections
query = "What is the ratio of long-term debt to equity in 2023?"
route_info = query_router.route(query)

# Retrieve with routing
results = hierarchical_retriever.retrieve(
    query=query,
    route_info=route_info,
    top_k_sections=5,
    top_k_content=10,
    use_hybrid=True
)

print("=== Stage B: Top Content (Text + Tables) ===")
print(f"Query type: {route_info['query_type']}\n")

for i, result in enumerate(results['content'][:10]):
    meta = result['metadata']
    content_type = meta['content_type']
    score = result['score']
    
    print(f"{i+1}. [{content_type.upper()}] Score: {score:.4f}")
    print(f"   {meta['ticker']} {meta['fiscal_year']} - {meta.get('section', meta.get('section_title', 'N/A'))}")
    print(f"   Content: {result['content'][:150]}...\n")

=== Stage B: Top Content (Text + Tables) ===
Query type: numeric_table

1. [TABLE] Score: 6.9865
   MTB 2023 - Unknown
   Content: : II. Investments in debt securities   None...

2. [TABLE] Score: 6.6350
   D 2024 - Unknown
   Content: LTIP:  Long-term incentive program...

3. [TABLE] Score: 4.2281
   HCA 2024 - Unknown
   Content: :  2023   Ratio   2022   Ratio   2021   Ratio  None None None None None None...

4. [TABLE] Score: 4.2263
   HCA 2023 - Unknown
   Content: :  2022   Ratio   2021   Ratio   2020   Ratio  None None None None None None...

5. [TABLE] Score: 3.2434
   MKTX 2024 - Unknown
   Content: (3): For emerging markets debt, the amount of new issuance is according to J.P. Morgan Markets. The amount of new issuance excludes debt issued by eme...

6. [TABLE] Score: 3.0117
   KIM 2022 - Unknown
   Content: : ● improving debt metrics and upgraded unsecured debt ratings...

7. [TABLE] Score: 2.9716
   MKTX 2023 - Unknown
   Content: (2): For emerging markets debt, ADTV is as m

## 6. Hybrid Search: Dense + BM25 Fusion

In [17]:
# Demonstrate hybrid search fusion
query = "R&D expense growth 2023 to 2024"

print(f"Query: {query}\n")

# Dense retrieval only
dense_results = hierarchical_retriever.retrieve(
    query=query,
    route_info={'is_table_centric': True},
    top_k_content=5,
    use_hybrid=False
)

print("=== Dense Retrieval Only ===")
for i, result in enumerate(dense_results['content'][:5]):
    print(f"{i+1}. [{result['metadata']['content_type']}] {result['content'][:100]}...")

# Hybrid retrieval (dense + BM25)
hybrid_results = hierarchical_retriever.retrieve(
    query=query,
    route_info={'is_table_centric': True},
    top_k_content=5,
    use_hybrid=True
)

print("\n=== Hybrid Retrieval (Dense + BM25) ===")
for i, result in enumerate(hybrid_results['content'][:5]):
    print(f"{i+1}. [{result['metadata']['content_type']}] {result['content'][:100]}...")

Query: R&D expense growth 2023 to 2024

=== Dense Retrieval Only ===
1. [table] Research and development expense:   109,181    87,581    219,112    82,310    107,182    41,301    6...
2. [table] Research and development expense:   122,389    105,021    236,380    92,801    149,094    47,919    ...
3. [table] Research and development expense:   145,500    129,626    246,050    114,604    204,244    54,989   ...
4. [table] : • business outlook for 2023 and beyond;...
5. [table] : • business outlook for 2024 and beyond;...

=== Hybrid Retrieval (Dense + BM25) ===
1. [table] ​:  2024   2023   2022  ...
2. [table] ​:  2024  2023  2022  2021  2020...
3. [table] (in millions):  2024   2023  None None...
4. [table] :  2024   2023   Percent Change  None None None...
5. [table] :  2024   2023   Actual   ConstantCurrency  None None None None...


## 7. Cross-Encoder Reranking

In [18]:
if USE_RERANKING:
    query = "operating segment revenue contribution 2024"
    
    # Initial retrieval
    initial_results = hierarchical_retriever.retrieve(
        query=query,
        route_info={'is_table_centric': True},
        top_k_content=20,
        use_hybrid=True
    )
    
    # Prepare pairs for reranking
    pairs = [[query, result['content']] for result in initial_results['content']]
    
    # Rerank
    rerank_scores = cross_encoder.predict(pairs)
    
    # Sort by rerank scores
    reranked_indices = np.argsort(rerank_scores)[::-1]
    
    print(f"Query: {query}\n")
    print("=== Reranked Results ===")
    for i, idx in enumerate(reranked_indices[:5]):
        result = initial_results['content'][idx]
        original_score = result['score']
        rerank_score = rerank_scores[idx]
        
        print(f"{i+1}. [{result['metadata']['content_type']}]")
        print(f"   Original: {original_score:.4f}, Reranked: {rerank_score:.4f}")
        print(f"   {result['content'][:120]}...\n")

Query: operating segment revenue contribution 2024

=== Reranked Results ===
1. [table]
   Original: 7.5742, Reranked: -4.1282
   Contracted Assets:  Contracted Assets operating segment...

2. [table]
   Original: 3.1250, Reranked: -4.3758
   Contracted Energy:  Contracted Energy operating segment, formerly known as the Contracted Assets operating segment...

3. [table]
   Original: 0.4222, Reranked: -4.5615
   Percentage of Total Revenues:    2023  2022  2021...

4. [table]
   Original: 2.4149, Reranked: -5.0173
   ​:  2024   2023   2022  ...

5. [table]
   Original: 2.2467, Reranked: -5.0614
   June 12, 2024:   August 15, 2024    September 12, 2024   $ 0.75   $ 5,575 ...



## 8. Multi-Query Test Suite

In [19]:
# Comprehensive test queries
test_suite = [
    {
        'query': "Report the YoY change in R&D expense for 2022 to 2024",
        'expected_content': ['table', 'text'],
        'expected_sections': ['Financial Statements', "Management's Discussion"]
    },
    {
        'query': "What is the ratio of long-term debt to equity in 2023?",
        'expected_content': ['table'],
        'expected_sections': ['Balance Sheet', 'Financial Statements']
    },
    {
        'query': "Which operating segment contributed most to revenue growth?",
        'expected_content': ['table', 'text'],
        'expected_sections': ['Segment Information', "Management's Discussion"]
    },
    {
        'query': "What are the main business risks?",
        'expected_content': ['text'],
        'expected_sections': ['Risk Factors']
    }
]

print("=== Retrieval Test Suite ===")

for test in test_suite:
    query = test['query']
    route_info = query_router.route(query)
    
    results = hierarchical_retriever.retrieve(
        query=query,
        route_info=route_info,
        top_k_sections=3,
        top_k_content=5,
        use_hybrid=True
    )
    
    print(f"\n{'='*80}")
    print(f"Query: {query}")
    print(f"Route: {route_info['query_type']} (table-centric: {route_info['is_table_centric']})")
    
    # Check content types
    content_types = [r['metadata']['content_type'] for r in results['content']]
    print(f"\nContent types retrieved: {set(content_types)}")
    
    # Show top results
    print("\nTop 3 results:")
    for i, result in enumerate(results['content'][:3]):
        meta = result['metadata']
        print(f"{i+1}. [{meta['content_type']}] {meta.get('section', meta.get('section_title', 'N/A'))}")
        print(f"   {result['content'][:100]}...")

=== Retrieval Test Suite ===

Query: Report the YoY change in R&D expense for 2022 to 2024
Route: numeric_table (table-centric: True)

Content types retrieved: {'table'}

Top 3 results:
1. [table] Unknown
   ​:  2024   2023   2022  ...
2. [table] Unknown
   :  2024   2023   Percent Change  None None None...
3. [table] Unknown
   ​:  2024  2023  2022  2021  2020...

Query: What is the ratio of long-term debt to equity in 2023?
Route: numeric_table (table-centric: True)

Content types retrieved: {'table'}

Top 3 results:
1. [table] Unknown
   : II. Investments in debt securities   None...
2. [table] Unknown
   :  2022   Ratio   2021   Ratio   2020   Ratio  None None None None None None...
3. [table] Unknown
   :  2023   Ratio   2022   Ratio   2021   Ratio  None None None None None None...

Query: Which operating segment contributed most to revenue growth?
Route: numeric_table (table-centric: True)

Content types retrieved: {'table'}

Top 3 results:
1. [table] Unknown
   Contracted Assets

## 9. Retrieval Metrics

In [20]:
# Calculate retrieval statistics
retrieval_stats = {
    'total_queries': len(test_suite),
    'avg_results_per_query': [],
    'content_type_distribution': {'text': 0, 'table': 0},
    'avg_retrieval_time': []
}

import time

for test in test_suite:
    query = test['query']
    route_info = query_router.route(query)
    
    start_time = time.time()
    results = hierarchical_retriever.retrieve(
        query=query,
        route_info=route_info,
        top_k_content=10
    )
    retrieval_time = time.time() - start_time
    
    retrieval_stats['avg_results_per_query'].append(len(results['content']))
    retrieval_stats['avg_retrieval_time'].append(retrieval_time)
    
    for result in results['content']:
        content_type = result['metadata']['content_type']
        if content_type in retrieval_stats['content_type_distribution']:
            retrieval_stats['content_type_distribution'][content_type] += 1

print("\n=== Retrieval Statistics ===")
print(f"Total queries tested: {retrieval_stats['total_queries']}")
print(f"Avg results per query: {np.mean(retrieval_stats['avg_results_per_query']):.2f}")
print(f"Avg retrieval time: {np.mean(retrieval_stats['avg_retrieval_time'])*1000:.2f} ms")
print(f"Content type distribution: {retrieval_stats['content_type_distribution']}")


=== Retrieval Statistics ===
Total queries tested: 4
Avg results per query: 10.00
Avg retrieval time: 61.95 ms
Content type distribution: {'text': 10, 'table': 30}


## Next Steps

Proceed to **05_qa_generation.ipynb** to implement answer generation with the LLM reader and math verification.