# Retrieval Module Verification

This notebook tests the `src.retrieval` module which handles:
- **Dense Vector Search** using Qdrant + BGE embeddings
- **Cross-Encoder Reranking** for high precision (top 20 ‚Üí top 5)
- **Parent Content Extraction** via `format_docs_for_gen()`

Uses a **small subset (50 docs)** for fast testing.

In [1]:
import sys
import os
import json
import zipfile

sys.path.append(os.path.abspath(".."))
PROJECT_ROOT = os.path.abspath("..")
QDRANT_PATH = os.path.join(PROJECT_ROOT, "qdrant_test_db")
MAX_DOCS = 50

print(f"Project root: {PROJECT_ROOT}")
print(f"Test subset size: {MAX_DOCS} documents")

Project root: /home/marcantoniolopez/Documenti/github/projects/llm-semeval-task8
Test subset size: 50 documents


## Step 0: Prepare Test Collection

Create a small Qdrant collection for testing retrieval.

In [2]:
# Extract corpus if needed
corpus_dir = os.path.join(PROJECT_ROOT, "dataset/corpora/passage_level")
jsonl_file = os.path.join(corpus_dir, "govt.jsonl")
zip_file = os.path.join(corpus_dir, "govt.jsonl.zip")

if not os.path.exists(jsonl_file) and os.path.exists(zip_file):
    print("üì¶ Extracting corpus...")
    with zipfile.ZipFile(zip_file, 'r') as zf:
        zf.extractall(corpus_dir)
    print("‚úÖ Corpus extracted")
else:
    print(f"‚úÖ Corpus ready: govt.jsonl")

‚úÖ Corpus ready: govt.jsonl


In [3]:
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams
from langchain_core.documents import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_qdrant import QdrantVectorStore

# Check if collection exists
need_create = True
if os.path.exists(QDRANT_PATH):
    try:
        client = QdrantClient(path=QDRANT_PATH)
        info = client.get_collection("mtrag_test")
        print(f"‚úÖ Test collection exists: {info.points_count} points")
        client.close()
        need_create = False
    except:
        pass

if need_create:
    print("üîÑ Creating test collection...")
    
    # Load subset
    docs = []
    with open(jsonl_file, 'r') as f:
        for i, line in enumerate(f):
            if i >= MAX_DOCS:
                break
            item = json.loads(line)
            text = item.get("text", "").strip()
            if text:
                docs.append(Document(page_content=text, metadata={"doc_id": item.get("id", str(i))}))
    print(f"   ‚Ä¢ Loaded {len(docs)} documents")
    
    # Chunk
    splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
    chunks = splitter.split_documents(docs)
    print(f"   ‚Ä¢ Split into {len(chunks)} chunks")
    
    # Build embeddings (using small model for speed)
    print("   ‚Ä¢ Building embeddings (bge-small-en)...")
    embedding_model = HuggingFaceEmbeddings(
        model_name="BAAI/bge-small-en-v1.5",
        model_kwargs={"device": "cpu"}
    )
    
    # Create collection
    client = QdrantClient(path=QDRANT_PATH)
    if client.collection_exists("mtrag_test"):
        client.delete_collection("mtrag_test")
    client.create_collection(
        collection_name="mtrag_test",
        vectors_config=VectorParams(size=384, distance=Distance.COSINE)
    )
    
    vectorstore = QdrantVectorStore(client=client, collection_name="mtrag_test", embedding=embedding_model)
    vectorstore.add_documents(chunks)
    print(f"\n‚úÖ Test collection created with {len(chunks)} chunks")

  from .autonotebook import tqdm as notebook_tqdm


‚úÖ Test collection exists: 294 points


## Step 1: Initialize Retriever

Create a retriever from the vector store.

In [4]:
# Reinitialize for clean state
embedding_model = HuggingFaceEmbeddings(
    model_name="BAAI/bge-small-en-v1.5",
    model_kwargs={"device": "cpu"}
)

client = QdrantClient(path=QDRANT_PATH)
vectorstore = QdrantVectorStore(
    client=client, 
    collection_name="mtrag_test", 
    embedding=embedding_model
)
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

print(f"üìä Retriever Configuration:")
print(f"   ‚Ä¢ Type: {type(retriever).__name__}")
print(f"   ‚Ä¢ Top-K: 5 documents")
print(f"   ‚Ä¢ Embedding model: bge-small-en-v1.5")

print("\n‚úÖ Retriever initialized!")

üìä Retriever Configuration:
   ‚Ä¢ Type: VectorStoreRetriever
   ‚Ä¢ Top-K: 5 documents
   ‚Ä¢ Embedding model: bge-small-en-v1.5

‚úÖ Retriever initialized!


## Step 2: Test Retrieval

Query the retriever and examine results.

In [5]:
query = "government regulations"
print(f"üîç Query: '{query}'")

docs = retriever.invoke(query)

print(f"\nüìä Retrieved {len(docs)} documents:")
for i, doc in enumerate(docs[:3]):
    print(f"\n   Document {i+1}:")
    print(f"   ‚Ä¢ Content: {doc.page_content[:100]}...")
    print(f"   ‚Ä¢ Doc ID: {doc.metadata.get('doc_id', 'N/A')}")

print("\n‚úÖ Retrieval working correctly!")

üîç Query: 'government regulations'

üìä Retrieved 5 documents:

   Document 1:
   ‚Ä¢ Content: Regulatory

Regulatory

DEC uses policies and regulations to limit environmental impacts. We issue p...
   ‚Ä¢ Doc ID: 50136ccb92494130-1579-3721

   Document 2:
   ‚Ä¢ Content: Search

 
 
 

 
HomeFAQs

Search

All categories
Code Enforcement
Construction Inspections
Drone FA...
   ‚Ä¢ Doc ID: b4521304b77788b6-2-2025

   Document 3:
   ‚Ä¢ Content: recreating, please review the map and rules that apply to the places you want to use....
   ‚Ä¢ Doc ID: 50136ccb92494130-5960-8099

‚úÖ Retrieval working correctly!


## Step 3: Test `format_docs_for_gen()`

This function from `src.retrieval`:
1. Extracts parent content from retrieved documents
2. Deduplicates to avoid repetition
3. Concatenates into context string for LLM

In [6]:
from src.retrieval import format_docs_for_gen

context = format_docs_for_gen(docs)

print(f"üìä Context Statistics:")
print(f"   ‚Ä¢ Total length: {len(context)} characters")
print(f"   ‚Ä¢ Unique documents: {len(docs)}")

print(f"\nüìÑ Context Preview (first 300 chars):")
print(f"   {context[:300]}...")

print("\n‚úÖ format_docs_for_gen() working correctly!")

üìä Context Statistics:
   ‚Ä¢ Total length: 1745 characters
   ‚Ä¢ Unique documents: 5

üìÑ Context Preview (first 300 chars):
   Regulatory

Regulatory

DEC uses policies and regulations to limit environmental impacts. We issue permits and licenses to individuals, municipalities, and corporations so they can comply with these regulations.

Regulations and Enforcement

Regulations

Regulatory Agenda

Guidance And Policy Docume...

‚úÖ format_docs_for_gen() working correctly!


## Cleanup
Remove test files after verification.

In [7]:
import shutil

# Close client first
client.close()

# Remove test database
if os.path.exists(QDRANT_PATH):
    shutil.rmtree(QDRANT_PATH)
    print(f"üóëÔ∏è Removed test database: {QDRANT_PATH}")

print("\n‚úÖ All retrieval tests passed!")

üóëÔ∏è Removed test database: /home/marcantoniolopez/Documenti/github/projects/llm-semeval-task8/qdrant_test_db

‚úÖ All retrieval tests passed!
