# SemEval 2026 Task 8 - Task A: Retrieval

This notebook implements **Task A: Retrieval** for MTRAGEval.

Uses a **UNIFIED collection** for all domains (govt, clapnq, fiqa, cloud).

**Goal:** Given a conversation, retrieve the top-K most relevant documents.

## 1. Setup & Imports

In [1]:
import os
import sys
import json
import zipfile
from tqdm import tqdm
from pathlib import Path

if os.path.exists("src"):
    PROJECT_ROOT = os.getcwd()
else:
    PROJECT_ROOT = os.path.abspath("..")

if PROJECT_ROOT not in sys.path:
    sys.path.insert(0, PROJECT_ROOT)

from src.ingestion import load_and_chunk_data, build_vector_store
from src.retrieval import get_retriever, get_qdrant_client

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# --- CONFIGURATION ---
TEAM_NAME = "Gbgers"
DOMAINS = ["govt", "clapnq", "fiqa", "cloud"]
TOP_K_RETRIEVE = 20
TOP_K_RERANK = 5

# UNIFIED COLLECTION NAME
COLLECTION_NAME = "mtrag_unified"

# TEST MODE: Set to True for quick verification
TEST_MODE = True
TEST_SUBSET_SIZE = 1000   # Number of chunks to index per domain
TEST_QUERY_LIMIT = 10     # Number of queries to process per domain

CORPUS_BASE_DIR = os.path.join(PROJECT_ROOT, "dataset/corpora/passage_level")
CONVERSATIONS_FILE = os.path.join(PROJECT_ROOT, "dataset/human/conversations/conversations.json")
QDRANT_PATH = os.path.join(PROJECT_ROOT, "qdrant_db")
OUTPUT_DIR = os.path.join(PROJECT_ROOT, "data/submissions")
OUTPUT_FILE = os.path.join(OUTPUT_DIR, f"submission_TaskA_{TEAM_NAME}.jsonl")

os.makedirs(OUTPUT_DIR, exist_ok=True)
os.makedirs(QDRANT_PATH, exist_ok=True)

if TEST_MODE:
    print(f"‚ö†Ô∏è TEST MODE: Indexing {TEST_SUBSET_SIZE} chunks/domain, {TEST_QUERY_LIMIT} queries/domain")

‚ö†Ô∏è TEST MODE: Indexing 1000 chunks/domain, 10 queries/domain


## 2. Helper Functions

In [3]:
def extract_last_query(messages):
    """Extract last user question from messages."""
    for msg in reversed(messages):
        if msg.get("speaker") == "user":
            return msg.get("text", "")
    return ""

def get_corpus_file(domain):
    """Get or extract corpus file path."""
    jsonl_path = os.path.join(CORPUS_BASE_DIR, f"{domain}.jsonl")
    zip_path = os.path.join(CORPUS_BASE_DIR, f"{domain}.jsonl.zip")
    
    if not os.path.exists(jsonl_path):
        if os.path.exists(zip_path):
            print(f"üì¶ Extracting {domain}.jsonl...")
            with zipfile.ZipFile(zip_path, 'r') as zf:
                zf.extractall(CORPUS_BASE_DIR)
        else:
            return None
    return jsonl_path

## 3. Build Unified Collection

In [4]:
# Check if collection already exists
need_build = True

if os.path.exists(QDRANT_PATH):
    try:
        client = get_qdrant_client(QDRANT_PATH)
        if client.collection_exists(COLLECTION_NAME):
            info = client.get_collection(COLLECTION_NAME)
            print(f"‚úÖ Unified collection '{COLLECTION_NAME}' exists ({info.points_count} vectors)")
            need_build = False
    except Exception as e:
        print(f"‚ö†Ô∏è Warning: {e}")

if need_build:
    print(f"üîÑ Building unified collection '{COLLECTION_NAME}' with all domains...")
    all_docs = []
    
    for domain in DOMAINS:
        corpus_path = get_corpus_file(domain)
        if not corpus_path:
            print(f"‚ö†Ô∏è Corpus not found for {domain}, skipping...")
            continue
        
        print(f"üìÇ Loading {domain}...")
        docs = load_and_chunk_data(corpus_path)
        
        # Add domain metadata
        for doc in docs:
            doc.metadata["domain"] = domain
        
        if TEST_MODE and len(docs) > TEST_SUBSET_SIZE:
            print(f"‚úÇÔ∏è TEST MODE: Slicing {domain} to {TEST_SUBSET_SIZE} chunks")
            docs = docs[:TEST_SUBSET_SIZE]
        
        all_docs.extend(docs)
        print(f"   Added {len(docs)} chunks")
    
    print(f"üìä Total documents: {len(all_docs)}")
    build_vector_store(all_docs, persist_dir=QDRANT_PATH, collection_name=COLLECTION_NAME)
    print("‚úÖ Unified collection built")

üîÑ Building unified collection 'mtrag_unified' with all domains...
üìÇ Loading govt...
--- LOADING DATA FROM /home/marcantoniolopez/Documenti/github/projects/llm-semeval-task8/dataset/corpora/passage_level/govt.jsonl ---
Loaded 49607 documents.
--- STARTING PARENT-CHILD SPLITTING ---
‚úÇÔ∏è TEST MODE: Slicing govt to 1000 chunks
   Added 1000 chunks
üìÇ Loading clapnq...
--- LOADING DATA FROM /home/marcantoniolopez/Documenti/github/projects/llm-semeval-task8/dataset/corpora/passage_level/clapnq.jsonl ---
Loaded 183408 documents.
--- STARTING PARENT-CHILD SPLITTING ---
‚úÇÔ∏è TEST MODE: Slicing clapnq to 1000 chunks
   Added 1000 chunks
üìÇ Loading fiqa...
--- LOADING DATA FROM /home/marcantoniolopez/Documenti/github/projects/llm-semeval-task8/dataset/corpora/passage_level/fiqa.jsonl ---
Loaded 60984 documents.
--- STARTING PARENT-CHILD SPLITTING ---
‚úÇÔ∏è TEST MODE: Slicing fiqa to 1000 chunks
   Added 1000 chunks
üìÇ Loading cloud...
--- LOADING DATA FROM /home/marcantoniolopez

  _embedding_model = HuggingFaceEmbeddings(


   Adding 4000 documents in batches of 64...


Indexing: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 63/63 [01:59<00:00,  1.90s/it]

--- VECTOR STORE BUILT AND SAVED ---
‚úÖ Unified collection built





## 4. Initialize Unified Retriever

In [5]:
print("üîç Initializing unified retriever...")
retriever = get_retriever(
    qdrant_path=QDRANT_PATH,
    collection_name=COLLECTION_NAME,
    top_k_retrieve=TOP_K_RETRIEVE,
    top_k_rerank=TOP_K_RERANK
)
print("‚úÖ Retriever ready")

üîç Initializing unified retriever...
üîß Loading reranker: cross-encoder/ms-marco-MiniLM-L-6-v2
‚úÖ Retriever ready


## 5. Run Retrieval

In [6]:
all_results = []

# Load ALL conversations
print("üìÇ Loading conversations...")
with open(CONVERSATIONS_FILE, 'r') as f:
    all_conversations = json.load(f)
print(f"Total conversations: {len(all_conversations)}")

for domain in DOMAINS:
    print(f"\n{'='*40}\nüåç DOMAIN: {domain.upper()}\n{'='*40}")
    
    # Filter by domain
    domain_convs = [c for c in all_conversations if domain.lower() in c.get("domain", "").lower()]
    print(f"Found {len(domain_convs)} conversations")
    
    if not domain_convs:
        continue
        
    if TEST_MODE:
        print(f"‚úÇÔ∏è TEST MODE: Processing {TEST_QUERY_LIMIT} queries")
        domain_convs = domain_convs[:TEST_QUERY_LIMIT]
    
    print(f"üöÄ Running retrieval...")
    for conv in tqdm(domain_convs):
        messages = conv.get("messages", [])
        query = extract_last_query(messages)
        if not query: 
            continue
            
        try:
            docs = retriever.invoke(query)
        except Exception as e:
            print(f"Error: {e}")
            docs = []
            
        # Format output
        contexts = []
        for i, doc in enumerate(docs):
            meta = doc.metadata
            contexts.append({
                "document_id": str(meta.get("doc_id") or meta.get("parent_id") or f"{domain}_{i}"),
                "score": float(meta.get("relevance_score") or 0.0),
                "text": meta.get("parent_text") or doc.page_content
            })
            
        all_results.append({
            "conversation_id": conv.get("author"),
            "task_id": f"{conv.get('author')}::1",
            "Collection": f"mt-rag-{domain}",
            "input": [{"speaker": m["speaker"], "text": m["text"]} for m in messages],
            "contexts": contexts
        })

print(f"\n‚úÖ Total results: {len(all_results)}")

üìÇ Loading conversations...
Total conversations: 110

üåç DOMAIN: GOVT
Found 28 conversations
‚úÇÔ∏è TEST MODE: Processing 10 queries
üöÄ Running retrieval...


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 10/10 [00:00<00:00, 14.48it/s]



üåç DOMAIN: CLAPNQ
Found 29 conversations
‚úÇÔ∏è TEST MODE: Processing 10 queries
üöÄ Running retrieval...


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 10/10 [00:00<00:00, 15.66it/s]



üåç DOMAIN: FIQA
Found 27 conversations
‚úÇÔ∏è TEST MODE: Processing 10 queries
üöÄ Running retrieval...


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 10/10 [00:00<00:00, 21.70it/s]



üåç DOMAIN: CLOUD
Found 26 conversations
‚úÇÔ∏è TEST MODE: Processing 10 queries
üöÄ Running retrieval...


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 10/10 [00:00<00:00, 19.50it/s]


‚úÖ Total results: 40





## 6. Save Results

In [7]:
print(f"üíæ Saving {len(all_results)} results to {OUTPUT_FILE}...")
with open(OUTPUT_FILE, 'w', encoding='utf-8') as f:
    for item in all_results:
        f.write(json.dumps(item, ensure_ascii=False) + '\n')
print("‚úÖ Done!")

# Validation
if all_results:
    sample = all_results[0]
    if "contexts" in sample and isinstance(sample["contexts"], list):
        print("\033[92mVALIDATION PASS: Structure correct.\033[0m")
    else:
        print("\033[91mVALIDATION FAIL: Key 'contexts' missing or invalid.\033[0m")

üíæ Saving 40 results to /home/marcantoniolopez/Documenti/github/projects/llm-semeval-task8/data/submissions/submission_TaskA_Gbgers.jsonl...
‚úÖ Done!
[92mVALIDATION PASS: Structure correct.[0m
