# Data Ingestion Module Verification

This notebook tests the `src.ingestion` module which handles:
- **Loading mtRAG data** from JSONL files
- **Parent-Child Chunking** for optimal retrieval (large chunks for context, small for search)
- **Vector Store Creation** using Qdrant + BGE-M3 embeddings

Uses a **small subset (50 docs)** for fast testing.

In [1]:
import sys
import os
import json
import zipfile

sys.path.append(os.path.abspath(".."))
PROJECT_ROOT = os.path.abspath("..")
QDRANT_PATH = os.path.join(PROJECT_ROOT, "qdrant_ingestion_test")
MAX_DOCS = 50

print(f"Project root: {PROJECT_ROOT}")
print(f"Test subset size: {MAX_DOCS} documents")

Project root: /home/marcantoniolopez/Documenti/github/projects/llm-semeval-task8
Test subset size: 50 documents


## Step 0: Prepare Test Data
Extract the corpus and create a small subset for fast testing.

In [2]:
# Extract corpus if needed
corpus_dir = os.path.join(PROJECT_ROOT, "dataset/corpora/passage_level")
jsonl_file = os.path.join(corpus_dir, "govt.jsonl")
zip_file = os.path.join(corpus_dir, "govt.jsonl.zip")

if not os.path.exists(jsonl_file) and os.path.exists(zip_file):
    print("üì¶ Extracting corpus...")
    with zipfile.ZipFile(zip_file, 'r') as zf:
        zf.extractall(corpus_dir)
    print("‚úÖ Corpus extracted")
else:
    print(f"‚úÖ Corpus ready: govt.jsonl")

# Create test subset
test_file = os.path.join(PROJECT_ROOT, "data/test_subset.jsonl")
os.makedirs(os.path.dirname(test_file), exist_ok=True)

print(f"üìù Creating test subset with {MAX_DOCS} documents...")
with open(jsonl_file, 'r') as f_in, open(test_file, 'w') as f_out:
    for i, line in enumerate(f_in):
        if i >= MAX_DOCS:
            break
        f_out.write(line)
print(f"‚úÖ Test file created: {test_file}")

‚úÖ Corpus ready: govt.jsonl
üìù Creating test subset with 50 documents...
‚úÖ Test file created: /home/marcantoniolopez/Documenti/github/projects/llm-semeval-task8/data/test_subset.jsonl


## Step 1: Test `load_and_chunk_data()`

This function:
1. Loads documents from JSONL
2. Applies **Parent-Child Chunking**:
   - Parent chunks: 1200 chars (full context for LLM)
   - Child chunks: 400 chars (indexed for search)
3. Stores parent text in child metadata for retrieval

In [3]:
from src.ingestion import load_and_chunk_data

print("üîÑ Loading and chunking data...")
docs = load_and_chunk_data(test_file)

print(f"\nüìä Results:")
print(f"   ‚Ä¢ Total chunks created: {len(docs)}")
print(f"   ‚Ä¢ Avg chunks per document: {len(docs) / MAX_DOCS:.1f}")

print(f"\nüìÑ Sample chunk:")
sample = docs[0]
print(f"   ‚Ä¢ Child content (indexed): {sample.page_content[:100]}...")
print(f"   ‚Ä¢ Parent text length: {len(sample.metadata.get('parent_text', ''))} chars")
print(f"   ‚Ä¢ Metadata keys: {list(sample.metadata.keys())}")

print("\n‚úÖ load_and_chunk_data() working correctly!")

  from .autonotebook import tqdm as notebook_tqdm


üîÑ Loading and chunking data...
--- LOADING DATA FROM /home/marcantoniolopez/Documenti/github/projects/llm-semeval-task8/data/test_subset.jsonl ---
Loaded 50 documents.
--- STARTING PARENT-CHILD SPLITTING ---

üìä Results:
   ‚Ä¢ Total chunks created: 409
   ‚Ä¢ Avg chunks per document: 8.2

üìÑ Sample chunk:
   ‚Ä¢ Child content (indexed): FAQs ‚Ä¢ What type of seats can I book?
FAQs ‚Ä¢ What type of seats can I book?

 

Skip to Main Content...
   ‚Ä¢ Parent text length: 776 chars
   ‚Ä¢ Metadata keys: ['doc_id', 'title', 'url', 'source', 'parent_text', 'parent_title', 'parent_url', 'parent_id']

‚úÖ load_and_chunk_data() working correctly!


## Step 2: Test `build_vector_store()`

This function:
1. Creates HuggingFace embeddings (BGE-M3)
2. Initializes Qdrant local database
3. Indexes all chunks with their embeddings

In [4]:
from src.ingestion import build_vector_store

# Use only first 30 chunks for speed
docs_subset = docs[:30]
print(f"üîÑ Building vector store with {len(docs_subset)} chunks...")
print("   (Using subset for faster testing)")

vectorstore = build_vector_store(docs_subset, persist_dir=QDRANT_PATH)

print("\n‚úÖ build_vector_store() working correctly!")

üîÑ Building vector store with 30 chunks...
   (Using subset for faster testing)
--- BUILDING VECTOR STORE ---


  embedding_model = HuggingFaceEmbeddings(


--- VECTOR STORE BUILT AND SAVED ---

‚úÖ build_vector_store() working correctly!


## Step 3: Verify Qdrant Collection

In [5]:
info = vectorstore.client.get_collection("mtrag_collection")

print(f"üìä Collection Statistics:")
print(f"   ‚Ä¢ Points (vectors): {info.points_count}")
print(f"   ‚Ä¢ Status: {info.status}")

print("\n‚úÖ Collection created and verified!")

üìä Collection Statistics:
   ‚Ä¢ Points (vectors): 30
   ‚Ä¢ Status: green

‚úÖ Collection created and verified!


## Step 4: Test Similarity Search

Verify that the vector store can find relevant documents.

In [6]:
query = "government regulations"
print(f"üîç Testing search with query: '{query}'")

results = vectorstore.similarity_search_with_score(query, k=3)

print(f"\nüìä Found {len(results)} results:")
for i, (doc, score) in enumerate(results):
    print(f"\n   Result {i+1} (similarity: {score:.4f}):")
    print(f"   ‚Ä¢ Child chunk: {doc.page_content[:80]}...")
    parent = doc.metadata.get('parent_text', '')
    print(f"   ‚Ä¢ Parent context: {parent[:80]}..." if parent else "   ‚Ä¢ No parent")

print("\n‚úÖ Similarity search working correctly!")

üîç Testing search with query: 'government regulations'

üìä Found 3 results:

   Result 1 (similarity: 0.4917):
   ‚Ä¢ Child chunk: Adopted City Council Ordinances

Adopted City Council Resolutions

 

 
HomeFAQs...
   ‚Ä¢ Parent context: FAQs ‚Ä¢ What type of seats can I book?
FAQs ‚Ä¢ What type of seats can I book?

 

...

   Result 2 (similarity: 0.4531):
   ‚Ä¢ Child chunk: FAQs ‚Ä¢ What type of seats can I book?
FAQs ‚Ä¢ What type of seats can I book?

 

...
   ‚Ä¢ Parent context: FAQs ‚Ä¢ What type of seats can I book?
FAQs ‚Ä¢ What type of seats can I book?

 

...

   Result 3 (similarity: 0.4520):
   ‚Ä¢ Child chunk: 3.
What type of seats can I book?...
   ‚Ä¢ Parent context: 2.
Where can I board MoGo?

Pick-up and drop-off is available at designated virt...

‚úÖ Similarity search working correctly!


## Cleanup
Remove test files after verification.

In [7]:
import shutil

# Close client first
vectorstore.client.close()

# Remove test files
if os.path.exists(QDRANT_PATH):
    shutil.rmtree(QDRANT_PATH)
    print(f"üóëÔ∏è Removed test database: {QDRANT_PATH}")

if os.path.exists(test_file):
    os.remove(test_file)
    print(f"üóëÔ∏è Removed test subset: {test_file}")

print("\n‚úÖ All ingestion tests passed!")

üóëÔ∏è Removed test database: /home/marcantoniolopez/Documenti/github/projects/llm-semeval-task8/qdrant_ingestion_test
üóëÔ∏è Removed test subset: /home/marcantoniolopez/Documenti/github/projects/llm-semeval-task8/data/test_subset.jsonl

‚úÖ All ingestion tests passed!
