This notebook/script processes semantic chunks generated from PDF and creates ground truth data 
for retrieval evaluation
1. Reads chunked semantic data from a JSON file.
2. Filters and renames specific fields for cleaner processing.
3. Uses an AI model to generate 5 sustainability-focused travel questions 
   for each record.
4. Generate unique doc ID that merges with Md5 hash (doc+hash)   
5. Saves the resulting ground truth dataset to CSV for retrieval evaluation.

#### Helper functions

In [1]:
import json
import pandas as pd
from pathlib import Path
from tqdm import tqdm
import os
import sys
from dotenv import load_dotenv

In [2]:
file_path = '../data/processed/semantic_chunks.json'
output_csv = '../data/processed/ground-truth-retreival.csv'

In [3]:
# ------------------------------------------------------------
# Load semantic chunks JSON file
# ------------------------------------------------------------
with open(file_path, "r", encoding="utf-8") as f:
    chunks = json.load(f)

In [4]:
chunks[0]

{'metadata': {'document_id': 'd4402d82c0',
  'pdf_name': 'Andhra_Pradesh.pdf',
  'pdf_part': 0},
 'chunk_text': 'Asia > South Asia > India > Southern India > Andhra Pradesh  \n![0_image_0.png](0_image_0.png)',
 'token_count': 93}

In [5]:
# ------------------------------------------------------------
# Create DataFrame with required fields
# ------------------------------------------------------------
df_filtered = pd.DataFrame(chunks)
df_filtered["location"] = df_filtered["metadata"].apply(lambda m: m.get("pdf_name", "").replace(".pdf", ""))
df_filtered["doc_id"] = df_filtered["metadata"].apply(lambda m: m.get("document_id", ""))
df_filtered["content"] = df_filtered["chunk_text"]
df_filtered = df_filtered[["location", "doc_id", "content"]]

In [6]:
display(df_filtered.head())

Unnamed: 0,location,doc_id,content
0,Andhra_Pradesh,d4402d82c0,Asia > South Asia > India > Southern India > A...
1,Andhra_Pradesh,d4402d82c0,Andhra Pradesh (AP) is a state in Southern Ind...
2,Andhra_Pradesh,d4402d82c0,"Northern Coast (Alluri Sitharama Raju, Anakapa..."
3,Andhra_Pradesh,d4402d82c0,Here are some of the most notable cities.
4,Andhra_Pradesh,ed240be0ed,Amaravati - the capital of Andhra Pradesh whic...


In [44]:
# Convert back to list of dictionaries for processing
documents = df_filtered.to_dict('records')

In [45]:
documents[1]

{'location': 'Andhra_Pradesh',
 'doc_id': 'd4402d82c0',
 'content': 'Andhra Pradesh (AP) is a state in Southern India, with Bay of Bengal on the east and shares boundaries with Telangana on the north, Chhattisgarh and Odisha on the north-east, Tamil Nadu on the south and Karnataka on the west. Vijayawada is the capital of this state.'}

In [46]:
print(f"Processed {len(documents)} documents")


Processed 149 documents


In [48]:
# ------------------------------------------------------------
# Generate unique document IDs (using existing doc_id + content hash)
# ------------------------------------------------------------
import hashlib
def generate_document_id(doc):
    # Use existing doc_id + first 20 chars of content for uniqueness
    combined = f"{doc['location']}-{doc['doc_id']}-{doc['content'][:20]}"
    hash_object = hashlib.md5(combined.encode())
    hash_hex = hash_object.hexdigest()
    document_id = hash_hex[:8]
    return document_id

for doc in documents:
    doc['id'] = generate_document_id(doc)

print(f"Generated IDs for all documents")

Generated IDs for all documents


In [50]:
# ------------------------------------------------------------
# Check for hash collisions
# ------------------------------------------------------------
from collections import defaultdict
hashes = defaultdict(list)

for doc in documents:
    doc_id = doc['id']
    hashes[doc_id].append(doc)

collision_count = 0
for k, values in hashes.items():
    if len(values) > 1:
        print(f"Hash collision: {k} appears {len(values)} times")
        collision_count += 1

print(f"Total documents: {len(documents)}, Unique IDs: {len(hashes)}, Collisions: {collision_count}")

Hash collision: 76f6a7f0 appears 2 times
Hash collision: 919660f8 appears 2 times
Total documents: 149, Unique IDs: 147, Collisions: 2


In [52]:
hashes['76f6a7f0']

[{'location': 'Karnataka',
  'doc_id': '9a1dcdf649',
  'content': "Karnataka is known for its wealth of flora and fauna. Around 25% of India's elephants and 10% of its tigers are in Karnataka. Many regions in Karnataka are as yet unexplored, leading to discoveries of flora and fauna. The Western Ghats teem with wildlife and are an acknowledged *biodiversity hotspot*.",
  'id': '76f6a7f0'},
 {'location': 'Karnataka',
  'doc_id': '9a1dcdf649',
  'content': 'Karnataka is known for a wide variety of **local fast foods** which are available at outlets everywhere. Some are standalone outlets while others are well known names with branch locations. The outlets are self-service where you stand and eat, although some of the newer ones provide sit-down table service for a higher charge. The outlets are open throughout the day and popular during breakfast and lunch. Some outlets specialize, while others have more variety as indicated in their menus.Udupi',
  'id': '76f6a7f0'}]

In [53]:
hashes['919660f8']

[{'location': 'Karnataka',
  'doc_id': '9a1dcdf649',
  'content': 'Kempegowda International Airport (BLR IATA) in northern Bangalore is the 3rd-busiest airport in India, connecting with major state capitals and cities as well as 20 international destinations.  \n![5_image_0.png](5_image_0.png)',
  'id': '919660f8'},
 {'location': 'Karnataka',
  'doc_id': '9a1dcdf649',
  'content': 'Kempegowda International Airport serves most domestic routes in India. Mangalore International Airport also serves domestic destinations. Domestic-only airports in Karnataka with varying service levels include Hubli, Belgaum, Hampi, Vidhyanagar and Mysore. These airports also connect with the Kempegowda International Airport in the state capital of Bangalore.',
  'id': '919660f8'}]

In [54]:
documents[0]

{'location': 'Andhra_Pradesh',
 'doc_id': 'd4402d82c0',
 'content': 'Asia > South Asia > India > Southern India > Andhra Pradesh  \n![0_image_0.png](0_image_0.png)',
 'id': '4f80b327'}

In [55]:
# ------------------------------------------------------------
# Save documents with IDs
# ------------------------------------------------------------
with open('../data/processed/documents-with-ids.json', 'wt') as f_out:
    json.dump(documents, f_out, indent=2)

print("Saved documents with IDs")

Saved documents with IDs


In [56]:
# ------------------------------------------------------------
# Prompt template for generating questions
# ------------------------------------------------------------
prompt_template = """
You are helping create training data for a travel information retrieval system.
Based on the travel content provided, formulate 5 questions that a traveler might ask.
The questions should be answerable from the given content and should be practical travel-related queries.
Use varied phrasing and avoid copying exact words from the content when possible.

Content about {location}:
{content}

Provide the output in parsable JSON without using code blocks:

["question1", "question2", "question3", "question4", "question5"]
""".strip()

In [59]:
prompt = prompt_template.format(**documents[1])

In [60]:
print(prompt)

You are helping create training data for a travel information retrieval system.
Based on the travel content provided, formulate 5 questions that a traveler might ask.
The questions should be answerable from the given content and should be practical travel-related queries.
Use varied phrasing and avoid copying exact words from the content when possible.

Content about Andhra_Pradesh:
Andhra Pradesh (AP) is a state in Southern India, with Bay of Bengal on the east and shares boundaries with Telangana on the north, Chhattisgarh and Odisha on the north-east, Tamil Nadu on the south and Karnataka on the west. Vijayawada is the capital of this state.

Provide the output in parsable JSON without using code blocks:

["question1", "question2", "question3", "question4", "question5"]


In [61]:
from dotenv import load_dotenv
import os
import requests

load_dotenv()  # Loads variables from .env
API_KEY = os.getenv("PERPLEXITY_API_KEY")

In [62]:
def llm(prompt):
    response = requests.post(
        'https://api.perplexity.ai/chat/completions',
        headers={
            'Authorization': f'Bearer {API_KEY}',
            'Content-Type': 'application/json'
        },
        json={
            'model': 'sonar',
            'messages': [{"role": "user", "content": prompt}]
        }
    )
    
    if response.status_code == 200:
        return response.json()['choices'][0]['message']['content']
    else:
        raise Exception(f"API call failed with status {response.status_code}: {response.text}")

In [63]:
questions = llm(prompt)

In [64]:
json.loads(questions)

['What are the must-see religious sites for pilgrims visiting Andhra Pradesh?',
 'Which beaches in Andhra Pradesh are recommended for tourists looking to relax by the sea?',
 'What are the key attractions for nature lovers visiting the Araku Valley area?',
 'Could you suggest historical forts or museums to visit in Vijayawada?',
 'What is the best time of year to plan a trip to popular Andhra Pradesh temples like Tirumala or Srikalahasti?']

In [65]:
def generate_questions(doc):
    prompt = prompt_template.format(
        location=doc['location'],
        content=doc['content']
    )
    
    try:
        # Make the Perplexity API call
        response = requests.post(
        'https://api.perplexity.ai/chat/completions',
        headers={
            'Authorization': f'Bearer {API_KEY}',
            'Content-Type': 'application/json'
        },
        json={
            'model': 'sonar',
            'messages': [{"role": "user", "content": prompt}]
        }
    )
        
        json_response = response.json()['choices'][0]['message']['content']
        return json_response
    except Exception as e:
        print(f"Error generating questions for doc {doc['id']}: {e}")
        return '["Error generating questions"]'

In [66]:
results = {}

print("Generating questions using LLM")

for doc in tqdm(documents): 
    doc_id = doc['id']
    if doc_id in results:
        continue  # Skip duplicates from hash collisions

    questions = generate_questions(doc)
    results[doc_id] = questions

print(f"Generated questions for {len(results)} unique documents")

Generating questions using LLM


  0%|          | 0/149 [00:00<?, ?it/s]

Generated questions for 147 unique documents


In [69]:
results['76f6a7f0']

'[\n  "Which national parks in Karnataka offer opportunities to see elephants and tigers in their natural habitats?",\n  "What are some top historical sites or forts that tourists should visit when traveling through Karnataka?",\n  "Where in Karnataka can I experience rich biodiversity, especially in the Western Ghats region?",\n  "What are the key tourist attractions in Bangalore that combine nature and historical significance?",\n  "During which months is it best to visit scenic places like Kudremukh or Agumbe in Karnataka for trekking and nature exploration?"\n]'

In [73]:
# ------------------------------------------------------------
# Save raw results
# ------------------------------------------------------------
import pickle

with open('../data/processed/results.bin', 'wb') as f_out:
    pickle.dump(results, f_out)

print("Saved raw results")

Saved raw results


In [74]:
# ------------------------------------------------------------
# Parse JSON responses
# ------------------------------------------------------------
parsed_results = {}

for doc_id, json_questions in results.items():
    try:
        parsed_results[doc_id] = json.loads(json_questions)
    except json.JSONDecodeError as e:
        print(f"Error parsing JSON for doc {doc_id}: {e}")
        print(f"Raw response: {json_questions}")
        # Create fallback questions
        parsed_results[doc_id] = [f"What can you tell me about this location?"]

print(f"Parsed {len(parsed_results)} question sets")

Parsed 147 question sets


In [75]:
# ------------------------------------------------------------
# Create document index and final results
# ------------------------------------------------------------
doc_index = {d['id']: d for d in documents}

final_results = []

for doc_id, questions in parsed_results.items():
    if doc_id not in doc_index:
        continue
        
    location = doc_index[doc_id]['location']
    original_doc_id = doc_index[doc_id]['doc_id']
    
    for q in questions:
        final_results.append((q, location, doc_id, original_doc_id))

print(f"Created {len(final_results)} question-document pairs")

Created 735 question-document pairs


In [82]:
final_results[:5]

[('What are the must-see religious sites in Andhra Pradesh for pilgrims?',
  'Andhra_Pradesh',
  '4f80b327',
  'd4402d82c0'),
 ('Which natural attractions and caves can tourists explore in Andhra Pradesh?',
  'Andhra_Pradesh',
  '4f80b327',
  'd4402d82c0'),
 ('Where in Andhra Pradesh can I visit museums related to military and tribal culture?',
  'Andhra_Pradesh',
  '4f80b327',
  'd4402d82c0'),
 ('What beaches and coastal towns in Andhra Pradesh offer historical landmarks?',
  'Andhra_Pradesh',
  '4f80b327',
  'd4402d82c0'),
 ('What are some recommended hill stations and forested areas for nature lovers in Andhra Pradesh?',
  'Andhra_Pradesh',
  '4f80b327',
  'd4402d82c0')]

In [100]:
df_ground_truth = pd.DataFrame(final_results, columns=['question', 'location', 'generated_id', 'original_doc_id'])

In [101]:
df_ground_truth.rename(columns={'generated_id': 'id'}, inplace=True)

In [102]:
display(df_ground_truth.head())

Unnamed: 0,question,location,id,original_doc_id
0,What are the must-see religious sites in Andhr...,Andhra_Pradesh,4f80b327,d4402d82c0
1,Which natural attractions and caves can touris...,Andhra_Pradesh,4f80b327,d4402d82c0
2,Where in Andhra Pradesh can I visit museums re...,Andhra_Pradesh,4f80b327,d4402d82c0
3,What beaches and coastal towns in Andhra Prade...,Andhra_Pradesh,4f80b327,d4402d82c0
4,What are some recommended hill stations and fo...,Andhra_Pradesh,4f80b327,d4402d82c0


In [103]:
df_ground_truth_filt = df_ground_truth.loc[:, ['question', 'id']]

In [104]:
display(df_ground_truth_filt.head())

Unnamed: 0,question,id
0,What are the must-see religious sites in Andhr...,4f80b327
1,Which natural attractions and caves can touris...,4f80b327
2,Where in Andhra Pradesh can I visit museums re...,4f80b327
3,What beaches and coastal towns in Andhra Prade...,4f80b327
4,What are some recommended hill stations and fo...,4f80b327


In [None]:
# Save to CSV
df_ground_truth_filt.to_csv(output_csv, index=False)
print(f"Saved ground truth data to {output_csv}")

Saved ground truth data to ../data/processed/ground-truth-retreival.csv


In [109]:
# ------------------------------------------------------------
# Save some statistics
# ------------------------------------------------------------
stats = {
    'total_documents': len(documents),
    'unique_documents': len(hashes),
    'hash_collisions': collision_count,
    'total_questions_generated': len(df_ground_truth),
    'unique_locations': df_ground_truth['location'].nunique(),
    'questions_per_location': len(df_ground_truth) / df_ground_truth['location'].nunique()
}

In [110]:
print(stats)

{'total_documents': 149, 'unique_documents': 147, 'hash_collisions': 2, 'total_questions_generated': 735, 'unique_locations': 2, 'questions_per_location': 367.5}
