This notebook/script processes semantic chunks generated from PDF and creates ground truth data 
for retrieval evaluation
1. Reads chunked semantic data from a JSON file.
2. Filters and renames specific fields for cleaner processing.
3. Uses an AI model to generate 5 sustainability-focused travel questions 
   for each record.
4. Saves the resulting ground truth dataset to CSV for retrieval evaluation.

#### Helper functions

In [1]:
import json
import pandas as pd
from pathlib import Path
from tqdm import tqdm
import os
import sys
from dotenv import load_dotenv

In [2]:
file_path = '../data/processed/semantic_chunks.json'
output_csv = '../data/processed/ground-truth-retreival.csv'

In [3]:
# ------------------------------------------------------------
# Load semantic chunks JSON file
# ------------------------------------------------------------
with open(file_path, "r", encoding="utf-8") as f:
    chunks = json.load(f)

In [4]:
chunks[0]

{'metadata': {'document_id': 'd4402d82c0',
  'pdf_name': 'Andhra_Pradesh.pdf',
  'pdf_part': 0},
 'chunk_text': 'Asia > South Asia > India > Southern India > Andhra Pradesh  \n![0_image_0.png](0_image_0.png)',
 'token_count': 93}

In [5]:
# ------------------------------------------------------------
# Create DataFrame with required fields
# ------------------------------------------------------------
df_filtered = pd.DataFrame(chunks)
df_filtered["location"] = df_filtered["metadata"].apply(lambda m: m.get("pdf_name", "").replace(".pdf", ""))
df_filtered["doc_id"] = df_filtered["metadata"].apply(lambda m: m.get("document_id", ""))
df_filtered["content"] = df_filtered["chunk_text"]
df_filtered = df_filtered[["location", "doc_id", "content"]]

In [6]:
display(df_filtered.head())

Unnamed: 0,location,doc_id,content
0,Andhra_Pradesh,d4402d82c0,Asia > South Asia > India > Southern India > A...
1,Andhra_Pradesh,d4402d82c0,Andhra Pradesh (AP) is a state in Southern Ind...
2,Andhra_Pradesh,d4402d82c0,"Northern Coast (Alluri Sitharama Raju, Anakapa..."
3,Andhra_Pradesh,d4402d82c0,Here are some of the most notable cities.
4,Andhra_Pradesh,ed240be0ed,Amaravati - the capital of Andhra Pradesh whic...


In [7]:
documents = df_filtered.to_dict(orient="records")

In [11]:
documents[1]

{'location': 'Andhra_Pradesh',
 'doc_id': 'd4402d82c0',
 'content': 'Andhra Pradesh (AP) is a state in Southern India, with Bay of Bengal on the east and shares boundaries with Telangana on the north, Chhattisgarh and Odisha on the north-east, Tamil Nadu on the south and Karnataka on the west. Vijayawada is the capital of this state.'}

In [12]:
# ------------------------------------------------------------
# Prompt template for generating questions
# ------------------------------------------------------------
prompt_template = """
You are an AI model assisting in developing a sustainable tourism recommender system for India.  
Your task is to generate 5 questions that a user might ask when planning a trip to this region with a focus on sustainability. 
The questions should be based on the provided record, which contains information from sources like WikiVoyage about travel destinations, ethical travel practices, and sustainable tourism tips. 

The record includes:

location: {location}
content: {content}

Formulate 5 clear and complete questions based on the provided record. These questions should be relevant to sustainable travel and tourism in Southeast Asia and should encourage users to think about ethical and eco-friendly travel options. 
Ensure the questions are varied and concise, using as few words as possible from the original text.

Provide the output in parsable JSON format without using code blocks:

["question1", "question2", ..., "question5"]
""".strip()

In [15]:
prompt = prompt_template.format(**documents[1])

In [16]:
print(prompt)

You are an AI model assisting in developing a sustainable tourism recommender system for India.  
Your task is to generate 5 questions that a user might ask when planning a trip to this region with a focus on sustainability. 
The questions should be based on the provided record, which contains information from sources like WikiVoyage about travel destinations, ethical travel practices, and sustainable tourism tips. 

The record includes:

location: Andhra_Pradesh
content: Andhra Pradesh (AP) is a state in Southern India, with Bay of Bengal on the east and shares boundaries with Telangana on the north, Chhattisgarh and Odisha on the north-east, Tamil Nadu on the south and Karnataka on the west. Vijayawada is the capital of this state.

Formulate 5 clear and complete questions based on the provided record. These questions should be relevant to sustainable travel and tourism in Southeast Asia and should encourage users to think about ethical and eco-friendly travel options. 
Ensure the qu

In [21]:
from dotenv import load_dotenv
import os
import requests

load_dotenv()  # Loads variables from .env
API_KEY = os.getenv("PERPLEXITY_API_KEY")

In [24]:
def llm(prompt):
    response = requests.post(
        'https://api.perplexity.ai/chat/completions',
        headers={
            'Authorization': f'Bearer {API_KEY}',
            'Content-Type': 'application/json'
        },
        json={
            'model': 'sonar',
            'messages': [{"role": "user", "content": prompt}]
        }
    )
    
    if response.status_code == 200:
        return response.json()['choices'][0]['message']['content']
    else:
        raise Exception(f"API call failed with status {response.status_code}: {response.text}")

In [None]:
# def llm(prompt):
#     response = client.chat.completions.create(
#         model='gpt-4o-mini',
#         messages=[{"role": "user", "content": prompt}]
#     )
    
#     return response.choices[0].message.content

In [25]:
questions = llm(prompt)

In [26]:
json.loads(questions)

['What eco-friendly accommodations are available in Andhra Pradesh that support local communities?',
 'How can I participate in cultural preservation while visiting historical sites like Lepakshi Temple and Gandikota Fort sustainably?',
 'What wildlife or nature conservation activities can I engage in when visiting places like Irakam Island?',
 'How do homestays in Andhra Pradesh contribute to sustainable tourism and benefit local residents?',
 'What sustainable travel infrastructure or services exist to minimize the environmental impact while exploring Andhra Pradesh?']

In [None]:
# def generate_questions(doc):
#     prompt = prompt_template.format(**doc)

#     response = client.chat.completions.create(
#         model='gpt-4o-mini',
#         messages=[{"role": "user", "content": prompt}]
#     )

#     json_response = response.choices[0].message.content
#     return json_response

In [27]:
def generate_questions(doc, prompt_template):
    # Format the prompt using the doc dict
    prompt = prompt_template.format(**doc)

    # Make the Perplexity API call
    response = requests.post(
        'https://api.perplexity.ai/chat/completions',
        headers={
            'Authorization': f'Bearer {API_KEY}',
            'Content-Type': 'application/json'
        },
        json={
            'model': 'sonar',
            'messages': [{"role": "user", "content": prompt}]
        }
    )
    # Handle response (simple error handling)
    if response.status_code == 200:
        json_response = response.json()['choices'][0]['message']['content']
        return json_response
    else:
        raise Exception(f"API call failed with status {response.status_code}: {response.text}")

In [28]:
from tqdm.auto import tqdm

In [29]:
results = {}

In [32]:
print(documents[0])

{'location': 'Andhra_Pradesh', 'doc_id': 'd4402d82c0', 'content': 'Asia > South Asia > India > Southern India > Andhra Pradesh  \n![0_image_0.png](0_image_0.png)'}


In [35]:
for doc in tqdm(documents):

    doc_id = doc["doc_id"]
    if doc_id in results:
        continue

    questions = generate_questions(doc,prompt_template)
    results[doc_id] = questions

  0%|          | 0/149 [00:00<?, ?it/s]

In [39]:
documents[20]

{'location': 'Andhra_Pradesh',
 'doc_id': 'd4402d82c0',
 'content': 'Carnatic music - Carnatic music is born in the rich Telugu language Kuchipudi dance - one of the world famous classical dance forms of India.'}

In [40]:
final_results = []

for doc_id, questions in results.items():
    for q in questions:
        final_results.append((doc_id, q))

In [43]:
final_results[5]

('d4402d82c0', 'W')

In [None]:
df_results = pd.DataFrame(final_results, columns=['id', 'question'])

In [None]:
df_results.to_csv(output_csv, index=False, encoding="utf-8")
print(f"✅ Ground truth data saved to {output_csv}")