This notebook/script processes semantic chunks generated from PDF and creates ground truth data 
for retrieval evaluation
1. Reads chunked semantic data from a JSON file.
2. Filters and renames specific fields for cleaner processing.
3. Uses an AI model to generate 5 sustainability-focused travel questions 
   for each record.
4. Saves the resulting ground truth dataset to CSV for retrieval evaluation.

#### Helper functions

In [37]:
import json
import pandas as pd
from pathlib import Path
from tqdm import tqdm
import os
import sys
from dotenv import load_dotenv

load_dotenv()
API_KEY = os.getenv("OPENAI_API_KEY")

In [13]:
file_path = '../data/processed/semantic_chunks.json'
output_csv = '../data/processed/ground-truth-retreival.csv'

In [14]:
# ------------------------------------------------------------
# Load semantic chunks JSON file
# ------------------------------------------------------------
with open(file_path, "r", encoding="utf-8") as f:
    chunks = json.load(f)

In [15]:
chunks[0]

{'metadata': {'document_id': 'd4402d82c0',
  'pdf_name': 'Andhra_Pradesh.pdf',
  'pdf_part': 0},
 'chunk_text': 'Asia > South Asia > India > Southern India > Andhra Pradesh  \n![0_image_0.png](0_image_0.png)',
 'token_count': 93}

In [16]:
# ------------------------------------------------------------
# Create DataFrame with required fields
# ------------------------------------------------------------
df_filtered = pd.DataFrame(chunks)
df_filtered["location"] = df_filtered["metadata"].apply(lambda m: m.get("pdf_name", "").replace(".pdf", ""))
df_filtered["doc_id"] = df_filtered["metadata"].apply(lambda m: m.get("document_id", ""))
df_filtered["content"] = df_filtered["chunk_text"]
df_filtered = df_filtered[["location", "doc_id", "content"]]

In [17]:
display(df_filtered.head())

Unnamed: 0,location,doc_id,content
0,Andhra_Pradesh,d4402d82c0,Asia > South Asia > India > Southern India > A...
1,Andhra_Pradesh,d4402d82c0,Andhra Pradesh (AP) is a state in Southern Ind...
2,Andhra_Pradesh,d4402d82c0,"Northern Coast (Alluri Sitharama Raju, Anakapa..."
3,Andhra_Pradesh,d4402d82c0,Here are some of the most notable cities.
4,Andhra_Pradesh,ed240be0ed,Amaravati - the capital of Andhra Pradesh whic...


In [18]:
documents = df_filtered.to_dict(orient="records")

In [25]:
documents[0]

{'location': 'Andhra_Pradesh',
 'doc_id': 'd4402d82c0',
 'content': 'Asia > South Asia > India > Southern India > Andhra Pradesh  \n![0_image_0.png](0_image_0.png)'}

In [22]:
# ------------------------------------------------------------
# Prompt template for generating questions
# ------------------------------------------------------------
prompt_template = """
You are an AI model assisting in developing a sustainable tourism recommender system for India.  
Your task is to generate 5 questions that a user might ask when planning a trip to this region with a focus on sustainability. 
The questions should be based on the provided record, which contains information from sources like WikiVoyage about travel destinations, ethical travel practices, and sustainable tourism tips. 

The record includes:

location: {location}
content: {content}

Formulate 5 clear and complete questions based on the provided record. These questions should be relevant to sustainable travel and tourism in Southeast Asia and should encourage users to think about ethical and eco-friendly travel options. 
Ensure the questions are varied and concise, using as few words as possible from the original text.

Provide the output in parsable JSON format without using code blocks:

["question1", "question2", ..., "question5"]
""".strip()

In [23]:
prompt = prompt_template.format(**documents[0])

In [24]:
print(prompt)

You are an AI model assisting in developing a sustainable tourism recommender system for India.  
Your task is to generate 5 questions that a user might ask when planning a trip to this region with a focus on sustainability. 
The questions should be based on the provided record, which contains information from sources like WikiVoyage about travel destinations, ethical travel practices, and sustainable tourism tips. 

The record includes:

location: Andhra_Pradesh
content: Asia > South Asia > India > Southern India > Andhra Pradesh  
![0_image_0.png](0_image_0.png)

Formulate 5 clear and complete questions based on the provided record. These questions should be relevant to sustainable travel and tourism in Southeast Asia and should encourage users to think about ethical and eco-friendly travel options. 
Ensure the questions are varied and concise, using as few words as possible from the original text.

Provide the output in parsable JSON format without using code blocks:

["question1", 

In [38]:
import httpx
from openai import OpenAI

# Create HTTPX client for Groq API
http_client = httpx.Client(
    base_url="https://api.groq.com/openai/v1",
    follow_redirects=True
)

# Pass Groq API key and HTTPX client
client = OpenAI(
    api_key=API_KEY,  
    http_client=http_client
)

# Example request
resp = client.chat.completions.create(
    model="llama3-8b-8192",
    messages=[{"role": "user", "content": "content"}]
)
print(resp.choices[0].message["content"])


AuthenticationError: Error code: 401 - {'error': {'message': 'Incorrect API key provided: sk-proj-********************************************************************************************************************************************************DwAA. You can find your API key at https://platform.openai.com/account/api-keys.', 'type': 'invalid_request_error', 'param': None, 'code': 'invalid_api_key'}}

In [None]:
def llm(prompt):
    response = client.chat.completions.create(
        model='gpt-4o-mini',
        messages=[{"role": "user", "content": prompt}]
    )
    
    return response.choices[0].message.content

In [None]:
questions = llm(prompt)

In [None]:
json.loads(questions)

In [7]:
# import openai

# # Initialize Groq client
# client = openai.OpenAI(
#     api_key="your-groq-api-key",
#     base_url="https://api.groq.com/openai/v1"
# )

In [None]:
# def generate_questions(doc):
#     prompt = prompt_template.format(**doc)
    
#     # Use Groq API with OpenAI-compatible interface
#     response = client.chat.completions.create(
#         model="openai/gpt-oss-20b",  # Popular Groq model (you can change this)
#         messages=[{"role": "user", "content": prompt}],
#     )
    
#     json_response = response.choices[0].message.content
#     return json_response

In [None]:
def generate_questions(doc):
    prompt = prompt_template.format(**doc)

    response = client.chat.completions.create(
        model='gpt-4o-mini',
        messages=[{"role": "user", "content": prompt}]
    )

    json_response = response.choices[0].message.content
    return json_response

In [None]:
from tqdm.auto import tqdm

In [None]:
results = {}

In [None]:
for doc in tqdm(documents):

    doc_id = doc["metadata"]["document_id"]
    if doc_id in results:
        continue

    questions = generate_questions(doc)
    results[doc_id] = questions

In [None]:
final_results = []

for doc_id, questions in results.items():
    for q in questions:
        final_results.append((doc_id, q))

In [None]:
final_results[0]

In [None]:
df_results = pd.DataFrame(final_results, columns=['id', 'question'])

In [None]:
df_results.to_csv(output_csv, index=False, encoding="utf-8")
print(f"✅ Ground truth data saved to {output_csv}")