# Generate Ground Truth Examples

In this notebook there will be created a ground truth dataset from the pdf files available, you can generate questions using either Open AI models or open-source Ollama framework.

## Import libraries

In [1]:
# imports
import os
import pickle
import sys
import json
from openai import OpenAI

from tqdm import tqdm
from dotenv import load_dotenv
import os

load_dotenv()


project_dir = os.path.abspath(os.path.join(os.getcwd(), os.pardir))
sys.path.insert(0, os.path.join(project_dir, "utils"))

## Load documents from pickle

In [2]:
with open("../data/docs_processed.pickle", "rb") as f:
    documents = pickle.load(f)

In [3]:
len(documents)

4428

In [4]:
documents[0]

{'metadata': {'document_id': 'f392923b41',
  'pdf_name': 'Responsible_travel.pdf',
  'pdf_part': 0,
  'Header 2': 'Responsible Travel'},
 'content': 'See Sustainable travel for the ecological and appropriate technology dimension of travel sustainability.'}

In [5]:
file_names = set()

for doc in documents:
    file_names.add(doc["metadata"]["pdf_name"])

In [6]:
len(file_names)

12

In [7]:
file_names

{'Brunei.pdf',
 'Cambodia.pdf',
 'Indonesia.pdf',
 'Laos.pdf',
 'Malaysia.pdf',
 'Myanmar.pdf',
 'Philippines.pdf',
 'Responsible_travel.pdf',
 'Singapore.pdf',
 'Sustainable_travel.pdf',
 'Thailand.pdf',
 'Vietnam.pdf'}

## Generate ground truth questions from text chunks

## Ollama

In [8]:
client = OpenAI(
    base_url="http://localhost:11434/v1/",
    api_key="ollama",
)
model_name = "phi3"

## OpenAI

In [8]:
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
model_name = "gpt-4o"

In [9]:
prompt_template = """
You are an AI model assisting in developing a sustainable tourism recommender system for Southeast Asia. 
Your task is to generate 4 questions that a user might ask when planning a trip to this region with a focus on sustainability. 
The questions should be based on the provided record, which contains information from sources like WikiVoyage about travel destinations, ethical travel practices, and sustainable tourism tips.

The record includes:

topic: {metadata}
text: {content}

Formulate 4 clear and complete questions based on the provided record. These questions should be relevant to sustainable travel and tourism in Southeast Asia and should encourage users to think about ethical and eco-friendly travel options. 
Ensure the questions are varied and concise, using as few words as possible from the original text.

Provide the output in parsable JSON format without using code blocks:

["question1", "question2", ..., "question5"]
""".strip()

In [10]:
def generate_questions(doc):
    prompt = prompt_template.format(**doc)

    response = client.chat.completions.create(
        model=model_name, messages=[{"role": "user", "content": prompt}]
    )

    json_response = response.choices[0].message.content
    return json_response

In [11]:
results = {}

for doc in tqdm(documents):

    doc_id = doc["metadata"]["document_id"]
    if doc_id in results:
        continue

    questions = generate_questions(doc)
    results[doc_id] = questions

100%|██████████| 4428/4428 [06:12<00:00, 11.88it/s]  


In [12]:
documents[1004]

{'metadata': {'document_id': '7f0cdf842b',
  'pdf_name': 'Malaysia.pdf',
  'pdf_part': 1,
  'Header 2': 'Cities'},
 'content': "old town and tin mining area Johor Bahru - capital of Johor and Malaysia's third largest city Kuantan - capital of Pahang and commercial centre of the East Coast Kota Kinabalu - close to tropical islands, lush rain forest and Mount Kinabalu Kuching - capital of Sarawak, and largest city in East Malaysia 7 8"}

### Join answers with questions

In [13]:
for doc in tqdm(documents):

    doc_id = doc["metadata"]["document_id"]
    if doc_id in results:

        doc["ground_truth"] = results[doc_id]

100%|██████████| 4428/4428 [00:00<00:00, 2700258.52it/s]


In [14]:
documents[4]

{'metadata': {'document_id': 'f36c94667a',
  'pdf_name': 'Responsible_travel.pdf',
  'pdf_part': 2,
  'Header 2': 'Understand'},
 'content': "bottom line. When you shop, you're putting your money in the hands of locals in a sustainable way, not staying at chain hotels, where revenue isn't spread around. Most principles of responsible tourism were put forth in the Cape Town Declaration on Responsible Tourism in Destinations (http://responsibletourismpartnership.org/cape-town-declaration-on-responsible -tourism/) (Responsible organization (http://www.icrtourism.org/)). In the development of many tourism projects, indigenous people have",
 'ground_truth': '[\n    "What are the benefits of shopping locally when traveling in Southeast Asia?",\n    "How does staying in locally-owned accommodations contribute to sustainable tourism?",\n    "What key principles of responsible tourism should be considered when visiting Southeast Asia?",\n    "How can indigenous communities be supported through 

In [15]:
import pickle

with open("../data/GT_docs_{}.bin".format(model_name), "wb") as file:
    pickle.dump(documents, file)