# Evaluation data lab

The aim of this notebook is to experiment and create the evaluation dataset (ground truth)

## Read previously cleaned/processed data

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('../../data/data.csv')
documents = df.to_dict(orient='records')

In [3]:
documents[0]

{'id': 39,
 'title': "Pokémon: Let's Go, Pikachu! and Pokémon: Let's Go, Eevee! Trailer",
 'description': "Return to the Kanto region and experience a classic Pokémon journey in a whole new way with Pokémon: Let's Go, Pikachu! and Pokémon: Let's Go, Eevee! on Nintendo Switch! Coming November 16, 2018: http://bit.ly/2ISUMVXOfficial site: https://www.pokemon.com/PokemonLetsGoShop: http://www.pokemoncenter.comFacebook: http://www.facebook.com/Pokemon Twitter: http://www.twitter.com/Pokemon Instagram: http://www.instagram.com/pokemon Tumblr: http://www.pokemon.tumblr.com",
 'tags': 'Pokémon|"Pokemon"|"Pokémon Let\'s Go"|"Pokémon Lets Go"|"Pokemon Lets Go"|"Pokemon Let\'s Go"|"Pikachu"|"Eevee"|"Pokémon Let\'s Go Pikachu"|"Pokémon Let\'s Go Eevee"|"Pokémon Let\'s Go Pikachu and Eevee"|"Pokémon game"|"New Pokémon game"|"Nintendo Switch"|"Pokémon Nintendo Switch"|"Pokémon video game"|"Nintendo"|"Pokémon for Switch"'}

## Ground truth or evaluation data

In [4]:
from dotenv import load_dotenv
from openai import OpenAI

load_dotenv()

client = OpenAI()

### Evaluation prompt template

In [5]:
prompt_template = """
You emulate a youtuber engaging title generator to achieve >1M views.
Formulate 5 video queries that a user might input to obtain its desired video title.
The record should contain the YouTube video title.

The record:

title: {title}
description: {description}
tags: {tags}

Provide the output in parsable JSON without using code blocks:

{{"queries": ["query1", "query2", ..., "query5"]}}
""".strip()

### Evaluation data generation

In [6]:
import json

def generate_queries(doc: dict) -> json:
    
    prompt = prompt_template.format(**doc)
    
    response = client.chat.completions.create(
        model='gpt-4o-mini',
        messages=[{"role": "user", "content": prompt}]
    )
    
    return response.choices[0].message.content

In [7]:
import ast

from tqdm import tqdm

results = {}

for doc in tqdm(documents):
    doc_id = doc['id']
    if doc_id in results:
        continue
    
    queries_raw = generate_queries(doc)

    # Avoid JSON decoding errors: dump and load JSON
    queries_raw = ast.literal_eval(queries_raw)
    queries_raw = json.dumps(queries_raw)
    queries = json.loads(fr"{queries_raw}")
            
    results[doc_id] = queries['queries']

  0%|          | 0/200 [00:00<?, ?it/s]

100%|██████████| 200/200 [04:15<00:00,  1.28s/it]


In [8]:
final_results = []

for doc_id, queries in results.items():
    for t in queries:
        final_results.append((doc_id, t))

In [9]:
len(final_results)

1000

In [10]:
final_results[0]

(39, "best Pokémon Let's Go Pikachu trailer reaction")

In [11]:
df_results = pd.DataFrame(final_results, columns=['id', 'query'])

In [12]:
df_results.head(10)

Unnamed: 0,id,query
0,39,best Pokémon Let's Go Pikachu trailer reaction
1,39,Pokémon Let's Go Eevee gameplay review
2,39,Pokémon Let's Go features explained
3,39,how to catch rare Pokémon in Let's Go
4,39,ultimate guide to Pokémon Let's Go on Nintendo...
5,156,Sidemen football challenge ideas
6,156,Total Wipeout football challenge highlights
7,156,Epic Sidemen sports challenges
8,156,Behind the scenes of Sidemen Total Wipeout
9,156,Sidemen Ultimate Challenge Compilation


In [13]:
df_results.shape

(1000, 2)

In [14]:
df_results.to_csv('../../data/ground-truth-retrieval.csv', index=False)