# Obtención del Ground Truth Data

Estrategia usada:
- for doc in documents:
    generate 5 questions

Los resultados se almacenaran: pregunta-curso-documento

## Obtiene los documentos

In [34]:
import requests

docs_url = 'https://github.com/DataTalksClub/llm-zoomcamp/blob/main/01-intro/documents.json?raw=1'
docs_rpta= requests.get(docs_url)
docu_raw= docs_rpta.json()

documentos= []
# se extrae solo los documentos con el nombre del curso agregado
for i in docu_raw:
    curso= i['course']
    for doc in i['documents']:
        doc['course']= curso #se crea nuevo elemento con el nombre del curso
        documentos.append(doc)

In [2]:
documentos[0]

{'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.",
 'section': 'General course-related questions',
 'question': 'Course - When will the course start?',
 'course': 'data-engineering-zoomcamp'}

## Se generan id para cada documentos

Una opción es de manera ascendente, asignando el índice:

In [3]:
documentos_c= documentos.copy()
for i in range(len(documentos_c)):
    documentos_c[i]['id']= i

In [4]:
documentos_c[25]

{'text': 'For uniformity at least, but you’re not restricted to GCP, you can use other cloud platforms like AWS if you’re comfortable with other cloud platforms, since you get every service that’s been provided by GCP in Azure and AWS or others..\nBecause everyone has a google account, GCP has a free trial period and gives $300 in credits  to new users. Also, we are working with BigQuery, which is a part of GCP.\nNote that to sign up for a free GCP account, you must have a valid credit card.',
 'section': 'General course-related questions',
 'question': 'Environment - Why are we using GCP and not other cloud providers?',
 'course': 'data-engineering-zoomcamp',
 'id': 25}

El problema de esta solución es que depende del orden, y para que el identificador sea robusto es preferible que depende del contenido. Entonces:

**Otra opción**
La libreria `hashlib` permite codificar entradas como mensajes generando valores de tamaño fijo y seguros.  <p>
En específico  `hash MD5` (que usa el algoritmo M5 hashing) recibe una input en string, lo codifica y entrega una salida de 128 bit o hexadecimal de 32 caracteres, como huella digital única.
Usado conmunmente para codificar contraseñas, información bancaria, etc.

In [5]:
# Ejemplo de uso con el primer texto del documento
import hashlib
# entrada
texto= documentos[0]['text']

# encode: codifica el string a byte, porque MD5 solo acepta byte
# hashlib.md5: se aplica el algoritmo
# hexdigest: retorna a un formato hexadecimal

encode_texto= texto.encode()
hash_algo= hashlib.md5(encode_texto)
hash_hexa= hash_algo.hexdigest()
hash_hexa

'e43a0a720f3665e082784e80a2f08be6'

Entonces, como entrada consideramos de *documents*:
- id: text + question + course 

Así el *id* será único por la pregunta, el curso y la respuesta.

In [3]:
# primero generar a función hash
# después generar el id codificado
def hash_encode(input):
    encode_texto= input.encode()
    hash_algoritmo= hashlib.md5(encode_texto)
    hash_hexa= hash_algoritmo.hexdigest()
    return hash_hexa


In [35]:
for doc in documentos:
    contenido_id = f"{doc['course']}" + f"{doc['question']}"+ f"{doc['text'][:10]}"
    id_codificado= hash_encode(contenido_id)
    id_codificado= id_codificado[:8]

    doc['id']= id_codificado
    

In [36]:
documentos[10]

{'text': 'It depends on your background and previous experience with modules. It is expected to require about 5 - 15 hours per week. [source1] [source2]\nYou can also calculate it yourself using this data and then update this answer.',
 'section': 'General course-related questions',
 'question': 'Course - \u200b\u200bHow many hours per week am I expected to spend on this  course?',
 'course': 'data-engineering-zoomcamp',
 'id': 'de2be51c'}

Estos id generados son únicos? Hay que comprobarlo.
Para ello se usa del módulo `collections` la subclase `defaultdic`. 
- Diccionario por defecto con parámetro de entrada el tipo del valor (list, str o int)
- Se muestra un ejemplo:

In [8]:
from collections import defaultdict

grouped_data= defaultdict(list)
data = [ ('apple',1), ('banana',2),('apple',3),('orange',4)]
for key, value in data:
    grouped_data[key].append(value)
print(grouped_data)
# permite agrupar por la llave(key)

defaultdict(<class 'list'>, {'apple': [1, 3], 'banana': [2], 'orange': [4]})


In [54]:
id_hash = defaultdict(list)

for doc in documentos:
    id= doc['id']
    id_hash[id].append(doc) # si hay valores repetidos agrupa en una misma llave los docs


In [33]:
print(f"Documento completo: {len(documentos)}")
print(f"Diccionario agrupado por los id: {len(id_hash)}")

Documento completo: 948
Diccionario agrupado por los id: 947


Al parecer hay 2 id repetidos, identificamos cuales son:

In [55]:
# necesitamos encontrar el en el diccionario la llave que tenga más de 1 valor en la lista
for key, value in id_hash.items():
    if len(value) > 1:
        id = key

doc_sameid= []
for indice, doc in enumerate(documentos):
    if doc['id']== id:
        print(f"Indice en documentos: documentos[{indice}]")
        doc_sameid.append(doc)

doc_sameid
    

Indice en documentos: documentos[654]
Indice en documentos: documentos[657]


[{'text': "They both do the same, it's just less typing from the script.\nAsked by Andrew Katoch, Added by Edidiong Esu",
  'section': '6. Decision Trees and Ensemble Learning',
  'question': 'Does it matter if we let the Python file create the server or if we run gunicorn directly?',
  'course': 'machine-learning-zoomcamp',
  'id': '7f22da472c'},
 {'text': "They both do the same, it's just less typing from the script.",
  'section': '6. Decision Trees and Ensemble Learning',
  'question': 'Does it matter if we let the Python file create the server or if we run gunicorn directly?',
  'course': 'machine-learning-zoomcamp',
  'id': '7f22da472c'}]

Ambos son iguales a pesar de ser registros diferentes, sin embargo esto se debe a su similitud de cada campo usado para la creación del id: course, question, text. 

**Solución**: se puede ignorar o eliminar un registro según criterio. Se elimina el índice 657

In [56]:
del documentos[657]

**Exportamos** los documentos con id

In [62]:
# Guardamos los documentos 
import json
with open('documentos_idhash.json','wt') as file:
    json.dump(documentos, file, indent=2)
    #json.dump: permite pasar de dict o list a un formato string json

In [None]:
# verificamos con los primeros elementos
!head documentos_idhash.json

[
  {
    "text": "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  \u201cOffice Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon\u2019t forget to register in DataTalks.Club's Slack and join the channel.",
    "section": "General course-related questions",
    "question": "Course - When will the course start?",
    "course": "data-engineering-zoomcamp",
    "id": "d3067a4159"
  },
  {
    "text": "GitHub - DataTalksClub data-engineering-zoomcamp#prerequisites",


## Generar el Ground Truth Data con LLM - GPT

### Base del prompt

In [11]:
prompt_template = """
You emulate a student who's taking our course.
Formulate 5 questions this student might ask based on a FAQ record. The record
should contain the answer to the questions, and the questions should be complete and not too short.
If possible, use as fewer words as possible from the record. 

The record:

section: {section}
question: {question}
answer: {text}

Provide the output in parsable JSON without using code blocks and not to consider '\n':
["question1", "question2", ..., "question5"]
""".strip()

In [12]:
doc1= documentos[10]
doc1

{'text': 'It depends on your background and previous experience with modules. It is expected to require about 5 - 15 hours per week. [source1] [source2]\nYou can also calculate it yourself using this data and then update this answer.',
 'section': 'General course-related questions',
 'question': 'Course - \u200b\u200bHow many hours per week am I expected to spend on this  course?',
 'course': 'data-engineering-zoomcamp',
 'id': 'f1e7d212aa'}

In [None]:
prompt= prompt_template.format(**doc1) # se coloca los campos designados sin necesidad de hacerlo manualmente
print(prompt)

You emulate a student who's taking our course.
Formulate 5 questions this student might ask based on a FAQ record. The record
should contain the answer to the questions, and the questions should be complete and not too short.
If possible, use as fewer words as possible from the record. 

The record:

section: General course-related questions
question: Course - ​​How many hours per week am I expected to spend on this  course?
answer: It depends on your background and previous experience with modules. It is expected to require about 5 - 15 hours per week. [source1] [source2]
You can also calculate it yourself using this data and then update this answer.

Provide the output in parsable JSON without using code blocks:
["question1", "question2", ..., "question5"]


### Función con llm 

In [14]:
#verificamos api_key
import os
api_key= os.getenv("OPENAI_API_KEY")
# api_key

In [15]:
from openai import OpenAI

cliente= OpenAI()

In [None]:
# Función para generar las 5 preguntas por cada documento
def generate_questions(doc):
    prompt_doc= prompt_template.format(**doc)

    response = cliente.chat.completions.create(
        model= 'gpt-4o-mini',
        messages= [ {"role": "user", "content":prompt_doc}]
    )
    return response.choices[0].message.content
 

Recordar el Ground Truth se compone de:
- Query (lo que estamos generando)
- Documentos relevantes
- Composición: { 'query1': ['doc1': , 'doc7'], ...}

El proceso es un poco largo, porque tiene que generar las preguntas para cada doc, entonces se usa la libreria `tqdm` para visualizar el progreso.

In [114]:
from tqdm.auto import tqdm

querys= {}
# generación de las preguntas
for doc in tqdm(documentos):
    preguntas  = generate_questions(doc)
    # de json string a lista
    lista_questions= json.loads(preguntas) 

    # Se asocia cada id a las preguntas generadas
    querys[doc['id']]= lista_questions

    # for question in lista_questions:
    #     querys[question] = doc['id']


  from .autonotebook import tqdm as notebook_tqdm
 24%|██▍       | 231/947 [05:41<17:38,  1.48s/it]


KeyboardInterrupt: 

Se cargó solo hasta el 24%, se usará los resultados [results.bin](https://github.com/DataTalksClub/llm-zoomcamp/blob/main/03-evaluation/search_evaluation/results.bin) generados en el curso. Para evitar gastos con open ai.

In [120]:
querys['d3067a4159']

['What is the scheduled start date and time for the course?',
 'Where can I find the course calendar to keep track of sessions?',
 'Is there a specific registration link I need to use before the course begins?',
 'How can I stay updated with course announcements and information?',
 "Which platform do I need to join for the course's community discussions?"]

Se usa el módulo pickle para deserializar un archivo (bytes) a un objeto python.
- Consideraciones: Este archivo tiene id diferentes con los generados en este notebook.

In [22]:
import pickle

with open('results.bin', 'rb') as file:
    resultados= pickle.load(file)

In [23]:
print(type(resultados['96606db2']))
print(resultados['96606db2'])

<class 'str'>
[
  "How can I persist pgAdmin configuration using Docker-Compose?",
  "What do I need to add to the Docker-Compose YAML file to persist pgAdmin settings?",
  "Where should the pgAdmin data be stored on the host machine for it to persist?",
  "What permissions are required for pgAdmin to write to the folder on the host machine?",
  "Which Docker-Compose command is used before running docker-compose up for pgAdmin configuration?"
]


Convertir los resultados de json string a lista


In [27]:
import ast

resultados_lista = {}
# for id, questions in resultados.items():
#     resultados_lista[id]= json.loads(questions) 

resultados_lista = {key: ast.literal_eval(value) for key, value in resultados.items()}



In [39]:
resultados_lista['c02e79ef']

['When does the course begin?',
 'How can I get the course schedule?',
 'What is the link for course registration?',
 'How can I receive course announcements?',
 'Where do I join the Slack channel?']

Estructura FInal de **Ground Truth Data**:
- question - curso - id 

In [None]:
ground_truth_data = []

doc_index = {d['id']: d for d in documentos}
doc_index

for doc_id, questions in resultados_lista.items():
    curso= doc_index[doc_id]['course'] # ubica el curso en base al id
    for q in questions:
        ground_truth_data.append((q,curso, doc_id))

In [None]:
import pandas as pd

df = pd.DataFrame(ground_truth_data, columns=['question', 'course', 'document'])

df.to_csv('ground-truth-data.csv', index=False)
