# 3. Automatic Evaluation Dataset Generation for Ragas

In this notebook, we generate an automatic or synthetic evaluation dataset that we can use to evaluate our RAG pipeline using Ragas. Ragas is a framework that helps you evaluate your Retrieval Augmented Generation (RAG) pipelines.  

Authors:
- Luis Bernardo Hernandez Salinas
- Juan R. Terven

In [71]:
import os
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import ChatOpenAI
from langchain.output_parsers import ResponseSchema, StructuredOutputParser
from langchain.prompts import ChatPromptTemplate
from tqdm import tqdm
import pandas as pd
from datasets import Dataset

In [53]:
# Model to use
llm_name = "gpt-4o"

# API key 
client = os.environ['OPENAI_API_KEY']

print(f"Using model {llm_name}")

Using model gpt-4o


## Get the documents splits

In [10]:
loader = PyPDFLoader('chevrolet-spark.pdf')

# load pdf pages
pages = loader.load()
print(f"The document has {len(pages)} pages")

# RecursiveCharacterTextSplitter with overlap
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 2000,  # chunk size in characters
    chunk_overlap = 150 # Caracteres de solapamiento entre segmentos consecutivos.
)

# split documents
splits = text_splitter.split_documents(pages)

print(f"Generated {len(splits)} splits")

The document has 204 pages
Generated 243 splits


In [47]:
# first split with info
first_split = 46
# last split with info
last_split = 200

splits[46]

Document(page_content='SEATS AND OCCUPANT PROTECTION SYSTEMS  2–3\nHOW SAFETY BELTS WORK!\nSafety belts cannot work unless they are\nworn and worn properly.\nVehicle occupants are injured if the forces\napplied to the body’s structures are greaterthan the body can tolerate without being\ninjured. If a person’s body is stopped\nabruptly, the forces applied to the bodywill be high, whereas if the body is slowed\ndown gradually over some distance, the\nforces will be much lower. Thus, in orderto protect an occupant from injury in a\ncrash, the idea is to give the person as\nmuch time and distance as possible incoming to a stop.\nImagine a person running at 15 miles per\nhour (25 km/h) head first into a concrete\nwall. Imagine a second person running at15 miles per hour (25 km/h) into a wall cov-\nered by a 3-feet (90 cm) thick deformable\ncushion. In the first instance the personcould be seriously injured or even killed.\nIn the second, the runner could expect to\nwalk away uninjured. Why

## Generating a synthetic question for testing the question generator

In [22]:
# ResponseSchema is a class that acts as the architectural blueprint for data elements in a response.
# Imagine it as the template for each piece in a complex puzzle of structured output.
question_schema = ResponseSchema(
    name='question',
    description='a question about the context.'
)
question_response_schemas = [question_schema]

# StructuredOutputParser is a class crafted for decoding and processing structured outputs,
# like a detective unraveling the mysteries of data (think JSON) returned from a source (often a language model)
question_output_parser = StructuredOutputParser.from_response_schemas(question_response_schemas)
format_instructions = question_output_parser.get_format_instructions()

# Define a template for the question generator
qa_templates = """\
You are a car expert creating a test for car users. For each context, create a question that is specific to the context.
Avoid creating generic or general questions. All the questions must be in english.

question: a question about the context.

Format the output as JSON with the following keys:
question

context: {context}
"""
prompt_template = ChatPromptTemplate.from_template(template=qa_templates)

# Generate a question from the provided context
messages = prompt_template.format_messages(
    context=splits[50],
    format_instructions=format_instructions
)
response = llm(messages) # Utiliza el modelo de lenguaje para generar respuestas.
output_dict = question_output_parser.parse(response.content) # Procesa y extrae información estructurada de la respuesta.

In [24]:

for k, v in output_dict.items():
    print(k)  # Imprime la clave del par actual.
    print("")
    print(v)  # Imprime el valor asociado a la clave actual.

question

What are the potential risks associated with having unbelted occupants in a vehicle during a crash, as described in the SEATS AND OCCUPANT PROTECTION SYSTEMS section?


## Generate 20 synthetic questions

In [49]:
# Inicialización de la lista para almacenar los triples de pregunta, respuesta y contexto.
qac_triples = []

# Procesamiento de los primeros 20 segmentos de texto con una barra de progreso visible.
for text in tqdm(random.sample(splits[first_split:last_split], 20)):
    # Formateo de mensajes basados en el contexto para enviar al modelo de lenguaje.
    messages = prompt_template.format_messages(
        context=text,
        format_instructions=format_instructions
    )
    
    # Generación de respuesta mediante el modelo de lenguaje.
    response = llm(messages)
    
    try:
        # Intento de parsear la respuesta para extraer datos estructurados.
        output_dict = question_output_parser.parse(response.content)
    except Exception as e:
        # Continuar con el siguiente segmento de texto si hay un error en el parseo.
        continue
    
    # Añadir el texto original como contexto en el diccionario de salida.
    output_dict['context'] = text
    # Añadir el diccionario actualizado a la lista de triples.
    qac_triples.append(output_dict)

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:44<00:00,  2.21s/it]


In [52]:
qac_triples[1]

{'question': 'According to the Chevrolet Spark manual, how often should you check your tire pressure, and what tool should you use for this check?',
 'context': Document(page_content='SERVICE AND VEHICLE CARE  7–25\nSee “VEHICLE SPECIFICATIONS” on\npage 9-6 for proper tire inflation pressure.\nTire condition should be inspected before\ndriving and tire pressure should be\nchecked each time you fill your fuel tankor at least once a month using a tire pres-sure gauge.\nIncorrect tire inflation pressures will:\n• Increase tire wear.\n• Impair vehicle handling and safe opera-\ntion.\n• Affect ride comfort.\n• Reduce fuel economy.\nIf tire pressure is too low, tires can over-\nheat and suffer internal damage, treadseparation, and even a blowout at high\nspeeds. Even if you later adjust the infla-\ntion pressure of your tires, previous driv-ing with low pressure may have damagedthe tires.Caring for your tires and wheels\nDriving over sharp objects can damage the\ntires and wheels. If some ob

## Insert the answer to each question 

In [60]:
answer_generation_llm = ChatOpenAI(model=llm_name, temperature=0)
answer_schema = ResponseSchema(
    name="answer",
    description="an answer to the question"
)
answer_response_schemas = [
    answer_schema,
]

answer_output_parser = StructuredOutputParser.from_response_schemas(answer_response_schemas)
format_instructions = answer_output_parser.get_format_instructions()

qa_template = """\
You are a car expert creating a test for car users. For each question and context, create an answer.
answer: an answer about the context.
Format the output as JSON with the following keys:
answer
question: {question}
context: {context}
"""
prompt_template = ChatPromptTemplate.from_template(template=qa_template)
answer_generation_chain = answer_generation_llm

### Let's first try with a single one and check the results

In [61]:
messages = prompt_template.format_messages(
    context=qac_triples[0]["context"],
    question=qac_triples[0]["question"],
    format_instructions=format_instructions
)
response = answer_generation_chain.invoke(messages)
output_dict = answer_output_parser.parse(response.content)

In [62]:
for k, v in output_dict.items():
    print(k)
    print("-----")
    print(v)

answer
-----
Before using the air conditioning system, open the windows for a few minutes to permit hot air to escape if the vehicle has been parked in direct sunlight.
question
-----
What should you do before using the air conditioning system if your vehicle has been parked in direct sunlight?
context
-----
page_content='4–10  HEATING & AIR CONDITIONING
OPERATING TIPS
• Before using the air conditioning sys-
tem, open the windows for a few min-
utes to permit hot air to escape if the
vehicle has been parked in direct sun-light.
• For maximum cooling, select the venti-
lation mode and the highest fan speed.
Make sure that the air conditioningcompressor is turned on. Then rotatethe temperature control knob to selectthe coolest temperature and select therecirculation mode.
• To defog the windows on rainy days
or in high humidity, turn on the airconditioning compressor.
• Turn on the air conditioning for a few
minutes at least once a week, even inthe winter or when the air conditioningsys

### Now get the answers on all the questions

In [64]:
for triple in tqdm(qac_triples):
    messages = prompt_template.format_messages(
        context=triple['context'],
        question=triple['question'],
        format_instructions=format_instructions
    )
    response = answer_generation_chain.invoke(messages)
    
    try:
        output_dict = answer_output_parser.parse(response.content)
    except Exception as e:
        continue
    
    # Actualización del triple actual con la respuesta generada.
    triple['answer'] = output_dict['answer']

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [02:19<00:00,  7.00s/it]


## Combine questions, contexts, and answers for evaluation dataset

In [67]:
# To pandas
ground_truth_qac_set = pd.DataFrame(qac_triples)

# Make sure context is string
ground_truth_qac_set["context"] = ground_truth_qac_set["context"].map(lambda x: str(x.page_content))

# rename answer to groundtruth
ground_truth_qac_set = ground_truth_qac_set.rename(columns={"answer" : "ground_truth"})

# Convert to Hugging Face Dataset
eval_dataset = Dataset.from_pandas(ground_truth_qac_set)

In [68]:
eval_dataset

Dataset({
    features: ['question', 'context', 'ground_truth'],
    num_rows: 20
})

In [69]:
eval_dataset[9]

{'question': 'What steps should you follow to use the bi-level ventilation setting on a cool, sunny day?',
 'context': 'HEATING & AIR CONDITIONING  4–9\nNormal heating\n1. Turn off air conditioning (A/C). (Indi-\ncator goes off)\n2. Slide the recirculation lever to outside\nair mode.\n3. Turn air distribution knob to FLOOR\n(     ) or BI-LEVEL (     ).\n4. Turn temperature control knob to red\narea for heating.\n5. Turn fan speed control knob to desired\nspeed.VENTILATION\nBi-level\nUse this setting on cool, but sunny days.\nWarmer air will flow into the floor area and\ncool outside air will flow towards yourupper body.\nTo use this setting:\n1. Slide the recirculation lever to outside\nair mode.\n2. Turn air distribution knob to BI-LEVEL\n(     ).\n3. Adjust temperature control knob to the\ndesired temperature.\n4. Turn fan speed control knob to the de-\nsired speed.Ventilation\nTo direct air through the center and side\nvents:\n1. Turn off air conditioning (A/C). (Indi-\ncator goes o

In [70]:
eval_dataset.to_csv('ground_truth_qac_set_spark_2.csv')

Creating CSV from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 178.15ba/s]


34007