# LAB34: Rag Evaluation & RAGas

En este notebook veremos estos puntos:
- 🧪 Creación de Datos Sintéticos: Entender el proceso y la importancia de generar datos sintéticos para la evaluación RAG.
- 🛠️ Utilizando la Herramienta Ragas: Aprender cómo usar Ragas para una evaluación completa del rendimiento del modelo RAG a través de varias métricas.
- 🔍 Impacto de los Métodos de Recuperación: Explorar cómo diferentes enfoques de recuperación influyen en la efectividad y precisión de los modelos RAG.
- 💡 Aplicación Práctica: Aplicar estos conceptos a través de ejemplos y ejercicios para consolidar la comprensión y las habilidades en la evaluación RAG.


In [1]:
!pip install -qU langchain openai ragas arxiv pymupdf chromadb tiktoken accelerate bitsandbytes datasets sentence_transformers FlagEmbedding ninja  tqdm rank_bm25 transformers

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/163.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m163.8/163.8 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m28.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m680.4/680.4 kB[0m [31m47.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90

In [2]:
!pip install -U flash_attn --no-build-isolation

Collecting flash_attn
  Downloading flash_attn-2.7.4.post1.tar.gz (6.0 MB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/6.0 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.3/6.0 MB[0m [31m9.5 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/6.0 MB[0m [31m37.5 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m6.0/6.0 MB[0m [31m67.6 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.0/6.0 MB[0m [31m54.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: flash_attn
  Building wheel for flash_attn (setup.py) ... [?25l[?25hdone
  Created wheel for flash_attn: filename=flash_attn-2.7.4.post1-cp311-cp311-linux_x86_64.whl size=187831595 sha256=588

## Creamos nuestro RAG Pipeline

Primero crearemos nuestro pipeline de RAG para poder después evaluarlo.

Obtenemos de ArxivLoader algunos artículos que hablan sobre RAG:

- Evaluation Metrics in the Era of GPT-4: Reliably Evaluating Large Language Models on Sequence to Sequence Tasks
- A Survey on Evaluation of Large Language Models
- An Evaluation on Large Language Model Outputs: Discourse and Memorization
- A Closer Look into Automatic Evaluation Using Large Language Models
- Large Language Models are Not Yet Human-Level Evaluators for Abstractive Summarization
- ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems
- RaLLe: A Framework for Developing and Evaluating Retrieval-Augmented Large Language Models
- Benchmarking Large Language Models in Retrieval-Augmented Generation
- Evaluating the Effectiveness of Retrieval-Augmented Large Language Models in Scientific Document Reasoning


In [3]:
from langchain.document_loaders import ArxivLoader
from langchain.document_loaders.merge import MergedDataLoader

papers = ["2310.13800", "2307.03109", "2304.08637", "2310.05657", "2305.13091", "2311.09476", "2308.10633", "2309.01431", "2311.04348"]

docs_to_merge = []

for paper in papers:
    loader = ArxivLoader(query=paper)
    docs_to_merge.append(loader)

all_loaders = MergedDataLoader(loaders=docs_to_merge)
all_docs = all_loaders.load()

In [4]:
for doc in all_docs:
  print(doc.metadata)

{'Published': '2023-10-20', 'Title': 'Evaluation Metrics in the Era of GPT-4: Reliably Evaluating Large Language Models on Sequence to Sequence Tasks', 'Authors': 'Andrea Sottana, Bin Liang, Kai Zou, Zheng Yuan', 'Summary': "Large Language Models (LLMs) evaluation is a patchy and inconsistent\nlandscape, and it is becoming clear that the quality of automatic evaluation\nmetrics is not keeping up with the pace of development of generative models. We\naim to improve the understanding of current models' performance by providing a\npreliminary and hybrid evaluation on a range of open and closed-source\ngenerative LLMs on three NLP benchmarks: text summarisation, text\nsimplification and grammatical error correction (GEC), using both automatic and\nhuman evaluation. We also explore the potential of the recently released GPT-4\nto act as an evaluator. We find that ChatGPT consistently outperforms many\nother popular models according to human reviewers on the majority of metrics,\nwhile scori

Ahora crearemos la Base de Datos Vectorial con los documentos que hemos recuperado.

Usaremos [ChromaDB](https://www.trychroma.com/) como base de datos y [BAAI/bge-large-en-v1.5](https://huggingface.co/BAAI/bge-large-en-v1.5) como modelo de embedding.


In [5]:
from langchain.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceBgeEmbeddings

model_name = "BAAI/bge-large-en-v1.5"
encode_kwargs = {'normalize_embeddings': True}

hf_bge_embeddings = HuggingFaceBgeEmbeddings(
    model_name=model_name,
    model_kwargs={'device': 'cuda'},
    encode_kwargs=encode_kwargs
)

text_splitter = RecursiveCharacterTextSplitter(chunk_size=512,
                                               chunk_overlap = 128,
                                               length_function=len)
docs = text_splitter.split_documents(all_docs)
vectorstore = Chroma.from_documents(docs, hf_bge_embeddings)

  hf_bge_embeddings = HuggingFaceBgeEmbeddings(
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/94.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/779 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/191 [00:00<?, ?B/s]

In [6]:
len(docs)

1728

In [7]:
base_retriever = vectorstore.as_retriever(search_kwargs={"k" : 5})

In [8]:
relevant_docs = base_retriever.get_relevant_documents("What are the challenges in evaluating Retrieval Augmented Generation pipelines?")

  relevant_docs = base_retriever.get_relevant_documents("What are the challenges in evaluating Retrieval Augmented Generation pipelines?")


In [9]:
len(relevant_docs)

5

In [10]:
for doc in relevant_docs:
  print(doc.page_content)
  print('\n')

ARES: An Automated Evaluation Framework for Retrieval-Augmented
Generation Systems
Jon Saad-Falcon
Stanford University ∗
jonsaadfalcon@stanford.edu
Omar Khattab
Stanford University
okhattab@stanford.edu
Christopher Potts
Stanford University
cgpotts@stanford.edu
Matei Zaharia
Databricks and UC Berkeley
matei@databricks.com
Abstract
Evaluating
retrieval-augmented
generation
(RAG) systems traditionally relies on hand
annotations for input queries, passages to re-
trieve, and responses to generate. We intro-


augmented generation in LLMs: noise robustness, nega-
tive rejection, information integration, and counterfactual
robustness. To conduct the evaluation, we built Retrieval-
Augmented Generation Benchmark (RGB). The instances of
RGB are generated from latest news articles and the external
documents obtained from search engines. The experimental
results suggest that current LLMs have limitations in the 4
abilities. This indicates that there is still a significant amount


on downstream

## Preguntas y Respuestas con un modelo preparado para RAG

Ahora, utilizando el pipeline de RAG que hemos creado, y una LLM afinada especialmente para tareas de RAG, crearemos un Q&A.

Usaremos el modelo [llmware/dragon-deci-7b-v0](https://huggingface.co/llmware/dragon-deci-7b-v0). Un modelo de la serie DRAGON afinado especialmente para tareas relacionadas con extracción de información a partir de un contexto.

Los creadores del modelo han creado algunos conjuntos de datos para entrenar modelos en áreas específicas:
- **Dominios Enfocados:** Concentrándose en sectores como los servicios financieros, seguros, legal, cumplimiento y regulación.
- **Análisis de Contexto Cerrado:** Buscando respuestas derivadas de documentos de fuente específicos en lugar de conocimiento general.
- **Preguntas y Respuestas Basadas en Hechos:** Mejorando habilidades en la extracción de clave-valor, Q&A conciso, análisis básico, y resúmenes tanto de forma corta como larga.
- **Habilidades RAG Esenciales:** Construyendo conjuntos de entrenamiento dirigidos para Yes/No Booleano, reconocimiento de "no encontrado", matemáticas y lógica de sentido común, lectura de tablas y preguntas de elección múltiple.
- **Respuestas Claras y Concisas:** Enfocándose en respuestas breves para facilitar el manejo programático, correlación con fuentes de evidencia, riesgo reducido de alucinaciones, y procesamiento de inferencia más rápido.

Es un claro ejemplo de que el afinado de modelos pequeños en dominios y habilidades específicas, les permite rendir de formas similares a los modelos más generales y más grandes. Haciéndolos altamente efectivos y eficientes en costos en flujos de trabajo RAG y automatización relacionada en entornos de nube privada.


In [11]:
from langchain.prompts import ChatPromptTemplate
template = """<human>: Answer the question based only on the following context. If you cannot answer the question with the context, please respond with 'I don't know':
### CONTEXT
{context}
### QUESTION
Question: {question}
\n
<bot>:
"""
prompt = ChatPromptTemplate.from_template(template)

In [12]:
from operator import itemgetter
import torch
from langchain.llms.huggingface_pipeline import HuggingFacePipeline
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, GenerationConfig, pipeline
from langchain.chat_models import ChatOpenAI
from langchain.schema.output_parser import StrOutputParser
from langchain.schema.runnable import RunnableLambda, RunnablePassthrough

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained("llmware/dragon-deci-7b-v0",
                                             trust_remote_code=True)

tokenizer = AutoTokenizer.from_pretrained("llmware/dragon-deci-7b-v0",
                                          trust_remote_code=True)

generation_config = GenerationConfig(
    max_length=4096,
    temperature=1e-3,
    do_sample=True,
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=tokenizer.eos_token_id
)

pipeline = pipeline("text-generation",
                    model=model,
                    tokenizer=tokenizer,
                    max_length=4096,
                    temperature=1e-3,
                    do_sample=True,
                    eos_token_id=tokenizer.eos_token_id,
                    pad_token_id=tokenizer.eos_token_id
                    )

deci_dragon = HuggingFacePipeline(pipeline=pipeline)

config.json:   0%|          | 0.00/855 [00:00<?, ?B/s]

configuration_decilm.py:   0%|          | 0.00/576 [00:00<?, ?B/s]

version_check.py:   0%|          | 0.00/371 [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/llmware/dragon-deci-7b-v0:
- version_check.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


(…)sformers_v4_35_2__configuration_llama.py:   0%|          | 0.00/9.20k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/llmware/dragon-deci-7b-v0:
- transformers_v4_35_2__configuration_llama.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
A new version of the following files was downloaded from https://huggingface.co/llmware/dragon-deci-7b-v0:
- configuration_decilm.py
- version_check.py
- transformers_v4_35_2__configuration_llama.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_decilm.py:   0%|          | 0.00/14.3k [00:00<?, ?B/s]

transformers_v4_35_2__modeling_llama.py:   0%|          | 0.00/56.6k [00:00<?, ?B/s]

(…)ers_v4_35_2__modeling_attn_mask_utils.py:   0%|          | 0.00/10.1k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/llmware/dragon-deci-7b-v0:
- transformers_v4_35_2__modeling_attn_mask_utils.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
A new version of the following files was downloaded from https://huggingface.co/llmware/dragon-deci-7b-v0:
- transformers_v4_35_2__modeling_llama.py
- transformers_v4_35_2__modeling_attn_mask_utils.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
A new version of the following files was downloaded from https://huggingface.co/llmware/dragon-deci-7b-v0:
- modeling_decilm.py
- transformers_v4_35_2__modeling_llama.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


pytorch_model.bin:   0%|          | 0.00/14.1G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/14.1G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/915 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

Device set to use cuda:0
  deci_dragon = HuggingFacePipeline(pipeline=pipeline)


Creamos el pipeline

In [13]:
retrieval_augmented_qa_chain = (
    {"context": itemgetter("question") | base_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": prompt | deci_dragon, "context": itemgetter("context")}
)

Testeamos el modelo

In [14]:
question = "Describe evaluation criteria for retrieval augmented generation pipelines"

result = retrieval_augmented_qa_chain.invoke({"question" : question})

print(result['response'])

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
  self.gen = func(*args, **kwds)


Human: <human>: Answer the question based only on the following context. If you cannot answer the question with the context, please respond with 'I don't know':
### CONTEXT
[Document(metadata={'Title': 'ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems', 'Summary': 'Evaluating retrieval-augmented generation (RAG) systems traditionally relies\non hand annotations for input queries, passages to retrieve, and responses to\ngenerate. We introduce ARES, an Automated RAG Evaluation System, for evaluating\nRAG systems along the dimensions of context relevance, answer faithfulness, and\nanswer relevance. By creating its own synthetic training data, ARES finetunes\nlightweight LM judges to assess the quality of individual RAG components. To\nmitigate potential prediction errors, ARES utilizes a small set of\nhuman-annotated datapoints for prediction-powered inference (PPI). Across eight\ndifferent knowledge-intensive tasks in KILT, SuperGLUE, and AIS, ARES\naccu

## Creación de un dataset de evaluación

Podemos evaluar el modelo en formato `batch` o en `realtime`. Para hacerlo en formato batch necesitamos un dataset de preguntas y respuestas fiables para luego poder comparar el resultado de nuestro modelo con la respuesta esperada.

Este dataset lo podemos crear nosotros manualmente o... ¡usar un modelo más grande con altas capacidades para generarlo!

Usaremos GPT-3.5 para generar las preguntas y GPT-4 para contestarlas.

Finalmente, nuestro dataset deberá tener:

- **Preguntas:** Estos son los prompts que tu modelo RAG tratará. Asegúrate de que tu conjunto de datos incluya una amplia variedad de preguntas. Esta diversidad prueba la capacidad del modelo para manejar una amplia gama de temas y complejidades de preguntas.
- **Verdades Fundamentales:** Estas son las respuestas correctas a tus preguntas. Las utilizarás como referencia para medir con qué precisión responde tu modelo RAG.
- **Respuestas Predichas:** Estas son las respuestas que genera tu modelo RAG. Tu tarea clave es comparar estas respuestas con las verdades fundamentales para evaluar la precisión del modelo.
- **Contextos:** Estos proporcionan el antecedente o información suplementaria necesaria que tu modelo RAG utiliza para elaborar sus respuestas. Entender cómo tu modelo aprovecha este contexto es vital para evaluar su eficacia al incorporar información externa en sus respuestas.


In [18]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

KeyboardInterrupt: Interrupted by user

In [None]:
from langchain.output_parsers import ResponseSchema
from langchain.output_parsers import StructuredOutputParser

question_schema = ResponseSchema(
    name="question",
    description="a question about the context."
)

question_response_schemas = [
    question_schema,
]

In [None]:
question_output_parser = StructuredOutputParser.from_response_schemas(question_response_schemas)

format_instructions = question_output_parser.get_format_instructions()

In [None]:
question_generation_llm = ChatOpenAI(model="gpt-4.1-mini")

bare_prompt_template = "{content}"

bare_template = ChatPromptTemplate.from_template(template=bare_prompt_template)

  question_generation_llm = ChatOpenAI(model="gpt-4o-mini")


In [None]:
from langchain.prompts import ChatPromptTemplate

qa_template = """\
  You are a University Professor creating a test for advanced students. For each context, create a question that is specific to the context. Avoid creating generic or general questions.
  question: a question about the context.
  Format the output as JSON with the following keys:
  question
  context: {context}
"""

prompt_template = ChatPromptTemplate.from_template(template=qa_template)

messages = prompt_template.format_messages(
    context=docs[0],
    format_instructions=format_instructions
)

question_generation_chain = bare_template | question_generation_llm

response = question_generation_chain.invoke({"content" : messages})

output_dict = question_output_parser.parse(response.content)

In [None]:
for k, v in output_dict.items():
  print(k)
  print(v)

question
What specific findings did the authors report regarding the performance of ChatGPT compared to other models in human evaluations across the three NLP benchmarks?
context
{'page_content': "Evaluation Metrics in the Era of GPT-4: Reliably Evaluating Large Language Models on Sequence to Sequence Tasks Andrea Sottana1 Bin Liang1 Kai Zou1 Zheng Yuan2,1 1NetMind.AI 2Department of Informatics, King’s College London {andrea.sottana, bin.liang, kz}@netmind.ai zheng.yuan@kcl.ac.uk Abstract Large Language Models (LLMs) evaluation is a patchy and inconsistent landscape, and it is becoming clear that the quality of automatic evaluation metrics is not keeping up with the pace of development of generative models. We aim to improve the understanding of current models' performance by providing a preliminary and hybrid evaluation on a range of open and closed-source generative LLMs on three NLP benchmarks: text summarisation, text simplification and grammatical error correction (GEC), using bot

In [None]:
from tqdm import tqdm
import random

random.seed(42)
qac_triples = []

loop = 5

for text in tqdm(random.sample(docs, loop)):

  messages = prompt_template.format_messages(
      context=text,
      format_instructions=format_instructions
  )

  response = question_generation_chain.invoke({"content" : messages})

  try:
    output_dict = question_output_parser.parse(response.content)
  except Exception as e:
    continue

  output_dict["context"] = text
  qac_triples.append(output_dict)

100%|██████████| 5/5 [00:50<00:00, 10.08s/it]


In [None]:
for qac in qac_triples:
  print(qac)

{'question': 'What are the key dimensions along which ARES evaluates retrieval-augmented generation (RAG) systems, and how does it mitigate prediction errors during evaluation?', 'context': Document(metadata={'Published': '2024-03-31', 'Title': 'ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems', 'Authors': 'Jon Saad-Falcon, Omar Khattab, Christopher Potts, Matei Zaharia', 'Summary': 'Evaluating retrieval-augmented generation (RAG) systems traditionally relies\non hand annotations for input queries, passages to retrieve, and responses to\ngenerate. We introduce ARES, an Automated RAG Evaluation System, for evaluating\nRAG systems along the dimensions of context relevance, answer faithfulness, and\nanswer relevance. By creating its own synthetic training data, ARES finetunes\nlightweight LM judges to assess the quality of individual RAG components. To\nmitigate potential prediction errors, ARES utilizes a small set of\nhuman-annotated datapoints for pred

In [None]:
answer_generation_llm = ChatOpenAI(model="gpt-4o", temperature=0)

answer_schema = ResponseSchema(
    name="answer",
    description="an answer to the question"
)

answer_response_schemas = [
    answer_schema,
]

answer_output_parser = StructuredOutputParser.from_response_schemas(answer_response_schemas)
format_instructions = answer_output_parser.get_format_instructions()

qa_template = """\
  You are a University Professor creating a test for advanced students. For each question and context, create an answer.
  answer: a answer about the context.
  Format the output as JSON with the following keys:
  answer
  question: {question}
  context: {context}
"""

prompt_template = ChatPromptTemplate.from_template(template=qa_template)
messages = prompt_template.format_messages(
    context=qac_triples[0]["context"],
    question=qac_triples[0]["question"],
    format_instructions=format_instructions
)

answer_generation_chain = bare_template | answer_generation_llm
response = answer_generation_chain.invoke({"content" : messages})
output_dict = answer_output_parser.parse(response.content)

In [None]:
for k, v in output_dict.items():
  print(k)
  print(v)

question
What are the key dimensions along which ARES evaluates retrieval-augmented generation (RAG) systems, and how does it mitigate prediction errors during evaluation?
context
page_content='Lora Aroyo, Michael Collins, Dipanjan Das, Slav
Petrov, Gaurav Singh Tomar, Iulia Turc, and David
Reitter. 2022. Measuring attribution in natural lan-
guage generation models.
Jon Saad-Falcon, Omar Khattab, Keshav Santhanam,
Radu Florian, Martin Franz, Salim Roukos, Avirup
Sil, Md Arafat Sultan, and Christopher Potts. 2023.
Udapdr: Unsupervised domain adaptation via llm
prompting and distillation of rerankers.
arXiv
preprint arXiv:2303.00807.
David P Sander and Laura Dietz. 2021. Exam: How' metadata={'Published': '2024-03-31', 'Title': 'ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems', 'Authors': 'Jon Saad-Falcon, Omar Khattab, Christopher Potts, Matei Zaharia', 'Summary': 'Evaluating retrieval-augmented generation (RAG) systems traditionally relies\non hand an

In [None]:
for triple in tqdm(qac_triples):

  messages = prompt_template.format_messages(
      context=triple["context"],
      question=triple["question"],
      format_instructions=format_instructions
  )

  response = answer_generation_chain.invoke({"content" : messages})

  try:
    output_dict = answer_output_parser.parse(response.content)
  except Exception as e:
    continue

  triple["answer"] = output_dict["answer"]

100%|██████████| 5/5 [00:36<00:00,  7.21s/it]


In [None]:
import pandas as pd
from datasets import Dataset

ground_truth_qac_set = pd.DataFrame(qac_triples)
ground_truth_qac_set["context"] = ground_truth_qac_set["context"].map(lambda x: str(x.page_content))
ground_truth_qac_set = ground_truth_qac_set.rename(columns={"answer" : "ground_truth"})

eval_dataset = Dataset.from_pandas(ground_truth_qac_set)

In [None]:
eval_dataset

Dataset({
    features: ['question', 'context', 'ground_truth'],
    num_rows: 5
})

In [None]:
eval_dataset[0]

{'question': 'What are the key dimensions along which ARES evaluates retrieval-augmented generation (RAG) systems, and how does it mitigate prediction errors during evaluation?',
 'context': 'Lora Aroyo, Michael Collins, Dipanjan Das, Slav\nPetrov, Gaurav Singh Tomar, Iulia Turc, and David\nReitter. 2022. Measuring attribution in natural lan-\nguage generation models.\nJon Saad-Falcon, Omar Khattab, Keshav Santhanam,\nRadu Florian, Martin Franz, Salim Roukos, Avirup\nSil, Md Arafat Sultan, and Christopher Potts. 2023.\nUdapdr: Unsupervised domain adaptation via llm\nprompting and distillation of rerankers.\narXiv\npreprint arXiv:2303.00807.\nDavid P Sander and Laura Dietz. 2021. Exam: How',
 'ground_truth': 'ARES evaluates retrieval-augmented generation (RAG) systems along the dimensions of context relevance, answer faithfulness, and answer relevance. To mitigate prediction errors during evaluation, ARES uses a small set of human-annotated datapoints for prediction-powered inference (P

In [None]:
from huggingface_hub import login
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
eval_dataset.push_to_hub("ericrisco/ragas-eval-dataset")

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/344 [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/datasets/ericrisco/ragas-eval-dataset/commit/3569519d68b411894f49ace95e1ce49aaef35002', commit_message='Upload dataset', commit_description='', oid='3569519d68b411894f49ace95e1ce49aaef35002', pr_url=None, pr_revision=None, pr_num=None)

## RAG Evaluation using RAGas

Recordemos las métricas que se evalúan:
- **Relevancia de la Respuesta:** La pertinencia de la respuesta del modelo RAG al prompt dado.
- **Fidelidad:** Si las respuestas son fieles a los hechos proporcionados en el contexto.
- **Precisión del Contexto:** La capacidad del modelo para clasificar la información relevante al principio.
- **Corrección de la Respuesta:** La exactitud de la respuesta en comparación con la verdad objetiva.


In [None]:
from ragas.metrics import (
    answer_relevancy,
    faithfulness,
    context_recall,
    context_precision,
    answer_correctness,
    answer_similarity
)
from ragas import evaluate

def create_ragas_dataset(rag_pipeline, eval_dataset):

  rag_dataset = []

  for row in tqdm(eval_dataset):
    answer = rag_pipeline.invoke({"question" : row["question"]})
    rag_dataset.append(
        {"question" : row["question"],
         "answer" : answer["response"],
         "contexts" : [context.page_content for context in answer["context"]],
         "ground_truths" : [row["ground_truth"]]
         }
    )

  rag_df = pd.DataFrame(rag_dataset)
  rag_eval_dataset = Dataset.from_pandas(rag_df)

  return rag_eval_dataset

def evaluate_ragas_dataset(ragas_dataset):

  result = evaluate(
    ragas_dataset,
    metrics=[
        context_precision,
        faithfulness,
        answer_relevancy,
        context_recall,
        answer_correctness,
        answer_similarity
    ],
  )

  return result

Ahora ya podemos evaluar el dataset. Si quisiéramos evaluar solo un registro, lo haríamos con un solo registro ¿no?


In [None]:
from tqdm import tqdm
import pandas as pd
basic_qa_ragas_dataset = create_ragas_dataset(retrieval_augmented_qa_chain, eval_dataset)

  self.gen = func(*args, **kwds)
  self.gen = func(*args, **kwds)
  self.gen = func(*args, **kwds)
  self.gen = func(*args, **kwds)
  self.gen = func(*args, **kwds)
100%|██████████| 5/5 [11:54<00:00, 142.91s/it]


In [None]:
basic_qa_result = evaluate_ragas_dataset(basic_qa_ragas_dataset)

ValueError: The metric [context_precision] that is used requires the following additional columns ['reference'] to be present in the dataset.

## Resultados gráficos


In [None]:
import matplotlib.pyplot as plt
def plot_metrics_with_values(metrics_dict, title='RAG Metrics'):
    """
    Plots a bar chart for metrics contained in a dictionary and annotates the values on the bars.
    Args:
    metrics_dict (dict): A dictionary with metric names as keys and values as metric scores.
    title (str): The title of the plot.
    """
    names = list(metrics_dict.keys())
    values = list(metrics_dict.values())
    plt.figure(figsize=(10, 6))
    bars = plt.barh(names, values, color='skyblue')
    # Adding the values on top of the bars
    for bar in bars:
        width = bar.get_width()
        plt.text(width + 0.01,  # x-position
                 bar.get_y() + bar.get_height() / 2,  # y-position
                 f'{width:.4f}',  # value
                 va='center')
    plt.xlabel('Score')
    plt.title(title)
    plt.xlim(0, 1)  # Setting the x-axis limit to be from 0 to 1
    plt.show()

In [None]:
plot_metrics_with_values(basic_qa_result, "Base Retriever ragas Metrics")


- **Context Precision:** Esta métrica evalúa qué tan bien el sistema puede seleccionar información relevante del contexto proporcionado. Un valor alto indica que el sistema es capaz de distinguir y priorizar la información más relevante para la consulta.

- **Faithfulness:** Mide la fidelidad de las respuestas generadas respecto al contexto original. Una puntuación alta significa que la mayoría de la información presente en las respuestas puede ser rastreada de manera fiable al contexto, garantizando que las respuestas son factualmente consistentes.

- **Answer Relevancy:** Esta métrica determina qué tan relevantes son las respuestas a las preguntas formuladas. Valores altos indican que el sistema entiende bien la consulta y proporciona respuestas que se ajustan estrechamente a la necesidad de información del usuario.

- **Context Recall:** Evalúa la capacidad del sistema para recuperar toda la información relevante disponible en el contexto o base de datos para una consulta específica. Una puntuación alta aquí sugeriría que el sistema es muy eficiente en encontrar y utilizar toda la información pertinente.

- **Context Relevancy:** Esta métrica examina si el contexto recuperado y utilizado por el sistema para responder a una consulta es realmente pertinente para la pregunta hecha. Un valor bajo podría indicar que el sistema está recuperando mucha información que, aunque es relevante para el contexto en general, no es útil para la consulta específica.

- **Answer Correctness:** Mide la precisión o corrección de las respuestas dadas. Un valor moderadamente alto indica que una buena parte de las respuestas son correctas, pero también hay espacio para la mejora en la precisión de las respuestas.

- **Answer Similarity:** Esta métrica compara las respuestas generadas con las respuestas esperadas o ideales para ver qué tan cercanas son en términos de contenido y contexto. Un valor alto indica que las respuestas generadas por el sistema se asemejan mucho a las que se desearían o esperarían, mostrando una buena comprensión del problema.
