Este notebook se ha hecho con la versión 0.2.13 de Ragas. Puede que con versiones anteriores o posteriores falle algo.  

En este notebook se explican las métricas predefinidas de Ragas, se explica cómo hacer llamadas sencillas para calcularlasz, se hace un análisis de los tiempos de ejecución de una de ellas, se pone un ejemplo de cómo se pueden modificar los prompts para calcular alguna de ellas y se explica cómo definir métricas propias.

Hay una serie de métricas para evaluar sistemas RAG predefinidas en Ragas:  
* Context Precision
* Context Recall
* Context Entities Recall
* Noise Sensitivity
* Response Relevancy
* Faithfulness
* Multimodal Faithfulness
* Multimodal Relevance  

Para utilizar algunas de ellas se requiere hacer uso de un LLM, para otras un modelo de embeddings o las dos cosas.  

Se puede encontrar un análisis en detalle de cómo se calculan estas métricas en el siguiente enlace: https://pixion.co/blog/ragas-evaluation-in-depth-insights. También se puede analizar directamente en el código, ya que Ragas es de código abierto.

# Métricas predefinidas en Ragas  

## Faithfulness  

Dada la **pregunta** y la **respuesta** se le pide a un LLM que extraiga una lista de _statements_. A continuación, se le pasa esta lista de _statements_ a un LLM y, junto al **contexto**, se le pide que de un veredicto sobre si cada _statement_ está basado en el contexto (_true_ o _false_). Hecho esto, el resultado de **faithfulness** es la media de los veredictos.  

## Context Precision  

**Context Precision** es una métrica que mide cómo de útil es un conjunto de **contextos** para responder a una **pregunta** dada una **ground truth**. Para hacerlo, se le pasa a un LLM la **pregunta**, **ground truth** y cada uno de los **contextos** (en llamadas por separado). Se le pide al LLM que determine si el **contexto** es útil para llegar a la respuesta correcta. Este veredicto será de nuevo _true_ o _false_, y junto a él se obtiene su razonamiento. Con el conjunto de veredictos para cada **contexto** se hace la media y este es el valor de **Context Precision**.  

## Context Recall  

**Context Recall** mide cómo de bien un **contexto** justifica una **respuesta** en función de una **ground truth**. Se calcula como la proporción entre el número de frases de la **ground truth** que pueden inferir del **contexto** respecto al total de frases de la **ground truth**. Para calcularlo se le pasa a un LLM la **pregunta**, **contexto** y **ground truth**. Se le pide al LLM que extraiga cada _statement_ de la **ground truth** y haga un veredicto _true_ o _false_ sobre si se puede inferir del **contexto**, junto con el razonamiento del veredicto. Con el conjunto de veredictos **Context Recall** será la media de veredictos.  

## Context Entity Recall  

Esta métrica se basa en la extracción de entidades. Se utiliza el mismo _propmt_ con un LLM para extraer las entidades del **contexto** y **ground truth**. Con las dos listas de entidades se calcula el número de entidades comunes en ambos casos, y se hace la proporción respecto al número de entidades total de la **ground truth**.  

Esta métrica puede fallar, ya que al hacer la extracción de entidades en dos llamadas diferentes puede que dos entidades idénticas estén redactadas de forma distinta, entendiéndose como entidades diferentes. Es por esto que **no recomiendo utilizar la métrica Context Entity Recall**.

## Context Relevancy  

**Context Relevancy** se basa en mandar a un LLM el **contexto** y la **pregunta**, pidiéndole que extraiga las frases relevantes del **contexto** para responder a la **pregunta**. Esta métrica se calcula haciendo la proporción entre este número de frases extaídas y el número de frases totales.  

Esta métrica va a ser deprecada en favor de **Context Precision**, por lo que **no recomiendo su uso**.  

## Answer Relevancy  

Para calcular esta métrica en primer lugar se le pasa a un LLM la **respuesta** y el **contexto** y se le pide que genere una pregunta, además de un veredicto sobre si la **respuesta** es _noncommittal_ o no (si es evasiva, vaga o ambigua). Hecho esto, se hace el _embedding_ de la **pregunta** y de las preguntas genradas por esta llamada al LLM, calculando después la similitud del coseno. Si hay múltiples **respuestas** y por tanto múltiples preguntas generadas, el valor de **Answer Relevancy** será la media de similitudes. Por último, si alguna de las **respuestas** es _noncommittal_ el valor de la métrica se pone a 0, ya que es una respuesta no idónea, y si no hay ninguna **respuesta** _noncommittal_ el valor de la métrica se mantiene igual.  

## Answer Similarity  

Esta métrica se basa en el cálculo de la similitud del coseno entre los _embeddings_ de la **respuesta** y la **ground truth**. El resultado de esta similitud será por naturaleza entre -1 y 1, pero se puede colocar un umbral a partir del cual cualquier puntuación que lo supere se convierta en 1 y cualquiera que no lo supere se convierta en 0. Esta métrica no requiere hacer una llamada a un LLM, únicamente necesita un modelo de _embeddings_, por lo que su tiempo de ejecución será considerablemente menor que en métricas que sí necesitan llamar a un LLM.  

## Answer Correctness  

Lo primero que hace esta métrica es pasarle a un LLM la **pregunta**, **respuesta** y **ground truth** y pedirle que haga una extracción de statements y clasificación en TP (_True Positive_, _statements_ presentes en la **respuesta** y la **ground truth**), FP (_False Positive_, _statements_ presentes en la **respuesta** pero no en la **ground truth**) y FN (_False Negative_, _statements_ releventes de la **ground truth** pero que no se mencionan en la **respuesta**). Hecho esto, se calcula la F1 score con el número de TP, FP y FN, representado con el símbolo ||.  

$$
F1 \text{ Score} = \frac{|\text{TP}|}{|\text{TP}| + 0.5 \times (|\text{FP}| + |\text{FN}|)}
$$

A continuación, hace la media ponderada entre la F1 score y la **Answer Similarity**, con unos pesos dados por Bias.  

## Aspect Critique  

Esta métrica utiliza un LLM al que se le pasan la **pregunta**, **respuesta** y **contexto**. Se le pide que clasifique (_true_ o _false_) en función de si se cumple un criterio, ya sea predefinido o propio. Además del veredicto, el LLM proporciona el razonamiento.

# Configuración  

A continuación configuramos el LLM y el modelo de embeddings que necesitaremos.

In [48]:
import os
from dotenv import load_dotenv
import asyncio
from langchain_openai.chat_models import AzureChatOpenAI
from langchain_openai.embeddings import AzureOpenAIEmbeddings
from ragas.llms import LangchainLLMWrapper
from ragas.dataset_schema import SingleTurnSample 
from ragas.metrics import Faithfulness
from ragas.embeddings import LangchainEmbeddingsWrapper
from ragas.metrics import ResponseRelevancy
load_dotenv()

azure_llm = AzureChatOpenAI(
    api_version=os.getenv("RAGAS_AZURE_OPENAI_API_VERSION"),
    base_url=os.getenv("RAGAS_AZURE_OPENAI_ENDPOINT"),
    azure_deployment=os.getenv("RAGAS_AZURE_OPENAI_DEPLOYMENT"),
    model=os.getenv("RAGAS_AZURE_OPENAI_DEPLOYMENT"),
    validate_base_url=False,
    api_key=os.getenv("RAGAS_AZURE_OPENAI_API_KEY"),
)

azure_embeddings = AzureOpenAIEmbeddings(
    openai_api_version=os.getenv("RAGAS_AZURE_OPENAI_API_VERSION"),
    azure_endpoint=os.getenv("EMBEDDINGS_AZURE_OPENAI_ENDPOINT"),
    azure_deployment=os.getenv("EMBEDDINGS_AZURE_OPENAI_DEPLOYMENT"),
    model=os.getenv("EMBEDDINGS_AZURE_OPENAI_MODEL"),
    api_key=os.getenv("RAGAS_AZURE_OPENAI_API_KEY"),
)

azure_llm = LangchainLLMWrapper(azure_llm)
azure_embeddings = LangchainEmbeddingsWrapper(azure_embeddings)


# Función async para ejecutar la evaluación
async def main():
    scorer = Faithfulness(llm=azure_llm)
    score = await scorer.single_turn_ascore(sample)
    print("Faithfulness Score:", score)


async def main_2():
    scorer = ResponseRelevancy(llm=azure_llm, embeddings=azure_embeddings)
    score = await scorer.single_turn_ascore(sample)
    print("Response Relevancy Score:", score)


# Faithfulness (solo requiere un LLM)

In [29]:
user_input="What is the speed of light in vacuum?"
retrieved_contexts=["The speed of light in vacuum is approximately 299,792,458 meters per second"]

response_array=[
    "The speed of light in vacuum is approximately 299,792,458 meters per second",
    "In vacuum, light travels at about 299.79 million meters per second",
    "The speed of light in vacuum is 250,000,000 meters per second",
    "The speed of light in vacuum is 299,792,458 kilometers per second",
    "Light travels at 299,792,458 meters per second in vacuum, but it can be much faster in certain materials",
    "Scientists recently discovered that the speed of light can exceed 299,792,458 meters per second in certain quantum experiments",
    "The speed of light in vacuum is approximately 299,792,458 meters per second, a fundamental constant in physics known as 'c' in Einstein's equations",
    "The speed of light is 500,000 meters per second, and it was first measured by Isaac Newton",
]

In [30]:
for i in range(len(response_array)):
    sample = SingleTurnSample(
        user_input=user_input,
        retrieved_contexts=retrieved_contexts,
        response=response_array[i]
    )
    print('____________________________________________________________')
    print(response_array[i])
    asyncio.run(main())

____________________________________________________________
The speed of light in vacuum is approximately 299,792,458 meters per second
Faithfulness Score: 1.0
____________________________________________________________
In vacuum, light travels at about 299.79 million meters per second
Faithfulness Score: 1.0
____________________________________________________________
The speed of light in vacuum is 250,000,000 meters per second
Faithfulness Score: 0.0
____________________________________________________________
The speed of light in vacuum is 299,792,458 kilometers per second
Faithfulness Score: 0.0
____________________________________________________________
Light travels at 299,792,458 meters per second in vacuum, but it can be much faster in certain materials
Faithfulness Score: 0.5
____________________________________________________________
Scientists recently discovered that the speed of light can exceed 299,792,458 meters per second in certain quantum experiments
Faithfulnes

In [2]:
user_input="Cuánto cuesta la multisim?"
retrieved_contexts=["La multisim cuesta aproximadamente 5 euros"]

response_array="La multisim cuesta aproximadamente 5,30 euros"

sample = SingleTurnSample(
    user_input=user_input,
    retrieved_contexts=retrieved_contexts,
    response=response_array
)
print('____________________________________________________________')
print(response_array)
asyncio.run(main())

____________________________________________________________
La multisim cuesta aproximadamente 5,30 euros
Faithfulness Score: 0.0


# Response Relevancy (requiere un LLM y un modelo de embeddings)

In [49]:
sample = SingleTurnSample(
        user_input="When was Albert Einstein born?",
        response="Albert Einstein was born on March 14, 1879.",
        retrieved_contexts=[
            "Albert Einstein was born on March 14, 1879, in Ulm, Germany."
        ]
    )

asyncio.run(main_2())

Response Relevancy Score: 0.9999999999999991


# Análisis de los tiempos de ejecución  

Los tiempos de llamada varían mucho (analizar por qué), por lo que he hecho 10 llamadas al LLM (10 cálculos de _faithfulness_) y con eso he hecho la media. Los resultados son:  

* gpt-4-turbo-o (PTUs):  

    t<sub>i</sub> = [2.0546483993530273, 1.4517347812652588, 1.6175198554992676, 1.4341368675231934, 1.4332411289215088, 1.4326183795928955, 1.537599802017212, 1.6392836570739746, 1.6329600811004639, 1.6404914855957031] s  

    t<sub>medio</sub> = 1.5874234437942505 s
  

* gpt-4o-mini:  

    t<sub>i</sub> = [11.061832189559937, 4.377746105194092, 8.399289608001709, 7.4798994064331055, 7.182506084442139, 10.782715082168579, 10.346308708190918, 1.5289039611816406, 8.980695724487305, 12.011392831802368] s  

    t<sub>medio</sub> = 8.215128970146178 s


* Phi 4:

    t<sub>i</sub> = [2.545086622238159, 2.3074698448181152, 2.306349515914917, 2.310863971710205, 2.307391405105591, 2.3078808784484863, 2.3036231994628906, 2.306018352508545, 2.3118233680725098, 2.3261337280273438] s  

    t<sub>medio</sub> = 2.3332640886306764 s  


* Phi 3.5 mini instruct:

    t<sub>i</sub> = [5.998244762420654, 5.708659410476685, 5.777618885040283, 5.827743053436279, 5.928270101547241, 5.527622222900391, 5.957532167434692, 5.757034778594971, 5.778102159500122, 5.799544334411621] s  

    t<sub>medio</sub> = 5.806037187576294 s 


* Ministral 3B:

    t<sub>i</sub> = [0.9343471527099609, 0.6317324638366699, 0.6269583702087402, 0.63470458984375, 0.629509687423706, 0.6346712112426758, 0.6314301490783691, 0.6341307163238525, 0.6263759136199951, 0.6268112659454346] s  

    t<sub>medio</sub> = 0.6610671520233155 s 


* Groundedness:

    t<sub>i</sub> = [0.13205432891845703, 0.10495495796203613, 0.3103790283203125, 0.15050053596496582, 0.14232754707336426, 0.1385951042175293, 0.08972954750061035, 0.12947559356689453, 0.09534430503845215, 0.1014101505279541] s  

    t<sub>medio</sub> = 0.13947710990905762 s 



**El tiempo de ejecución de la métrica Faithfulness con PTUs es de 1.5 s aproximadamente. Está lejos de los 0.3 s de ejecución del groundedness de Microsoft. Por el tiempo medio de ejecución del Groundedness, parece que es un método que no usa ningún LLM, sino que usa exclusivamente un modelo de embeddings.**

# Custom Prompts  

## Ver los prompts que utiliza una métrica de Ragas

In [3]:
from ragas.metrics import Faithfulness

scorer = Faithfulness(llm=azure_llm)
scorer.get_prompts()

{'n_l_i_statement_prompt': NLIStatementPrompt(instruction=Your task is to judge the faithfulness of a series of statements based on a given context. For each statement you must return verdict as 1 if the statement can be directly inferred based on the context or 0 if the statement can not be directly inferred based on the context., examples=[(NLIStatementInput(context='John is a student at XYZ University. He is pursuing a degree in Computer Science. He is enrolled in several courses this semester, including Data Structures, Algorithms, and Database Management. John is a diligent student and spends a significant amount of time studying and completing assignments. He often stays late in the library to work on his projects.', statements=['John is majoring in Biology.', 'John is taking a course on Artificial Intelligence.', 'John is a dedicated student.', 'John has a part-time job.']), NLIStatementOutput(statements=[StatementFaithfulnessAnswer(statement='John is majoring in Biology.', reas

In [4]:
prompts = scorer.get_prompts()
print(prompts["statement_generator_prompt"].to_string())

Given a question, an answer, and sentences from the answer analyze the complexity of each sentence given under 'sentences' and break down each sentence into one or more fully understandable statements while also ensuring no pronouns are used in each statement. Format the outputs in JSON.
Please return the output in a JSON format that complies with the following schema as specified in JSON Schema:
{'properties': {'statements': {'description': 'The generated statements', 'items': {'type': 'string'}, 'title': 'Statements', 'type': 'array'}}, 'required': ['statements'], 'title': 'StatementGeneratorOutput', 'type': 'object'}Do not use single quotes in your response but double quotes,properly escaped with a backslash.

--------EXAMPLES-----------
Example 1
Input: {
    "question": "Who was Albert Einstein and what is he best known for?",
    "answer": "He was a German-born theoretical physicist, widely acknowledged to be one of the greatest and most influential physicists of all time. He was

In [5]:
prompts = scorer.get_prompts()
print(prompts["n_l_i_statement_prompt"].to_string())

Your task is to judge the faithfulness of a series of statements based on a given context. For each statement you must return verdict as 1 if the statement can be directly inferred based on the context or 0 if the statement can not be directly inferred based on the context.
Please return the output in a JSON format that complies with the following schema as specified in JSON Schema:
{'$defs': {'StatementFaithfulnessAnswer': {'properties': {'statement': {'description': 'the original statement, word-by-word', 'title': 'Statement', 'type': 'string'}, 'reason': {'description': 'the reason of the verdict', 'title': 'Reason', 'type': 'string'}, 'verdict': {'description': 'the verdict(0/1) of the faithfulness.', 'title': 'Verdict', 'type': 'integer'}}, 'required': ['statement', 'reason', 'verdict'], 'title': 'StatementFaithfulnessAnswer', 'type': 'object'}}, 'properties': {'statements': {'items': {'$ref': '#/$defs/StatementFaithfulnessAnswer'}, 'title': 'Statements', 'type': 'array'}}, 'requi

## Cambiar los prompts de una métrica de Ragas  

Si se cambian los system prompts de alguna de las métricas predefinidas el cambio no es permanente; cuando se cierra el kernel o se sale del entorno, el cambio se revierte.

### Las instrucciones

In [8]:
prompt=scorer.get_prompts()["statement_generator_prompt"]
prompt.instruction="Given a question, an answer, and sentences from the answer analyze the complexity of each sentence given under 'sentences' and break down each sentence into one or more fully understandable statements while also ensuring no pronouns are used in each statement. Then, reverse the meaning of each statement. Format the outputs in JSON."
scorer.set_prompts(**{"statement_generator_prompt": prompt})

prompts=scorer.get_prompts()
print(prompts["statement_generator_prompt"].to_string())

Given a question, an answer, and sentences from the answer analyze the complexity of each sentence given under 'sentences' and break down each sentence into one or more fully understandable statements while also ensuring no pronouns are used in each statement. Then, reverse the meaning of each statement. Format the outputs in JSON.
Please return the output in a JSON format that complies with the following schema as specified in JSON Schema:
{'properties': {'statements': {'description': 'The generated statements', 'items': {'type': 'string'}, 'title': 'Statements', 'type': 'array'}}, 'required': ['statements'], 'title': 'StatementGeneratorOutput', 'type': 'object'}Do not use single quotes in your response but double quotes,properly escaped with a backslash.

--------EXAMPLES-----------
Example 1
Input: {
    "question": "Who was Albert Einstein and what is he best known for?",
    "answer": "He was a German-born theoretical physicist, widely acknowledged to be one of the greatest and mo

### Los ejemplos

In [18]:
prompt = scorer.get_prompts()["statement_generator_prompt"]
prompt.examples

[(StatementGeneratorInput(question='Who was Albert Einstein and what is he best known for?', answer='He was a German-born theoretical physicist, widely acknowledged to be one of the greatest and most influential physicists of all time. He was best known for developing the theory of relativity, he also made important contributions to the development of the theory of quantum mechanics.'),
  StatementGeneratorOutput(statements=['Albert Einstein was a German-born theoretical physicist.', 'Albert Einstein is recognized as one of the greatest and most influential physicists of all time.', 'Albert Einstein was best known for developing the theory of relativity.', 'Albert Einstein also made important contributions to the development of the theory of quantum mechanics.']))]

In [21]:
from ragas.metrics._faithfulness import StatementGeneratorInput
from ragas.metrics._faithfulness import StatementGeneratorOutput

new_example = [
    (
        StatementGeneratorInput(
            question="Who was Albert Einstein and what is he best known for?",
            answer="He was a German-born theoretical physicist, widely acknowledged to be one of the greatest and most influential physicists of all time. He was best known for developing the theory of relativity, he also made important contributions to the development of the theory of quantum mechanics."
        ),
        StatementGeneratorOutput(
            statements=[
                "Albert Einstein was not a German-born theoretical physicist.",
                "Albert Einstein is not recognized as one of the greatest and most influential physicists of all time.",
                "Albert Einstein was not best known for developing the theory of relativity.",
                "Albert Einstein did not make important contributions to the development of the theory of quantum mechanics."
            ]
        )
    )
]

prompt.examples = new_example
scorer.set_prompts(**{"statement_generator_prompt": prompt})
print(scorer.get_prompts()["statement_generator_prompt"].examples)


[(StatementGeneratorInput(question='Who was Albert Einstein and what is he best known for?', answer='He was a German-born theoretical physicist, widely acknowledged to be one of the greatest and most influential physicists of all time. He was best known for developing the theory of relativity, he also made important contributions to the development of the theory of quantum mechanics.'), StatementGeneratorOutput(statements=['Albert Einstein was not a German-born theoretical physicist.', 'Albert Einstein is not recognized as one of the greatest and most influential physicists of all time.', 'Albert Einstein was not best known for developing the theory of relativity.', 'Albert Einstein did not make important contributions to the development of the theory of quantum mechanics.']))]


### Una muestra de cómo implementarlo

In [None]:
from ragas.metrics._faithfulness import StatementGeneratorInput
from ragas.metrics._faithfulness import StatementGeneratorOutput

async def main():
    scorer = Faithfulness(llm=azure_llm)

    ################################################################# Cambio del prompt
    prompt=scorer.get_prompts()["statement_generator_prompt"]
    prompt.instruction="Given a question, an answer, and sentences from the answer analyze the complexity of each sentence given under 'sentences' and break down each sentence into one or more fully understandable statements while also ensuring no pronouns are used in each statement. Then, reverse the meaning of each statement. Format the outputs in JSON."
    scorer.set_prompts(**{"statement_generator_prompt": prompt})



    new_example = [
        (
            StatementGeneratorInput(
                question="Who was Albert Einstein and what is he best known for?",
                answer="He was a German-born theoretical physicist, widely acknowledged to be one of the greatest and most influential physicists of all time. He was best known for developing the theory of relativity, he also made important contributions to the development of the theory of quantum mechanics."
            ),
            StatementGeneratorOutput(
                statements=[
                    "Albert Einstein was not a German-born theoretical physicist.",
                    "Albert Einstein is not recognized as one of the greatest and most influential physicists of all time.",
                    "Albert Einstein was not best known for developing the theory of relativity.",
                    "Albert Einstein did not make important contributions to the development of the theory of quantum mechanics."
                ]
            )
        )
    ]
    prompt.examples = new_example
    scorer.set_prompts(**{"statement_generator_prompt": prompt})


    prompts=scorer.get_prompts()
    print(prompts["statement_generator_prompt"].to_string())
    #################################################################

    score = await scorer.single_turn_ascore(sample)
    print('____________________________________________________________')
    print("Faithfulness Score:", score)





user_input="Cuánto cuesta la multisim?"
retrieved_contexts=["La multisim cuesta aproximadamente 5 euros"]
response_array="La multisim cuesta aproximadamente 5 euros"

sample = SingleTurnSample(
    user_input=user_input,
    retrieved_contexts=retrieved_contexts,
    response=response_array
)


asyncio.run(main())
print(response_array)

Given a question, an answer, and sentences from the answer analyze the complexity of each sentence given under 'sentences' and break down each sentence into one or more fully understandable statements while also ensuring no pronouns are used in each statement. Then, reverse the meaning of each statement. Format the outputs in JSON.
Please return the output in a JSON format that complies with the following schema as specified in JSON Schema:
{'properties': {'statements': {'description': 'The generated statements', 'items': {'type': 'string'}, 'title': 'Statements', 'type': 'array'}}, 'required': ['statements'], 'title': 'StatementGeneratorOutput', 'type': 'object'}Do not use single quotes in your response but double quotes,properly escaped with a backslash.

--------EXAMPLES-----------
Example 1
Input: {
    "question": "Who was Albert Einstein and what is he best known for?",
    "answer": "He was a German-born theoretical physicist, widely acknowledged to be one of the greatest and mo

# Creación de una métrica propia  

Creamos por ejemplo la métrica Hallucinations

## Aspect Critique/Simple Criteria Scoring  

https://docs.ragas.io/en/latest/concepts/metrics/available_metrics/aspect_critic/

Aspect Critique permite verificar si una respuesta se ajusta a criterios específicos mediante una evaluación binaria y configurable en términos de strictness (rango ideal entre 2 y 4, por defecto es 1 ¿?).  

Los tests predefinidos de Ragas son: 
* harmfulness
* maliciousness
* coherence
* correctness
* conciseness

In [None]:
from ragas.dataset_schema import SingleTurnSample
from ragas.metrics import AspectCritic

sample = SingleTurnSample(
    user_input="What is the acceleration due to gravity on Earth?",
    response="The acceleration due to gravity on Earth is approximately 9.8 m/s².",
    reference="The standard acceleration due to gravity is 9.8 m/s².",
)

scorer =  AspectCritic(
        name="maliciousness",
        definition="Is the submission intended to harm, deceive, or exploit users?",
        #strictness=2,
        llm=azure_llm,
    )
await scorer.single_turn_ascore(sample)

0

In [45]:
from ragas.metrics import AspectCritic

sample = SingleTurnSample(
    user_input="What is the acceleration due to gravity on Earth?",
    response="The acceleration due to gravity on Earth is approximately 9.8 m/s² and on Pluto is way lower.",
    reference="The standard acceleration due to gravity is 9.8 m/s².",
)

hallucinations_binary = AspectCritic(
    name="hallucinations_binary",
    definition="Did the model hallucinate or add any information that was not present in the retrieved context?",
    llm=azure_llm,
)

await hallucinations_binary.single_turn_ascore(sample)

1

## Domain Specific Metrics/Rubric based Metrics  

Métricas con resultados no binarios. Se pueden definir todos los resultados o scores que se necesite, añadiendo una descripción para cada resultado.

In [47]:
from ragas.metrics import RubricsScore

sample = SingleTurnSample(
    user_input="What is the acceleration due to gravity on Earth?",
    response="The acceleration due to gravity on Earth is approximately 9.8 m/s² and on Pluto is way lower.",
    reference="The standard acceleration due to gravity is 9.8 m/s².",
)

rubric = {
    "score1_description": "There is no hallucination in the response. All the information in the response is present in the retrieved context.",
    "score2_description": "There are no factual statements that are not present in the retrieved context but the response is not fully accurate and lacks important details.",
    "score3_description": "There are many factual statements that are not present in the retrieved context.",
    "score4_description": "The response contains some factual errors and lacks important details.",
    "score5_description": "The model adds new information and statements that contradict the retrieved context.",
}

hallucinations_rubric = RubricsScore(
    name="hallucinations_rubric", llm=azure_llm, rubrics=rubric
)

await hallucinations_rubric.single_turn_ascore(sample)

2