# NVIDIA Metrics

##Answer Accuracy

Answer Accuracy measures the agreement between a model’s response and a reference ground truth for a given question. This is done via two distinct "LLM-as-a-judge" prompts that each return a rating (0, 2, or 4). The metric converts these ratings into a [0,1] scale and then takes the average of the two scores from the judges. Higher scores indicate that the model’s answer closely matches the reference.

- 0 → The response is inaccurate or does not address the same question as the reference.
- 2 → The response partially align with the reference.
- 4 → The response exactly aligns with the reference.

In [1]:
import os
from dotenv import load_dotenv
load_dotenv()

os.environ["GROQ_API_KEY"] = os.getenv("GROQ_API_KEY")

In [7]:
from ragas import SingleTurnSample
from ragas.metrics import AnswerAccuracy
from langchain_groq import ChatGroq
from ragas.llms import LangchainLLMWrapper

# Create your sample
sample = SingleTurnSample(
    user_input="When was Einstein born?",
    response="Albert Einstein was born in 1879.",
    reference="Albert Einstein was born in 1879."
)

# Initialize Groq LLM and wrap it
groq_llm = ChatGroq(model="llama3-8b-8192")
evaluator_llm = LangchainLLMWrapper(groq_llm)

# Run metric evaluation
scorer = AnswerAccuracy(llm=evaluator_llm)
score = await scorer.single_turn_ascore(sample)

print(score)



1.0


##Context Relevance

Context Relevance evaluates whether the retrieved_contexts (chunks or passages) are pertinent to the user_input. This is done via two independent "LLM-as-a-judge" prompt calls that each rate the relevance on a scale of 0, 1, or 2. The ratings are then converted to a [0,1] scale and averaged to produce the final score. Higher scores indicate that the contexts are more closely aligned with the user's query.

- 0 → The retrieved contexts are not relevant to the user’s query at all.
- 1 → The contexts are partially relevant.
- 2 → The contexts are completely relevant.

In [8]:
from ragas.metrics import ContextRelevance

sample = SingleTurnSample(
    user_input="When and Where Albert Einstein was born?",
    retrieved_contexts=[
        "Albert Einstein was born March 14, 1879.",
        "Albert Einstein was born at Ulm, in Württemberg, Germany.",
    ]
)

scorer1 = ContextRelevance(llm= evaluator_llm)
score1 = await scorer1.single_turn_ascore(sample)
print(score1)

1.0


##Response Groundedness

Response Groundedness measures how well a response is supported or "grounded" by the retrieved contexts. It assesses whether each claim in the response can be found, either wholly or partially, in the provided contexts.

- 0 → The response is not grounded in the context at all.
- 1 → The response is partially grounded.
- 2 → The response is fully grounded (every statement can be found or inferred from the retrieved context).

In [11]:
from ragas.dataset_schema import SingleTurnSample
from ragas.metrics import ResponseGroundedness

sample = SingleTurnSample(
    response="Albert Einstein was born in 1879.",
    retrieved_contexts=[
        "Albert Einstein was born March 14, 1879.",
        "Albert Einstein was born at Ulm, in Württemberg, Germany.",
    ]
)

scorer2 = ResponseGroundedness(llm=evaluator_llm)
score2 = await scorer2.single_turn_ascore(sample)
print(score2)

1.0
