# Notebook 3: Evaluation with Ragas


Leveraging a strong LLM for reference-free evaluation is an upcoming solution that has shown a lot of promise. They correlate better with human judgment than traditional metrics and also require less human annotation. Papers like G-Eval have experimented with this and given promising results but there are certain shortcomings too.

LLM prefers their own outputs and when asked to compare between different outputs the relative position of those outputs matters more. LLMs can also have a bias toward a value when asked to score given a range and they also prefer longer responses.

[Ragas](https://docs.ragas.io/en/latest/) aims to work around these limitations of using LLMs to evaluate your QA pipelines while also providing actionable metrics using as little annotated data as possible, cheaper, and faster.

In this notebook, we will use NVIDIA AI playground's  Llama 70B LLM as a judge and eval model. **NVIDIA AI Playground** on NGC allows developers to experience state of the art LLMs accelerated on NVIDIA DGX Cloud with NVIDIA TensorRT nd Triton Inference Server. Developers get **free credits for 10K requests** to any of the available models. Sign up process is easy. Follow the instructions [here.](../docs/rag/aiplayground.md)

### Step 1: Set NVIDIA AI Playground API key

In [None]:
import os
os.environ['NVIDIA_API_KEY'] = "nvapi-*"

In [None]:
from langchain_nvidia_ai_endpoints import ChatNVIDIA, NVIDIAEmbeddings
llm = ChatNVIDIA(
    model="meta/llama3-70b-instruct",
    temperature=0.2,
    max_tokens=300,
)
embeddings = NVIDIAEmbeddings(model="ai-embed-qa-4", model_type="passage")

### Bring your own LLMs¶
Ragas uses langchain under the hood for connecting to LLMs for metrices that require them. This means you can swap out the default LLM (gpt-3.5) with llama3 70B from API catalog.

In [None]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
nvpl_llm = LangchainLLMWrapper(langchain_llm=llm)
nvpl_embeddings = LangchainEmbeddingsWrapper(embeddings)

### Step 2: Import Eval Data and Reformat It

In [None]:
import json
with open('eval.json', 'r') as file:
    json_data = json.load(file)

In [None]:
eval_questions = []
eval_answers = []
ground_truths = []
vdb_contexts = []
counter = 0
for entry in json_data:
    eval_questions.append(entry["question"])
    eval_answers.append(entry["answer"])
    vdb_contexts.append(entry["contexts"])
    ground_truths.append([entry["gt_answer"]])

In [None]:
data_samples = {
    'question': eval_questions,
    'answer': eval_answers,
    'contexts' : vdb_contexts,
    'ground_truths': ground_truths
}

In [None]:
from ragas import evaluate
from datasets import Dataset

dataset = Dataset.from_dict(data_samples)

In [None]:
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
evaluate(dataset, llm=nvpl_llm, embeddings=nvpl_embeddings, metrics=[faithfulness, answer_relevancy, context_precision, context_recall])

### Step 3: View and Interpret Results

A Ragas score is comprised of the following:
![ragas](imgs/ragas.png)

#### Metrics explained 
1. **Faithfulness**: measures the factual accuracy of the generated answer with the context provided. This is done in 2 steps. First, given a question and generated answer, Ragas uses an LLM to figure out the statements that the generated answer makes. This gives a list of statements whose validity we have we have to check. In step 2, given the list of statements and the context returned, Ragas uses an LLM to check if the statements provided are supported by the context. The number of correct statements is summed up and divided by the total number of statements in the generated answer to obtain the score for a given example.
   
2. **Answer Relevancy**: measures how relevant and to the point the answer is to the question. For a given generated answer Ragas uses an LLM to find out the probable questions that the generated answer would be an answer to and computes similarity to the actual question asked.
   
3. **Context Precision**: measures the precision of the retrieved context in providing relevant information for generating answer. Given a question, answer and retrieved context, Ragas calls LLM to check sentences from the ground truth answer against a retrieved context. It is the ratio between the relevant sentences from retrieved context and the total sentence from ground truth answer.

4. **Context Recall**: measures the ability of the retriever to retrieve all the necessary information needed to answer the question. Ragas calculates this by using the provided ground_truth answer and using an LLM to check if each statement from it can be found in the retrieved context. If it is not found that means the retriever was not able to retrieve the information needed to support that statement.
