# Notebook 3: Evaluation with Ragas


Leveraging a strong LLM for reference-free evaluation is an upcoming solution that has shown a lot of promise. They correlate better with human judgment than traditional metrics and also require less human annotation. Papers like G-Eval have experimented with this and given promising results but there are certain shortcomings too.

LLM prefers their own outputs and when asked to compare between different outputs the relative position of those outputs matters more. LLMs can also have a bias toward a value when asked to score given a range and they also prefer longer responses.

[Ragas](https://docs.ragas.io/en/latest/) aims to work around these limitations of using LLMs to evaluate your QA pipelines while also providing actionable metrics using as little annotated data as possible, cheaper, and faster.

In this notebook, we will use NVIDIA AI playground's  Llama 70B LLM as a judge and eval model. **NVIDIA AI Playground** on NGC allows developers to experience state of the art LLMs accelerated on NVIDIA DGX Cloud with NVIDIA TensorRT nd Triton Inference Server. Developers get **free credits for 10K requests** to any of the available models. Sign up process is easy. Follow the instructions [here.](../docs/rag/aiplayground.md)

### Step 1: Set NVIDIA AI Playground API key

In [None]:
!pip install ragas --upgrade

In [None]:
import os
os.environ["OPENAI_API_KEY"] = ""
import sys
from continuous_eval.data_downloader import example_data_downloader
from continuous_eval.evaluators import RetrievalEvaluator, GenerationEvaluator
from datasets import Dataset
import json
import pandas as pd
from continuous_eval.metrics import PrecisionRecallF1, RankedRetrievalMetrics, DeterministicAnswerCorrectness, DeterministicFaithfulness, BertAnswerRelevance, BertAnswerSimilarity, DebertaAnswerScores

In [None]:
from langchain_nvidia_ai_endpoints import ChatNVIDIA, NVIDIAEmbeddings
from langchain.embeddings import HuggingFaceEmbeddings
os.environ['NVIDIA_API_KEY'] = ""
os.environ['NVAPI_KEY'] = ""
# make sure to export your NVIDIA AI Playground key as NVIDIA_API_KEY!
llm = ChatNVIDIA(model="playground_steerlm_llama_70b")
nv_embedder = NVIDIAEmbeddings(model="nvolveqa_40k")
nv_document_embedder = NVIDIAEmbeddings(model="nvolveqa_40k", model_type="passage")
nv_query_embedder = NVIDIAEmbeddings(model="nvolveqa_40k", model_type="query")

model_name = "intfloat/e5-large-v2"
model_kwargs = {"device": "cuda"}
encode_kwargs = {"normalize_embeddings": True}
e5_embeddings = HuggingFaceEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs,
)

from langchain.embeddings import HuggingFaceBgeEmbeddings
model_name = "facebook/dragon-plus-context-encoder"
model_kwargs = {'device': 'cuda'}
encode_kwargs = {'normalize_embeddings': True} # set True to compute cosine similarity
dragon_embeddings = HuggingFaceBgeEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs,
    query_instruction="Represent this sentence for searching relevant passages: "
)
dragon_embeddings.query_instruction = "Represent this sentence for searching relevant passages: "

In [None]:
print(llm.available_models)

### Bring your own LLMs¶
Ragas uses langchain under the hood for connecting to LLMs for metrices that require them. This means you can swap out the default LLM (gpt-3.5) with llama2 70B from AI playground.

In [None]:
from ragas.llms import LangchainLLM
nvpl_llm = LangchainLLM(llm=llm)

In [None]:
# for i in data_samples['retrieved_contexts']:
#     print((i))

### Step 2: Import Eval Data and Reformat It.
#### Lets start with E5 evals

In [None]:
evals_file = '../synthetic_data_openai.json'
with open(evals_file, 'r') as file:
    json_data_syn_e5 = json.load(file)
print(json_data_syn_e5[10])
eval_questions = []
eval_answers = []
ground_truths = []
vdb_contexts = []
gt_contexts = []
counter = 0
for entry in json_data_syn_e5:
    entry["contexts"] = [str(r) for r in entry["contexts"]]
    entry["gt_answer"] = str(entry["gt_answer"])
    eval_questions.append(entry["question"])
    eval_answers.append(entry["answer"])
    vdb_contexts.append(entry["contexts"][0:3])
    ground_truths.append([entry["gt_answer"]])
    gt_contexts.append([entry["gt_context"]])

data_samples = {
    'question': eval_questions,
    'answer': eval_answers,
    'retrieved_contexts' : vdb_contexts,
    'ground_truths': ground_truths,
    'ground_truth_contexts':gt_contexts
}

dataset_syn_E5 = pd.DataFrame.from_dict(data_samples) # Dataset.from_dict(data_samples)

In [None]:
# Use continous evals library to generate retriever/generator metrics
evaluator = RetrievalEvaluator(
    dataset=dataset_syn_E5,
    metrics=[
        PrecisionRecallF1(),
        RankedRetrievalMetrics(),
    ],
)
# Run the eval!
r_results = evaluator.run(k=3,batch_size=50)
# Peaking at the results
print(evaluator.aggregated_results)
# Saving the results for future use
# evaluator.save("retrieval_evaluator_results.jsonl")

evaluator = GenerationEvaluator(
    dataset=dataset_syn_E5,
    metrics=[
        DeterministicAnswerCorrectness(),
        DeterministicFaithfulness(),
        # DebertaAnswerScores()
    ],
)
# Run the eval!
g_results = evaluator.run(batch_size=50)
# Peaking at the results
print(evaluator.aggregated_results)

In [None]:
# convert results to dataframe
g_results[0]
df_g_results = pd.DataFrame(g_results)
df_g_results

### Step 3: View and Interpret Results with RAGAS

A Ragas score is comprised of the following:
![ragas](imgs/ragas.png)

#### Metrics explained 
1. **Faithfulness**: measures the factual accuracy of the generated answer with the context provided. This is done in 2 steps. First, given a question and generated answer, Ragas uses an LLM to figure out the statements that the generated answer makes. This gives a list of statements whose validity we have we have to check. In step 2, given the list of statements and the context returned, Ragas uses an LLM to check if the statements provided are supported by the context. The number of correct statements is summed up and divided by the total number of statements in the generated answer to obtain the score for a given example.
   
2. **Answer Relevancy**: measures how relevant and to the point the answer is to the question. For a given generated answer Ragas uses an LLM to find out the probable questions that the generated answer would be an answer to and computes similarity to the actual question asked.
   
3. **Context Relevancy**: measures the signal-to-noise ratio in the retrieved contexts. Given a question, Ragas calls LLM to figure out sentences from the retrieved context that are needed to answer the question. A ratio between the sentences required and the total sentences in the context gives you the score

4. **Context Recall**: measures the ability of the retriever to retrieve all the necessary information needed to answer the question. Ragas calculates this by using the provided ground_truth answer and using an LLM to check if each statement from it can be found in the retrieved context. If it is not found that means the retriever was not able to retrieve the information needed to support that statement.

In [None]:
with open(evals_file, 'r') as file:
    json_data = json.load(file)
eval_questions = []
eval_answers = []
ground_truths = []
vdb_contexts = []
# gt_contexts = []
counter = 0
for entry in json_data:
    # print(type([entry["gt_context"]]))
    entry["contexts"] = [str(r) for r in entry["contexts"]]
    entry["gt_answer"] = str(entry["gt_answer"])
    eval_questions.append(entry["question"])
    eval_answers.append(entry["answer"])
    vdb_contexts.append(entry["contexts"][0:3])
    ground_truths.append([entry["gt_answer"]])
    # gt_contexts.append([entry["gt_context"]])

data_samples = {
    'question': eval_questions,
    'answer': eval_answers,
    'contexts' : vdb_contexts,
    'ground_truths': ground_truths,
    # 'ground_truth_contexts':gt_contexts
}

dataset_syn_e5 = Dataset.from_dict(data_samples)
dataset_syn_e5[57]

In [None]:
from ragas.metrics import (
    answer_relevancy,
    faithfulness,
    context_recall,
    context_precision,
    answer_correctness,
    ContextRecall,
)
from ragas.metrics import Faithfulness
from ragas.metrics import AnswerCorrectness
from ragas.metrics import ContextRelevancy
from ragas.metrics import AnswerRelevancy

context_relevancy = ContextRelevancy(llm=nvpl_llm)

faithfulness = Faithfulness(
    batch_size = 15
)
faithfulness.llm = nvpl_llm

answer_correctness = AnswerCorrectness(llm=nvpl_llm,
    weights=[0.4,0.6]
)
answer_correctness.llm = nvpl_llm

context_recall = ContextRecall(llm=nvpl_llm)
context_recall.llm = nvpl_llm

context_precision.llm = nvpl_llm


# answer_correctness.embeddings = nv_query_embedder
## using NVIDIA embedding
from ragas.metrics import AnswerSimilarity

answer_similarity = AnswerSimilarity(llm=nvpl_llm, embeddings=nv_query_embedder)
answer_relevancy = AnswerRelevancy(embeddings=nv_query_embedder,llm=nvpl_llm) #embeddings=nv_query_embedder,
# init_model to load models used
answer_relevancy.init_model()
answer_correctness.init_model()

In [None]:
from ragas import evaluate
# openai.api_key = os.getenv(INSERT YOUR OPENAI_API_KEY)
results_1 = evaluate(dataset_syn_e5,metrics=[faithfulness,answer_similarity,context_precision,answer_relevancy,context_relevancy])
# results_2 = evaluate(dataset_syn_e5,metrics=[answer_relevancy])
# results_3 = evaluate(dataset_syn_e5,metrics=[context_precision])
# results_4 = evaluate(dataset_syn_e5,metrics=[context_relevancy])
results_5 = evaluate(dataset_syn_e5,metrics=[context_recall])
# results_6 = evaluate(dataset_syn_e5,metrics=[answer_correctness])


In [None]:
df2 = results_5.to_pandas()
df2
results_5['context_recall']=df2['context_recall'].mean(skipna=True)
print(results_5)

In [None]:
df = results_1.to_pandas()
df
# results
df_merge = pd.concat([df, df2['context_recall'], df_g_results], axis = 1)
results_1.update(evaluator.aggregated_results)
results_1.update(results_5)
del results_1['rouge_faithfulness']
del results_1['token_overlap_faithfulness']
del results_1['bleu_faithfulness']

In [None]:
df_merge['ground_truths'] = df_merge['ground_truths'].apply(', '.join)
df_merge['contexts'] = df_merge['contexts'].apply(', '.join)
df_merge = df_merge.drop(columns=['rouge_faithfulness', 'token_overlap_faithfulness','bleu_faithfulness','rouge_p_by_sentence','token_overlap_p_by_sentence','blue_score_by_sentence'])
df_merge.to_pickle('metrics_df.pkl')
df_merge

#### Plot base evals with the above Metrics

In [None]:
# @title
import matplotlib.pyplot as plt
plt.rcParams.update({'font.size': 14})
plt.rcParams["backend"]='agg'
def plot_metrics_with_values(metrics_dict, title='RAG Metrics',figsize=(10, 6),name='save.png'):
    """
    Plots a bar chart for metrics contained in a dictionary and annotates the values on the bars.
    Args:
    metrics_dict (dict): A dictionary with metric names as keys and values as metric scores.
    title (str): The title of the plot.
    """
    names = list(metrics_dict.keys())
    values = list(metrics_dict.values())
    plt.figure(figsize=figsize)
    bars = plt.barh(names, values, color='green')
    # Adding the values on top of the bars
    for bar in bars:
        width = bar.get_width()
        plt.text(width - 0.2,  # x-position
                 bar.get_y() + bar.get_height() / 2,  # y-position
                 f'{width:.4f}',  # value
                 va='center')
    plt.xlabel('Score')
    plt.title(title)
    plt.xlim(0, 1)  # Setting the x-axis limit to be from 0 to 1
    # plt.show()
    plt.savefig(name, bbox_inches='tight', dpi=150)
    # bars = plt.bar(names, values, color='skyblue')
    # # Adding the values on top of the bars
    # for bar in bars:
    #     width = bar.get_height()
    #     # plt.text(width + 0.01,  # x-position
    #     #          bar.get_x() + bar.get_width() / 2,  # y-position
    #     #          f'{width:.4f}',  # value
    #     #          va='center')
    # plt.ylabel('Score')
    # plt.title(title)
    # plt.xticks(rotation=30)
    # plt.ylim(0, 1)  # Setting the x-axis limit to be from 0 to 1
    # plt.show()

In [None]:
generator_metrics = ['faithfulness','answer_relevancy']
retriever_metrics = ['context_precision','context_relevancy','context_recall']
endtoend = ['answer_similarity','token_overlap_f1','token_overlap_precision','token_overlap_recall','rouge_l_f1','rouge_l_precision','rouge_l_recall','bleu_score']
results_generator = {}
results_retriever = {}
results_endtoend = {}
for i in generator_metrics:
    results_generator[i] = results_1[i]
for i in retriever_metrics:
    results_retriever[i] = results_1[i]
for i in endtoend:
    results_endtoend[i] = results_1[i]
print(results_generator)
print(results_retriever)
print(results_endtoend)
plot_metrics_with_values(results_generator, "Base Generator Metrics",figsize=(6, 3),name='gen.jpg')
plot_metrics_with_values(results_retriever, "Base Retriever Metrics",figsize=(6, 3),name='ret.jpg')
plot_metrics_with_values(results_endtoend, "Base End-to-End Metrics",figsize=(10, 6),name='end.jpg')
