# **Evaluating RAG Applications with Ragas**

This notebook demonstrates how to evaluate Retrieval-Augmented Generation (RAG) applications using the Ragas framework. We will install the necessary libraries, test each metric individually, and evaluate all metrics on a sample dataset.

## Table of Contents
- Install Libraries
- Setup
- Test Metrics Individually
- Evaluate Full Datasets
- Notes

## Install Libraries

In [None]:
pip install -r requirements.txt

## Setup

In this example, we’ll use:

🔹 Groq as the LLM — fast and free with an API key

🔹 Hugging Face for embeddings — lightweight and open-source

These are used to compute Ragas scores, where the LLM evaluates answer quality and embeddings handle semantic similarity for retrieval metrics.

In [None]:
from langchain_groq import ChatGroq
from ragas.llms import LangchainLLMWrapper
import  os

os.environ["GROQ_API_KEY"] = "gsk_***"

llm_groq = ChatGroq(
    model="llama-3.1-8b-instant",
)

llm = LangchainLLMWrapper(llm_groq)

In [None]:
from langchain.embeddings import HuggingFaceEmbeddings
from ragas.embeddings import LangchainEmbeddingsWrapper

EMBEDDING_MODEL_NAME = 'sentence-transformers/all-MiniLM-L6-v2'
embeddings_hf = HuggingFaceEmbeddings(
    model_name=EMBEDDING_MODEL_NAME,
    model_kwargs={"device": "cuda"}
)

embeddings = LangchainEmbeddingsWrapper(embeddings_hf)

## Test Metrics Individually

### Context Precision

LLM-Based Context Precision

In [None]:
from ragas import SingleTurnSample
from ragas.metrics import LLMContextPrecisionWithoutReference

context_precision = LLMContextPrecisionWithoutReference(llm=llm)

sample = SingleTurnSample(
   user_input="Where is the Eiffel Tower located?",
   response="The Eiffel Tower is located in Paris.",
   retrieved_contexts=["The Eiffel Tower is located in Paris."], 
)

await context_precision.single_turn_ascore(sample)

Non LLM-Based Context Precision

In [None]:
from ragas import SingleTurnSample
from ragas.metrics import NonLLMContextPrecisionWithReference

context_precision = NonLLMContextPrecisionWithReference()

sample = SingleTurnSample(
    retrieved_contexts=["The Eiffel Tower is located in Paris."], 
    reference_contexts=["Paris is the capital of France.", "The Eiffel Tower is one of the most famous landmarks in Paris."]
)

await context_precision.single_turn_ascore(sample)

### Context Recall

LLM-Based Context Recall

In [None]:
from ragas.dataset_schema import SingleTurnSample
from ragas.metrics import LLMContextRecall

sample = SingleTurnSample(
    user_input="Where is the Eiffel Tower located?",
    response="The Eiffel Tower is located in Paris.",
    reference="The Eiffel Tower is located in Paris.",
    retrieved_contexts=["Paris is the capital of France."], 
)

context_recall = LLMContextRecall(llm=llm)
await context_recall.single_turn_ascore(sample)

Non LLM-Based Context Recall

In [None]:
from ragas.dataset_schema import SingleTurnSample
from ragas.metrics import NonLLMContextRecall

sample = SingleTurnSample(
    retrieved_contexts=["Paris is the capital of France."], 
    reference_contexts=["Paris is the capital of France.", "The Eiffel Tower is one of the most famous landmarks in Paris."]
)

context_recall = NonLLMContextRecall()
await context_recall.single_turn_ascore(sample)

### Faithfulness

In [None]:
from ragas.dataset_schema import SingleTurnSample 
from ragas.metrics import Faithfulness

sample = SingleTurnSample(
        user_input="When was the first super bowl?",
        response="The first superbowl was held on Jan 15, 1967",
        retrieved_contexts=[
            "The First AFL–NFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles."
        ]
    )
scorer = Faithfulness(llm=llm)

await scorer.single_turn_ascore(sample)

### Answer Relevancy

In [None]:
from ragas import SingleTurnSample 
from ragas.metrics import ResponseRelevancy

sample = SingleTurnSample(
        user_input="When was the first super bowl?",
        response="The first superbowl was held on Jan 15, 1967",
        retrieved_contexts=[
            "The First AFL–NFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles."
        ]
    )

scorer = ResponseRelevancy(llm=llm, embeddings=embeddings)
await scorer.single_turn_ascore(sample)

## Evaluate Full Dataset

In [None]:
from datasets import Dataset

data_samples = {
    'question': [
        "When was Soekarno born?",
        "What is Pancasila?",
        "Where is Mount Bromo located?",
        "When did Indonesia gain independence?",
        "What is the capital city of Indonesia?"
    ],
    'answer': [
        "Soekarno was born on June 6, 1901.",
        "Pancasila is the foundation of Indonesia.",
        "Mount Bromo is located in East Java, Indonesia.",
        "Indonesia declared independence in 1945.",
        "Jakarta is the capital city of Indonesia."
    ],
    'contexts': [
        ["Soekarno, the first President of Indonesia, was born on June 6, 1901, in Surabaya."],
        ["Pancasila is the philosophical foundation of the Indonesian state, consisting of five principles introduced by Sukarno in 1945."],
        ["Mount Bromo is an active volcano situated in East Java, Indonesia, and is part of the Tengger massif."],
        ["Indonesia proclaimed its independence from Dutch colonial rule on August 17, 1945, following the end of World War II."],
        ["Jakarta is the current capital city of Indonesia, located on the northwest coast of the island of Java."]
    ],

    # Optional, but required for evaluating context precision or context recall
    'ground_truth': [
        "June 6, 1901",
        "The philosophical foundation of the Indonesian state consisting of five principles.",
        "East Java, Indonesia",
        "August 17, 1945",
        "Jakarta"
    ]
}

dataset = Dataset.from_dict(data_samples)

In [None]:
import os
from ragas import evaluate
from ragas.metrics import (
    context_precision,
    context_recall,
    faithfulness,
    answer_relevancy,
)

score = evaluate(
    dataset,
    llm=llm,
    embeddings=embeddings,
    metrics=[
        context_precision,
        context_recall,
        faithfulness,
        answer_relevancy,
    ]
)
df = score.to_pandas()
df.to_csv('result/score.csv', index=False)

## Notes:

- **Model Initialization**: Make sure to initialize your RAG model before running the evaluation metrics. You may need to load a pre-trained model or define your own.
- **Ragas Methods**: Ensure that the methods used in the Ragas framework (like `evaluate_faithfulness`, `evaluate_relevancy`, etc.) are correctly named according to the actual implementation in the library.
- **Results Interpretation**: In the conclusion section, you can summarize the findings based on the scores obtained from the metrics.

This structure should help you create a comprehensive Jupyter Notebook that aligns with the content of your article while providing a clear and organized evaluation process.