# Evaluate with RAGAS

**Note**

The code in this notebook requires the evaluation dataset to be available as a JSON file. The JSON file is an array of JSON structure
```
{
    "user_input" : user_input,
    "retrieved_contexts"  : retrieved_contexts,
    "reference" : reference,
    "response" : responses
}
```
The JSON file is generated with the notebook: **ragas-prepare-eval-dataset**

**Evaluation steps**

1. Read the eval file and create the evaluation dataset *(check the last cell in prepare eval dataset notebook)*
2. Setup RAGAS configuration and metrics
3. Run the evaluation

**Sample results**

Embedding LLM = "all-MiniLM-L6-v2"
RAG LLM = Cohere Command
Judge LLM = Google Gemini Flash 1.5
Chunk size = 900
Chunk overlap = 150
{'answer_relevancy': 0.9053, 'context_precision': 0.4000, 'context_recall': 0.2143, 'context_entity_recall': 0.0686, 'semantic_similarity': 0.8116, 'answer_correctness': 0.4810}

## Setup packages

In [1]:
from datasets import Dataset
import json
from ragas import EvaluationDataset
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper

from langchain_community.embeddings.sentence_transformer import (
    SentenceTransformerEmbeddings,
)

from IPython.display import JSON
from dotenv import load_dotenv
import os
from langchain.prompts import PromptTemplate
import warnings
import sys
import json
warnings.filterwarnings("ignore")


# Load the file that contains the API keys
load_dotenv('C:\\Users\\raj\\.jupyter\\.env')

# setting path
sys.path.append('../')

from utils.create_llm import create_gpt_llm, create_cohere_llm, create_ollama_llm, create_hugging_face_llm, create_google_llm, create_ai21_llm

## 1. Create the eval dataset by reading the file

In [2]:

# Specify the path to your JSON file
eval_filepath = "amzn-10k/amz-10k-2024-eval-dataset-cohere-chunk900-overlap150-k5.json"

# Open the JSON file and load its contents into a variable
with open(eval_filepath, 'r', encoding='utf-8') as file:
    eval_dataset_json = json.load(file)

eval_dataset_hf = Dataset.from_dict(eval_dataset_json)

## 2. Setup RAGAS config & metrics

In [3]:
from ragas import evaluate, RunConfig
from ragas.metrics import (
    answer_relevancy,
    context_recall,
    context_precision,
    context_entity_recall,
    answer_similarity,
    answer_correctness
)

#specify the metrics here
metrics = [
        answer_relevancy,        # ['user_input', 'response']
        context_precision,       # ['user_input', 'retrieved_contexts', 'reference'] 
        context_recall,          # ['user_input', 'retrieved_contexts']
        context_entity_recall,   # ['user_input', 'retrieved_contexts']
        answer_similarity,       # ['response']
        answer_correctness       # ['user_input', 'response']
    ]

## 3. Run evaluation

https://docs.ragas.io/en/stable/references/evaluate/

In [4]:

# Create a Ragas LLM wrapper
# llm_for_evaluation = create_google_llm()
llm_for_evaluation = create_cohere_llm()
# llm_for_evaluation = create_gpt_llm(args={"temperature": 0, "max_tokens": 2000})

# Create the RAGAS LLM wrapper
llm_for_evaluation_wrapper = LangchainLLMWrapper(llm_for_evaluation)

# Create a Ragas Embeddings wrapper
embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
embeddings_model_wrapper = LangchainEmbeddingsWrapper(embedding_function)

# Convert dataset to Ragas eval dataset
eval_dataset_ragas = EvaluationDataset.from_hf_dataset(eval_dataset_hf)



# Validate

# eval_dataset_ragas.validate_samples()
# eval_dataset_ragas.to_csv()

# eval_dataset_ragas

result = evaluate(
        dataset=eval_dataset_ragas,
        metrics=metrics,
        llm=llm_for_evaluation_wrapper,
        embeddings=embeddings_model_wrapper,
        run_config=RunConfig(max_workers=2)

    )

Evaluating:   0%|          | 0/60 [00:00<?, ?it/s]

Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
Prompt correctness_classifier failed to parse output: The output parser failed to parse the output including retries.
Exception raised in Job[5]: RagasOutputParserException(The output parser failed to parse the output including retries.)
Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
Prompt correctness_classifier failed to parse output: The output parser failed to pa

In [7]:
print(result)

{'answer_relevancy': 0.9102, 'context_precision': 0.4000, 'context_recall': 0.2143, 'context_entity_recall': 0.0686, 'semantic_similarity': 0.8116, 'answer_correctness': 0.4810}
