<a href="https://colab.research.google.com/github/ShankarChavan/RAG-Evaluation/blob/main/RAG_Evals/openai_ragas_eval.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

<a target="_blank" href="https://colab.research.google.com/github/shahules786/openai-cookbook/blob/ragas/examples/evaluation/ragas/openai-ragas-eval-cookbook.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

Ragas is the de-facto opensource standard for RAG evaluations. Ragas provides features and methods to help evaluate RAG applications. In this notebook we will cover basic steps for evaluating your RAG application with Ragas.

### Contents
- [Prerequisites]()
- [Dataset preparation]()
- [Evaluation]()
- [Analysis]()

### Prerequisites
- Ragas is a python package and we can install it using pip
- Some documents to build our simple RAG pipeline
- Ragas uses model guided techniques underneath to produce scores for each metric. In this tutorial, we will use OpenAI `gpt-3.5-turbo` and `text-embedding-ada-002`. These are the default models used in ragas but you can use any LLM or Embedding of your choice by referring to this [guide](https://docs.ragas.io/en/stable/howtos/customisations/bring-your-own-llm-or-embs.html). I highly recommend that you try this notebook with open-ai so that you get a feel of it with ease.

In [1]:
! pip install -q ragas llama-index

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/177.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m177.1/177.1 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m37.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m40.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m16.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m33.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m23.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [3

In [2]:
! pip install isort

Collecting isort
  Downloading isort-6.0.0-py3-none-any.whl.metadata (11 kB)
Downloading isort-6.0.0-py3-none-any.whl (94 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/94.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m94.1/94.1 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: isort
Successfully installed isort-6.0.0


In [3]:
!git clone https://huggingface.co/datasets/explodinggradients/prompt-engineering-guide-papers

Cloning into 'prompt-engineering-guide-papers'...
remote: Enumerating objects: 32, done.[K
remote: Counting objects: 100% (28/28), done.[K
remote: Compressing objects: 100% (28/28), done.[K
remote: Total 32 (delta 9), reused 0 (delta 0), pack-reused 4 (from 1)[K
Unpacking objects: 100% (32/32), 3.11 MiB | 3.20 MiB/s, done.


In [4]:
import os
#os.environ["OPENAI_API_KEY"] = "<your-key-here>"

try:
  import google.colab
  PATH = "/content/prompt-engineering-guide-papers"
except:
  PATH = "./prompt-engineering-guide-papers"

And that's it. You're ready to go.

## Dataset preparation

Evaluating any ML pipeline will require several data points that constitues a test dataset. For Ragas, the data points required for evaluating your RAG completely are

- `question`: A question or query that is relevant to your RAG.
- `contexts`: The retrieved contexts corresponding to each question. This is a `list[list]` since each question can retrieve multiple text chunks.
- `answer`:  The answer generated by your RAG corresponding to each question.
- `ground_truth`: The expected correct answer corresponding to each question.

For the purpose of this notebook, I have this dataset prepared from a simple RAG that I created myself to help me with NLP research. Let's use it.

In [5]:
from datasets import load_dataset

In [6]:
eval_dataset = load_dataset("explodinggradients/prompt-engineering-guide-papers",trust_remote_code=True)
eval_dataset = eval_dataset['test'].to_pandas()
eval_dataset.head()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

prompt-engineering-guide-papers.py:   0%|          | 0.00/3.97k [00:00<?, ?B/s]

synthetic_test_dataset_v1.jsonl:   0%|          | 0.00/126k [00:00<?, ?B/s]

Generating test split: 0 examples [00:00, ? examples/s]

Unnamed: 0,question,ground_truth,answer,contexts
0,How does instruction tuning affect the zero-sh...,For larger models on the order of 100B paramet...,"For larger models with around 100B parameters,...",[Published as a conference paper at ICLR 2022\...
1,What is the Zero-shot-CoT method and how does ...,Zero-shot-CoT is a zero-shot template-based pr...,The Zero-shot-CoT method is a zero-shot templa...,"[Similar to\nFew-shot-CoT, Zero-shot-CoT facil..."
2,How does prompt tuning affect model performanc...,Prompt tuning improves model performance in im...,Prompt tuning has been shown to enhance model ...,[The orange bars indicate standard deviation a...
3,What is the purpose of instruction tuning in l...,The purpose of instruction tuning in language ...,The purpose of instruction tuning in language ...,[Although one might\nexpect labeled data to ha...
4,What distinguishes Zero-shot-CoT from Few-shot...,Zero-shot-CoT differs from Few-shot-CoT in tha...,Zero-shot-CoT requires prompting LLMs twice bu...,[Baselines We compare our Zero-shot-CoT mainly...


As you can see, the dataset contains two of the required attributes mentioned,that is `question` and `ground_truth` answers. Now we can move on our next step to collect the other two attributes.

**Note:**
We know that it's hard to formulate a test data containing Question and ground truth answer pairs when starting out. We have the perfect solution for this in this form of a ragas synthetic test data generation feature.

#### Simple RAG pipeline

Now with the above step we have two attributes needed for evaluation, that is `question` and `ground_truth` answers. We now need to feed these test questions to our RAG pipeline to collect the other two attributes, ie `contexts` and `answer`.  Let's build a simple RAG using llama-index to do that.

In [38]:
import nest_asyncio
from llama_index.core.indices import VectorStoreIndex
from llama_index.core.readers import SimpleDirectoryReader
#from llama_index.core.service_context import ServiceContext
from llama_index.core import Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from datasets import Dataset
from google.colab import userdata
nest_asyncio.apply()


token = userdata.get('OPENAI_API_KEY')
os.environ["OPENAI_API_KEY"] = token

Settings.llm = OpenAI(model="gpt-4o",api_key=token,api_base="https://models.inference.ai.azure.com")
#Settings.embed_model = OpenAIEmbedding(api_base="https://models.inference.ai.azure.com",model_name='text-embedding-3-large')

def build_query_engine(documents):
    vector_index = VectorStoreIndex.from_documents(
        documents,
    )

    query_engine = vector_index.as_query_engine(similarity_top_k=3)
    return query_engine

# Function to evaluate as Llama index does not support async evaluation for HFInference API
def generate_responses(query_engine, test_questions, test_answers):
  responses = [query_engine.query(q) for q in test_questions]

  answers = []
  contexts = []
  for r in responses:
    answers.append(r.response)
    contexts.append([c.node.get_content() for c in r.source_nodes])
  dataset_dict = {
        "question": test_questions,
        "answer": answers,
        "contexts": contexts,
  }
  if test_answers is not None:
    dataset_dict["ground_truth"] = test_answers
  ds = Dataset.from_dict(dataset_dict)
  return ds

In [27]:
reader = SimpleDirectoryReader(PATH,num_files_limit=1, required_exts=[".pdf"])
documents = reader.load_data()


In [25]:
test_questions = eval_dataset['question'].values.tolist()
test_answers = eval_dataset['ground_truth'].values.tolist()

In [39]:
query_engine1 = build_query_engine(documents[0:10])
result_ds = generate_responses(query_engine1, test_questions, test_answers)

## Evaluation
For evaluation ragas provides several metrics which is aimed to quantify the end-end performance of the pipeline and also the component wise performance of the pipeline. For this tutorial let's consider few of them

**Note**: *Refer to our [metrics](https://docs.ragas.io/en/stable/concepts/metrics/index.html) docs to read more about different metrics.*

In [44]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
generator_llm = LangchainLLMWrapper(ChatOpenAI(base_url="https://models.inference.ai.azure.com",model="gpt-4o",api_key=token))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

In [46]:
from ragas.metrics import answer_correctness, faithfulness
from ragas import evaluate

lst_metrics=[answer_correctness, faithfulness]
for m in lst_metrics:
    m.__setattr__("llm", generator_llm)
    if hasattr(m, "embeddings"):
      m.__setattr__("embeddings", generator_embeddings)

ragas_results = evaluate(result_ds, metrics=[answer_correctness, faithfulness],llm=generator_llm,embeddings=generator_embeddings)

Evaluating:   0%|          | 0/40 [00:00<?, ?it/s]

ERROR:ragas.executor:Exception raised in Job[1]: TimeoutError()
ERROR:ragas.executor:Exception raised in Job[3]: TimeoutError()
ERROR:ragas.executor:Exception raised in Job[4]: TimeoutError()
ERROR:ragas.executor:Exception raised in Job[2]: TimeoutError()
ERROR:ragas.executor:Exception raised in Job[5]: TimeoutError()
ERROR:ragas.executor:Exception raised in Job[0]: TimeoutError()
ERROR:ragas.executor:Exception raised in Job[6]: TimeoutError()
ERROR:ragas.executor:Exception raised in Job[7]: TimeoutError()
ERROR:ragas.executor:Exception raised in Job[8]: TimeoutError()
ERROR:ragas.executor:Exception raised in Job[9]: TimeoutError()
ERROR:ragas.executor:Exception raised in Job[10]: TimeoutError()
ERROR:ragas.executor:Exception raised in Job[11]: TimeoutError()
ERROR:ragas.executor:Exception raised in Job[12]: TimeoutError()
ERROR:ragas.executor:Exception raised in Job[14]: TimeoutError()
ERROR:ragas.executor:Exception raised in Job[13]: TimeoutError()
ERROR:ragas.executor:Exception rais

## Analysis
You can export the individual scores to dataframe and analyse it. You can also add [callbacks and tracing](https://docs.ragas.io/en/latest/howtos/applications/tracing.html) to ragas to do indepth analysis.

In [None]:
ragas_results.to_pandas().head(5)

Unnamed: 0,question,answer,contexts,ground_truth,answer_correctness,faithfulness
0,How does instruction tuning affect the zero-sh...,Instruction tuning enhances the zero-shot perf...,[34\nthe effectiveness of different constructi...,For larger models on the order of 100B paramet...,0.781983,1.0
1,What is the Zero-shot-CoT method and how does ...,Zero-shot-CoT is a method that involves append...,[Plan-and-Solve Prompting: Improving Zero-Shot...,Zero-shot-CoT is a zero-shot template-based pr...,0.667026,1.0
2,How does prompt tuning affect model performanc...,Prompt tuning can impact model performance in ...,[4 C. Liu et al.\nto generate results directly...,Prompt tuning improves model performance in im...,0.39604,1.0
3,What is the purpose of instruction tuning in l...,The purpose of instruction tuning in language ...,"[In practice,\ninstruction tuning offers a gen...",The purpose of instruction tuning in language ...,0.694074,1.0
4,What distinguishes Zero-shot-CoT from Few-shot...,Zero-shot-CoT conditions the LM on a single pr...,[Wei et al. (2022b ) observe that the success ...,Zero-shot-CoT differs from Few-shot-CoT in tha...,0.530018,1.0


**If you liked this tutorial, checkout [ragas](https://github.com/explodinggradients/ragas) and consider leaving a star!**