# Evaluation with RAGAS and Advanced Retrieval Methods Using LangChain

In the following notebook we'll discuss a major component of LLM Ops:

- Evaluation

We're going to be leveraging the [RAGAS]() framework for our evaluations today as it's becoming a standard method of evaluating (at least directionally) RAG systems.

We're also going to discuss a few more powerful Retrieval Systems that can potentially improve the quality of our generations!

Let's start as we always do: Grabbing our dependencies!

In [1]:
!pip install vllm -Uqq

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
ydata-profiling 4.5.1 requires numpy<1.24,>=1.16.0, but you have numpy 1.24.3 which is incompatible.[0m[31m
[0m

In [2]:
!pip install -U -q langchain openai ragas arxiv pymupdf chromadb wandb tiktoken

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cudf 23.8.0 requires cupy-cuda11x>=12.0.0, which is not installed.
cuml 23.8.0 requires cupy-cuda11x>=12.0.0, which is not installed.
dask-cudf 23.8.0 requires cupy-cuda11x>=12.0.0, which is not installed.
apache-beam 2.46.0 requires dill<0.3.2,>=0.3.1.1, but you have dill 0.3.7 which is incompatible.
apache-beam 2.46.0 requires pyarrow<10.0.0,>=3.0.0, but you have pyarrow 11.0.0 which is incompatible.
cudf 23.8.0 requires pandas<1.6.0dev0,>=1.3, but you have pandas 2.0.3 which is incompatible.
cudf 23.8.0 requires protobuf<5,>=4.21, but you have protobuf 3.20.3 which is incompatible.
cuml 23.8.0 requires dask==2023.7.1, but you have dask 2023.12.0 which is incompatible.
cuml 23.8.0 requires distributed==2023.7.1, but you have distributed 2023.12.0 which is incompatible.
dask-cuda 23.8.0 requires dask==2023.7

In [4]:
import torch

# Check if CUDA is available
if torch.cuda.is_available():
    print("CUDA is available!")
else:
    print("CUDA is not available.")

# Get the number of GPUs
num_gpus = torch.cuda.device_count()
print("Number of GPUs:", num_gpus)
device = torch.device("cuda:0")  # Use the first GPU
device



CUDA is not available.
Number of GPUs: 2


device(type='cuda', index=0)

In [5]:
import os
import openai
from getpass import getpass

# openai.api_key = getpass("Please provide your OpenAI Key: ")
# os.environ["OPENAI_API_KEY"] = openai.api_key

### Data Collection

We're going to be using papers from Arxiv as our context today.

We can collect these documents rather straightforwardly with the `ArxivLoader` document loader from LangChain.

Let's grab and load 5 documents.

- [`ArxivLoader`](https://api.python.langchain.com/en/latest/document_loaders/langchain.document_loaders.arxiv.ArxivLoader.html)

In [6]:
!pip install -U -q langchain ragas arxiv pymupdf chromadb wandb tiktoken -Uqq
!pip install sentence-transformers -Uqq
!pip install  typing_extensions



[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
apache-beam 2.46.0 requires dill<0.3.2,>=0.3.1.1, but you have dill 0.3.7 which is incompatible.
apache-beam 2.46.0 requires pyarrow<10.0.0,>=3.0.0, but you have pyarrow 11.0.0 which is incompatible.
google-cloud-bigquery 2.34.4 requires packaging<22.0dev,>=14.3, but you have packaging 23.2 which is incompatible.
google-cloud-pubsublite 1.8.3 requires overrides<7.0.0,>=6.0.1, but you have overrides 7.4.0 which is incompatible.
tensorflow 2.13.0 requires typing-extensions<4.6.0,>=3.6.6, but you have typing-extensions 4.7.1 which is incompatible.[0m[31m


In [48]:
!pip install ngrok
# !ngrok --version
!pip show ngrok

# !ngrok http 8080


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Name: ngrok
Version: 0.12.1
Summary: The ngrok Agent SDK for Python
Home-page: 
Author: 
Author-email: 
License: MIT OR Apache-2.0
Location: /opt/conda/lib/python3.10/site-packages
Requires: 
Required-by: 


In [49]:
# start the vLLM server
!python -m vllm.entrypoints.openai.api_server \
    --model HuggingFaceH4/zephyr-7b-beta \
    --host 0.0.0.0 \
    --port 5000

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


INFO 12-23 09:37:22 api_server.py:719] args: Namespace(host='0.0.0.0', port=5000, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], served_model_name=None, chat_template=None, response_role='assistant', model='HuggingFaceH4/zephyr-7b-beta', tokenizer=None, revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', max_model_len=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, block_size=16, seed=0, swap_space=4, gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256, max_paddings=256, disable_log_stats=False, quantization=None, enforce_eager=False, max_context_len_to_capture=8192, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
INFO 12-23 09:37:22 llm_engine.py:73] Initializing an LLM engine with config: model='HuggingFaceH4/zephyr-7b-beta', tokenizer='HuggingF

In [None]:
# from ragas.metrics import (
#     context_precision,
#     faithfulness,
#     context_recall,
# )
# from ragas.metrics.critique import harmfulness

# # change the LLM

# faithfulness.llm = vllm
# context_precision.llm = vllm
# context_recall.llm = vllm
# harmfulness.llm = vllm

# # evaluate
# from ragas import evaluate

# result = evaluate(
#     fiqa_eval["baseline"].select(range(5)),  # showing only 5 for demonstration
#     metrics=[faithfulness],
# )

# result

In [52]:
from langchain.chat_models import ChatOpenAI
from ragas.llms import LangchainLLM

inference_server_url = "http://localhost:5000/v1"

# create vLLM Langchain instance
chat = ChatOpenAI(
    model="HuggingFaceH4/zephyr-7b-beta",
    openai_api_key="no-key",
    openai_api_base=inference_server_url,
    max_tokens=5,
    temperature=0,
)

# use the Ragas LangchainLLM wrapper to create a RagasLLM instance
vllm = LangchainLLM(llm=chat)

In [51]:
chat

ChatOpenAI(client=<openai.resources.chat.completions.Completions object at 0x7954b7dae9b0>, async_client=<openai.resources.chat.completions.AsyncCompletions object at 0x7954d8c47160>, model_name='HuggingFaceH4/zephyr-7b-alpha', temperature=0.0, openai_api_key='no-key', openai_api_base='http://localhost:5000/v1', openai_proxy='', max_tokens=5)

In [9]:
from langchain.document_loaders import ArxivLoader

base_docs = ArxivLoader(query="Retrieval Augmented Generation", load_max_docs=5).load()
len(base_docs)

5

In [10]:
for doc in base_docs:
  print(doc.metadata)

{'Published': '2022-02-13', 'Title': 'A Survey on Retrieval-Augmented Text Generation', 'Authors': 'Huayang Li, Yixuan Su, Deng Cai, Yan Wang, Lemao Liu', 'Summary': 'Recently, retrieval-augmented text generation attracted increasing attention\nof the computational linguistics community. Compared with conventional\ngeneration models, retrieval-augmented text generation has remarkable\nadvantages and particularly has achieved state-of-the-art performance in many\nNLP tasks. This paper aims to conduct a survey about retrieval-augmented text\ngeneration. It firstly highlights the generic paradigm of retrieval-augmented\ngeneration, and then it reviews notable approaches according to different tasks\nincluding dialogue response generation, machine translation, and other\ngeneration tasks. Finally, it points out some important directions on top of\nrecent methods to facilitate future research.'}
{'Published': '2023-12-09', 'Title': 'Context Tuning for Retrieval Augmented Generation', 'Autho

### Creating an Index

Let's use a naive index creation strategy of just using `RecursiveCharacterTextSplitter` on our documents and embedding each into our `VectorStore` using `OpenAIEmbeddings()`.

- [`RecursiveCharacterTextSplitter()`](https://api.python.langchain.com/en/latest/text_splitter/langchain.text_splitter.RecursiveCharacterTextSplitter.html)
- [`Chroma`](https://api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.chroma.Chroma.html?highlight=chroma#langchain.vectorstores.chroma.Chroma)
- [`OpenAIEmbeddings()`](https://api.python.langchain.com/en/latest/embeddings/langchain.embeddings.openai.OpenAIEmbeddings.html?highlight=openaiembeddings#langchain-embeddings-openai-openaiembeddings)

In [11]:
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=250)

docs = text_splitter.split_documents(base_docs)

vectorstore = Chroma.from_documents(docs, HuggingFaceEmbeddings())



.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

train_script.py:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

Batches:   0%|          | 0/149 [00:00<?, ?it/s]

In [20]:
len(docs)

4756

In [21]:
print(max([len(chunk.page_content) for chunk in docs]))

249


Let's convert our `Chroma` vectorstore into a retriever with the `.as_retriever()` method.

In [22]:
base_retriever = vectorstore.as_retriever(search_kwargs={"k" : 2})

Now to give it a test!

In [23]:
relevant_docs = base_retriever.get_relevant_documents("What is Retrieval Augmented Generation?")

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [24]:
len(relevant_docs)

2

## Creating a Retrieval Augmented Generation Prompt

Now we can set up a prompt template that will be used to provide the LLM with the necessary contexts, user query, and instructions!

In [42]:
!pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Looking in indexes: https://download.pytorch.org/whl/cu121


In [53]:
from langchain.llms import VLLM
primary_qa_llm = VLLM(
                    model="HuggingFaceH4/zephyr-7b-beta",
                    trust_remote_code=True,  # mandatory for hf models
                    max_new_tokens=128,
                    top_k=10,
                    top_p=0.95,
                    temperature=0.8,)




INFO 12-23 09:39:10 llm_engine.py:73] Initializing an LLM engine with config: model='HuggingFaceH4/zephyr-7b-beta', tokenizer='HuggingFaceH4/zephyr-7b-beta', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, enforce_eager=False, seed=0)


RuntimeError: The NVIDIA driver on your system is too old (found version 11040). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver.

In [54]:
!pip3 install text_generation

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting text_generation
  Obtaining dependency information for text_generation from https://files.pythonhosted.org/packages/14/f7/cadf3a0fc619a72d7c667d16e96ef0a5b4c557e6e2b4788a0360dfba4fee/text_generation-0.6.1-py3-none-any.whl.metadata
  Downloading text_generation-0.6.1-py3-none-any.whl.metadata (7.8 kB)
Downloading text_generation-0.6.1-py3-none-any.whl (10 kB)
Installing collected packages: text_generation
Successfully installed text_generation-0.6.1


In [57]:
from langchain.llms import HuggingFaceTextGenInference

llm = HuggingFaceTextGenInference(
    inference_server_url="http://localhost:8010/",
    max_new_tokens=512,
    top_k=10,
    top_p=0.95,
    typical_p=0.95,
    temperature=0.01,
    repetition_penalty=1.03,
)
llm("What did foo say about bar?")

ConnectionError: HTTPConnectionPool(host='localhost', port=8010): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7954b7529780>: Failed to establish a new connection: [Errno 111] Connection refused'))

In [25]:
from langchain.prompts import ChatPromptTemplate

template = """Answer the question based only on the following context. If you cannot answer the question with the context, please respond with 'I don't know':

### CONTEXT
{context}

### QUESTION
Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)

### Setting Up our Basic QA Chain

Now we can instantiate our basic RAG chain!

We'll follow *exactly* the chain we made on Tuesday to keep things simple for now - if you need a refresher on what it looked like - check out last week's notebook!

In [29]:
from operator import itemgetter

from langchain.chat_models import ChatOpenAI
from langchain.schema.output_parser import StrOutputParser
from langchain.schema.runnable import RunnableLambda, RunnablePassthrough

primary_qa_llm = ChatOpenAI(
                        model="HuggingFaceH4/zephyr-7b-alpha",
                        openai_api_key="no-key",
                        openai_api_base=inference_server_url,
                        max_tokens=5,
                        temperature=0.2)

retrieval_augmented_qa_chain = (
    # INVOKE CHAIN WITH: {"question" : "<<SOME USER QUESTION>>"}
    # "question" : populated by getting the value of the "question" key
    # "context"  : populated by getting the value of the "question" key and chaining it into the base_retriever
    {"context": itemgetter("question") | base_retriever, "question": itemgetter("question")}
    # "context"  : is assigned to a RunnablePassthrough object (will not be called or considered in the next step)
    #              by getting the value of the "context" key from the previous step
    | RunnablePassthrough.assign(context=itemgetter("context"))
    # "response" : the "context" and "question" values are used to format our prompt object and then piped
    #              into the LLM and stored in a key called "response"
    # "context"  : populated by getting the value of the "context" key from the previous step
    | {"response": prompt | primary_qa_llm, "context": itemgetter("context")}
)

Let's test it out!

In [30]:
question = "What is RAG?"

result = retrieval_augmented_qa_chain.invoke({"question" : question})

print(result)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

APIConnectionError: Connection error.

### Ground Truth Dataset Creation Using GPT-3.5-turbo and GPT-4

The next section might take you a long time to run, so the evaluation dataset is provided.

The basic idea is that we can use LangChain to create questions based on our contexts, and then answer those questions.

Let's look at how that works in the code!

In [34]:
from langchain.output_parsers import ResponseSchema
from langchain.output_parsers import StructuredOutputParser

question_schema = ResponseSchema(
    name="question",
    description="a question about the context."
)

question_response_schemas = [
    question_schema,
]

In [35]:
question_output_parser = StructuredOutputParser.from_response_schemas(question_response_schemas)
format_instructions = question_output_parser.get_format_instructions()

In [36]:
question_generation_llm =  ChatOpenAI(
                            model="HuggingFaceH4/zephyr-7b-alpha",
                            openai_api_key="no-key",
                            openai_api_base=inference_server_url,
                            max_tokens=5,
                            temperature=0.2)
bare_prompt_template = "{content}"
bare_template = ChatPromptTemplate.from_template(template=bare_prompt_template)

In [37]:
from langchain.prompts import ChatPromptTemplate

qa_template = """\
You are a University Professor creating a test for advanced students. For each context, create a question that is specific to the context. Avoid creating generic or general questions.

question: a question about the context.

Format the output as JSON with the following keys:
question

context: {context}
"""

prompt_template = ChatPromptTemplate.from_template(template=qa_template)

messages = prompt_template.format_messages(
    context=docs[0],
    format_instructions=format_instructions
)

question_generation_chain = bare_template | question_generation_llm

response = question_generation_chain.invoke({"content" : messages})
output_dict = question_output_parser.parse(response.content)

APIConnectionError: Connection error.

In [None]:
for k, v in output_dict.items():
  print(k)
  print(v)

question

What is the aim of the paper 'A Survey on Retrieval-Augmented Text Generation'?


In [None]:
!pip install -q -U tqdm

In [None]:
from tqdm import tqdm

qac_triples = []

for text in tqdm(docs[:10]):
  messages = prompt_template.format_messages(
      context=text,
      format_instructions=format_instructions
  )
  response = question_generation_chain.invoke({"content" : messages})
  try:
    output_dict = question_output_parser.parse(response.content)
  except Exception as e:
    continue
  output_dict["context"] = text
  qac_triples.append(output_dict)

100%|██████████| 10/10 [00:35<00:00,  3.55s/it]


In [None]:
qac_triples[5]

{'question': 'What is the main focus of this paper?',
 'context': Document(page_content='thisisjcykcd@gmail.com, brandenwang@tencent.com\nlemaoliu@gmail.com\nAbstract\nRecently, retrieval-augmented text generation\nattracted increasing attention of the compu-\ntational linguistics community.\nCompared', metadata={'Published': '2022-02-13', 'Title': 'A Survey on Retrieval-Augmented Text Generation', 'Authors': 'Huayang Li, Yixuan Su, Deng Cai, Yan Wang, Lemao Liu', 'Summary': 'Recently, retrieval-augmented text generation attracted increasing attention\nof the computational linguistics community. Compared with conventional\ngeneration models, retrieval-augmented text generation has remarkable\nadvantages and particularly has achieved state-of-the-art performance in many\nNLP tasks. This paper aims to conduct a survey about retrieval-augmented text\ngeneration. It firstly highlights the generic paradigm of retrieval-augmented\ngeneration, and then it reviews notable approaches according 

In [None]:
answer_generation_llm = ChatOpenAI(model="gpt-4-1106-preview", temperature=0)

answer_schema = ResponseSchema(
    name="answer",
    description="an answer to the question"
)

answer_response_schemas = [
    answer_schema,
]

answer_output_parser = StructuredOutputParser.from_response_schemas(answer_response_schemas)
format_instructions = answer_output_parser.get_format_instructions()

qa_template = """\
You are a University Professor creating a test for advanced students. For each question and context, create an answer.

answer: a answer about the context.

Format the output as JSON with the following keys:
answer

question: {question}
context: {context}
"""

prompt_template = ChatPromptTemplate.from_template(template=qa_template)

messages = prompt_template.format_messages(
    context=qac_triples[0]["context"],
    question=qac_triples[0]["question"],
    format_instructions=format_instructions
)

answer_generation_chain = bare_template | answer_generation_llm

response = answer_generation_chain.invoke({"content" : messages})
output_dict = answer_output_parser.parse(response.content)

In [None]:
for k, v in output_dict.items():
  print(k)
  print(v)

answer

The focus of this paper is on retrieval-augmented text generation, which has gained increasing attention in the computational linguistics community. The paper conducts a survey of this field, highlighting the generic paradigm of retrieval-augmented generation, reviewing notable approaches across various tasks such as dialogue response generation and machine translation, and discussing future research directions.

question

What is the focus of the paper titled 'A Survey on Retrieval-Augmented Text Generation'?


In [None]:
for triple in tqdm(qac_triples):
  messages = prompt_template.format_messages(
      context=triple["context"],
      question=triple["question"],
      format_instructions=format_instructions
  )
  response = answer_generation_chain.invoke({"content" : messages})
  try:
    output_dict = answer_output_parser.parse(response.content)
  except Exception as e:
    continue
  triple["answer"] = output_dict["answer"]

100%|██████████| 10/10 [01:10<00:00,  7.09s/it]


In [None]:
!pip install -q -U datasets

In [None]:
import pandas as pd
from datasets import Dataset

ground_truth_qac_set = pd.DataFrame(qac_triples)
ground_truth_qac_set["context"] = ground_truth_qac_set["context"].map(lambda x: str(x.page_content))
ground_truth_qac_set = ground_truth_qac_set.rename(columns={"answer" : "ground_truth"})


eval_dataset = Dataset.from_pandas(ground_truth_qac_set)

In [None]:
eval_dataset

Dataset({
    features: ['question', 'context', 'ground_truth'],
    num_rows: 10
})

In [None]:
eval_dataset[0]

{'question': 'What is the focus of this paper?',
 'context': 'A Survey on Retrieval-Augmented Text Generation\nHuayang Li♥,∗\nYixuan Su♠,∗\nDeng Cai♦,∗\nYan Wang♣,∗\nLemao Liu♣,∗\n♥Nara Institute of Science and Technology\n♠University of Cambridge\n♦The Chinese University of Hong Kong\n♣Tencent AI Lab',
 'ground_truth': 'The focus of this paper is on retrieval-augmented text generation, which has gained increasing attention in the computational linguistics community. The paper surveys the paradigm of retrieval-augmented generation, reviews notable approaches across various NLP tasks such as dialogue response generation and machine translation, and discusses future research directions in this area.'}

In [None]:
eval_dataset.to_csv("groundtruth_eval_dataset.csv")

Creating CSV from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

7359

### Evaluating RAG Pipelines

If you skipped ahead and need to load the `.csv` directly - uncomment the code below.

If you're using Colab to do this notebook - please ensure you add it to your session files.

In [None]:
# from datasets import Dataset
# eval_dataset = Dataset.from_csv("groundtruth_eval_dataset.csv")

In [None]:
eval_dataset

Dataset({
    features: ['question', 'context', 'ground_truth'],
    num_rows: 10
})

### Evaluation Using RAGAS

Now we can evaluate using RAGAS!

The set-up is fairly straightforward - we simply need to create a dataset with our generated answers and our contexts, and then evaluate using the framework.

In [None]:
from ragas.metrics import (
    answer_relevancy,
    faithfulness,
    context_recall,
    context_precision,
    context_relevancy,
    answer_correctness,
    answer_similarity
)

from ragas.metrics.critique import harmfulness
from ragas import evaluate

def create_ragas_dataset(rag_pipeline, eval_dataset):
  rag_dataset = []
  for row in tqdm(eval_dataset):
    answer = rag_pipeline.invoke({"question" : row["question"]})
    rag_dataset.append(
        {"question" : row["question"],
         "answer" : answer["response"].content,
         "contexts" : [context.page_content for context in answer["context"]],
         "ground_truths" : [row["ground_truth"]]
         }
    )
  rag_df = pd.DataFrame(rag_dataset)
  rag_eval_dataset = Dataset.from_pandas(rag_df)
  return rag_eval_dataset

def evaluate_ragas_dataset(ragas_dataset):
  result = evaluate(
    ragas_dataset,
    metrics=[
        context_precision,
        faithfulness,
        answer_relevancy,
        context_recall,
        context_relevancy,
        answer_correctness,
        answer_similarity
    ],
  )
  return result

Lets create our dataset first:

In [None]:
from tqdm import tqdm
import pandas as pd

basic_qa_ragas_dataset = create_ragas_dataset(retrieval_augmented_qa_chain, eval_dataset)

100%|██████████| 10/10 [00:12<00:00,  1.30s/it]


In [None]:
basic_qa_ragas_dataset

Dataset({
    features: ['question', 'answer', 'contexts', 'ground_truths'],
    num_rows: 10
})

In [None]:
basic_qa_ragas_dataset[0]

{'question': 'What is the focus of this paper?',
 'answer': 'The focus of this paper is on improving the zero-shot generalization ability of language models through the use of Mixture-Of-Memory Augmentation (MoMA), a mechanism that retrieves augmentation documents from multiple information corpora.',
 'contexts': ['paper are used for illustration only, they do not represent\nthe ethical attitude of the authors.\nReferences\nPayal Bajaj, Daniel Campos, Nick Craswell, Li Deng,\nJianfeng Gao, Xiaodong Liu, Rangan Majumder,\nAndrew McNamara, Bhaskar Mitra, Tri Nguyen,',
  '2Microsoft Research, Redmond, USA\n3Beijing National Research Center for Information Science and Technology, Beijing, China\n{yuzc19, yus21}@mails.tsinghua.edu.cn; chenyan.xiong@microsoft.com\nliuzy@tsinghua.edu.cn\nAbstract'],
 'ground_truths': ['The focus of this paper is on retrieval-augmented text generation, which has gained increasing attention in the computational linguistics community. The paper surveys the parad

Save it for later:

In [None]:
basic_qa_ragas_dataset.to_csv("basic_qa_ragas_dataset.csv")

Creating CSV from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

11397

And finally - evaluate how it did!

In [None]:
basic_qa_result = evaluate_ragas_dataset(basic_qa_ragas_dataset)

evaluating with [context_precision]


100%|██████████| 1/1 [00:00<00:00,  1.04it/s]


evaluating with [faithfulness]


100%|██████████| 1/1 [00:07<00:00,  7.44s/it]


evaluating with [answer_relevancy]


100%|██████████| 1/1 [00:05<00:00,  5.91s/it]


evaluating with [context_recall]


100%|██████████| 1/1 [00:07<00:00,  7.35s/it]


evaluating with [context_relevancy]


100%|██████████| 1/1 [00:02<00:00,  2.26s/it]


evaluating with [answer_correctness]


100%|██████████| 1/1 [00:09<00:00,  9.06s/it]


evaluating with [answer_similarity]


100%|██████████| 1/1 [00:00<00:00,  1.74it/s]


In [None]:
basic_qa_result

{'context_precision': 0.5000, 'faithfulness': 0.4000, 'answer_relevancy': 0.9535, 'context_recall': 1.0000, 'context_relevancy': 0.0559, 'answer_correctness': 0.6167, 'answer_similarity': 1.0000}

### Testing Other Retrievers

Now we can test our how changing our Retriever impacts our RAGAS evaluation!

We'll build this simple qa_chain factory to create standardized qa_chains where the only different component will be the retriever.

In [None]:
def create_qa_chain(retriever):
  primary_qa_llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)
  created_qa_chain = (
    {"context": itemgetter("question") | retriever,
     "question": itemgetter("question")
    }
    | RunnablePassthrough.assign(
        context=itemgetter("context")
      )
    | {
         "response": prompt | primary_qa_llm,
         "context": itemgetter("context"),
      }
  )

  return created_qa_chain

#### Parent Document Retriever

One of the easier ways we can imagine improving a retriever is to embed our documents into small chunks, and then retrieve a significant amount of additional context that "surrounds" the found context.

You can read more about this method [here](https://python.langchain.com/docs/modules/data_connection/retrievers/parent_document_retriever)!

The basic outline of this retrieval method is as follows:

1. Obtain User Question
2. Retrieve child documents using Dense Vector Retrieval
3. Merge the child documents based on their parents. If they have the same parents - they become merged.
4. Replace the child documents with their respective parent documents from an in-memory-store.
5. Use the parent documents to augment generation.

In [None]:
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore

parent_splitter = RecursiveCharacterTextSplitter(chunk_size=1500)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=200)

vectorstore = Chroma(collection_name="split_parents", embedding_function=OpenAIEmbeddings())

store = InMemoryStore()

In [None]:
parent_document_retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

In [None]:
parent_document_retriever.add_documents(base_docs)

Let's create, test, and then evaluate our new chain!

In [None]:
parent_document_retriever_qa_chain = create_qa_chain(parent_document_retriever)

In [None]:
parent_document_retriever_qa_chain.invoke({"question" : "What is RAG?"})["response"].content

'RAG stands for Retrieval-Augmented Generation.'

In [None]:
pdr_qa_ragas_dataset = create_ragas_dataset(parent_document_retriever_qa_chain, eval_dataset)

100%|██████████| 10/10 [00:17<00:00,  1.80s/it]


In [None]:
pdr_qa_ragas_dataset.to_csv("pdr_qa_ragas_dataset.csv")

Creating CSV from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

55620

In [None]:
pdr_qa_result = evaluate_ragas_dataset(pdr_qa_ragas_dataset)

evaluating with [context_precision]


100%|██████████| 1/1 [01:01<00:00, 61.80s/it]


evaluating with [faithfulness]


100%|██████████| 1/1 [01:09<00:00, 69.23s/it]


evaluating with [answer_relevancy]


100%|██████████| 1/1 [01:04<00:00, 64.76s/it]


evaluating with [context_recall]


100%|██████████| 1/1 [00:06<00:00,  6.54s/it]


evaluating with [context_relevancy]


100%|██████████| 1/1 [00:04<00:00,  4.02s/it]


evaluating with [answer_correctness]


100%|██████████| 1/1 [00:06<00:00,  6.83s/it]


evaluating with [answer_similarity]


100%|██████████| 1/1 [00:00<00:00,  1.54it/s]


In [None]:
pdr_qa_result

{'context_precision': 0.6972, 'faithfulness': 0.3500, 'answer_relevancy': 0.9439, 'context_recall': 1.0000, 'context_relevancy': 0.0134, 'answer_correctness': 0.6000, 'answer_similarity': 1.0000}

#### Ensemble Retrieval

Next let's look at ensemble retrieval!

You can read more about this [here](https://python.langchain.com/docs/modules/data_connection/retrievers/ensemble)!

The basic idea is as follows:

1. Obtain User Question
2. Hit the Retriever Pair
    - Retrieve Documents with BM25 Sparse Vector Retrieval
    - Retrieve Documents with Dense Vector Retrieval Method
3. Collect and "fuse" the retrieved docs based on their weighting using the [Reciprocal Rank Fusion](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf) algorithm into a single ranked list.
4. Use those documents to augment our generation.

Ensure your `weights` list - the relative weighting of each retriever - sums to 1!

In [None]:
!pip install -q -U rank_bm25

In [None]:
from langchain.retrievers import BM25Retriever, EnsembleRetriever

text_splitter = RecursiveCharacterTextSplitter(chunk_size=450, chunk_overlap=75)
docs = text_splitter.split_documents(base_docs)

bm25_retriever = BM25Retriever.from_documents(docs)
bm25_retriever.k = 2

embedding = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(docs, embedding)
chroma_retriever = vectorstore.as_retriever(search_kwargs={"k": 3})

ensemble_retriever = EnsembleRetriever(retrievers=[bm25_retriever, chroma_retriever], weights=[0.75, 0.25])

In [None]:
ensemble_retriever_qa_chain = create_qa_chain(ensemble_retriever)

In [None]:
ensemble_retriever_qa_chain.invoke({"question" : "What is RAG?"})["response"].content

'RAG stands for Retrieval-Augmented Generation.'

In [None]:
ensemble_qa_ragas_dataset = create_ragas_dataset(ensemble_retriever_qa_chain, eval_dataset)

100%|██████████| 10/10 [00:20<00:00,  2.07s/it]


In [None]:
ensemble_qa_ragas_dataset.to_csv("ensemble_qa_ragas_dataset.csv")

Creating CSV from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

22820

In [None]:
ensemble_qa_result = evaluate_ragas_dataset(ensemble_qa_ragas_dataset)

evaluating with [context_precision]


100%|██████████| 1/1 [01:01<00:00, 61.76s/it]


evaluating with [faithfulness]


100%|██████████| 1/1 [01:08<00:00, 68.62s/it]


evaluating with [answer_relevancy]


100%|██████████| 1/1 [00:05<00:00,  5.37s/it]


evaluating with [context_recall]


100%|██████████| 1/1 [00:11<00:00, 11.67s/it]


evaluating with [context_relevancy]


100%|██████████| 1/1 [01:02<00:00, 62.45s/it]


evaluating with [answer_correctness]


100%|██████████| 1/1 [00:08<00:00,  9.00s/it]


evaluating with [answer_similarity]


100%|██████████| 1/1 [00:00<00:00,  1.57it/s]


In [None]:
ensemble_qa_result

{'context_precision': 0.8858, 'faithfulness': 0.7000, 'answer_relevancy': 0.8918, 'context_recall': 0.9800, 'context_relevancy': 0.0192, 'answer_correctness': 0.7750, 'answer_similarity': 1.0000}

### Conclusion

Observe your results in a table!

In [None]:
basic_qa_result

{'context_precision': 0.5000, 'faithfulness': 0.4000, 'answer_relevancy': 0.9535, 'context_recall': 1.0000, 'context_relevancy': 0.0559, 'answer_correctness': 0.6167, 'answer_similarity': 1.0000}

In [None]:
pdr_qa_result

{'context_precision': 0.6972, 'faithfulness': 0.3500, 'answer_relevancy': 0.9439, 'context_recall': 1.0000, 'context_relevancy': 0.0134, 'answer_correctness': 0.6000, 'answer_similarity': 1.0000}

In [None]:
ensemble_qa_result

{'context_precision': 0.8858, 'faithfulness': 0.7000, 'answer_relevancy': 0.8918, 'context_recall': 0.9800, 'context_relevancy': 0.0192, 'answer_correctness': 0.7750, 'answer_similarity': 1.0000}

We can also zoom in on each result and find specific information about each of the questions and answers.

In [None]:
ensemble_qa_result_df = ensemble_qa_result.to_pandas()

In [None]:
ensemble_qa_result_df

Unnamed: 0,question,contexts,answer,ground_truths,context_precision,faithfulness,answer_relevancy,context_recall,context_relevancy,answer_correctness,answer_similarity
0,What is the focus of this paper?,[has to make an important career decision.\nNe...,The focus of this paper is on a framework call...,[The focus of this paper is on retrieval-augme...,1.0,0.666667,0.784617,1.0,0.0,0.5,True
1,What is the title of the paper?,[of War. The game was released worldwide in\nG...,"Title: Self-RAG: Learning to Retrieve, Generat...",[The title of the paper is 'A Survey on Retrie...,0.5,1.0,0.976911,1.0,0.0,0.5,True
2,What is the aim of this paper?,[has to make an important career decision.\nNe...,The aim of this paper is to introduce a new fr...,[The aim of this paper is to conduct a compreh...,1.0,0.333333,0.800732,1.0,0.078947,0.75,True
3,What is the main focus of the paper 'A Survey ...,[A Survey on Retrieval-Augmented Text Generati...,The main focus of the paper 'A Survey on Retri...,[The main focus of the paper 'A Survey on Retr...,0.679167,1.0,0.982435,0.8,0.017857,1.0,True
4,What is the main focus of this paper?,[example of completions of the prompt by diffe...,I don't know.,[The main focus of this paper is to conduct a ...,1.0,0.0,0.742601,1.0,0.0,0.5,True
5,What is the main focus of this paper?,[example of completions of the prompt by diffe...,I don't know.,[The main focus of this paper is to conduct a ...,1.0,0.0,0.742601,1.0,0.0,0.5,True
6,What are the advantages of retrieval-augmented...,[attracted increasing attention of the compu-\...,The advantages of retrieval-augmented text gen...,[The advantages of retrieval-augmented text ge...,1.0,1.0,0.968712,1.0,0.025641,1.0,True
7,What is the main focus of the paper 'A Survey ...,[A Survey on Retrieval-Augmented Text Generati...,The main focus of the paper 'A Survey on Retri...,[The main focus of the paper 'A Survey on Retr...,0.679167,1.0,0.982422,1.0,0.017857,1.0,True
8,What are the advantages of retrieval-augmented...,[attracted increasing attention of the compu-\...,The advantages of retrieval-augmented text gen...,[The advantages of retrieval-augmented text ge...,1.0,1.0,0.968731,1.0,0.025641,1.0,True
9,What are the advantages of retrieval-augmented...,[attracted increasing attention of the compu-\...,The advantages of retrieval-augmented text gen...,[The advantages of retrieval-augmented text ge...,1.0,1.0,0.968692,1.0,0.025641,1.0,True


We'll also look at combining the results and looking at them in a single table so we can make inferences about them!

In [None]:
def create_df_dict(pipeline_name, pipeline_items):
  df_dict = {"name" : pipeline_name}
  for name, score in pipeline_items:
    df_dict[name] = score
  return df_dict

In [None]:
basic_rag_df_dict = create_df_dict("basic_rag", basic_qa_result.items())

In [None]:
pdr_rag_df_dict = create_df_dict("pdr_rag", pdr_qa_result.items())

In [None]:
ensemble_rag_df_dict = create_df_dict("ensemble_rag", ensemble_qa_result.items())

In [None]:
results_df = pd.DataFrame([basic_rag_df_dict, pdr_rag_df_dict, ensemble_rag_df_dict])

In [None]:
results_df.sort_values("answer_correctness", ascending=False)

Unnamed: 0,name,context_precision,faithfulness,answer_relevancy,context_recall,context_relevancy,answer_correctness,answer_similarity
2,ensemble_rag,0.885833,0.7,0.891845,0.98,0.019158,0.775,1.0
0,basic_rag,0.5,0.4,0.953475,1.0,0.055904,0.616667,1.0
1,pdr_rag,0.697222,0.35,0.943909,1.0,0.013386,0.6,1.0


### ❓QUESTION❓

What conclusions can you draw about the above results?

Describe in your own words what the metrics are expressing.

In [None]:
retrieval_augmented_qa_chain = (
    RunnableParallel({
        'context': itemgetter('question') | base_retriever,
        'question': RunnablePassthrough()
    }) | {
        'response': prompt | primary_qa_llm | parser,
        'context': itemgetter('context')
    }
)