# **The below/following two cells alone has to be loaded in another colab Notebook or GPU Serving**

In [None]:
!pip install langchain datasets ragas  vllm -Uqq

In [None]:
!python -u -m vllm.entrypoints.openai.api_server \
       --host 0.0.0.0 \
       --model TheBloke/zephyr-7B-beta-AWQ \
       --dtype half \
       --max-num-batched-tokens 8192 \
       --max-model-len 8192 \
       --quantization awq \
       --tensor-parallel-size 1 \
       --port 8000 | grep "Uvicorn" & npx localtunnel --port 8000
#local tunneling is Must as it only worked out in my case as a proxy host endpoint
#For Multiple GPU's usage change Tensor-parallel-size according to num of gpu devices available
#Change Quantisation type to gptq for GPTQ model
#for bfloat not supported error use dtype as half i.e float16/fp16
#if cuda out of memory error adjust the max-model-len and max-num-batched-token

In [None]:
#Note: The above cell has to run in another session and has to be localtunneled for using as inference url in our ChatOpenAI Pipeline

##########################################################################################################################################################################################################

# Evaluation with RAGAS and Advanced Retrieval Methods Using LangChain with `VLLM-HuggingFace Models` Alternative to OPENAI

In the following notebook we'll discuss a major component of LLM Ops:

- Evaluation

We're going to be leveraging the [RAGAS]() framework for our evaluations today as it's becoming a standard method of evaluating (at least directionally) RAG systems.

We're also going to discuss a few more powerful Retrieval Systems that can potentially improve the quality of our generations!

Let's start as we always do: Grabbing our dependencies!

**Thanking AI MAKERSPACE resource colab operated with OPENAI.Here we are extending their resource with VLLM OpenSource Model RAGAS Evaluation**

https://colab.research.google.com/drive/1TZo2sgf1YFzI4_U-tGppg_ylHAR3MXF_?usp=sharing#scrollTo=bGHipYsf7phk

AI MAKERSPACE Tutorial using OPENAI:
https://youtu.be/mEv-2Xnb_Wk?feature=shared

In [120]:
!pip install -U -q langchain openai ragas arxiv pymupdf chromadb wandb tiktoken  -Uqq
!pip install vllm -Uqqq

In [2]:
import os
import openai
from getpass import getpass

# openai.api_key = getpass("Please provide your OpenAI Key: ")
os.environ["OPENAI_API_KEY"] = "api-key"

In [118]:
# #To be used if documents is not loading and shows Zero in following cell
# !pip uninstall arxiv
# !pip install arxiv


### Data Collection

We're going to be using papers from Arxiv as our context today.

We can collect these documents rather straightforwardly with the `ArxivLoader` document loader from LangChain.

Let's grab and load 5 documents.

- [`ArxivLoader`](https://api.python.langchain.com/en/latest/document_loaders/langchain.document_loaders.arxiv.ArxivLoader.html)

In [3]:
from langchain.document_loaders import ArxivLoader

base_docs = ArxivLoader(query="Retrieval Augmented Generation", load_max_docs=5).load()
len(base_docs)

5

In [4]:
for doc in base_docs:
  print(doc.metadata)

{'Published': '2022-02-13', 'Title': 'A Survey on Retrieval-Augmented Text Generation', 'Authors': 'Huayang Li, Yixuan Su, Deng Cai, Yan Wang, Lemao Liu', 'Summary': 'Recently, retrieval-augmented text generation attracted increasing attention\nof the computational linguistics community. Compared with conventional\ngeneration models, retrieval-augmented text generation has remarkable\nadvantages and particularly has achieved state-of-the-art performance in many\nNLP tasks. This paper aims to conduct a survey about retrieval-augmented text\ngeneration. It firstly highlights the generic paradigm of retrieval-augmented\ngeneration, and then it reviews notable approaches according to different tasks\nincluding dialogue response generation, machine translation, and other\ngeneration tasks. Finally, it points out some important directions on top of\nrecent methods to facilitate future research.'}
{'Published': '2023-12-09', 'Title': 'Context Tuning for Retrieval Augmented Generation', 'Autho

### Creating an Index

Let's use a naive index creation strategy of just using `RecursiveCharacterTextSplitter` on our documents and embedding each into our `VectorStore` using `HuugingFaceEmbeddings()`.

- [`RecursiveCharacterTextSplitter()`](https://api.python.langchain.com/en/latest/text_splitter/langchain.text_splitter.RecursiveCharacterTextSplitter.html)
- [`Chroma`](https://api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.chroma.Chroma.html?highlight=chroma#langchain.vectorstores.chroma.Chroma)
- [`HuggingFaceEmbeddings()`](https://huggingface.co/blog/getting-started-with-embeddings)

In [121]:
!pip install sentence-transformers -Uqq #HuggingFaceEmbeddings

In [6]:
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=250)

docs = text_splitter.split_documents(base_docs)

vectorstore = Chroma.from_documents(docs, HuggingFaceEmbeddings())

.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

train_script.py:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

In [7]:
len(docs)

4756

In [8]:
print(max([len(chunk.page_content) for chunk in docs]))

249


Let's convert our `Chroma` vectorstore into a retriever with the `.as_retriever()` method.

In [9]:
base_retriever = vectorstore.as_retriever(search_kwargs={"k" : 2})

Now to give it a test!

In [10]:

relevant_docs = base_retriever.get_relevant_documents("What is Retrieval Augmented Generation?")

In [11]:
len(relevant_docs)

2

## Creating a Retrieval Augmented Generation Prompt

Now we can set up a prompt template that will be used to provide the LLM with the necessary contexts, user query, and instructions!

In [12]:
from langchain.prompts import ChatPromptTemplate

template = """Answer the question based only on the following context. If you cannot answer the question with the context, please respond with 'I don't know':

### CONTEXT
{context}

### QUESTION
Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)

### Setting Up our Basic QA Chain

Now we can instantiate our basic RAG chain!

We'll follow *exactly* the chain we made on Tuesday to keep things simple for now - if you need a refresher on what it looked like - check out last week's notebook!

In [122]:
!pip install litellm[proxy] -Uqq

In [None]:
# !litellm --test

# **We can use LiteLLM also as an alternative,but i ve faced some error like Model Parallel Instantisation.So I'm Skipping and continuing with VLLM**

In [None]:
# from litellm import completion

# model_name = "TheBloke/zephyr-7B-beta-AWQ"
# provider = "vllm"
# messages = [[{"role": "user", "content": "Hey, how's it going"}] for _ in range(5)]


# from litellm import batch_completion



# response_list = batch_completion(
#             model=model_name,
#             custom_llm_provider=provider, # can easily switch to huggingface, replicate, together ai, sagemaker, etc.
#             messages=messages,
#             api_base="https://beige-donuts-dream.loca.lt/v1",
#             temperature=0.2,
#             max_tokens=80,
#         )
# print(response_list)

config.json:   0%|          | 0.00/828 [00:00<?, ?B/s]

INFO 12-25 05:25:40 llm_engine.py:73] Initializing an LLM engine with config: model='TheBloke/zephyr-7B-beta-AWQ', tokenizer='TheBloke/zephyr-7B-beta-AWQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=awq, enforce_eager=False, seed=0)
INFO 12-25 05:25:40 llm_engine.py:73] Initializing an LLM engine with config: model='TheBloke/zephyr-7B-beta-AWQ', tokenizer='TheBloke/zephyr-7B-beta-AWQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=awq, enforce_eager=False, seed=0)
INFO 12-25 05:25:40 llm_engine.py:73] Initializing an LLM engine with config: model='TheBloke/zephyr-7B-beta-AWQ', tokenizer='TheBloke/zephyr-7B-beta-AWQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, tru

tokenizer_config.json:   0%|          | 0.00/1.43k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/168 [00:00<?, ?B/s]


[1;31mGive Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new[0m
LiteLLM.Info: If you need to debug this error, use `litellm.set_verbose=True'.


[1;31mGive Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new[0m
LiteLLM.Info: If you need to debug this error, use `litellm.set_verbose=True'.


[1;31mGive Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new[0m
LiteLLM.Info: If you need to debug this error, use `litellm.set_verbose=True'.


[1;31mGive Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new[0m
LiteLLM.Info: If you need to debug this error, use `litellm.set_verbose=True'.


[1;31mGive Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new[0m
LiteLLM.Info: If you need to debug this error, use `litellm.set_verbose=True'.



APIConnectionError: ignored

In [91]:
from openai import OpenAI
from IPython.display import display, HTML

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "api-key"
openai_api_base = "https://proud-colts-film.loca.lt/v1" #**this is dynamic and everytime u instantiate a vllm api endpoint change in another notebook**
#add v1 at end of your localtunneled inference server
client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)
completion = client.completions.create(model="TheBloke/zephyr-7B-beta-AWQ",
                                      prompt="Lionel Scaloni is a")
print("Completion result:", completion)
output_html = f"<div style='max-height: 300px; overflow-y: auto;'>{completion}</div>"
display(HTML(output_html))

Completion result: Completion(id='cmpl-898e54536f6e4b2588d802d64f2dc3d7', choices=[CompletionChoice(finish_reason='length', index=0, logprobs=None, text=' Certified FIFA Coach and is coaching young football players for the last few years.')], created=568, model='TheBloke/zephyr-7B-beta-AWQ', object='text_completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=16, prompt_tokens=9, total_tokens=25))


In [56]:
from operator import itemgetter

from langchain.chat_models import ChatOpenAI
from langchain.schema.output_parser import StrOutputParser
from langchain.schema.runnable import RunnableLambda, RunnablePassthrough
from langchain.llms import VLLM
from ragas.llms import LangchainLLM

# llm = VLLM(
#     model=TheBloke/Mistral-7B-Instruct-v0.2-AWQ",
#     trust_remote_code=True,  # mandatory for hf models
#     max_new_tokens=128,
#     top_k=10,
#     top_p=0.95,
#     temperature=0.8,
# )

inference_server_url = "https://red-hotels-camp.loca.lt/v1"

# create vLLM Langchain instance
chat = ChatOpenAI(
    model="TheBloke/zephyr-7B-beta-AWQ",
    openai_api_key="api-key",
    openai_api_base=inference_server_url,
    max_tokens=4096,
    temperature=0.2,
)

# use the Ragas LangchainLLM wrapper to create a RagasLLM instance
vllm = LangchainLLM(llm=chat)

primary_qa_llm = chat

retrieval_augmented_qa_chain = (
    # INVOKE CHAIN WITH: {"question" : "<<SOME USER QUESTION>>"}
    # "question" : populated by getting the value of the "question" key
    # "context"  : populated by getting the value of the "question" key and chaining it into the base_retriever
    {"context": itemgetter("question") | base_retriever, "question": itemgetter("question")}
    # "context"  : is assigned to a RunnablePassthrough object (will not be called or considered in the next step)
    #              by getting the value of the "context" key from the previous step
    | RunnablePassthrough.assign(context=itemgetter("context"))
    # "response" : the "context" and "question" values are used to format our prompt object and then piped
    #              into the LLM and stored in a key called "response"
    # "context"  : populated by getting the value of the "context" key from the previous step
    | {"response": prompt | primary_qa_llm, "context": itemgetter("context")}
)

Alternative to VLLM we can use HuggingFace Text Gen Inference as well

In [None]:
# !pip install text-generation

In [None]:
# from langchain.llms import HuggingFaceTextGenInference

# llm = HuggingFaceTextGenInference(
#     inference_server_url="https://petite-eagles-shake.loca.lt/v1/",
#     max_new_tokens=4096,
#     top_k=10,
#     top_p=0.95,
#     typical_p=0.95,
#     temperature=0.01,
#     repetition_penalty=1.03,
# )
# llm

In [17]:
from IPython.display import display, HTML
question = "What is RAG?"

result = retrieval_augmented_qa_chain.invoke({"question" : question})

print(result)
# Create a scrollable area for the output
output_html = f"<div style='max-height: 300px; overflow-y: auto;'>{result}</div>"
display(HTML(output_html))

{'response': AIMessage(content="Answer: RAG, according to the context provided, is a retrieval augmented model that shows promise in enhancing traditional language models by improving their contextual understanding, integrating private data, and reducing hallucination. It requires additional memory for processing, but the Hybrid Retrieval-Augmented Generation (HybridRAG) framework proposed in the text aims to overcome the limitation of RAG's processing time required for real-time responses by incorporating asynchronously generated retrieval-augmented memory from a Large Language Model (LLM) in the cloud. The text also mentions a RAG baseline, which is a RAG model without any ideal reference label for comparison, and in this case, the same label as the HybridRAG approach is used."), 'context': [Document(page_content='without additional memory. For RAG, reference\nlabels are generated by GPT3 with full text. In the\ncase of GPT3-zeroshot baseline, since there is no\nideal reference label

### Ground Truth Dataset Creation Using Zephyr beta and Mistral Instruct

The next section might take you a long time to run, so the evaluation dataset is provided.

The basic idea is that we can use LangChain to create questions based on our contexts, and then answer those questions.

Let's look at how that works in the code!

In [18]:
from langchain.output_parsers import ResponseSchema
from langchain.output_parsers import StructuredOutputParser

question_schema = ResponseSchema(
    name="question",
    description="a question about the context."
)

question_response_schemas = [
    question_schema,
]

In [19]:
question_output_parser = StructuredOutputParser.from_response_schemas(question_response_schemas)
format_instructions = question_output_parser.get_format_instructions()
question_output_parser

StructuredOutputParser(response_schemas=[ResponseSchema(name='question', description='a question about the context.', type='string')])

In [20]:
question_generation_llm = chat

bare_prompt_template = "{content}"
bare_template = ChatPromptTemplate.from_template(template=bare_prompt_template)

In [47]:
from langchain.prompts import ChatPromptTemplate
import json

qa_template = """\
You are a University Professor creating a test for advanced students. For each context, create a question that is specific to the context.
question: a question about the context.

Format the output as JSON with the following keys:
question

context: {context}
"""



prompt_template = ChatPromptTemplate.from_template(template=qa_template)

messages = prompt_template.format_messages(
    context=docs[0],
    format_instructions=format_instructions
)

question_generation_chain = bare_template | question_generation_llm

response = question_generation_chain.invoke({"content" : messages})



output_dict = question_output_parser.parse(response.content)
# # Create a scrollable area for the output
output_html = f"<div style='max-height: 300px; overflow-y: auto;'>{response.content}</div>"
display(HTML(output_html))

In [48]:
for k, v in output_dict.items():
  print(k)
  print(v)

question
What is the generic paradigm of retrieval-augmented text generation, and how does it differ from conventional text generation models?
context
{'metadata': {'Published': '2022-02-13', 'Title': 'A Survey on Retrieval-Augmented Text Generation', 'Authors': 'Huayang Li, Yixuan Su, Deng Cai, Yan Wang, Lemao Liu', 'Summary': 'Recently, retrieval-augmented text generation attracted increasing attention of the computational linguistics community. Compared with conventional generation models, retrieval-augmented text generation has remarkable advantages and particularly has achieved state-of-the-art performance in many NLP tasks. This paper aims to conduct a survey about retrieval-augmented text generation. It firstly highlights the generic paradigm of retrieval-augmented generation, and then it reviews notable approaches according to different tasks including dialogue response generation, machine translation, and other generation tasks. Finally, it points out some important directions

In [39]:
!pip install -q -U tqdm

In [49]:
from tqdm import tqdm

qac_triples = []

for text in tqdm(docs[:10]):
  messages = prompt_template.format_messages(
      context=text,
      format_instructions=format_instructions
  )
  response = question_generation_chain.invoke({"content" : messages})
  try:
    output_dict = question_output_parser.parse(response.content)
  except Exception as e:
    continue
  output_dict["context"] = text
  qac_triples.append(output_dict)
  output_html = f"<div style='max-height: 300px; overflow-y: auto;'>{response.content}</div>"
  display(HTML(output_html))


 10%|█         | 1/10 [00:14<02:12, 14.75s/it]

 20%|██        | 2/10 [00:30<02:01, 15.19s/it]

 30%|███       | 3/10 [00:43<01:41, 14.50s/it]

 40%|████      | 4/10 [00:58<01:26, 14.44s/it]

 50%|█████     | 5/10 [01:08<01:05, 13.04s/it]

 60%|██████    | 6/10 [01:22<00:52, 13.23s/it]

 70%|███████   | 7/10 [01:39<00:43, 14.57s/it]

 80%|████████  | 8/10 [01:53<00:28, 14.38s/it]

 90%|█████████ | 9/10 [02:06<00:13, 13.74s/it]

100%|██████████| 10/10 [02:19<00:00, 13.95s/it]


In [50]:
qac_triples

[{'question': "In the context of 'A Survey on Retrieval-Augmented Text Generation' by Huayang Li et al., what are some notable approaches to retrieval-augmented text generation for dialogue response generation, machine translation, and other generation tasks?",
  'context': Document(page_content='Huayang Li♥,∗\nYixuan Su♠,∗\nDeng Cai♦,∗\nYan Wang♣,∗\nLemao Liu♣,∗\n♥Nara Institute of Science and Technology\n♠University of Cambridge\n♦The Chinese University of Hong Kong\n♣Tencent AI Lab\nli.huayang.lh6@is.naist.jp, ys484@cam.ac.uk', metadata={'Published': '2022-02-13', 'Title': 'A Survey on Retrieval-Augmented Text Generation', 'Authors': 'Huayang Li, Yixuan Su, Deng Cai, Yan Wang, Lemao Liu', 'Summary': 'Recently, retrieval-augmented text generation attracted increasing attention\nof the computational linguistics community. Compared with conventional\ngeneration models, retrieval-augmented text generation has remarkable\nadvantages and particularly has achieved state-of-the-art performa

In [51]:
qac_triples[1]

{'question': 'What is retrieval-augmented text generation and how does it differ from conventional text generation models?',
 'context': Document(page_content='Yan Wang♣,∗\nLemao Liu♣,∗\n♥Nara Institute of Science and Technology\n♠University of Cambridge\n♦The Chinese University of Hong Kong\n♣Tencent AI Lab\nli.huayang.lh6@is.naist.jp, ys484@cam.ac.uk\nthisisjcykcd@gmail.com, brandenwang@tencent.com', metadata={'Published': '2022-02-13', 'Title': 'A Survey on Retrieval-Augmented Text Generation', 'Authors': 'Huayang Li, Yixuan Su, Deng Cai, Yan Wang, Lemao Liu', 'Summary': 'Recently, retrieval-augmented text generation attracted increasing attention\nof the computational linguistics community. Compared with conventional\ngeneration models, retrieval-augmented text generation has remarkable\nadvantages and particularly has achieved state-of-the-art performance in many\nNLP tasks. This paper aims to conduct a survey about retrieval-augmented text\ngeneration. It firstly highlights the g

# **We can use Two different hosted vllm api points models separately in two gpu sessions and can load one as Evaluator and one as RAG model.Like MIxTRAL MOE AWQ from BLoke as Evaluator as its SOTA Opensource model as of Dec2023**

In [58]:
chat = chat = ChatOpenAI(
    model="TheBloke/zephyr-7B-beta-AWQ",
    openai_api_key="api-key",
    openai_api_base=inference_server_url,
    max_tokens=4096,
    temperature=0.2,
)
answer_generation_llm = chat

#In place of GPT4 as evaluator we can use SOTA Mixtral awq  from The Bloke  if u can host Mixtral in ur Colab Pro or high vram Local GPU serving
# chat = ChatOpenAI(
#     model="TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ",
#     openai_api_key="api-key",
#     openai_api_base=inference_server_url, #Two models should be hosted in different tunneling server in vllm in two different notebooks
#     max_tokens=4096,
#     temperature=0.2,
# )

# https://red-hotels-camp.loca.lt
answer_schema = ResponseSchema(
    name="answer",
    description="an answer to the question"
)

answer_response_schemas = [
    answer_schema,
]

answer_output_parser = StructuredOutputParser.from_response_schemas(answer_response_schemas)
format_instructions = answer_output_parser.get_format_instructions()

qa_template = """\
You are a University Professor creating a test for advanced students. For each question and context, create an answer.

answer: a answer about the context.

Format the output as JSON with the following keys:


question: {question}
answer
context: {context}
"""

prompt_template = ChatPromptTemplate.from_template(template=qa_template)

messages = prompt_template.format_messages(
    context=qac_triples[0]["context"],
    question=qac_triples[0]["question"],
    format_instructions=format_instructions
)

answer_generation_chain = bare_template | answer_generation_llm

response = answer_generation_chain.invoke({"content" : messages})
print(response.content)
output_dict = answer_output_parser.parse(response.content)
output_html = f"<div style='max-height: 300px; overflow-y: auto;'>{output_dict}</div>"
display(HTML(output_html))


{
  "question": "In the context of 'A Survey on Retrieval-Augmented Text Generation' by Huayang Li et al., what are some notable approaches to retrieval-augmented text generation for dialogue response generation, machine translation, and other generation tasks?",
  "answer": "According to the survey by Huayang Li et al., some notable approaches to retrieval-augmented text generation for dialogue response generation include the Retrieval-Augmented Generation (RAG) model by Qu et al. (2020), which combines a retrieval component with a generation component to improve the accuracy and fluency of responses, and the Retrieval-Assisted Response Generation (RARG) model by Zhang et al. (2020), which uses a retrieval-based ranking mechanism to select the most relevant response from a retrieved set. For machine translation, the Retrieval-Augmented Neural Machine Translation (RANMT) model by Gu et al. (2018) incorporates a retrieval component into the translation process to improve the quality of 

In [59]:
for k, v in output_dict.items():
  print(k)
  print(v)

question
In the context of 'A Survey on Retrieval-Augmented Text Generation' by Huayang Li et al., what are some notable approaches to retrieval-augmented text generation for dialogue response generation, machine translation, and other generation tasks?
answer
According to the survey by Huayang Li et al., some notable approaches to retrieval-augmented text generation for dialogue response generation include the Retrieval-Augmented Generation (RAG) model by Qu et al. (2020), which combines a retrieval component with a generation component to improve the accuracy and fluency of responses, and the Retrieval-Assisted Response Generation (RARG) model by Zhang et al. (2020), which uses a retrieval-based ranking mechanism to select the most relevant response from a retrieved set. For machine translation, the Retrieval-Augmented Neural Machine Translation (RANMT) model by Gu et al. (2018) incorporates a retrieval component into the translation process to improve the quality of translations, wh

In [60]:
for triple in tqdm(qac_triples):
  messages = prompt_template.format_messages(
      context=triple["context"],
      question=triple["question"],
      format_instructions=format_instructions
  )
  response = answer_generation_chain.invoke({"content" : messages})
  try:
    output_dict = answer_output_parser.parse(response.content)
  except Exception as e:
    continue
  triple["answer"] = output_dict["answer"]

100%|██████████| 9/9 [01:44<00:00, 11.62s/it]


In [61]:
!pip install -q -U datasets

In [62]:
import pandas as pd
from datasets import Dataset

ground_truth_qac_set = pd.DataFrame(qac_triples)
ground_truth_qac_set["context"] = ground_truth_qac_set["context"].map(lambda x: str(x.page_content))
ground_truth_qac_set = ground_truth_qac_set.rename(columns={"answer" : "ground_truth"})


eval_dataset = Dataset.from_pandas(ground_truth_qac_set)

In [63]:
eval_dataset

Dataset({
    features: ['question', 'context', 'ground_truth'],
    num_rows: 9
})

In [64]:
eval_dataset[0]

{'question': "In the context of 'A Survey on Retrieval-Augmented Text Generation' by Huayang Li et al., what are some notable approaches to retrieval-augmented text generation for dialogue response generation, machine translation, and other generation tasks?",
 'context': 'Huayang Li♥,∗\nYixuan Su♠,∗\nDeng Cai♦,∗\nYan Wang♣,∗\nLemao Liu♣,∗\n♥Nara Institute of Science and Technology\n♠University of Cambridge\n♦The Chinese University of Hong Kong\n♣Tencent AI Lab\nli.huayang.lh6@is.naist.jp, ys484@cam.ac.uk',
 'ground_truth': 'According to the survey by Huayang Li et al., some notable approaches to retrieval-augmented text generation for dialogue response generation include the Retrieval-based Response Generation (RRG) model by Zhang et al. (2018), which combines a retrieval component and a generation component, and the Retrieval-augmented Generative Response (RaGR) model by Zhang et al. (2019), which uses a retrieval-augmented decoding algorithm. For machine translation, the authors men

In [65]:
eval_dataset.to_csv("groundtruth_eval_dataset.csv")

Creating CSV from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

13368

In [66]:
from google.colab import files
files.download("groundtruth_eval_dataset.csv")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

### Evaluating RAG Pipelines

If you skipped ahead and need to load the `.csv` directly - uncomment the code below.

If you're using Colab to do this notebook - please ensure you add it to your session files.

In [67]:
from datasets import Dataset
eval_dataset = Dataset.from_csv("groundtruth_eval_dataset.csv")

Generating train split: 0 examples [00:00, ? examples/s]

In [68]:
eval_dataset

Dataset({
    features: ['question', 'context', 'ground_truth'],
    num_rows: 9
})

### Evaluation Using RAGAS

Now we can evaluate using RAGAS!

The set-up is fairly straightforward - we simply need to create a dataset with our generated answers and our contexts, and then evaluate using the framework.

In [69]:
from ragas.metrics import (
    answer_relevancy,
    faithfulness,
    context_recall,
    context_precision,
    context_relevancy,
    answer_correctness,
    answer_similarity
)

from ragas.metrics.critique import harmfulness
from ragas import evaluate

def create_ragas_dataset(rag_pipeline, eval_dataset):
  rag_dataset = []
  for row in tqdm(eval_dataset):
    answer = rag_pipeline.invoke({"question" : row["question"]})
    rag_dataset.append(
        {"question" : row["question"],
         "answer" : answer["response"].content,
         "contexts" : [context.page_content for context in answer["context"]],
         "ground_truths" : [row["ground_truth"]]
         }
    )
  rag_df = pd.DataFrame(rag_dataset)
  rag_eval_dataset = Dataset.from_pandas(rag_df)
  return rag_eval_dataset

def evaluate_ragas_dataset(ragas_dataset):
  result = evaluate(
    ragas_dataset,
    metrics=[
        context_precision,
        faithfulness,
        answer_relevancy,
        context_recall,
        context_relevancy,
        answer_correctness,
        answer_similarity
    ],
  )
  return result

Lets create our dataset first:

In [70]:
from tqdm import tqdm
import pandas as pd

basic_qa_ragas_dataset = create_ragas_dataset(retrieval_augmented_qa_chain, eval_dataset)

100%|██████████| 9/9 [01:12<00:00,  8.11s/it]


In [71]:
basic_qa_ragas_dataset

Dataset({
    features: ['question', 'answer', 'contexts', 'ground_truths'],
    num_rows: 9
})

In [72]:
basic_qa_ragas_dataset[0]

{'question': "In the context of 'A Survey on Retrieval-Augmented Text Generation' by Huayang Li et al., what are some notable approaches to retrieval-augmented text generation for dialogue response generation, machine translation, and other generation tasks?",
 'answer': "Answer: According to the context provided, some notable approaches to retrieval-augmented text generation for dialogue response generation include the one proposed by Wei et al. (2018), as mentioned in the paper. For machine translation, the paper cites the approach proposed by Gu et al. (2018). The paper also mentions the approach proposed by Hashimoto et al. (2018) for other generation tasks. However, the specific details of these approaches are not provided in the context given, as the focus is on summarizing the paper's overview of retrieval-augmented text generation. If you need more information about these approaches, you may need to refer to the original papers or consult other sources.",
 'contexts': ['when re

Save it for later:

In [73]:
basic_qa_ragas_dataset.to_csv("basic_qa_ragas_dataset.csv")

Creating CSV from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

23538

In [75]:
from google.colab import files
files.download("basic_qa_ragas_dataset.csv")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [115]:
from langchain.chat_models import ChatOpenAI
from ragas.llms import LangchainLLM
inference_server_url = "https://proud-colts-film.loca.lt/v1"

# create vLLM Langchain instance
chat = ChatOpenAI(
    model="TheBloke/zephyr-7B-beta-AWQ",
    openai_api_key="api-key",
    openai_api_base=inference_server_url,
    max_tokens=4096,
    temperature=0.2,
)


# use the Ragas LangchainLLM wrapper to create a RagasLLM instance
vllm = LangchainLLM(llm=chat)

In [93]:
from ragas.metrics import (
    context_precision,
    faithfulness,
    context_recall,
)
from ragas import evaluate
from ragas.metrics.critique import harmfulness

# change the LLM

faithfulness.llm = vllm
context_precision.llm = vllm
context_recall.llm = vllm
harmfulness.llm = vllm
answer_relevancy.lm = vllm
context_relevancy.llm = vllm
answer_correctness.llm = vllm
answer_similarity.llm = vllm
# evaluate

And finally - evaluate how it did!

In [123]:
basic_qa_result = evaluate_ragas_dataset(basic_qa_ragas_dataset)

In [None]:
basic_qa_result

### Testing Other Retrievers

Now we can test our how changing our Retriever impacts our RAGAS evaluation!

We'll build this simple qa_chain factory to create standardized qa_chains where the only different component will be the retriever.

In [94]:
def create_qa_chain(retriever):
  primary_qa_llm = chat
  created_qa_chain = (
    {"context": itemgetter("question") | retriever,
     "question": itemgetter("question")
    }
    | RunnablePassthrough.assign(
        context=itemgetter("context")
      )
    | {
         "response": prompt | primary_qa_llm,
         "context": itemgetter("context"),
      }
  )

  return created_qa_chain

#### Parent Document Retriever

One of the easier ways we can imagine improving a retriever is to embed our documents into small chunks, and then retrieve a significant amount of additional context that "surrounds" the found context.

You can read more about this method [here](https://python.langchain.com/docs/modules/data_connection/retrievers/parent_document_retriever)!

The basic outline of this retrieval method is as follows:

1. Obtain User Question
2. Retrieve child documents using Dense Vector Retrieval
3. Merge the child documents based on their parents. If they have the same parents - they become merged.
4. Replace the child documents with their respective parent documents from an in-memory-store.
5. Use the parent documents to augment generation.

In [95]:
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore

parent_splitter = RecursiveCharacterTextSplitter(chunk_size=1500)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=200)

vectorstore = Chroma(collection_name="split_parents", embedding_function=HuggingFaceEmbeddings())

store = InMemoryStore()

In [96]:
parent_document_retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

In [97]:
parent_document_retriever.add_documents(base_docs)

Let's create, test, and then evaluate our new chain!

In [98]:
parent_document_retriever_qa_chain = create_qa_chain(parent_document_retriever)

In [99]:
parent_document_retriever_qa_chain.invoke({"question" : "What is RAG?"})["response"].content

'Answer: RAG, or Retrieval Augmented Generation, is a methodology that aims to alleviate the issue of outdated information in large language models (LLMs) by augmenting their input with retrieved information, such as relevant tools, through three primary components: Tool Retrieval, Plan Generation, and Execution. (Context from: "In this study, we focus on enhancing tool retrieval, with the goal of achieving subsequent improvements in plan generation.")'

In [100]:
pdr_qa_ragas_dataset = create_ragas_dataset(parent_document_retriever_qa_chain, eval_dataset)

100%|██████████| 9/9 [01:36<00:00, 10.71s/it]


In [101]:
pdr_qa_ragas_dataset.to_csv("pdr_qa_ragas_dataset.csv")

Creating CSV from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

38246

In [102]:
from google.colab import files
files.download("pdr_qa_ragas_dataset.csv")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [124]:
pdr_qa_result = evaluate_ragas_dataset(pdr_qa_ragas_dataset)

In [None]:
pdr_qa_result

#### Ensemble Retrieval

Next let's look at ensemble retrieval!

You can read more about this [here](https://python.langchain.com/docs/modules/data_connection/retrievers/ensemble)!

The basic idea is as follows:

1. Obtain User Question
2. Hit the Retriever Pair
    - Retrieve Documents with BM25 Sparse Vector Retrieval
    - Retrieve Documents with Dense Vector Retrieval Method
3. Collect and "fuse" the retrieved docs based on their weighting using the [Reciprocal Rank Fusion](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf) algorithm into a single ranked list.
4. Use those documents to augment our generation.

Ensure your `weights` list - the relative weighting of each retriever - sums to 1!

In [103]:
!pip install -q -U rank_bm25

In [104]:
from langchain.retrievers import BM25Retriever, EnsembleRetriever

text_splitter = RecursiveCharacterTextSplitter(chunk_size=450, chunk_overlap=75)
docs = text_splitter.split_documents(base_docs)

bm25_retriever = BM25Retriever.from_documents(docs)
bm25_retriever.k = 2

embedding = HuggingFaceEmbeddings()
vectorstore = Chroma.from_documents(docs, embedding)
chroma_retriever = vectorstore.as_retriever(search_kwargs={"k": 3})

ensemble_retriever = EnsembleRetriever(retrievers=[bm25_retriever, chroma_retriever], weights=[0.75, 0.25])

In [105]:
ensemble_retriever_qa_chain = create_qa_chain(ensemble_retriever)

In [106]:
ensemble_retriever_qa_chain.invoke({"question" : "What is RAG?"})["response"].content

'RAG, mentioned in the context, is a retrieval augmented generation model proposed in the paper "Hybrid Retrieval-Augmented Generation for Real-time Composition Assistance" by Xuchao Zhang, Menglin Xia, Camille Couturier, Guoqing Zheng, Saravan Rajmohan, and Victor Ruhle. It is a language model that improves contextual understanding, integrates private data, and reduces hallucination by incorporating retrieval-augmented memory generated asynchronously by a Large Language Model (LLM) in the cloud. The client model is capable of delivering real-time responses to user requests without the need to wait for memory synchronization from the cloud. The paper compares the performance of RAG with other models, including HybridRAG, GPT3 zero-shot, and Vanilla OPT, on Wikitext and Pile subsets.'

In [107]:
ensemble_qa_ragas_dataset = create_ragas_dataset(ensemble_retriever_qa_chain, eval_dataset)

100%|██████████| 9/9 [01:55<00:00, 12.79s/it]


In [108]:
ensemble_qa_ragas_dataset.to_csv("ensemble_qa_ragas_dataset.csv")

Creating CSV from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

33712

In [110]:
from google.colab import files
files.download("ensemble_qa_ragas_dataset.csv")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
ensemble_qa_result = evaluate_ragas_dataset(ensemble_qa_ragas_dataset)

evaluating with [context_precision]


100%|██████████| 1/1 [01:01<00:00, 61.76s/it]


evaluating with [faithfulness]


100%|██████████| 1/1 [01:08<00:00, 68.62s/it]


evaluating with [answer_relevancy]


100%|██████████| 1/1 [00:05<00:00,  5.37s/it]


evaluating with [context_recall]


100%|██████████| 1/1 [00:11<00:00, 11.67s/it]


evaluating with [context_relevancy]


100%|██████████| 1/1 [01:02<00:00, 62.45s/it]


evaluating with [answer_correctness]


100%|██████████| 1/1 [00:08<00:00,  9.00s/it]


evaluating with [answer_similarity]


100%|██████████| 1/1 [00:00<00:00,  1.57it/s]


In [None]:
ensemble_qa_result

{'context_precision': 0.8858, 'faithfulness': 0.7000, 'answer_relevancy': 0.8918, 'context_recall': 0.9800, 'context_relevancy': 0.0192, 'answer_correctness': 0.7750, 'answer_similarity': 1.0000}

### Conclusion

Observe your results in a table!

In [None]:
basic_qa_result

{'context_precision': 0.5000, 'faithfulness': 0.4000, 'answer_relevancy': 0.9535, 'context_recall': 1.0000, 'context_relevancy': 0.0559, 'answer_correctness': 0.6167, 'answer_similarity': 1.0000}

In [None]:
pdr_qa_result

{'context_precision': 0.6972, 'faithfulness': 0.3500, 'answer_relevancy': 0.9439, 'context_recall': 1.0000, 'context_relevancy': 0.0134, 'answer_correctness': 0.6000, 'answer_similarity': 1.0000}

In [None]:
ensemble_qa_result

{'context_precision': 0.8858, 'faithfulness': 0.7000, 'answer_relevancy': 0.8918, 'context_recall': 0.9800, 'context_relevancy': 0.0192, 'answer_correctness': 0.7750, 'answer_similarity': 1.0000}

We can also zoom in on each result and find specific information about each of the questions and answers.

In [None]:
ensemble_qa_result_df = ensemble_qa_result.to_pandas()

In [None]:
ensemble_qa_result_df

Unnamed: 0,question,contexts,answer,ground_truths,context_precision,faithfulness,answer_relevancy,context_recall,context_relevancy,answer_correctness,answer_similarity
0,What is the focus of this paper?,[has to make an important career decision.\nNe...,The focus of this paper is on a framework call...,[The focus of this paper is on retrieval-augme...,1.0,0.666667,0.784617,1.0,0.0,0.5,True
1,What is the title of the paper?,[of War. The game was released worldwide in\nG...,"Title: Self-RAG: Learning to Retrieve, Generat...",[The title of the paper is 'A Survey on Retrie...,0.5,1.0,0.976911,1.0,0.0,0.5,True
2,What is the aim of this paper?,[has to make an important career decision.\nNe...,The aim of this paper is to introduce a new fr...,[The aim of this paper is to conduct a compreh...,1.0,0.333333,0.800732,1.0,0.078947,0.75,True
3,What is the main focus of the paper 'A Survey ...,[A Survey on Retrieval-Augmented Text Generati...,The main focus of the paper 'A Survey on Retri...,[The main focus of the paper 'A Survey on Retr...,0.679167,1.0,0.982435,0.8,0.017857,1.0,True
4,What is the main focus of this paper?,[example of completions of the prompt by diffe...,I don't know.,[The main focus of this paper is to conduct a ...,1.0,0.0,0.742601,1.0,0.0,0.5,True
5,What is the main focus of this paper?,[example of completions of the prompt by diffe...,I don't know.,[The main focus of this paper is to conduct a ...,1.0,0.0,0.742601,1.0,0.0,0.5,True
6,What are the advantages of retrieval-augmented...,[attracted increasing attention of the compu-\...,The advantages of retrieval-augmented text gen...,[The advantages of retrieval-augmented text ge...,1.0,1.0,0.968712,1.0,0.025641,1.0,True
7,What is the main focus of the paper 'A Survey ...,[A Survey on Retrieval-Augmented Text Generati...,The main focus of the paper 'A Survey on Retri...,[The main focus of the paper 'A Survey on Retr...,0.679167,1.0,0.982422,1.0,0.017857,1.0,True
8,What are the advantages of retrieval-augmented...,[attracted increasing attention of the compu-\...,The advantages of retrieval-augmented text gen...,[The advantages of retrieval-augmented text ge...,1.0,1.0,0.968731,1.0,0.025641,1.0,True
9,What are the advantages of retrieval-augmented...,[attracted increasing attention of the compu-\...,The advantages of retrieval-augmented text gen...,[The advantages of retrieval-augmented text ge...,1.0,1.0,0.968692,1.0,0.025641,1.0,True


We'll also look at combining the results and looking at them in a single table so we can make inferences about them!

In [None]:
def create_df_dict(pipeline_name, pipeline_items):
  df_dict = {"name" : pipeline_name}
  for name, score in pipeline_items:
    df_dict[name] = score
  return df_dict

In [None]:
basic_rag_df_dict = create_df_dict("basic_rag", basic_qa_result.items())

In [None]:
pdr_rag_df_dict = create_df_dict("pdr_rag", pdr_qa_result.items())

In [None]:
ensemble_rag_df_dict = create_df_dict("ensemble_rag", ensemble_qa_result.items())

In [None]:
results_df = pd.DataFrame([basic_rag_df_dict, pdr_rag_df_dict, ensemble_rag_df_dict])

In [None]:
results_df.sort_values("answer_correctness", ascending=False)

Unnamed: 0,name,context_precision,faithfulness,answer_relevancy,context_recall,context_relevancy,answer_correctness,answer_similarity
2,ensemble_rag,0.885833,0.7,0.891845,0.98,0.019158,0.775,1.0
0,basic_rag,0.5,0.4,0.953475,1.0,0.055904,0.616667,1.0
1,pdr_rag,0.697222,0.35,0.943909,1.0,0.013386,0.6,1.0


### ❓QUESTION❓

What conclusions can you draw about the above results?

Describe in your own words what the metrics are expressing.

In [None]:
retrieval_augmented_qa_chain = (
    RunnableParallel({
        'context': itemgetter('question') | base_retriever,
        'question': RunnablePassthrough()
    }) | {
        'response': prompt | primary_qa_llm | parser,
        'context': itemgetter('context')
    }
)