# Retrieval Augmented Generation (RAG)



![RAG](https://miro.medium.com/v2/resize:fit:1200/0*zMIZgSKLG7uPb7eQ.png "Retrieval Augmented Generation")
<small> Image source: https://miro.medium.com/v2/resize:fit:1200/0*zMIZgSKLG7uPb7eQ.png </small>

In this scenario you are going to build a Retrieval Augmented Generation pipeline for the [question answering task](https://paperswithcode.com/task/question-answering). You will learn:
* What are the parts of the RAG pipeline?
* How to implement the pipeline using the [dspy](https://github.com/stanfordnlp/dspy) framework?
* What can be done to automate the evaluation of your solution?

The advent of Large Language Models (LLMs) is a fantastic progress in many well-known NLP tasks. One of such tasks is Question Answering. By and large, in Question Answering the input is a question in natural language and the expected output is the answer to that question.
LLMs achieve very good results in zero-shot or few-shot scenarios, i.e., when the model is provided 0 or few annotated examples. However, due to the nature of text generation, LLMs sometimes output confident, but incorrect answers. This phenomenon is called [hallucination](https://arxiv.org/pdf/2311.05232.pdf).
One of the ways to reduce model hallucination is to enrich the question with paragrphs of text containing the answer. The model is able to use the paragraph in prompt to output correct answer tailored to the use case needs. This technique is called the [Retrieval Augmented Generation](https://arxiv.org/pdf/2005.11401.pdf).
While the technique can be used in the end-to-end training, due to time and resource contraints we are going to focus on the zero-shot variant based on pretrained retriever and generator. This approach is often a very good first candidate in real use cases.

This laboratory scenario uses [dspy](https://github.com/stanfordnlp/dspy), a framework by Stanford NLP. Based on the documentation, "DSPy is a framework for algorithmically optimizing LM prompts and weights, especially when LMs are used one or more times within a pipeline." The framework enables rapid development of pipelines based on pretrained models, and has an internal compiler aimed at optimization of itermediate prompts to maximize the desired result. In this laboratory we are going to focus on the construction of the dspy pipelines, but are not going to use the compiler. Please see [here](https://github.com/stanfordnlp/dspy?tab=readme-ov-file#4b-asking-dspy-to-automatically-optimize-your-program-with-dspyteleprompt) if you are interested in automated optimization.

In [None]:
! pip install dspy-ai

In [1]:
import json
import os
from typing import Any, Optional

import boto3
import numpy as np
from dsp.utils import dotdict
from dspy import (
    Example,
    InputField,
    Module,
    OpenAI,
    OutputField,
    Predict,
    Prediction,
    Retrieve,
    Signature,
    context,
)
from dspy.datasets import DataLoader
from dspy.evaluate import Evaluate
from langchain.embeddings import OpenAIEmbeddings


In [2]:
os.environ["OPENAI_API_KEY"] = "Your OpenAI API key"

# Data

We are going to work with the [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) dataset. To limit the use of resources, only a subset of the 30 questions will be used. 

For a corpus of context paragraphs from the first 100 questions in the validation split will be used.


In [3]:
data = DataLoader().from_huggingface(
    dataset_name="rajpurkar/squad",
)
validation = data["validation"][:30]

In [4]:
corpus = [example.context for example in data["validation"][:100]]

We are going to work in a zero-shot scenario i.e. LLMs will not be trained to answer our questions. The LLM used to answer the question is GPT3.5-turbo.
GPT4 will be used during the evaluation.

In [5]:
gpt35turbo = OpenAI(model='gpt-3.5-turbo-1106', max_tokens=300)
gpt4 = OpenAI(model='gpt-4', max_tokens=300)

# Simple Question Answering pipeline

For reference you are given the implementation of the vanilla question answering pipeline. It will be used as a baseline.
The pipeline asks the question to LLM and returns model's answer. The pipeline can be created as a dspy [Signature](https://dspy-docs.vercel.app/docs/deep-dive/signature/understanding-signatures).

In [6]:
class VanillaQuestionAnswering(Signature):
    """Answer questions with short factually correct answers."""

    question = InputField()
    answer = OutputField(desc="Answer is often short and educational.")

In [7]:
with context(lm=gpt35turbo):
    qa_pipeline = Predict(VanillaQuestionAnswering)
    prediction = qa_pipeline(
        question="Which NFL team represented the AFC at Super Bowl 50?"
    )

In [8]:
prediction.answer

'Denver Broncos.'

# Retrieval Augmented Generation (RAG)

Your will implement the Retrieval Augmented Generatio pipeline. You can read more about RAG [here](https://blogs.nvidia.com/blog/what-is-retrieval-augmented-generation/).
You tasks are:
* Implement a custom [Retriever](https://dspy-docs.vercel.app/docs/deep-dive/retrieval_models_clients/custom-rm-client) (follow the interface below).
* Implement a a custom [dspy.Module](https://dspy-docs.vercel.app/docs/deep-dive/modules/guide) (follow the interface below).

In [9]:
class VanillaRetriever(Retrieve):
    def __init__(self, corpus: list[str], k: int = 3):
        super().__init__(k=k)
        self.corpus = corpus
        self.embedder = OpenAIEmbeddings(model="text-embedding-3-large")
        self.embeddings = self._normalize_embeddings(
            np.array(self.embedder.embed_documents(texts=corpus))
        )

    def _normalize_embeddings(self, embeddings: np.ndarray) -> np.ndarray:
        if len(embeddings.shape) == 1:
            embeddings = np.expand_dims(embeddings, axis=0)
        return embeddings / np.expand_dims(np.linalg.norm(embeddings, axis=1), axis=1)

    def forward(self, query_or_queries: str, k: Optional[int] = None) -> Prediction:
        k = k if k is not None else self.k
        embedding = self._normalize_embeddings(
            np.array(self.embedder.embed_query(query_or_queries))
        )

        similarity = (self.embeddings @ embedding.T).T
        top_k_indices = np.argsort(similarity.ravel())[::-1][:k].tolist()

        return Prediction(
            passages=[dotdict({"long_text": self.corpus[idx]}) for idx in top_k_indices]
        )

In [10]:
retrieval = VanillaRetriever(corpus=corpus, k=2)

  warn_deprecated(


In [11]:
class RAGSignature(Signature):
    """Answer questions with short answers. Output only answer."""

    context = InputField(desc="may contain relevant information.")
    question = InputField(desc="User question")
    answer = OutputField(desc="answer is often concise and educational.")

In [12]:
class RAG(Module):
    def __init__(self, corpus: list[str], num_passages: int) -> None:
        super().__init__()

        self.retrieve = VanillaRetriever(corpus=corpus, k=num_passages)
        self.generate_answer = Predict(RAGSignature)
    
    def forward(self, question: str, **kwargs: Any) -> Predict:
        context = self.retrieve(question).passages
        prediction = self.generate_answer(context=context, question=question)
        return Prediction(context=context, answer=prediction.answer)

In [13]:
with context(lm=gpt35turbo):
    rag_pipeline = RAG(corpus=corpus, num_passages=2)
    prediction = rag_pipeline(
        question="Which NFL team represented the AFC at Super Bowl 50?"
    )

In [14]:
prediction.answer

'Denver Broncos'

# LLM as a judge

Evaluation of tasks based on text generation poses a challenge. One can:
* Evaluate manually
* Develop automated metrics

Recently, the third option emerged:
* Ask the LLM to do the evaluation for us!

In this section you will implement the LLM as a jude evaluatio of vanilla QA and RAG pipelines. Your tasks are:
* Implement the factuality_metric, which asks the LLM to assess if the candidate answer is factually correct (follow the interface below)

In [15]:
class LLMJudge(Signature):
    """Assess the quality of the answer along the specified criterion."""

    answer = InputField(desc="Candidate answer for the question")
    question = InputField(desc="Question to be answered")
    golden_answer = InputField(desc="The golden correct answer for the question")
    criterion = InputField(desc="criterion:")
    judgement = OutputField(
        desc="Answer Yes or No based on the criterion",
        prefix="Yes or No",
    )


judge = Predict(LLMJudge)


def factuality_metric(example: Example, pred: Prediction) -> int:
    factual = (
        "Does the candidate answer to the question contain the golden correct answer?"
    )
    assessment = judge(
        answer=pred.answer,
        question=example.question,
        golden_answer=example.golden_answer,
        criterion=factual,
    )
    return int("yes" == assessment.judgement.lower())

In [16]:
devset = [
    Example(question=e.question, golden_answer=e.answers["text"][0]).with_inputs(
        "question", "golden_answer"
    )
    for e in validation
]

In [17]:
with context(lm=gpt4):
    evaluate_pipeline = Evaluate(
        devset=devset,
        metric=factuality_metric,
        num_threads=1,
        display_progress=True,
        display_table=100,
    )
    results_rag = evaluate_pipeline(rag_pipeline)
    results_qa = evaluate_pipeline(qa_pipeline)

Average Metric: 29 / 30  (96.7): 100%|██████████| 30/30 [00:15<00:00,  1.98it/s] 
  df = df.applymap(truncate_cell)
 '1' '1' '1' '0' '1' '1' '1' '1' '1' '1' '1' '1']' has dtype incompatible with int64, please explicitly cast to a compatible dtype first.
  df.loc[:, metric_name] = df[metric_name].apply(


Average Metric: 29 / 30  (96.7%)


Unnamed: 0,question,golden_answer,context,answer,factuality_metric
0,Which NFL team represented the AFC at Super Bowl 50?,Denver Broncos,[{'long_text': 'Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American...,The Denver Broncos,1
1,Which NFL team represented the NFC at Super Bowl 50?,Carolina Panthers,[{'long_text': 'Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American...,The Carolina Panthers represented the NFC at Super Bowl 50.,1
2,Where did Super Bowl 50 take place?,"Santa Clara, California",[{'long_text': 'Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American...,"Super Bowl 50 took place at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California.",1
3,Which NFL team won Super Bowl 50?,Denver Broncos,[{'long_text': 'Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American...,The Denver Broncos won Super Bowl 50.,1
4,What color was used to emphasize the 50th anniversary of the Super Bowl?,gold,[{'long_text': 'Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American...,Gold,1
5,What was the theme of Super Bowl 50?,"""golden anniversary""",[{'long_text': 'Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American...,"The theme of Super Bowl 50 was the ""golden anniversary"".",1
6,What day was the game played on?,"February 7, 2016",[{'long_text': 'Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American...,"The game was played on February 7, 2016.",1
7,What is the AFC short for?,American Football Conference,[{'long_text': 'Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American...,American Football Conference,1
8,What was the theme of Super Bowl 50?,"""golden anniversary""",[{'long_text': 'Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American...,"The theme of Super Bowl 50 was the ""golden anniversary"".",1
9,What does AFC stand for?,American Football Conference,[{'long_text': 'Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American...,American Football Conference,1


Average Metric: 22 / 30  (73.3): 100%|██████████| 30/30 [00:00<00:00, 425.01it/s]

Average Metric: 22 / 30  (73.3%)



  df = df.applymap(truncate_cell)
 '1' '1' '1' '0' '0' '0' '1' '1' '1' '1' '1' '1']' has dtype incompatible with int64, please explicitly cast to a compatible dtype first.
  df.loc[:, metric_name] = df[metric_name].apply(


Unnamed: 0,question,golden_answer,answer,factuality_metric
0,Which NFL team represented the AFC at Super Bowl 50?,Denver Broncos,The Denver Broncos represented the AFC at Super Bowl 50.,1
1,Which NFL team represented the NFC at Super Bowl 50?,Carolina Panthers,The Carolina Panthers represented the NFC at Super Bowl 50.,1
2,Where did Super Bowl 50 take place?,"Santa Clara, California","Super Bowl 50 took place at Levi's Stadium in Santa Clara, California.",1
3,Which NFL team won Super Bowl 50?,Denver Broncos,The Denver Broncos won Super Bowl 50.,1
4,What color was used to emphasize the 50th anniversary of the Super Bowl?,gold,Gold was used to emphasize the 50th anniversary of the Super Bowl.,1
5,What was the theme of Super Bowl 50?,"""golden anniversary""","The theme of Super Bowl 50 was ""gold"" to commemorate the game's golden anniversary.",1
6,What day was the game played on?,"February 7, 2016",The information provided does not specify the day the game was played on.,0
7,What is the AFC short for?,American Football Conference,AFC stands for Asian Football Confederation.,0
8,What was the theme of Super Bowl 50?,"""golden anniversary""","The theme of Super Bowl 50 was ""gold"" to commemorate the game's golden anniversary.",1
9,What does AFC stand for?,American Football Conference,AFC stands for Asian Football Confederation.,0


In [18]:
results_qa

73.33

In [19]:
results_rag

96.67