# Retrieval Augmented Generation (RAG)



![RAG](https://miro.medium.com/v2/resize:fit:1200/0*zMIZgSKLG7uPb7eQ.png "Retrieval Augmented Generation")
<small> Image source: https://miro.medium.com/v2/resize:fit:1200/0*zMIZgSKLG7uPb7eQ.png </small>

In this scenario you are going to build a Retrieval Augmented Generation pipeline for the [question answering task](https://paperswithcode.com/task/question-answering). You will learn:
* What are the parts of the RAG pipeline?
* How to implement the pipeline using the [dspy](https://github.com/stanfordnlp/dspy) framework?
* What can be done to automate the evaluation of your solution?

The advent of Large Language Models (LLMs) is a fantastic progress in many well-known NLP tasks. One of such tasks is Question Answering. By and large, in Question Answering the input is a question in natural language and the expected output is the answer to that question.
LLMs achieve very good results in zero-shot or few-shot scenarios, i.e., when the model is provided 0 or few annotated examples. However, due to the nature of text generation, LLMs sometimes output confident, but incorrect answers. This phenomenon is called [hallucination](https://arxiv.org/pdf/2311.05232.pdf).
One of the ways to reduce model hallucination is to enrich the question with paragrphs of text containing the answer. The model is able to use the paragraph in prompt to output correct answer tailored to the use case needs. This technique is called the [Retrieval Augmented Generation](https://arxiv.org/pdf/2005.11401.pdf).
While the technique can be used in the end-to-end training, due to time and resource contraints we are going to focus on the zero-shot variant based on pretrained retriever and generator. This approach is often a very good first candidate in real use cases.

This laboratory scenario uses [dspy](https://github.com/stanfordnlp/dspy), a framework by Stanford NLP. Based on the documentation, "DSPy is a framework for algorithmically optimizing LM prompts and weights, especially when LMs are used one or more times within a pipeline." The framework enables rapid development of pipelines based on pretrained models, and has an internal compiler aimed at optimization of itermediate prompts to maximize the desired result. In this laboratory we are going to focus on the construction of the dspy pipelines, but are not going to use the compiler. Please see [here](https://github.com/stanfordnlp/dspy?tab=readme-ov-file#4b-asking-dspy-to-automatically-optimize-your-program-with-dspyteleprompt) if you are interested in automated optimization.

In [3]:
! pip install cohere boto3 dspy



In [4]:
import json
import os
from typing import Any, Optional

import boto3
import numpy as np
import dspy
from dspy.dsp.utils import dotdict
from dspy import (
    Example,
    InputField,
    Module,
    OutputField,
    Predict,
    Prediction,
    Retrieve,
    Signature,
    context,
)
from dspy.datasets import DataLoader
from dspy.evaluate import Evaluate
from sentence_transformers import SentenceTransformer


2025-04-01 11:17:53.707355: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1743499073.732584   58659 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1743499073.739639   58659 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1743499073.763928   58659 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1743499073.763961   58659 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1743499073.763963   58659 computation_placer.cc:177] computation placer alr

In [5]:
os.environ["COHERE_API_KEY"] = "EhIqPfTHeMWbPA4XiaYYZLuEoX9HyjrEHAh9sgaL"

# Data

We are going to work with the [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) dataset. To limit the use of resources, only a subset of the 10 questions will be used.

For a corpus of context paragraphs from the first 100 questions in the validation split will be used.


In [6]:
data = DataLoader().from_huggingface(
    dataset_name="rajpurkar/squad",
)
validation = data["validation"][:10]

In [7]:
corpus = [example.context for example in data["validation"][:100]]

We are going to work in a zero-shot scenario i.e. LLMs will not be trained to answer our questions. The LLM used to answer the question is Command-r-plus from Cohere.

In [8]:
lm = dspy.LM('cohere/command-r-plus', api_key=os.environ["COHERE_API_KEY"])
dspy.configure(lm=lm)

# Simple Question Answering pipeline

For reference you are given the implementation of the vanilla question answering pipeline. It will be used as a baseline.
The pipeline asks the question to LLM and returns model's answer. The pipeline can be created as a dspy [Signature](https://dspy-docs.vercel.app/docs/deep-dive/signature/understanding-signatures).

In [9]:
class VanillaQuestionAnswering(Signature):
    """Answer questions with short factually correct answers."""

    question = InputField()
    answer = OutputField(desc="Answer is often short and educational.")

In [10]:
with context(lm=lm):
    qa_pipeline = Predict(VanillaQuestionAnswering)
    prediction = qa_pipeline(
        question="Which NFL team represented the AFC at Super Bowl 50?"
    )

In [11]:
prediction.answer, f"Correct answer: Denver Broncos"

('The Denver Broncos represented the AFC at Super Bowl 50.',
 'Correct answer: Denver Broncos')

# Retrieval Augmented Generation (RAG)

Your will implement the Retrieval Augmented Generatio pipeline. You can read more about RAG [here](https://blogs.nvidia.com/blog/what-is-retrieval-augmented-generation/).
You tasks are:
* Implement a custom [Retriever](https://dspy-docs.vercel.app/docs/deep-dive/retrieval_models_clients/custom-rm-client) (follow the interface below).
* Implement a a custom [dspy.Module](https://dspy-docs.vercel.app/docs/deep-dive/modules/guide) (follow the interface below).

In [42]:
class VanillaRetriever(Retrieve):
    def __init__(self, corpus: list[str], k: int = 3):
        ############### TODO ###############
        self.embedder = dspy.Embedder(
            SentenceTransformer('all-MiniLM-L6-v2').encode,
            batch_size=100,
        )

        self.k = k

        self.embeddings = self._normalize_embeddings(self.embedder(corpus))

        self.corpus = np.array(corpus)
        ####################################

    def _normalize_embeddings(self, embeddings: np.ndarray) -> np.ndarray:
        ############### TODO ###############
        l = embeddings.shape[0]
        lengths = np.sum(embeddings * embeddings, axis = 1).reshape(l, 1)
        return embeddings / lengths 
        ####################################

    def forward(self, query_or_queries: str, k: Optional[int] = None) -> Prediction:
        ############### TODO ###############
        if k is None:
            k = self.k 

        query_or_queries = [query_or_queries]

        query_embedding = self._normalize_embeddings(
            self.embedder(query_or_queries)
        )

        queries = len(query_or_queries)
        embeddings = np.tile(
            self.embeddings.reshape(100, 1, -1),
            (1, queries, 1),
        )

        dot_products = np.sum(embeddings * query_embedding, axis = -1)

        indices = np.argpartition(dot_products, -(k + 1), axis = 0)[-(k + 1):]
        indices = indices.ravel().astype(int).tolist()

        # result = self.corpus[indices]

        Prediction(
            passages=[dotdict({"long_text": self.corpus[idx]}) for idx in indices]
        )
        ####################################

In [None]:
retrieval = VanillaRetriever(corpus=corpus, k=2)

In [48]:
class RAGSignature(Signature):
    """Answer questions with short answers. Output only answer."""

    context = InputField(desc="may contain relevant information.")
    question = InputField(desc="User question")
    answer = OutputField(desc="answer is often concise and educational.")

In [None]:
class RAG(Module):
    def __init__(self, corpus: list[str], num_passages: int) -> None:
        super().__init__()
        ############### TODO ###############
        self.retriever = VanillaRetriever(corpus, k = 3)
        self.answer = Predict(RAGSignature)
        ####################################

    def forward(self, question: str, **kwargs: Any) -> Predict:
        ############### TODO ###############
        context = self.retriever(question).passages
        pred = self.answer(context = context, question = question)

        return Prediction(context = context, answer = pred.answer)
        ####################################

In [52]:
with context(lm=lm):
    rag_pipeline = RAG(corpus=corpus, num_passages=2)
    prediction = rag_pipeline(
        question="Which NFL team represented the AFC at Super Bowl 50?"
    )

AttributeError: 'NoneType' object has no attribute 'passages'

In [None]:
prediction.answer, f"Correct answer: Denver Broncos"

('The Denver Broncos represented the AFC at Super Bowl 50.',
 'Correct answer: Denver Broncos')

# LLM as a judge

Evaluation of tasks based on text generation such as Question Answering poses a challenge. One can:
* Evaluate manually
* Develop automated metrics

Recently, the third option emerged:
* Ask the LLM to do the evaluation for us!

In this section you will implement the LLM as a jude evaluatio of vanilla QA and RAG pipelines. Your tasks are:
* Implement the factuality_metric, which asks the LLM to assess if the candidate answer is factually correct (follow the interface below)

In [None]:
class LLMJudge(Signature):
    """Assess the quality of the answer along the specified criterion."""

    answer = InputField(desc="Candidate answer for the question")
    question = InputField(desc="Question to be answered")
    golden_answer = InputField(desc="The golden correct answer for the question")
    criterion = InputField(desc="criterion:")
    judgement = OutputField(
        desc="Answer Yes or No based on the criterion",
        prefix="Yes or No",
    )


judge = Predict(LLMJudge)


def factuality_metric(example: Example, pred: Prediction) -> int:
    ############### TODO ###############
    ####################################

In [None]:
devset = [
    Example(question=e.question, golden_answer=e.answers["text"][0]).with_inputs(
        "question", "golden_answer"
    )
    for e in validation
]

In [None]:
with context(lm=lm):
    evaluate_pipeline = Evaluate(
        devset=devset,
        metric=factuality_metric,
        num_threads=1,
        display_progress=True,
        display_table=100,
    )
    results_rag = evaluate_pipeline(rag_pipeline)
    results_qa = evaluate_pipeline(qa_pipeline)

In [None]:
results_qa

In [None]:
results_rag