Maastricht_University_logo.svg

# Information Retrieval and Text Mining Course - Retrieval Augmented Generation (RAG) Tutorial
Authors: Abderrahmane Issam and Jan Scholtes


Version 2024-2025


In this notebook we will learn how to implement a Retrieval Augmented Generation (RAG) pipeline and evaluate its performance. In the begining of the tutorial, we demonstrate how to use [Distilabel](https://github.com/argilla-io/distilabel) to annotate Wikipedia documents and use them for fine-tuning ColBERT. [RAGatouille](https://github.com/AnswerDotAI/RAGatouille) offers a very simple API for fine-tuning and using ColBERT, so we will use it in this tutorial as well. For the RAG part, we will be using [DSPy](https://github.com/stanfordnlp/**dspy**), which is a framework for programming language models.

## Setup

In [None]:
!pip install --upgrade distilabel

In [None]:
!pip install ragatouille

In [None]:
!pip install dspy

In [None]:
!curl -fsSL https://ollama.com/install.sh | sh

In [None]:
!pip install -U bm25s PyStemmer

## Transforming Unstructured Data to a Structured Dataset

In this part of the tutorial, we will transform Wikipedia documents into a datataset that we can use for fine-tuning a ColBERT. The process starts with retrieving the documents, then chunking them, annotation using a LLM, and finally fine-tuning ColBERT.

In [None]:
from ragatouille.utils import get_wikipedia_page

We will retrieve the following 3 Wikipedia pages.

In [None]:
from ragatouille.utils import get_wikipedia_page

my_full_corpus = [get_wikipedia_page("Hayao_Miyazaki")]
my_full_corpus += [get_wikipedia_page("Studio_Ghibli")]
my_full_corpus += [get_wikipedia_page("Toei_Animation")]

A Wikipedia document often contains different types of information about a certain topic. It is also more than we can afford to feed an LLM. Although some LLMs support a very large context window, feeding large documents into the model will require more computational resources, furthermore, it might end up confusing the model to feed it with a full documents when only one segment is needed to answer the prompt. \\
In the following code, we will use Ragatouile `CorpusProcessor` to chunk the copus into multiple segments of 180 tokens. By default it makes sure that the chunks overlap to prevent losing important context.

In [None]:
from ragatouille.data import CorpusProcessor, llama_index_sentence_splitter

corpus_processor = CorpusProcessor(document_splitter_fn=llama_index_sentence_splitter)
documents = corpus_processor.process_corpus(my_full_corpus, chunk_size=180)

len(documents)

An example document:

In [None]:
documents[0]

Distilabel expects a HuggingFace dataset with a text column `anchor`.

In [None]:
import pandas as pd
from datasets import Dataset

df = pd.DataFrame.from_dict(documents)

dataset = Dataset.from_pandas(df)
dataset = dataset.rename_column("content", "anchor")

df.head()

We will use a small LLM locally through the transformers library for our demo purposes, but distilabel can be used with the other paid LLM APIs such as OpenAI or Anthropic.

In [None]:
from distilabel.models import TransformersLLM

llm = TransformersLLM(model="microsoft/Phi-3-mini-4k-instruct")

`GenerateSentencePair` is a component for generating datasets for embedding models, this includes retrieval, reranking or feature extraction. To fine-tune ColBERT we will only queries, and that is why we set `action` to query. Other supported actions are "paraphrase", "semantically-similar", "answer". In case we need both positive and negative examples, we can set `triplet` to `True`. \\

Distillabel will create a prompt based on these parameters and extract the LLM answer for us, which in this case is the query. The context is a descrption of our copus that will be included in the prompt as well.

In [None]:
from distilabel.pipeline import Pipeline
from distilabel.steps.tasks import GenerateSentencePair

context = (
"""
The text is a chunk from wikipedia pages that we want to use for fine-tuning a retrieval model.
"""
)

with Pipeline(name="generate") as pipeline:
    generate_retrieval_pairs = GenerateSentencePair(
        name="generate_retrieval_pairs",
        triplet=False,
        action="query",
        llm=llm,
        input_batch_size=10,
        context=context,
    )

The following code takes approximately 11 minutes to finish.

In [None]:
distiset = pipeline.run(dataset=dataset, use_cache=False)

You can check distilabel_metadata column to see the full prompt and model output for each anchor:

In [None]:
df = distiset['default']['train'].to_pandas()
df

We instantiate `RAGTrainer`:

In [None]:
from ragatouille import RAGTrainer

trainer = RAGTrainer(model_name="GhibliColBERTv2.0", pretrained_model_name="colbert-ir/colbertv2.0")

We pass query anchor pairs from the dataset to the trainer.

In [None]:
pairs = [[q, doc] for q, doc in zip(df.positive, df.anchor)]

trainer.prepare_training_data(
        raw_data = pairs,
        all_documents = documents,
        num_new_negatives = 10,
        mine_hard_negatives= True,
        )

### Exercise 1:
Explain what `mine_hard_nagtives` does?

Answer here.

We fine-tune the model for a maximum of 1000 steps. You can change this number or play with other hyperparameters if you like.

In [None]:
finetuned_colbert_path = trainer.train(maxsteps=1000)
finetuned_colbert_path

Now it is time to compare our fine-tuned model to the original ColBERT model on few queries. This is more of a qualitative analysis to showcase the effect of fine-tuning. In a real-world setup, it is important to have an dedicated test dataset for evaluation. \\

The fine-tuning we did is often refered to as domain adaptation, where we took a general purpose retrieval model and adapted it to a specific domain. This often leads to better performance on our domain but loses some of the generalization capailities of the original model.

In [None]:
from ragatouille import RAGPretrainedModel
colbert = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")

With RAGatouile, we can index our dataset as follows:

In [None]:
colbert.index(collection=[doc['content'] for doc in documents], document_ids=["docno"+str(i) for i in range(len(documents))], split_documents=False)

We use `search` to query the index:

In [None]:
colbert.search(query = "what's studio ghibli's most famous movie?", k=3)

Indexing is important for making retrieval more efficient but it is also time consuming. In our case where the dataset is small, indexing doesn't offer an advantage over keeping our encodings in memory. We can use `encode` to skip the indexing part:

In [None]:
colbert.encode([x['content'] for x in documents], document_metadatas=[{"about": "ghibli"} for _ in range(len(documents))])

To search the encoded documents we use `search_encoded_docs` instead of `search`:

In [None]:
colbert.search_encoded_docs(query="what's studio ghibli's most famous movie?", k=3)

### Exercise 2
Use the fine-tuned ColBERT model to encode documents run the previous query. Try mulitple queries with the original and fine-tuned ColBERT models and describe any difference between the two.

Answer here

### Exercise 3

List 3 ways to improve the fine-tuning of the ColBERT model.

Answer here.

## Retrieval Augmented Generation (RAG)

DSPy expects an API that is running the model, and we can achieve this locally using OLlama. Ideally, you want to run this on a server or use paid LLM API (DSPy supports multiple APIs), but for learning purposes we can run OLlama in the background and use it in this notebook:

In [None]:
!nohup ollama serve > ollama.log 2>&1 &

We run phi-3-instruct. Feel free to try other models supported by Ollama: https://ollama.com/search

In [None]:
!ollama run phi3:3.8b-instruct &

If you encounter any issues with using the model along the way. Then it might be worth it to restart Ollama beginning from `ollama serve`, but first we will need to kill the running process to free the port. You can find the PID and kill the process as follows:
```
!lsof -i :11434
!kill PID     # PID from previous step
```

After this run `ollama serve` cell followed by `ollama run` and try again.

Download question--answer pairs from the RAG-QA Arena "Tech" dataset:

In [None]:
import ujson
from dspy.utils import download

download("https://huggingface.co/dspy/cache/resolve/main/ragqa_arena_tech_examples.jsonl")

with open("ragqa_arena_tech_examples.jsonl") as f:
  data = [ujson.loads(line) for line in f]

data[0]

`Example` is DSPy data type to represent items in our data. We can specify the input field using `with_inputs`. In this case the input is the question, and the label is the response.

In [None]:
import dspy

data = [dspy.Example(**d).with_inputs('question') for d in data]

example = data[2]
example

We split our dataset as follows. We will only use 20 examples for evaluation (devset) since it is time consuming to generate using the LLM as well as use it for evaluation.

In [None]:
import random

random.Random(123).shuffle(data)
trainset, devset, testset = data[:200], data[200:220], data[220:400]

len(trainset), len(devset), len(testset)

We connect DSPy to the running Ollama API as follows:

In [None]:
import dspy
lm = dspy.LM('ollama_chat/phi3:3.8b-instruct', api_base='http://localhost:11434', api_key='')
dspy.configure(lm=lm)

We create a ChainOfThought module which will use the LM we provide and instruct it to reason before generating the answer. In the Prediction output below, we can see the reasoning generated by the model, and the final answer:

In [None]:
cot = dspy.ChainOfThought('question -> response')
cot(question="should curly braces appear on their own line?")

Another question :)

In [None]:
cot(question="how to install python on mac?")

Semantic F1 is a metric that attemtpts to capture the following: How well does the system response cover all key facts in the gold response? (Recall) And the other way around, how well is the system response not saying things that aren't in the gold response? (Precision). \\
The semantic part comes from the fact that we are using an LLM to measure this. We will be using the same LLM we are using for generation.

In [None]:
from dspy.evaluate import SemanticF1

# Instantiate the metric.
metric = SemanticF1(decompositional=True)

# Produce a prediction from our `cot` module
example = data[2]
pred = cot(**example.inputs())

# Compute the metric score for the prediction.
score = metric(example, pred)

print(f"Question: \t {example.question}\n")
print(f"Gold Response: \t {example.response}\n")
print(f"Predicted Response: \t {pred.response}\n")
print(f"Semantic F1 Score: {score:.2f}")


Following is how we would evaluate our model using DSPy. We should take the results with a grain of salt because the dataset is tiny.

In [None]:
evaluate = dspy.Evaluate(devset=devset, metric=metric, num_threads=1,
                         display_progress=True, display_table=2, provide_traceback=True)

evaluate(cot)


Now we will implement it a RAG pipeline that starts with retrieving 2 documents as context, then generating the answer using our Chain Of Thought model:

In [None]:
from ragatouille import RAGPretrainedModel

class RAG(dspy.Module):
  def __init__(self, ir_model, documents, topk=2):
    super().__init__()
    self.ir_model = RAGPretrainedModel.from_pretrained(ir_model)
    self.ir_model.encode(documents)
    self.generate_answer = dspy.ChainOfThought('context, question -> response')
    self.topk = topk
    self.documents = documents

  def forward(self, question):
    context = self.ir_model.search_encoded_docs(query=question, k=self.topk)
    context = [doc['content'] for doc in context]
    prediction = self.generate_answer(context=context, question=question)
    return prediction

We will be using all the documents in our dataset for retrieval:

In [None]:
docs = [doc["response"] for doc in data]
rag = RAG("colbert-ir/colbertv2.0", docs)

We try a query:

In [None]:
rag(question="how to install python on mac?")

And finally evaluate the model. We can see that we got ~1 point improvement by using RAG. But again since we are using 20 examples for validation, the results are not conclusive:

In [None]:
evaluate = dspy.Evaluate(devset=devset, metric=metric, num_threads=1,
                         display_progress=True, display_table=2, provide_traceback=True)

evaluate(rag)


### Exercise 4

Fine-tune ColBERT on the train set and use in the rag pipeline above. explain why it does or doesn't improve the results.

Answer here.

Let's try using BM25 for retrieval instead of ColBERT which is slower. We will use `bm25s` to create an index from our documents as follow:

In [None]:
import bm25s
import Stemmer

stemmer = Stemmer.Stemmer("english")
corpus_tokens = bm25s.tokenize(docs, stopwords="en", stemmer=stemmer)

retriever = bm25s.BM25(k1=0.9, b=0.4)
retriever.index(corpus_tokens)

We can retrieve documents from the index we created as folows:

In [None]:
tokens = bm25s.tokenize("should curly braces appear on their own line?", stopwords="en", stemmer=stemmer, show_progress=False)
results, scores = retriever.retrieve(tokens, k=2, n_threads=1, show_progress=False)
run = [docs[doc] for doc, score in zip(results[0], scores[0])]
run[0]

In [None]:
from ragatouille import RAGPretrainedModel

class RAG_bm25(dspy.Module):
  def __init__(self, documents, topk=2):
    super().__init__()
    self.stemmer = Stemmer.Stemmer("english")
    corpus_tokens = bm25s.tokenize(documents, stopwords="en", stemmer=self.stemmer)
    self.retriever = bm25s.BM25(k1=0.9, b=0.4)
    self.retriever.index(corpus_tokens)

    self.generate_answer = dspy.ChainOfThought('context, question -> response')
    self.topk = topk
    self.documents = documents

  def bm25_search(self, question: str) -> list[str]:
    tokens = bm25s.tokenize(question, stopwords="en", stemmer=self.stemmer, show_progress=False)
    results, scores = self.retriever.retrieve(tokens, k=self.topk, n_threads=1, show_progress=False)
    run = [docs[doc] for doc, score in zip(results[0], scores[0])]
    return run

  def forward(self, question):
    context = self.bm25_search(question)
    prediction = self.generate_answer(context=context, question=question)
    return prediction

In [None]:
rag_bm25 = RAG_bm25(docs)

In [None]:
evaluate = dspy.Evaluate(devset=devset, metric=metric, num_threads=1,
                         display_progress=True, display_table=2, provide_traceback=True)

evaluate(rag_bm25)

### Exericse 5
Implement a reranking into the RAG pipeline. Start by retrieving 10 documents using BM25 then rerank them using ColBERT and return 2 documents as context. Evaluate the model and compare it against the other results.

Answer here.