![MSE Logo](https://moodle.msengineering.ch/pluginfile.php/1/core_admin/logo/0x150/1643104191/logo-mse.png)

# AdvNLP Lab (Graded Lab): Experimenting with Retrieval as Part of a RAG System

Total: 44 points

**Objectives:** We build the retrieval part of a RAG system and compare performance of classic KNN retrieval with additional cross encoder reranking. Eventually, we write two prompts for generation and test it on a LLM.

**Useful documentation:** Since you'll use LangChain for this assignment, [their documentation](https://python.langchain.com/docs/introduction/) might be helpful.

## Students

Künzi Dominic, Matzinger Jaron

## Setup

First, we need to install the required packages for this assignment.

In [1]:
pip install pyperclip pandas langchain-community langchain-huggingface faiss-cpu --quiet

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m29.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.3/31.3 MB[0m [31m35.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m438.1/438.1 kB[0m [31m18.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.0/363.0 kB[0m [31m13.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.4/44.4 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.9/50.9 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
import os
os.environ['TOKENIZERS_PARALLELISM'] = 'false'

import pyperclip
import pandas as pd

from langchain_core.documents import Document
from langchain_community.vectorstores import FAISS
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.document_loaders import CSVLoader
from langchain_community.cross_encoders import HuggingFaceCrossEncoder
from langchain.retrievers.document_compressors import CrossEncoderReranker

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


We will use the [DRAGONBall Dataset](https://github.com/OpenBMB/RAGEval) as a basis for this assignment and load a subset of their documents. These will be the stored knowledge of the RAG system. To store them into the vector store, we will later directly create embeddings out of them, since they have alredy the size of suitable chunks. Each document consists of a unique ID and the actual content.

In [4]:
ground_path = 'drive/MyDrive/data/'

In [5]:
documents = pd.read_csv(ground_path + 'docs.csv', index_col=0)
documents

Unnamed: 0_level_0,content
id,Unnamed: 1_level_1
40,Acme Government Solutions is a government indu...
41,Entertainment Enterprises Inc. is an entertain...
42,"Advanced Manufacturing Solutions Inc., establi..."
43,"EcoGuard Solutions, established on April 15, 2..."
44,"Green Fields Agriculture Ltd., established on ..."
...,...
211,Hospitalization Record:\n\nBasic Information:\...
212,**Hospitalization Record**\n\n**Basic Informat...
213,Hospitalization Record\n\nBasic Information:\n...
214,Hospitalization Record\n----------------------...


The main goal of the assignment is to evaluate the retrieval component of the RAG system. For that, we also load a dataset of queries, which we can use to retrieve matching documents. Each query has also assigned an array of documents in the form of their IDs, which match with the documents loaded before. We can use these to evaluate whether the correct documents were found by the retrieval or not.

In [6]:
queries = pd.read_csv(ground_path + 'queries.csv', index_col=0)
queries['ground_truth_doc_ids'] = queries['ground_truth_doc_ids'].apply(lambda x: x.split(';'))
queries

Unnamed: 0_level_0,query,ground_truth_doc_ids
query_id,Unnamed: 1_level_1,Unnamed: 2_level_1
2286,When was Sparkling Clean Housekeeping Services...,[64]
2433,How did HealthPro Innovations' strategic partn...,[54]
6266,According to the hospitalization records of Br...,[212]
4499,"According to the judgment of Norwood, Unionvil...",[124]
2448,Based on HealthLife Solutions' 2020 corporate ...,[73]
...,...,...
2186,How did the severe drought in August 2018 lead...,[65]
3251,Compare the large-scale financing activities o...,"[58, 55]"
2268,How did CleanCo Housekeeping Services' investm...,[47]
3311,What were the outcomes of the debt restructuri...,"[56, 53]"


## 1. Recall@N

**1a) [2 points]** We will evaluate the retrieval by comparing the retrieved documents with the ground truth documents assigned to the query. For that, we will use the Recall@N metric. Please describe in 1-2 sentences how we can interpret this metric in our case.

**Your Answer:**

The higher the recall, the more documents from the ground truth were found in the top N positions of the retrieved documents.

**1b) [4 points]** Implement the Recall@N metric and test it with the following code.

In [7]:
def recall_at_n(retrieved_docs, relevant_doc_ids, n):
    """
    Calculate Recall@N.

    Parameters:
    - retrieved_docs: Sorted list of retrieved documents as LangChain Document objects
    - relevant_doc_ids: List of relevant document IDs
    - n: Number of top documents to consider

    Returns:
    - Recall@N
    """
    # TODO YOUR CODE HERE
    top_n_docs = retrieved_docs[:n]
    top_n_ids = [doc.metadata['id'] for doc in top_n_docs]

    relevant_in_top_n = sum(1 for doc_id in top_n_ids if doc_id in relevant_doc_ids)

    if not relevant_doc_ids:
        return 0.0

    return relevant_in_top_n / len(relevant_doc_ids)

In [8]:
### Test

recall_at_n(
    [Document(page_content='', metadata={'id': str(id)}) for id in range(10)],
    ['0', '1', '20'],
    3
)

0.6666666666666666

## 2. Embedding Model

**2a) [3 points]** Each document will be converted to an embedding representing the semantic meaning of the document. In this assignment, we will use model `sentence-transformers/all-MiniLM-L6-v2` from HuggingFace. Please answer the following questions about this model:

**Your Answers:**

Embedding Length: 384

Number of Parameters: 22.7 Million

Maximum Sequence Length: 256 word pieces (during training, they limited it to 128 tokens, as stated on the HF model card)

## 3. Vector Store

**3a) [4 points]** Use LangChain to create a FAISS vector store and embed the documents with the above-mentioned embedding model. Load the documents again but this time with a Loader object from LangChain. Eventually, print the number of documents in the vector store.

In [9]:
# TODO YOUR CODE HERE
loader = CSVLoader(file_path=ground_path + 'docs.csv', source_column='content', metadata_columns=['id'])
documents = loader.load()

embeddings = HuggingFaceEmbeddings(model_name='all-MiniLM-L6-v2')
vectorstore = FAISS.from_documents(documents, embeddings)

print(f'Nr. of docs in the vector store: {len(vectorstore.index_to_docstore_id)}')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Nr. of docs in the vector store: 108


**3b) [3 points]** Retrieve the Top-3 documents for this query: "According to the hospitalization records of Bridgewater General Hospital, summarize the present illness of J. Reyes." and print the documents' ID and L2 distance.

In [10]:
# TODO YOUR CODE HERE
query = 'According to the hospitalization records of Bridgewater General Hospital, summarize the present illness of J. Reyes.'

results = vectorstore.similarity_search_with_score(query, k=3)

print('Top-3 matching documents:')
for doc, score in results:
    doc_id = doc.metadata.get('id', 'N/A')
    print(f'ID: {doc_id}, L2 Distance: {score:.4f}')

Top-3 matching documents:
ID: 212, L2 Distance: 0.7302
ID: 213, L2 Distance: 0.9898
ID: 210, L2 Distance: 1.0050


**3c) [2 points]** Check and show if a suitable document is found for the query in the Top-3 retrieved documents and show the relevant ones.

In [11]:
# TODO YOUR CODE HERE
for doc, score in results:
    print(f'Doc ID: {doc.metadata.get("id", "N/A")}')
    print(f'{doc.page_content[:350]}...')
    print('-' * 40)

Doc ID: 212
content: **Hospitalization Record**

**Basic Information:**
Name: J. Reyes
Gender: Male
Age: 52
Ethnicity: Hispanic
Marital Status: Married
Occupation: Construction Worker
Address: 22, Sunnyvale street, Bridgewater
Admission Time: 7th, September
Record Time: 8th, September
Historian: Self
Hospital Name: Bridgewater General Hospital

**Chief Complai...
----------------------------------------
Doc ID: 213
content: Hospitalization Record

Basic Information:
Name: K. Ramos
Gender: Female
Age: 82
Ethnicity: Caucasian
Marital Status: Widowed
Occupation: Retired
Address: 21, Greenfield Street, Windsor
Admission Time: 26th October
Record Time: 26th October, 10:00 AM
Historian: S. Martinez, MD
Hospital Name: Windsor General Hospital

Chief Complaint:
Sever...
----------------------------------------
Doc ID: 210
content: Hospitalization Record

Basic Information:
Name: J. Alvarez
Gender: Female
Age: 83
Ethnicity: Hispanic
Marital Status: Widowed
Occupation: Retired
Address: 23, Yarmo

**Your Answer:**

According to the L2-Distance, the most similar document (ID 212) contains the hospitalization record of J. Reyes.

## 4. Vector Store Evaluation

**4a) [4 points]** Now, we will search with each of the queries for the most relevant documents in the vector store, and calculate Recall@N with them and the assigned ground truth document IDs. To aggregate the results over all queries, we will calculate the mean. We will do this 3 times to and use a different value for $N$ each time: $N \in \{ 1, 3, 5, 25\}$.

In [12]:
# TODO YOUR CODE HERE
N = [1, 3, 5, 25]
recall_scores = {n: [] for n in N}

for _, row in queries.iterrows():
    query = row['query']
    try:
        relevant_doc_ids = row['ground_truth_doc_ids']
    except:
        continue

    for n in N:
        results = vectorstore.similarity_search(query, k=n)

        recall = recall_at_n(results, relevant_doc_ids, n)
        recall_scores[n].append(recall)

print('Mean Recall@N across all queries:')
for n in N:
    if recall_scores[n]:
        avg_recall = sum(recall_scores[n]) / len(recall_scores[n])
        print(f'Recall@{n}: {avg_recall:.4f}')
    else:
        print(f'Recall@{n}: No valid queries')

Mean Recall@N across all queries:
Recall@1: 0.6650
Recall@3: 0.8150
Recall@5: 0.8600
Recall@25: 1.0000


**4b) [2 points]** When looking at the four calculated Recall@N scores, what do you observe and how can you explain this?

**Your Answer:**
As more documents are considered the chance of including a relevant document goes up.

## 5. Cross Encoder

**5a) [3 points]** We want to use a cross encoder model to rerank the retrieved documents. Describe in 1-2 sentences how a new document order can be determined using a cross encoder.

**Your Answer:**

A cross encoder reranks retrieved documents by jointly encoding query-document pairs and scoring their relevance using a classification or regression head (e.g. similarity score). By computing scores for each pair, the encoder can reorder the documents so that those most semantically aligned with the query are ranked highest.

**5b) [4 points]** Now again, we want to calculate Recall@N for all queries and the same $N$ as before. This time, we want to rerank the Top-25 retrieved documents using the cross encoder model `BAAI/bge-reranker-base`. Implement this using LangChain components and report the average Recall for $N \in \{ 1, 3, 5, 25\}$.

In [13]:
# TODO YOUR CODE HERE
cross_encoder = HuggingFaceCrossEncoder(model_name='BAAI/bge-reranker-base')
reranker = CrossEncoderReranker(model=cross_encoder, top_n=25)

N = [1, 3, 5, 25]
recall_scores = {n: [] for n in N}

for _, row in queries.iterrows():
    query = row['query']
    try:
        relevant_doc_ids = row['ground_truth_doc_ids']
    except:
        continue

    initial_results = vectorstore.similarity_search(query, k=25)
    reranked_results = reranker.compress_documents(initial_results, query)

    for n in N:
        recall = recall_at_n(reranked_results, relevant_doc_ids, n)
        recall_scores[n].append(recall)

print('Mean Recall@N after reranking with Cross-Encoder:')
for n in N:
    if recall_scores[n]:
        avg_recall = sum(recall_scores[n]) / len(recall_scores[n])
        print(f'Recall@{n}: {avg_recall:.4f}')
    else:
        print(f'Recall@{n}: No valid queries')

config.json:   0%|          | 0.00/799 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/443 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/279 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/34.1k [00:00<?, ?B/s]

Mean Recall@N after reranking with Cross-Encoder:
Recall@1: 0.7400
Recall@3: 0.9600
Recall@5: 0.9800
Recall@25: 1.0000


**5c) [2 points]** What do you observe when you compare the Recall@N scores after reranking with the scores without reranking? Write 1-2 sentences about this and why this might happen.

**Your Answer:**
After reranking with the Cross-Encoder, the Recall@N improved as follows:
- Recall@1 from 0.665 to 0.740
- Recall@3 from 0.815 to 0.960
- Recall@5 from 0.860 to 0.980


This improvement occurs because the Cross-Encoder performs deep semantic comparison between the query and documents, allowing it to more accurately rank the most relevant documents at the top, which dense vector similarity alone might miss.

## 6. Generation

**6a) [6 points]** After improving the retrieval part of the RAG system, we want to finally generate an answer for our query. Retrieve the most relevant document for query "How much funding did HealthPro Innovations raise in February 2021?" and print its ID. Then write the instruction message of a prompt to answer this query including all necessary elements before running it using your favourite LLM (ChatGPT GPT-4o, etc.). Please paste the answer from the model and indicate which model you used.

In [14]:
# TODO YOUR CODE HERE
query = 'How much funding did HealthPro Innovations raise in Febuary 2021?'

initial_results = vectorstore.similarity_search(query, k=25)
reranked_results = reranker.compress_documents(initial_results, query)

top_doc = reranked_results[0]
doc_id = top_doc.metadata.get('id', 'N/A')
print(f'Most relevant document ID: {doc_id}')

try:
    pyperclip.copy(top_doc.page_content)
    print('Page content has been successfully copied to the clipboard.')
except:
    print('Unable to copy page content to the clipboard.')

Most relevant document ID: 54
Unable to copy page content to the clipboard.


**Your Prompt:**

You are an expert financial analyst assistant. Based on the following document, answer the user's question with a direct and concise response. If the answer is not clearly found in the document, reply with 'Information not available in the provided document.'

Question: How much funding did HealthPro Innovations raise in February 2021?

Document:
...

**Generated Answer:**
HealthPro Innovations raised \$150 million in February 2021.

**Used Model:**
GPT-4o


**6b) [3 points]** We want to use in-context learning and provide the LLM one example of a possible answer. Use the same prompt and extend it, that it should follow this example answer: "Yep, they sold a lot in that year. Over 50 million units as I can see — pretty big move, respect!". Use the same model, create a fresh chat and run this new prompt. Highlight the changes in the prompt using **bold style** or <span style="color:red;">color</span>.

**Your Prompt:**

You are an expert financial analyst assistant. Based on the following document, answer the user's question with a direct and concise response. If the answer is not clearly found in the document, reply with "Information not available in the provided document."

**Here is an example of how to answer:**

**Q: How many units did the company sell last year?**

**A: Yep, they sold a lot in that year. Over 50 million units as I can see — pretty big move, respect!**

**Now use the same tone and style to answer the following.**

Question: How much funding did HealthPro Innovations raise in February 2021?

Document:
...

**Generated Answer:**
They pulled in a solid \$150 million in February 2021 — big-time funding move right there.

**6c) [2 points]** Please check if the two answers are correct according to the document and how they differ. Does the model follow the example in the second prompt?

**Your Answer:**

1. Both answers are correct according to the document
2. The first answer is more formal and neutral, focusing on the facts. In contrast, the second answer is more casual and mimics the tone of the example
3. Yes, the model clearly followed the tone and instruction from the example in the second prompt

## End of AdvNLP Lab

Please make sure all cells have been executed, save this completed notebook, and upload it to Moodle.