# Project: Question-Answering using Retrieval Augmented Generation
by L.Arduini, D.N.Ghaneh, L.Menchini, C.Petruzzella

## Instructions to Run

### Prerequisites
1. Python 3.10 or above.
2. Access to a runtime environment with GPU support (e.g., NVIDIA T4 on Google Colab) for optimal performance.

### Running the project
- Switch the runtime to GPU (e.g., NVIDIA T4) for enhanced performance

---
This document outlines the implementation details, providing step-by-step guidance and explanations.

# Intelligent Retrieval and QA System Evaluation

## Overview

This project focuses on the comparative analysis of two advanced retrieval methods: **Rank Fusion** and **Cascading Retrieval**. The aim is to build a pipeline capable of retrieving relevant documents for a given query, generating responses using a large language model (Llama 3.2-1B), and evaluating those responses for quality and relevance.

## Key Components

1. **Document Retrieval**:
    - Implementation of multiple retrieval strategies:
        - **Sparse Retrieval**: BM25 for term-based matching.
        - **Dense Retrieval**: Embedding-based semantic similarity.
        - **Rank Fusion**: Combining sparse and dense retrieval scores.
        - **Cascading Retrieval**: Two-step refinement using dense embeddings after initial sparse retrieval.
        - we evaluated each single method using *NDCG* and *RECALL* as quantitative metrics.
    - Dataset: **TREC-COVID**, simulating real-world information retrieval challenges.

2. **Question-Answering with LLM**:
    - Responses are generated for each query using a state-of-the-art language model. Retrieved documents serve as context to provide more accurate answers.

3. **Evaluation Framework**:
    - Human-centric evaluation to assess response quality:
        - A numerical relevance score ranging from **0 (irrelevant)** to **2 (highly relevant)**.
        - A brief **motivation** accompanying each evaluation score for better interpretability.

4. **Consolidated Output**:
    - For each query, the following information is logged:
        - Query text.
        - Retrieved context.
        - Model-generated response.
        - Human evaluation score and motivation.

## Objective

The primary objective of this notebook is to demonstrate the **end-to-end design and evaluation of a retrieval-enhanced QA system**, highlighting its applicability in real-world scenarios such as medical information retrieval. 

Through this work, we aim to:
- Explore and enhance document retrieval strategies by combining traditional and modern techniques.
- Analyze the strengths and limitations of LLMs in QA tasks.
- Provide a structured framework for evaluating the relevance and quality of LLM-generated responses.

---

### Structure

- **Document Retrieval**: Implementation and comparison of retrieval methods.
- **QA Generation**: Leveraging an LLM to generate context-aware responses.
- **Response Evaluation**: Scoring and analysis of the responses with human-annotated motivations.

This notebook is designed to offer insights into retrieval-augmented QA systems while enabling practical understanding of their performance and potential improvements.


In [1]:
# Install required Python packages
!pip install ir_datasets
!pip install rank_bm25
!pip install sentence_transformers
!pip install pytrec_eval
!pip install PyStemmer
!pip install --upgrade gdown



In [2]:
# Import necessary libraries and initialize global configurations
from tqdm import tqdm
import json
import ir_datasets
import os
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import pandas as pd
from rank_bm25 import BM25Okapi
from transformers import pipeline
import torch
from huggingface_hub import login
import pytrec_eval
import collections
import itertools
import heapq
from google.colab import drive
import os
import shutil

api_key = "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"
login(token=api_key)

# Check GPU availability
def get_device():
    if torch.cuda.is_available():
        device = "cuda"
        cuda_version = torch.version.cuda  # Retrieve CUDA version
        gpu_properties = torch.cuda.get_device_properties(torch.cuda.current_device())
        print(f"Using GPU: {gpu_properties.name}")
        print(f"CUDA Version: {cuda_version}")
        print(f"CUDA Cores: {gpu_properties.multi_processor_count}")
        print(f"Total Memory: {gpu_properties.total_memory / 1e9:.2f} GB")
        print(f"Compute Capability: {gpu_properties.major}.{gpu_properties.minor}")
    elif torch.backends.mps.is_available():
        device = "mps"
        print("Using MPS (Metal Performance Shaders)")
    else:
        device = "cpu"
        print("Using CPU")
    return device

device = get_device()


Using GPU: Tesla T4
CUDA Version: 12.1
CUDA Cores: 40
Total Memory: 15.84 GB
Compute Capability: 7.5


# Section 1: Dataset loading and preparation

This section focuses on loading and preparing the dataset for the QA model.

In [3]:
# Define functions for text preprocessing and tokenization
from functools import lru_cache
import re
import string
import Stemmer
import nltk
nltk.download("stopwords", quiet=True)

# ------- Pre Initialization -------
# 1. Compile regex patterns once globally
# 2. Preload stopwords set
# 3. Initialize stemmer

ACRONYM_REGEX = re.compile(r"(?<!\w)\.(?!\d)")
PUNCTUATION_TRANS = str.maketrans("", "", string.punctuation)
STOPWORDS = set(nltk.corpus.stopwords.words('english'))
STEMMER = Stemmer.Stemmer('english')

# Define a cached function to stem individual words
@lru_cache(maxsize=1000)
def stem(word):
    return STEMMER.stemWord(word)

# ----------------------------------

def preprocess(s):
    """
    Preprocess a string for indexing or querying.

    Args:
        s: The input string.

    Returns:
        A list of preprocessed tokens.
    """

    s = s.lower()
    s = s.replace("&", " and ")
    # normalize quotes and dashes
    s = s.translate(str.maketrans("‘’´“”–-", "'''\"\"--"))
    # remove unnecessary dots in acronyms (but not decimals)
    s = ACRONYM_REGEX.sub("", s)
    # remove punctuation
    s = s.translate(PUNCTUATION_TRANS)
    # strip and remove extra spaces
    s = " ".join(s.split())

    tokens = s.split()
    tokens = [t for t in tokens if t not in STOPWORDS]
    tokens = STEMMER.stemWords(tokens)
    return tokens

In [4]:
# Load dataset
print("Loading the trec covid dataset...")
dataset = ir_datasets.load("cord19/trec-covid")

Loading the trec covid dataset...


In [5]:
import pandas as pd

# Convert the dataset to a pandas DataFrame for easier manipulation
df = pd.DataFrame(dataset.docs_iter(), columns=['doc_id', 'title', 'doi', 'date', 'abstract'])

# Check length of the dataset
print(f"Dataset length: {len(df)}")

# Check number of documents with duplicate abstracts
print(f"Number of documents with duplicate abstracts: {df['abstract'].duplicated().sum()}")

# Remove documents with empty or null abstracts
print("Removing documents with empty or null abstracts...")
data_cleaned = df[~df['abstract'].isnull() & (df['abstract'].str.strip() != '')]

# Remove documents with duplicate abstracts
print("Removing documents with duplicate abstracts...")
docs_dataset = df.drop_duplicates(subset='abstract')

# Check dataset length
print(f"Cleaned dataset length: {len(docs_dataset)}")

Dataset length: 192509
Number of documents with duplicate abstracts: 66793
Removing documents with empty or null abstracts...
Removing documents with duplicate abstracts...
Cleaned dataset length: 125716


In [6]:
# Define functions for text preprocessing and tokenization
# Prepare documents and queries
print("Preparing documents and queries...")

# put all documents and queries in a list of dictionaries
all_docs = []
for index, row in docs_dataset.iterrows():
    abstract = f"{row['title']} {row['abstract']}"
    all_docs.append({"doc_id": row['doc_id'], "abstract": abstract, "context": row['abstract']})

all_queries = []
for query in dataset.queries_iter():
    query_text = f"{query.description}"
    all_queries.append({"query_id": query.query_id, "title": query_text})

# Print dataset size information
print(f"Summary: {len(all_docs)} documents and {len(all_queries)} queries are available in the dataset.")

# Tokenize documents
tokenized_docs = [preprocess(doc) for doc in [docs["abstract"] for docs in all_docs]]
tokenized_queries = [preprocess(query) for query in [queries["title"] for queries in all_queries]]
print("Tokenization of documents is done.")

bm25 = BM25Okapi(tokenized_docs)

Preparing documents and queries...
Summary: 125716 documents and 50 queries are available in the dataset.
Tokenization of documents is done.


In [7]:
# convert qrels to a dictionary
qrels_dict = collections.defaultdict(dict)
for qrel in dataset.qrels_iter():
    qrels_dict[qrel.query_id][qrel.doc_id] = int(qrel.relevance)

# Section 2: Embeddings generation

In [8]:
# From Google Drive import embeddings
drive.mount('/content/drive')

repository = "1his04UkuSdcF9UUV5HMMnHKBviXd9_vE"
repository_name = "lm-project-files"
!gdown --folder $repository

# Copia i file dalla cartella scaricata a /content/
for item in os.listdir(repository_name):
  s = os.path.join(repository_name, item)
  d = os.path.join('/content/', item)
  if os.path.isfile(s):  # Copia solo se è un file
    shutil.copy2(s, d)

# Rimuovi la cartella scaricata (opzionale)
shutil.rmtree(repository_name)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Retrieving folder contents
Processing file 1qaX7rC1TcT1smqs6TROv2LsoF1FLHiI3 trec_covid_doc_embeddings.csv
Processing file 1WGH8XgI4TXsaymzt9ji4-0_Mg7NVsXPo trec_covid_query_embeddings.csv
Retrieving folder contents completed
Building directory structure
Building directory structure completed
Downloading...
From (original): https://drive.google.com/uc?id=1qaX7rC1TcT1smqs6TROv2LsoF1FLHiI3
From (redirected): https://drive.google.com/uc?id=1qaX7rC1TcT1smqs6TROv2LsoF1FLHiI3&confirm=t&uuid=b829000c-0652-4fdf-aa93-868b8834aa2d
To: /content/lm-project-files/trec_covid_doc_embeddings.csv
100% 591M/591M [00:06<00:00, 92.5MB/s]
Downloading...
From: https://drive.google.com/uc?id=1WGH8XgI4TXsaymzt9ji4-0_Mg7NVsXPo
To: /content/lm-project-files/trec_covid_query_embeddings.csv
100% 236k/236k [00:00<00:00, 4.09MB/s]
Download completed


In [9]:
# Load or generate embeddings
force_generate = False

def generate_embeddings():
    if not force_generate and os.path.exists("trec_covid_doc_embeddings.csv") and os.path.exists("trec_covid_query_embeddings.csv"):
        print("Loading precomputed embeddings...")
        doc_embeddings = pd.read_csv("trec_covid_doc_embeddings.csv").values
        query_embeddings = pd.read_csv("trec_covid_query_embeddings.csv").values
    else:
        print("No precomputed embeddings found.")
        print("Generating new embeddings using SentenceTransformer model 'sentence-transformers/all-MiniLM-L6-v2'.")
        model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2", device=device)
        doc_embeddings = model.encode([doc["abstract"] for doc in all_docs], batch_size=32, show_progress_bar=True, normalize_embeddings=True)
        query_embeddings = model.encode([query['title'] for query in all_queries], batch_size=32, show_progress_bar=True, normalize_embeddings=True)

        # Save embeddings for future use
        pd.DataFrame(doc_embeddings).to_csv("trec_covid_doc_embeddings.csv", index=False)
        pd.DataFrame(query_embeddings).to_csv("trec_covid_query_embeddings.csv", index=False)

    return doc_embeddings, query_embeddings

doc_embeddings, query_embeddings = generate_embeddings()

Loading precomputed embeddings...


# Section 3: Retrieval Implementation

In this section, we implement all the previously introduced retrieval methods. First, we define the function for running *pytrec evaluation*, ensuring a robust assessment of the retrieval results. 

We then execute the experimental queries using each retrieval method, storing the outcomes in a JSON file. 

For neural reranking, we leverage a pretrained cross-encoder and tokenizer from the `ms-marco-MiniLM-L-6-v2` model. This step is crucial for improving the ranking quality by precisely evaluating the relevance of query-document pairs through deep contextual understanding.


In [10]:
# Function to prepare run data for pytrec_eval
def prepare_run_data(results):
    """
    Prepares the run data in the format expected by pytrec_eval.
    Converts numpy scores to native Python float for compatibility.
    """
    run = {}
    for query_results in results:
        query_id = query_results['query']['query_id']
        run[query_id] = {}
        for doc_id, score in zip(query_results['results'], query_results['scores']):
            run[query_id][doc_id] = float(score)  # Convert numpy type to float
    return run

### Document Retrieval Methods

1. **BM25 Sparse Retrieval**:
   - The **BM25 algorithm** is used to perform sparse retrieval on tokenized documents by calculating a relevance score for each document based on the query. It then returns the indices and relevance scores of the top-k most relevant documents.

2. **Dense Retrieval**:
   - **Dense retrieval** is performed by calculating the cosine similarity between the query embedding and the document embeddings. The top-k documents with the highest similarity scores are returned.

3. **Rank Fusion Retrieval**:
   - Results from both **BM25** and **dense retrieval** are combined using a **rank fusion** technique. Scores from both methods are normalized and combined, then the documents are ranked based on the combined scores, returning the top k documents.

4. **Cascading Retrieval**:
   - Initially, a set of documents is retrieved using both Sparse and Dense Retrieval. Afterwards, a reranking step is made using a Reranker Model.

In [11]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

cross_encoder_model = AutoModelForSequenceClassification.from_pretrained('cross-encoder/ms-marco-MiniLM-L-6-v2').to("cuda")
cross_encoder_tokenizer = AutoTokenizer.from_pretrained('cross-encoder/ms-marco-MiniLM-L-6-v2')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [12]:
# Define functions for text preprocessing and tokenization
from scipy.stats import zscore

# BM25 Sparse Retrieval
def bm25_retrieve(query, bm25, top_k=5):
    """
    Perform sparse retrieval using BM25 on the tokenized documents.
    Returns the indices and scores of the top-k documents.
    """
    tokenized_query = preprocess(query)                                     # Tokenize the query into words
    scores = bm25.get_scores(tokenized_query)                                   # Get BM25 scores for all documents
    top_k_indices = np.argsort(scores)[-top_k:][::-1]                           # Get indices of top-k documents based on BM25 score
    return top_k_indices, scores[top_k_indices]

# Dense Retrieval
def dense_retrieve(query_embedding, doc_embeddings, top_k=5):
    """
    Perform dense retrieval using cosine similarity between query and document embeddings.
    Returns the indices and similarities of the top-k documents.
    """
    similarities = cosine_similarity([query_embedding], doc_embeddings)[0]      # Compute cosine similarity
    top_k_indices = np.argsort(similarities)[-top_k:][::-1]                     # Get top-k indices based on similarity
    return top_k_indices, similarities[top_k_indices]


# rank fusion retrieval
def combsum_fusion(dense_indices, dense_scores, sparse_indices, sparse_scores, top_k=5):
    # Combine indices and scores from dense and sparse sources
    all_doc_ids = np.concatenate((sparse_indices, dense_indices))
    all_scores = np.concatenate((sparse_scores, dense_scores))

    # Aggregate scores for each document
    combined_scores = collections.defaultdict(float)
    for doc_id, score in zip(all_doc_ids, all_scores):
        combined_scores[doc_id] += score

    # Retrieve top-k documents based on combined scores
    top_docs = heapq.nlargest(top_k, combined_scores.items(), key=lambda x: x[1])

    # return top k indices and scores
    return [doc[0] for doc in top_docs], [doc[1] for doc in top_docs]

# neural reranking for cascading retrieval
def neural_rerank(query_text, dense_indices, dense_scores, sparse_indices, sparse_scores, top_k=5):

    doc_ids = np.concatenate((sparse_indices, dense_indices))
    documents = []
    for doc in doc_ids:
        documents.append(all_docs[doc]['abstract'])
    features = cross_encoder_tokenizer([query_text]*len(documents), documents, padding=True, truncation=True, return_tensors="pt").to("cuda")
    with torch.no_grad():
      scores = cross_encoder_model(**features).logits

    # Rerank the documents by scores
    doc_scores = {doc_id: score.item() for doc_id, score in zip(doc_ids, scores)}
    reranked_doc_scores = dict(sorted(doc_scores.items(), key=lambda item: item[1], reverse=True))

    # return top indices and top scores
    return list(reranked_doc_scores.keys())[:top_k], list(reranked_doc_scores.values())[:top_k]

after the definition of all the function, we run the retrieval experiment. saving results inside a Json file

In [13]:
# Run retrieval experiments
def run_retrieval_experiments():
    """
    Execute sparse, dense, rank fusion, and cascading retrieval for all queries.
    Save the results to a JSON file for further analysis.
    """
    results = {"sparse": [], "dense": [], "rank_fusion": [], "cascade": []}

    print("Running retrieval experiments on all queries.")

    # Iterate over each query and its embedding
    for query, query_embedding in tqdm(zip(all_queries, query_embeddings), total=len(all_queries)):
        # Extract the query ID and text for the current query
        query_id = query['query_id']
        query_text = query['title']

        # Sparse Retrieval using BM25
        sparse_indices, sparse_scores = bm25_retrieve(query_text, bm25)                 # Retrieve the top-k BM25 documents and their scores
        sparse_docs = [all_docs[idx]['doc_id'] for idx in sparse_indices]               # Get document IDs from the indices

        # Dense Retrieval using cosine similarity
        dense_indices, dense_scores = dense_retrieve(query_embedding, doc_embeddings)   # Retrieve the top-k documents based on cosine similarity of embeddings
        dense_docs = [all_docs[idx]['doc_id'] for idx in dense_indices]

        # Normalize scores
        sparse_scores = zscore(sparse_scores)
        dense_scores = zscore(dense_scores)
        results["sparse"].append({"query": query, "results": sparse_docs, "scores": sparse_scores}) # Store the BM25 results for the current query
        results["dense"].append({"query": query, "results": dense_docs, "scores": dense_scores})

        # Rank Fusion Retrieval by combining sparse (BM25) and dense result
        fusion_indices, fusion_scores = combsum_fusion(dense_indices, dense_scores, sparse_indices, sparse_scores)
        fusion_docs = [all_docs[idx]['doc_id'] for idx in fusion_indices]
        results["rank_fusion"].append({"query": query, "results": fusion_docs, "scores": fusion_scores})

        # Cascade Retrieval: compute sparse and dense retrieval, then use reranker
        cascade_indices, cascade_scores = neural_rerank(query_text, dense_indices, dense_scores, sparse_indices, sparse_scores)
        cascade_docs = [all_docs[idx]['doc_id'] for idx in cascade_indices]
        results["cascade"].append({"query": query, "results": cascade_docs, "scores": cascade_scores})
    return results

results = run_retrieval_experiments()


Running retrieval experiments on all queries.


100%|██████████| 50/50 [00:54<00:00,  1.09s/it]


after the experiment, we evaluate each method by passing run results to pytrec_eval. 
we also aggregate the metrics for a better visualization of the differences between methods.

In [14]:
run_sparse = prepare_run_data(results["sparse"])
run_dense = prepare_run_data(results["dense"])
run_rank_fusion = prepare_run_data(results["rank_fusion"])
run_cascade = prepare_run_data(results["cascade"])

# Evaluate results with pytrec_eval
evaluator = pytrec_eval.RelevanceEvaluator(qrels_dict, {'recall.5', 'ndcg_cut.5'})
eval_results_sparse = evaluator.evaluate(run_sparse)
eval_results_dense = evaluator.evaluate(run_dense)
eval_results_rank_fusion = evaluator.evaluate(run_rank_fusion)
eval_results_cascade = evaluator.evaluate(run_cascade)

# Aggregate metrics for overall performance
aggregated_results = {
    "sparse": {
        metric: sum([res[metric] for res in eval_results_sparse.values()]) / len(eval_results_sparse)
        for metric in eval_results_sparse[next(iter(eval_results_sparse))]
    },
    "dense": {
        metric: sum([res[metric] for res in eval_results_dense.values()]) / len(eval_results_dense)
        for metric in eval_results_dense[next(iter(eval_results_dense))]
    },
    "rank_fusion": {
        metric: sum([res[metric] for res in eval_results_rank_fusion.values()]) / len(eval_results_rank_fusion)
        for metric in eval_results_rank_fusion[next(iter(eval_results_rank_fusion))]
    },
    "cascade": {
        metric: sum([res[metric] for res in eval_results_cascade.values()]) / len(eval_results_cascade)
        for metric in eval_results_cascade[next(iter(eval_results_cascade))]
    }
}

print("Aggregated results:", json.dumps(aggregated_results, indent=4))
print("Retrieval results and metrics saved to files.")

Aggregated results: {
    "sparse": {
        "recall_5": 0.008623629150449596,
        "ndcg_cut_5": 0.6805147106092604
    },
    "dense": {
        "recall_5": 0.008256141329265207,
        "ndcg_cut_5": 0.6636285607092102
    },
    "rank_fusion": {
        "recall_5": 0.008793696779750117,
        "ndcg_cut_5": 0.7045910900616086
    },
    "cascade": {
        "recall_5": 0.010368315231514647,
        "ndcg_cut_5": 0.7947962701630851
    }
}
Retrieval results and metrics saved to files.



### Interpretation and Observations:

1. **Recall**: Across all methods, the recall is notably low. This can be attributed to specific characteristics of the TREC-COVID dataset:
   - **Sparse Content**: Many documents in the dataset lack substantial textual content or have poorly formatted abstracts, making it challenging for both sparse and dense retrieval methods to capture relevant information.
   - **Duplicated Entries**: Some documents appear as near duplicates, causing retrieval methods to focus on similar results, potentially missing diverse relevant documents.
   - **Dataset Complexity**: The nature of the dataset's queries and documents may also contribute to difficulty in achieving higher recall, particularly if relevant documents are highly context-specific or dispersed across the corpus.

2. **NDCG (Normalized Discounted Cumulative Gain)**: 
   - NDCG values are relatively higher, with the cascade method showing the best performance (0.7948). This metric reflects that the top-ranked documents, when retrieved, align better with the relevance judgments, indicating that the retrieval methods are effective in ranking relevant documents higher in the results.

3. **Method Performance**:
   - **Sparse Retrieval** (BM25): Performs decently in terms of NDCG but struggles with recall, possibly due to its reliance on exact term matches, which can miss semantically relevant documents.
   - **Dense Retrieval**: Provides slightly lower recall and NDCG compared to sparse methods, highlighting challenges in embedding-based approaches when dealing with sparse or noisy data.
   - **Rank Fusion**: Combines strengths of sparse and dense methods, leading to moderate improvements in both metrics.
   - **Cascade Retrieval**: Achieves the best overall performance, benefiting from an initial sparse retrieval stage followed by neural reranking, which leverages a pretrained cross-encoder to fine-tune rankings based on semantic similarity.

# Section 4: QA with Language Model

In [15]:
# QA for the first query
QUERY_INDEX = 3                                                     # Index of the query to be used for retrieval
query = all_queries[QUERY_INDEX - 1]                                # Select the query from the list based on the index
query_text = query['title'] if isinstance(query, dict) else query   # Get the query text

# Perform dense retrieval using query embedding and document embeddings
dense_top_k_indices, dense_top_k_scores = dense_retrieve(query_embeddings[QUERY_INDEX-1], doc_embeddings)
# Perform sparse retrieval using BM25 on the query text
sparse_top_k_indices, sparse_top_k_scores = bm25_retrieve(query_text, bm25)
# Perform rank fusion retrieval by combining BM25 and dense retrieval results
rank_top_k_indices, rank_top_k_scores = combsum_fusion(dense_top_k_indices, dense_top_k_scores, sparse_top_k_indices, sparse_top_k_scores)
# Perform cascading retrieval: first BM25, then re-rank with dense retrieval
cascading_top_k_indices, cascading_top_k_scores = neural_rerank(query_text, dense_top_k_indices, dense_top_k_scores, sparse_top_k_indices, sparse_top_k_scores)

# Get retrieved documents for each method
dense_retrieved_docs = [f"Document {i+1}: {all_docs[idx]['context']}" for i, idx in enumerate(dense_top_k_indices)]
sparse_retrieved_docs = [f"Document {i+1}: {all_docs[idx]['context']}" for i, idx in enumerate(sparse_top_k_indices)]
rank_retrieved_docs = [f"Document {i+1}: {all_docs[idx]['context']}" for i, idx in enumerate(rank_top_k_indices)]
cascading_retrieved_docs = [f"Document {i+1}: {all_docs[idx]['context']}" for i, idx in enumerate(cascading_top_k_indices)]

# Definition of the model that will be used to generate the various responses.
lm_pipeline = pipeline("text-generation",
                      model="meta-llama/Llama-3.2-1B",
                      device=0 if device == "cuda" else -1)

tokenizer = lm_pipeline.tokenizer
tokenizer.pad_token = tokenizer.eos_token
lm_pipeline.tokenizer = tokenizer


config.json:   0%|          | 0.00/843 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/185 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/301 [00:00<?, ?B/s]

Device set to use cuda:0


In [16]:
INSTRUCTIONS = "Answer the user's QUESTION using the CONTEXT text above in a clear and conversational tone. Keep your answer ground in the facts of the CONTEXT. Avoid structured formats. If the DOCUMENT doesn’t contain the facts to answer the QUESTION return {NONE}"
ANSWER = "Answer:\n"

def build_prompt(query_text, retrieved_docs):
  context = "\n".join(retrieved_docs)
  prompt = f"CONTEXT:\n{context}\n\nQUESTION:\n{query_text}\n\nINSTRUCTIONS:\n{INSTRUCTIONS}\n\n{ANSWER}"
  return prompt

#### Question-answering using DENSE RETRIEVAL

In [17]:
print("------------------ DENSE RETRIEVAL ----------------------\n")
prompt = build_prompt(query_text, dense_retrieved_docs)

print(f"----------------- Length of the prompt -----------------\n{len(prompt.split())} words")
print(f"------------------------ Prompt ------------------------\n{prompt}")

# Generate response
response = lm_pipeline(prompt,
                      max_new_tokens=150,
                      temperature=0.3,
                      pad_token_id=tokenizer.eos_token_id,
                      truncation=True,
                       padding=True
                       )[0]["generated_text"]

response = response.split(prompt)[1].strip()
print(f"------------------ Response ------------------\n{response}")

------------------ DENSE RETRIEVAL ----------------------

----------------- Length of the prompt -----------------
309 words
------------------------ Prompt ------------------------
CONTEXT:
Document 1: Understanding the properties and mechanisms by which antibodies provide protection is essential to defining immunity. Although neutralizing antibodies have been proposed as a potential key mechanism of protection against many viral pathogens, antibodies mediate additional immune functions that may have both protective and pathological consequences. Dissecting these properties against SARS-CoV-2 is likely necessary for defining metrics of immunity that will inform the design of vaccines and therapeutics and improve clinical management.
Document 2: We thank Dr McDonald ([1][1]) for his close reading of our paper ([2][2]) and acknowledge that he makes important arguments for exercising precautions in order to prevent built environment-mediated transmission of SARS-CoV-2 We would like to a

#### Question-answering using SPARSE RETRIEVAL

In [18]:
print("------------------ SPARSE RETRIEVAL ----------------------\n")
prompt = build_prompt(query_text, sparse_retrieved_docs)

print(f"----------------- Length of the prompt -----------------\n{len(prompt.split())} words")
print(f"------------------------ Prompt ------------------------\n{prompt}")

# Generate response
response = lm_pipeline(prompt,
                        max_new_tokens=150,
                        temperature=0.3,
                        pad_token_id=tokenizer.eos_token_id,
                        truncation=True,
                        padding=True
                       )[0]["generated_text"]

response = response.split(prompt)[1].strip()
print(f"------------------ Response ------------------\n{response}")

------------------ SPARSE RETRIEVAL ----------------------

----------------- Length of the prompt -----------------
872 words
------------------------ Prompt ------------------------
CONTEXT:
Document 1: It has been unclear why the new severe acute respiratory syndrome coronavirus (sars‐CoV‐2) hits a small minority hard, while the vast majority of children appear to be protected and develop mild or no disease (1,2). The editorial by Brodin suggests some possible mechanisms why it is so (1). I would like to emphasize the significance of cross immunity due to previous exposure to seasonal coronavirus; it may be a plausible explanation for why children appear to be protected (2,3).
Document 2: Actinobacillus pleuropneumoniae (A. pleuropneumoniae/APP) is the pathogen that causes porcine contagious pleuropneumonia. Actinobacillus pleuropneumoniae is divided into 18 serovars, and the cross protection efficacy of epitopes is debatable, which has resulted in the slow development of a vaccine.

#### Question-answering using RANK FUSION

In [19]:
print("------------------ RANK FUSION ----------------------\n")
prompt = build_prompt(query_text, rank_retrieved_docs)

print(f"----------------- Length of the prompt -----------------\n{len(prompt.split())} words")
print(f"------------------------ Prompt ------------------------\n{prompt}")

# Generate response
response = lm_pipeline(prompt,
                      max_new_tokens=150,
                      temperature=0.3,
                      pad_token_id=tokenizer.eos_token_id,
                      truncation=True,
                      padding=True
                       )[0]["generated_text"]

response = response.split(prompt)[1].strip()
print(f"------------------ Response ------------------\n{response}")

------------------ RANK FUSION ----------------------

----------------- Length of the prompt -----------------
872 words
------------------------ Prompt ------------------------
CONTEXT:
Document 1: It has been unclear why the new severe acute respiratory syndrome coronavirus (sars‐CoV‐2) hits a small minority hard, while the vast majority of children appear to be protected and develop mild or no disease (1,2). The editorial by Brodin suggests some possible mechanisms why it is so (1). I would like to emphasize the significance of cross immunity due to previous exposure to seasonal coronavirus; it may be a plausible explanation for why children appear to be protected (2,3).
Document 2: Actinobacillus pleuropneumoniae (A. pleuropneumoniae/APP) is the pathogen that causes porcine contagious pleuropneumonia. Actinobacillus pleuropneumoniae is divided into 18 serovars, and the cross protection efficacy of epitopes is debatable, which has resulted in the slow development of a vaccine. Cons

#### Question-answering using CASCADING RETRIEVAL

In [20]:
print("------------------ CASCADING RETRIEVAL ----------------------\n")
prompt = build_prompt(query_text, cascading_retrieved_docs)

print(f"----------------- Length of the prompt -----------------\n{len(prompt.split())} words")
print(f"------------------------ Prompt ------------------------\n{prompt}")

# Generate response
response = lm_pipeline(prompt,
                      max_new_tokens=150,
                      temperature=0.3,
                      pad_token_id=tokenizer.eos_token_id,
                      truncation=True,
                       padding=True)[0]["generated_text"]

response = response.split(prompt)[1].strip()
print(f"------------------ Response ------------------\n{response}")

------------------ CASCADING RETRIEVAL ----------------------

----------------- Length of the prompt -----------------
552 words
------------------------ Prompt ------------------------
CONTEXT:
Document 1: It has been unclear why the new severe acute respiratory syndrome coronavirus (sars‐CoV‐2) hits a small minority hard, while the vast majority of children appear to be protected and develop mild or no disease (1,2). The editorial by Brodin suggests some possible mechanisms why it is so (1). I would like to emphasize the significance of cross immunity due to previous exposure to seasonal coronavirus; it may be a plausible explanation for why children appear to be protected (2,3).
Document 2: In contrast with adults, children infected by severe acute respiratory syndrome-corona virus (SARS-CoV) develop milder clinical symptoms. Because of this, it is speculated that children vaccinated with various childhood vaccines might develop cross immunity against SARS-CoV. Antisera and T cells

#### Question-answering WITH NO CONTEXT PROVIDED WITH RAG

In [21]:
print("------------------ RESPONSE WITHOUT RAG ----------------------\n")
prompt = f"""Question:\n{query_text}\n\nAnswer in a concise and clear manner without repetition (if no direct answer, provide a general summary):"""

print(f"----------------- Length of the prompt -----------------\n{len(prompt.split())} words")
print(f"------------------------ Prompt ------------------------\n{prompt}")

response = lm_pipeline(prompt,
                      max_new_tokens=150,
                      temperature=0.3,
                      pad_token_id=tokenizer.eos_token_id,
                      truncation=True,
                       padding=True)[0]["generated_text"]

response = response.split("Answer in a concise and clear manner without repetition (if no direct answer, provide a general summary):")[1].strip()
print(f"------------------ Response ------------------\n{response}")

------------------ RESPONSE WITHOUT RAG ----------------------

----------------- Length of the prompt -----------------
28 words
------------------------ Prompt ------------------------
Question:
will SARS-CoV2 infected people develop immunity? Is cross protection possible?

Answer in a concise and clear manner without repetition (if no direct answer, provide a general summary):
------------------ Response ------------------



## Model Response Evaluation
In this section, we will analyze the responses generated by Llama by assigning a numerical score for the relevance of the response, accompanied by a short textual motivation to explain the rating. All details for each query will be saved in a well-structured CSV file for further analysis.

In [42]:
# function for csv implementation
import csv

def add_entry_to_file(filename, query, context_rank, context_casc, response_rank, response_casc, evaluation_rank, evaluation_casc, motivation):
    """
    Add a new entry to the CSV file.
    """

    with open(filename, mode='a', newline='', encoding='utf-8') as file:
        writer = csv.writer(file)

        if file.tell() == 0:
            writer.writerow(["Query", "Context_rank_fusion", "context_cascading", "Response_rank_fusion", "Response_cascading", "Evaluation_rank_fusion", "Evaluation_cascading", "Motivation"])

        writer.writerow([query['title'], context_rank, context_casc, response_rank, response_casc, evaluation_rank, evaluation_casc,motivation])

filename = "model_evaluation.csv"
evaluation = "TODO"
motivation = "TODO"

In [46]:
import random

picked_queries = random.sample(all_queries, 20)


for q in picked_queries:

    # For each query, retrieve and rank documents independently
    query_text = q['title']
    
    cascading_top_k_indices, cascading_top_k_scores = neural_rerank(query_text, dense_top_k_indices, dense_top_k_scores, sparse_top_k_indices, sparse_top_k_scores)
    cascading_retrieved_docs = [f"Document {i+1}: {all_docs[idx]['abstract']}" for i, idx in enumerate(cascading_top_k_indices)]
    cascading_context = "\n".join(cascading_retrieved_docs)

    dense_top_k_indices, dense_top_k_scores = dense_retrieve(query_embeddings[int(q['query_id'])-1], doc_embeddings)
    sparse_top_k_indices, sparse_top_k_scores = bm25_retrieve(query_text, bm25)
    rank_top_k_indices, rank_top_k_scores = combsum_fusion(dense_top_k_indices, dense_top_k_scores, sparse_top_k_indices, sparse_top_k_scores)
    rank_retrieved_docs = [f"Document {i+1}: {all_docs[idx]['abstract']}" for i, idx in enumerate(rank_top_k_indices)]
    rank_fusion_context = "\n".join(rank_retrieved_docs)

    cascading_prompt = build_prompt(query_text, cascading_retrieved_docs)
    rank_fusion_prompt = build_prompt(query_text, rank_retrieved_docs)
    # cascading_prompt = f"Context:\n{cascading_context}\n\nQuestion:\n{query_text}\n\nAnswer in a concise and clear manner without repetition (if no direct answer, provide a general summary):"
    # rank_fusion_prompt = f"Context:\n{rank_fusion_context}\n\nQuestion:\n{query_text}\n\nAnswer in a concise and clear manner without repetition (if no direct answer, provide a general summary):"

    # Generate response using language model
    cascading_response = lm_pipeline(cascading_prompt,
                                      max_new_tokens=150,
                                      temperature=0.3,
                                      pad_token_id=tokenizer.eos_token_id,
                                      truncation=True,
                                      padding=True)[0]["generated_text"]

    rank_fusion_response = lm_pipeline(rank_fusion_prompt,
                                        max_new_tokens=150,
                                        temperature=0.3,
                                        pad_token_id=tokenizer.eos_token_id,
                                        truncation=True,
                                        padding=True)[0]["generated_text"]

    # Extract the answer from the response
    cascading_response = cascading_response.split(cascading_prompt)[1].strip()
    rank_fusion_response = rank_fusion_response.split(rank_fusion_prompt)[1].strip()

    # Print the results
    print("\n------------------------------")
    print(f"QUERY: {query_text}")
    print(f"CASCADING RESPONSE: {cascading_response}")
    print(f"RANK FUSION RESPONSE: {rank_fusion_response}")
    print("------------------------------\n")

    # csv storing data
    add_entry_to_file(filename, q, rank_fusion_context, cascading_context, rank_fusion_response, cascading_response, evaluation, evaluation , motivation)


------------------------------
QUERY: What new public datasets are available related to COVID-19?
CASCADING RESPONSE: The COVID-19 pandemic is a global health crisis that has affected millions of people worldwide. The virus has been identified as severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), which causes coronavirus disease 2019 (COVID-19). The virus has been found to spread rapidly and has been linked to severe respiratory symptoms, including pneumonia, in many people. The virus has also been found to spread through contact with infected individuals, and it has been reported that it can be transmitted through the air. The virus has been found to be highly contagious and can be spread from person to person through close contact, such as touching or coughing on someone. The virus has also been found to be highly contagious and can be spread through contact with
RANK FUSION RESPONSE: 1. The editorial by Brodin suggests some possible mechanisms why it is so (1). I would l