Implementation of the comparison between RAG and T5 based on their generated answers. This is a project from Konstantina Ellina and Pablo de Vicente Abad for the course Case Studies in Data Science and AI in University of Antwerp.

ATTENTION!
If you run the notebook in Google Colab, make sure that you have included all python files(dataset_processing.py, handle_models.py, eval_data.json) provided in github so that all functions are imported. For more details into the code, please follow the corresponding python files that are used in each case.

In [1]:
# Requirements
!pip -q install kagglehub
!pip -q install transformers
!pip -q install datasets
!pip -q install bert_score
!pip -q install evaluate

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.1/61.1 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m94.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m74.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m41.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m846.6 kB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m12.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
# some preparation steps -- if in google colab
!mkdir logs
import sys

# clone the repository
!rm -rf Pleias-RAG-Library/
!git clone --quiet https://github.com/Pleias/Pleias-RAG-Library

# install the cloned package in development mode
%cd Pleias-RAG-Library
!pip install -e . -q

%cd ..
sys.path.append('/content/Pleias-RAG-Library')

/content/Pleias-RAG-Library
  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m294.1/294.1 MB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m98.4/98.4 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m111.0/111.0 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m30.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m87.6/87.6 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.3/44.3 MB[0m [31m17.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.8/4.8 MB[0m [31m89.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m343.3/343.3 kB[0m [31m20.2

In [3]:
# import the functions from the python files
import dataset_processing, handle_models
from dataset_processing import transform_paper_to_source, create_demo_jsonl, load_sources_jsonl
from handle_models import load_qa_models, query_rag, query_t5

In [4]:
# download the dataset and create the json file for the sources
import kagglehub
import os

print("Downloading ArXiv dataset from Kaggle...")
path = kagglehub.dataset_download("Cornell-University/arxiv")
print(f"Dataset downloaded to: {path}")

print("\nFiles in the downloaded directory:")
for root, dirs, files in os.walk(path):
    for file in files:
        name_file = os.path.join(root, file)
        print(name_file, '\n')

input_path = os.path.expanduser(name_file)
corpus = 'arxiv_demo_20_sources.jsonl'
create_demo_jsonl(input_path, corpus) # get the first 20 sources into a json file in a form that rag accepts

Downloading ArXiv dataset from Kaggle...
Dataset downloaded to: /kaggle/input/arxiv

Files in the downloaded directory:
/kaggle/input/arxiv/arxiv-metadata-oai-snapshot.json 

Saved 20 demo sources to arxiv_demo_20_sources.jsonl


In [5]:
# add one more line for the trivial question
new_line = '{"text": "Brussels is the capital city of Belgium. Other major cities of Belgium are Antwerp, Ghent, Charleroi, Liège, Bruges, Namur, and Leuven. Situated in a coastal lowland region known as the Low Countries, it is bordered by the Netherlands to the north, Germany to the east, Luxembourg to the southeast, France to the south, and the North Sea to the west.", "metadata": {"authors": "Wikipedia", "title": "An introduction about Belgium", "update_date": "2025-05-17", "reliability": "high"}}\n'

input_file = "arxiv_demo_20_sources.jsonl"
json_file = "arxiv_21_sources.jsonl"

with open(input_file, "r", encoding="utf-8") as infile, open(json_file, "w", encoding="utf-8") as outfile:
    # Write the new line first
    outfile.write(new_line)
    # Then copy the rest of the original file
    for line in infile:
        outfile.write(line)

In [6]:
# load sources and models
sources = load_sources_jsonl(json_file)
print(f"Loaded {len(sources)} sources")

rag, t5 = load_qa_models() # load rag and t5 models

Loaded 21 sources
CUDA available: False
Loading model with transformers from PleIAs/Pleias-RAG-350M...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/4.90k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/535 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/654 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/707M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

Model loaded successfully with transformers
-------RAG Loaded correctly-------


tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.13G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Device set to use cpu


--------T5 Loaded correctly---------


In [7]:
# call this function to get the results of both models for a specific query
def rag_t5(query, sources, rag, t5):

  #  ——— RAG ——————————————————————
  rag_result = query_rag(query, sources, rag)
  if rag_result.get('error'):
      print("RAG error:", rag_result['error'])
  else:
      print("RAG answer:", rag_result['response']['processed']['clean_answer'])
      print(f"(took {rag_result['time']:.2f}s)\n\n")

  #  ——— T5 ——————————————————————
  t5_out = query_t5(query, t5)
  if t5_out.get('error'):
      print("T5 error:", t5_out['error'])
  else:
      print("T5 answer:", t5_out['response'])
      print(f"(took {t5_out['time']:.2f}s)")

In [8]:
# standard question
query = "Does the dark matter field describe the evolution of the Earth-Moon system?"
rag_t5(query, sources, rag, t5)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
This is a friendly reminder - the current text generation call will exceed the model's predefined maximum length (4096). Depending on the model, you may observe exceptions, performance degradation, or nothing at all.


RAG answer: The dark matter field in the Earth-Moon system represents a fascinating intersection of astrophysics and cosmology. Here's what we know about its evolution and impact:

The dark matter field in the Earth-Moon system has been extensively studied, with recent research showing that the dark matter field can be described by a specific mathematical formulation that accounts for the evolution of the Earth-Moon system[1].

The dark matter field follows a specific pattern: it follows a particular pattern of behavior, with the dark matter field being approximately equal to the total dark matter density. This density is typically around 25,000 km/s at 4.39 x 10^3 solar masses per square degree, which is far less than the Roche's limit of 25,000 km/s[2][3].

The dark matter field's behavior is influenced by various factors, including:
- The particle's velocity
- The particle's density
- The particle's spin
- The particle's spin-orbit angle
- The particle's orbital angular momentum[4]


In [9]:
# trivial question
query = "What is the capital of Belgium?"
rag_t5(query, sources, rag, t5)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


RAG answer: The capital city of Belgium is Brussels [1]. This is confirmed by the fact that Brussels is mentioned in the context of the European Union [2].

**Citations**
[1] "Brussels is the capital city of Belgium" [Source 1]
[2] "The European Union is a supranational union of 27 member states" [Source 1]

(took 161.95s)


T5 answer: brussels
(took 6.74s)


In [10]:
# refusal question
query = "Does the Earth–Moon system cause the color of the sky?"
rag_t5(query, sources, rag, t5)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


RAG answer: I notice you're asking about the relationship between Earth-Moon system colors and sky color. Unfortunately, after reviewing the provided documents, I cannot provide any information about this connection. While there is a brief mention of the Earth-Moon system in one of the sources, it doesn't contain any information about its color or its relationship to the sky.

The only reference to the Earth-Moon system in the sources is a brief mention of it being "the North Star of the Earth" and being "the North Star of the Earth"[1], but this doesn't help answer your question about color.

To properly answer your question about the relationship between Earth-Moon system colors and sky color, we would need sources that specifically discuss:
- The color of Earth-Moon system
- How this color relates to the sky
- The relationship between these two concepts

If you're interested in learning more about this topic, I'd recommend consulting astronomical texts or resources that specifically

In [11]:
# Evaluation of the models with a new json file
import json
from evaluate import load

with open("eval_data.json", "r", encoding="utf-8") as f:
    eval_data = json.load(f)

# generate answers
results = []
for item in eval_data:
    q = item["query"]
    rag_out = query_rag(q, sources, rag)
    t5_out = query_t5(q, t5)
    rag_ans = rag_out["response"]["processed"]["answer"] if rag_out.get("response") else ""
    t5_ans = t5_out["response"] or ""
    results.append((item["reference"], rag_ans, t5_ans))

refs, rag_preds, t5_preds = zip(*results)

# compute BERTScore
bertscore = load("bertscore")
b_rag = bertscore.compute(predictions=rag_preds, references=refs, lang="en")
b_t5  = bertscore.compute(predictions=t5_preds, references=refs, lang="en")

print("RAG:")
print("  Precision:", sum(b_rag["precision"]) / len(b_rag["precision"]))
print("  Recall:   ", sum(b_rag["recall"]) / len(b_rag["recall"]))
print("  F1:       ", sum(b_rag["f1"]) / len(b_rag["f1"]))

print("T5:")
print("  Precision:", sum(b_t5["precision"]) / len(b_t5["precision"]))
print("  Recall:   ", sum(b_t5["recall"]) / len(b_t5["recall"]))
print("  F1:       ", sum(b_t5["f1"]) / len(b_t5["f1"]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Downloading builder script:   0%|          | 0.00/7.95k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/482 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


RAG:
  Precision: 0.803324818611145
  Recall:    0.9043115178743998
  F1:        0.8508170048395792
T5:
  Precision: 0.8366188406944275
  Recall:    0.8372902274131775
  F1:        0.836655835310618
