# 📘 Nuggetizer: A lightweight nugget-based evaluation framework for pyterrier-rag

## 📌 Introduction
In this notebook, we demonstrate how to evaluate a Retrieval-Augmented Generation (RAG) system
using a semantic nugget-based evaluation framework inspired by the "AutoNuggetizer" used in TREC 2024.
The goal is to assess the factual informativeness of generated answers through fine-grained nugget detection
and scoring. This setup is general and compatible with Google Colab (T4 GPU).

## 🎯 Motivation and Background
- The Problem: Traditional RAG evaluations rely on lexical overlap or ROUGE scores, which miss semantic correctness.
- The Solution: Nugget evaluation, originally proposed in TREC QA 2003, revived by AutoNuggetizer, uses semantically atomic facts (“nuggets”) to evaluate answers.
- Inspiration: This library reimplements a simplified, local version of AutoNuggetizer with modular hooks into PyTerrier and HuggingFace models.

## ⚙️ Installation and Setup

In [1]:
!pip install git+https://github.com/MattiWe/ir_datasets.git@add-msmarco-v2.1-trec-rag

[0mCollecting git+https://github.com/MattiWe/ir_datasets.git@add-msmarco-v2.1-trec-rag
  Cloning https://github.com/MattiWe/ir_datasets.git (to revision add-msmarco-v2.1-trec-rag) to /tmp/pip-req-build-mi7hhf8g
  Running command git clone --filter=blob:none --quiet https://github.com/MattiWe/ir_datasets.git /tmp/pip-req-build-mi7hhf8g
  Running command git checkout -b add-msmarco-v2.1-trec-rag --track origin/add-msmarco-v2.1-trec-rag
  Switched to a new branch 'add-msmarco-v2.1-trec-rag'
  Branch 'add-msmarco-v2.1-trec-rag' set up to track remote branch 'add-msmarco-v2.1-trec-rag' from 'origin'.
  Resolved https://github.com/MattiWe/ir_datasets.git to commit bd018b783e3d25942b69290f7be19eeb929022c2
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[0m

In [2]:
!pip install git+https://github.com/terrier-org/pyterrier

[0mCollecting git+https://github.com/terrier-org/pyterrier
  Cloning https://github.com/terrier-org/pyterrier to /tmp/pip-req-build-iltyqsds
  Running command git clone --filter=blob:none --quiet https://github.com/terrier-org/pyterrier /tmp/pip-req-build-iltyqsds
  Resolved https://github.com/terrier-org/pyterrier to commit b5a7910386a0e860283d08c4f971b93de603bf15
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[0m

In [3]:
!pip install -q pyterrier_t5 pyterrier_pisa

[0m

In [4]:
!pip install -q git+https://github.com/terrierteam/pyterrier_rag.git

[0m

In [5]:
!pip install -q --no-deps ../.

[0m

In [6]:
import pyterrier as pt
from pyterrier_rag.backend import Backend

  from .autonotebook import tqdm as notebook_tqdm


# Dataset

In [7]:
import ir_datasets
dataset = ir_datasets.load('msmarco-segment-v2.1')

In [8]:
pt_dataset = pt.get_dataset("irds:msmarco-segment-v2.1")

# Pipelines

In [9]:
def rename_segment(run):
    run = run.rename(columns={"segment": "text"})
    return run
rename_pipe = pt.apply.generic(rename_segment)

In [10]:
import pyterrier_alpha as pta
from pyterrier_pisa import PisaIndex
from pyterrier_t5 import MonoT5ReRanker

index = pta.Artifact.from_hf('namawho/msmarco-segment-v2.1.pisa')
bm25_ret = index.bm25() >> pt.text.get_text(pt_dataset, "segment") >> rename_pipe

monoT5 = MonoT5ReRanker(batch_size=64, verbose=False)
monoT5_ret = bm25_ret % 10 >> monoT5

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


# Building a baseline retrieval run to generate baseline nuggets

In [11]:
from datasets import load_dataset
dataset = load_dataset("namawho/trec-raggy-dev")["validation"].to_pandas()
topics_df  = dataset[["qid", "query"]]
answers_df = dataset[["qid", "query", "gold_answer"]]

In [12]:
baseline = (monoT5_ret)(topics_df.head(10))
baseline

Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


Unnamed: 0,qid,query,docno,text,score,rank
0,23287,are landlords liable if someone breaks in a hu...,msmarco_v2.1_doc_02_759557285#0_1325339642,Is a landlord liable if a tenant or visitor is...,-0.002110,0
1,23287,are landlords liable if someone breaks in a hu...,msmarco_v2.1_doc_48_841527758#3_1529122925,"1996), reh'g denied (1996).) If a landlord is ...",-0.035341,6
2,23287,are landlords liable if someone breaks in a hu...,msmarco_v2.1_doc_02_759557285#1_1325342568,"To do this, the injured person must show that:...",-0.003268,1
3,23287,are landlords liable if someone breaks in a hu...,msmarco_v2.1_doc_48_841527758#12_1529136555,But if the tenant has a month-to-month rental ...,-0.027581,5
4,23287,are landlords liable if someone breaks in a hu...,msmarco_v2.1_doc_48_841527758#11_1529134815,most courts hold landlords liable for knowing ...,-0.065491,8
...,...,...,...,...,...,...
95,395948,information about who howard gardner and what ...,msmarco_v2.1_doc_43_701261032#3_1477785431,Multiple Intelligences Test\nBased on the work...,-4.660838,8
96,395948,information about who howard gardner and what ...,msmarco_v2.1_doc_45_911384240#2_1740648328,"He is the director of Harvard Project Zero , A...",-0.016688,2
97,395948,information about who howard gardner and what ...,msmarco_v2.1_doc_01_1630976798#1_2378503407,and some of the issues around its conceptualiz...,-0.748225,6
98,395948,information about who howard gardner and what ...,msmarco_v2.1_doc_44_376498252#0_936606965,Multiple Intelligences (Howard Gardner) - Inst...,-5.776030,9


# Nuggetizer setup

In [13]:
from pyterrier_rag.backend import HuggingFaceBackend

backend =  HuggingFaceBackend("hugging-quants/gemma-2-9b-it-AWQ-INT4",
                                          max_new_tokens=2048,
                                          model_args={
                                              "device_map": "cuda"
                                          }
                                         )
# backend =  HuggingFaceBackend("hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4",
#                                           max_new_tokens=2048,
#                                           model_args={
#                                               "device_map": "auto"
#                                           }
#                                          )

--- Logging error ---
Traceback (most recent call last):
  File "/opt/miniconda3/envs/nuggetizer/lib/python3.10/logging/__init__.py", line 1100, in emit
    msg = self.format(record)
  File "/opt/miniconda3/envs/nuggetizer/lib/python3.10/logging/__init__.py", line 943, in format
    return fmt.format(record)
  File "/opt/miniconda3/envs/nuggetizer/lib/python3.10/logging/__init__.py", line 678, in format
    record.message = record.getMessage()
  File "/opt/miniconda3/envs/nuggetizer/lib/python3.10/logging/__init__.py", line 368, in getMessage
    msg = msg % self.args
TypeError: not all arguments converted during string formatting
Call stack:
  File "/opt/miniconda3/envs/nuggetizer/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/miniconda3/envs/nuggetizer/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/opt/miniconda3/envs/nuggetizer/lib/python3.10/site-packages/ipykernel_launcher

In [14]:
from fastchat.conversation import register_conv_template, get_conv_template, Conversation, SeparatorStyle

register_conv_template(
    Conversation(
        name="meta-llama-3.1-sp",
        system_message="",
        roles=("user", "assistant"),
        sep_style=SeparatorStyle.ADD_COLON_SINGLE,
        sep="\n",
        messages=[],
    )
)

conv_template = get_conv_template("meta-llama-3.1-sp")

In [15]:
import pandas as pd

def save_csv(path, content):
    content.to_csv(path, index=False)

def load_csv(path):
    try:
        content = pd.read_csv(path)
        return content
    except Exception:
        return None

In [16]:
from open_nuggetizer.nuggetizer import Nuggetizer

nuggetizer = Nuggetizer(
    backend=backend, 
    conversation_template=conv_template,
    verbose=True
)

nuggets = load_csv("nuggets.csv")
if nuggets is None:
    nuggets = nuggetizer.create(baseline)
    save_csv("nuggets.csv", nuggets)

scored_nuggets = load_csv("scored_nuggets.csv")
if scored_nuggets is None:
    scored_nuggets = nuggetizer.score(nuggets)
    save_csv("scored_nuggets.csv", scored_nuggets)

# Evaluation

In [17]:
from pyterrier_rag.prompt import Concatenator
from pyterrier_rag.readers import Reader
from pyterrier_rag.prompt import PromptTransformer
from jinja2 import Template

def make_callable_template(template: Template):
    def template_call(**kwargs):
        return template.render(**kwargs)

    return template_call

GENERIC_PROMPT = Template(
    "Use the context information to answer the Question: \n Context: {{ context }} \n Question: {{ query }} \n Answer:"
)

prompt = PromptTransformer(
            instruction=make_callable_template(GENERIC_PROMPT),
            system_message="You are an helpful assistant.",
            conversation_template=conv_template,
            input_fields=[
                "context",
                "query",
            ],
        )

reader = Reader(backend, prompt)
rag_pipeline = monoT5_ret % 3 >> Concatenator() >> reader

results = (rag_pipeline)(topics_df.head(2))
results

The 'batch_size' attribute of HybridCache is deprecated and will be removed in v4.49. Use the more precisely named 'self.max_batch_size' attribute instead.


Unnamed: 0,prompt,qid,query_0,qanswer
0,You are an helpful assistant.\nuser: Use the c...,23287,are landlords liable if someone breaks in a hu...,\n\n\nThe provided text focuses on landlord li...
1,You are an helpful assistant.\nuser: Use the c...,30611,average age of men at marriage,analysis showed the vast majority of American...


In [18]:
scored_nuggets = scored_nuggets.rename(columns={"qid": "query_id"})
for element in nuggetizer.VitalScore().iter_calc(scored_nuggets, results):
    print("Hello")
    print(f"Query ID: {element['query_id']}, Measure: {element['measure']}, Value: {element['value']}")
    break

KeyError: 'query_id'

In [None]:
"""
import pyterrier_rag.measures

results = pt.Experiment(
    [
        rag_pipeline
    ],
    topics_df.head(2), 
    answers_df,
    [pyterrier_rag.measures.F1, nuggetizer.VitalScore()],
    #batch_size=25,
    names=['baseline retriever'],
)
"""