# 📘 Nuggetizer: A lightweight nugget-based evaluation framework for pyterrier-rag

## 📌 Introduction
In this notebook, we demonstrate how to evaluate a Retrieval-Augmented Generation (RAG) system
using a semantic nugget-based evaluation framework inspired by the "AutoNuggetizer" used in TREC 2024.
The goal is to assess the factual informativeness of generated answers through fine-grained nugget detection
and scoring. This setup is general and compatible with Google Colab (T4 GPU).

## 🎯 Motivation and Background
- The Problem: Traditional RAG evaluations rely on lexical overlap or ROUGE scores, which miss semantic correctness.
- The Solution: Nugget evaluation, originally proposed in TREC QA 2003, revived by AutoNuggetizer, uses semantically atomic facts (“nuggets”) to evaluate answers.
- Inspiration: This library reimplements a simplified, local version of AutoNuggetizer with modular hooks into PyTerrier and HuggingFace models.

## ⚙️ Installation and Setup

In [1]:
!pip install git+https://github.com/MattiWe/ir_datasets.git@add-msmarco-v2.1-trec-rag datasets

[0mCollecting git+https://github.com/MattiWe/ir_datasets.git@add-msmarco-v2.1-trec-rag
  Cloning https://github.com/MattiWe/ir_datasets.git (to revision add-msmarco-v2.1-trec-rag) to /tmp/pip-req-build-fu_8v688
  Running command git clone --filter=blob:none --quiet https://github.com/MattiWe/ir_datasets.git /tmp/pip-req-build-fu_8v688
  Running command git checkout -b add-msmarco-v2.1-trec-rag --track origin/add-msmarco-v2.1-trec-rag
  Switched to a new branch 'add-msmarco-v2.1-trec-rag'
  Branch 'add-msmarco-v2.1-trec-rag' set up to track remote branch 'add-msmarco-v2.1-trec-rag' from 'origin'.
  Resolved https://github.com/MattiWe/ir_datasets.git to commit bd018b783e3d25942b69290f7be19eeb929022c2
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[0m

In [2]:
!pip install --no-deps -e ../.

[0mObtaining file:///mnt/primary/projects/open-nuggetizer
  Installing build dependencies ... [?25ldone
[?25h  Checking if build backend supports build_editable ... [?25ldone
[?25h  Getting requirements to build editable ... [?25ldone
[?25h  Preparing editable metadata (pyproject.toml) ... [?25ldone
[?25hBuilding wheels for collected packages: open_nuggetizer
  Building editable for open_nuggetizer (pyproject.toml) ... [?25ldone
[?25h  Created wheel for open_nuggetizer: filename=open_nuggetizer-0.0.1-0.editable-py3-none-any.whl size=7000 sha256=2f9de6a7dd99b08551a3e8c0f85ec78883ab161639acb870718fb65d0c560ba5
  Stored in directory: /tmp/pip-ephem-wheel-cache-qzcrlsfc/wheels/6e/ea/aa/fdd765af96c15c323c9c5cab3ddfdff595a6eaa2f890e08273
Successfully built open_nuggetizer
Installing collected packages: open_nuggetizer
  Attempting uninstall: open_nuggetizer
    Found existing installation: open_nuggetizer 0.0.1
    Uninstalling open_nuggetizer-0.0.1:
      Successfully uninstalled

In [3]:
!pip install ir_measures

[0m

In [4]:
!pip install -q python-terrier

[0m

In [5]:
!pip install -q git+https://github.com/terrierteam/pyterrier_rag.git

[0m

In [6]:
!pip install -q pyterrier_t5 pyterrier_pisa

[0m

In [7]:
import pyterrier as pt
from pyterrier_rag.backend import Backend

  from .autonotebook import tqdm as notebook_tqdm


# Dataset

In [8]:
import ir_datasets
dataset = ir_datasets.load('msmarco-segment-v2.1')

In [9]:
pt_dataset = pt.get_dataset("irds:msmarco-segment-v2.1")

# Pipelines

In [10]:
def rename_segment(run):
    run = run.rename(columns={"segment": "text"})
    return run
rename_pipe = pt.apply.generic(rename_segment)

In [11]:
import pyterrier_alpha as pta
from pyterrier_pisa import PisaIndex
from pyterrier_t5 import MonoT5ReRanker

index = pta.Artifact.from_hf('namawho/msmarco-segment-v2.1.pisa')
bm25_ret = index.bm25() >> pt.text.get_text(pt_dataset, "segment") >> rename_pipe

monoT5 = MonoT5ReRanker(batch_size=64, verbose=False)
monoT5_ret = bm25_ret % 10 >> monoT5

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


# Building a baseline retrieval run to generate baseline nuggets

In [12]:
from datasets import load_dataset
dataset = load_dataset("namawho/trec-raggy-dev")["validation"].to_pandas()
topics_df  = dataset[["qid", "query"]]
answers_df = dataset[["qid", "query", "gold_answer"]]

In [13]:
baseline = (monoT5_ret)(topics_df.head(10))
baseline

Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


Unnamed: 0,qid,query,docno,text,score,rank
0,23287,are landlords liable if someone breaks in a hu...,msmarco_v2.1_doc_02_759557285#0_1325339642,Is a landlord liable if a tenant or visitor is...,-0.002110,0
1,23287,are landlords liable if someone breaks in a hu...,msmarco_v2.1_doc_48_841527758#3_1529122925,"1996), reh'g denied (1996).) If a landlord is ...",-0.035341,6
2,23287,are landlords liable if someone breaks in a hu...,msmarco_v2.1_doc_02_759557285#1_1325342568,"To do this, the injured person must show that:...",-0.003268,1
3,23287,are landlords liable if someone breaks in a hu...,msmarco_v2.1_doc_48_841527758#12_1529136555,But if the tenant has a month-to-month rental ...,-0.027580,5
4,23287,are landlords liable if someone breaks in a hu...,msmarco_v2.1_doc_48_841527758#11_1529134815,most courts hold landlords liable for knowing ...,-0.065491,8
...,...,...,...,...,...,...
95,395948,information about who howard gardner and what ...,msmarco_v2.1_doc_43_701261032#3_1477785431,Multiple Intelligences Test\nBased on the work...,-4.660834,8
96,395948,information about who howard gardner and what ...,msmarco_v2.1_doc_45_911384240#2_1740648328,"He is the director of Harvard Project Zero , A...",-0.016688,2
97,395948,information about who howard gardner and what ...,msmarco_v2.1_doc_01_1630976798#1_2378503407,and some of the issues around its conceptualiz...,-0.748221,6
98,395948,information about who howard gardner and what ...,msmarco_v2.1_doc_44_376498252#0_936606965,Multiple Intelligences (Howard Gardner) - Inst...,-5.776045,9


# Nuggetizer setup

In [14]:
from pyterrier_rag.backend import HuggingFaceBackend

# backend =  HuggingFaceBackend("hugging-quants/gemma-2-9b-it-AWQ-INT4",
#                                           max_new_tokens=2048,
#                                           model_args={
#                                               "device_map": "cuda"
#                                           }
#                                          )
backend =  HuggingFaceBackend("hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4",
                                          max_new_tokens=2048,
                                          model_args={
                                              "device_map": "auto"
                                          }
                                         )

I have left this message as the final dev message to help you transition.

Important Notice:
- AutoAWQ is officially deprecated and will no longer be maintained.
- The last tested configuration used Torch 2.6.0 and Transformers 4.51.3.
- If future versions of Transformers break AutoAWQ compatibility, please report the issue to the Transformers project.

Alternative:
- AutoAWQ has been adopted by the vLLM Project: https://github.com/vllm-project/llm-compressor

For further inquiries, feel free to reach out:
- X: https://x.com/casper_hansen_
- LinkedIn: https://www.linkedin.com/in/casper-hansen-804005170/

Loading checkpoint shards: 100%|██████████| 9/9 [05:55<00:00, 39.46s/it]


In [15]:
from fastchat.conversation import register_conv_template, get_conv_template, Conversation, SeparatorStyle

register_conv_template(
    Conversation(
        name="meta-llama-3.1-sp",
        system_message="",
        roles=("user", "assistant"),
        sep_style=SeparatorStyle.ADD_COLON_SINGLE,
        sep="\n",
        messages=[],
    )
)

conv_template = get_conv_template("meta-llama-3.1-sp")

In [16]:
import pandas as pd

def save_csv(path, content):
    content.to_csv(path, index=False)

def load_csv(path):
    try:
        content = pd.read_csv(path)
        return content
    except Exception:
        return None

In [17]:
from open_nuggetizer.nuggetizer import Nuggetizer

nuggetizer = Nuggetizer(
    backend=backend, 
    conversation_template=conv_template,
    verbose=True
)

nuggets = load_csv("nuggets.csv")
if nuggets is None:
    nuggets = nuggetizer.create(baseline)
    save_csv("nuggets.csv", nuggets)

scored_nuggets = load_csv("scored_nuggets.csv")
if scored_nuggets is None:
    scored_nuggets = nuggetizer.score(nuggets)
    save_csv("scored_nuggets.csv", scored_nuggets)

Registered measures:
Accuracy
AP
MAP
BPM
Bpref
BPref
Compat
ERR_IA
nERR_IA
alpha_DCG
α_DCG
alpha_nDCG
α_nDCG
NRBP
nNRBP
AP_IA
MAP_IA
P_IA
StRecall
ERR
INST
INSQ
infAP
IPrec
Judged
nDCG
NDCG
NERR8
NERR9
NERR10
NERR11
NumQ
NumRel
NumRet
NumRelRet
P
Precision
R
Recall
RBP
Rprec
RPrec
RR
MRR
SDCG
SetP
SetRelP
SetR
SetF
SetAP
Success
AllScore
VitalScore
WeightedScore
Registered measures with details:


# Evaluation

In [18]:
from pyterrier_rag.prompt import Concatenator
from pyterrier_rag.readers import Reader
from pyterrier_rag.prompt import PromptTransformer
from jinja2 import Template

def make_callable_template(template: Template):
    def template_call(**kwargs):
        return template.render(**kwargs)

    return template_call

GENERIC_PROMPT = Template(
    "Use the context information to answer the Question: \n Context: {{ context }} \n Question: {{ query }} \n Answer:"
)

prompt = PromptTransformer(
            instruction=make_callable_template(GENERIC_PROMPT),
            system_message="You are an helpful assistant.",
            conversation_template=conv_template,
            input_fields=[
                "qcontext",
                "query",
            ],
        )

reader = Reader(backend, prompt)
rag_pipeline = monoT5_ret % 3 >> Concatenator() >> reader

results = (rag_pipeline)(topics_df.head(2))
results

The following generation flags are not valid and may be ignored: ['top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Unnamed: 0,prompt,qid,query_0,qanswer
0,You are an helpful assistant.\nuser: Use the c...,23287,are landlords liable if someone breaks in a hu...,Landlords may be liable if someone breaks in ...
1,You are an helpful assistant.\nuser: Use the c...,30611,average age of men at marriage,assistant\n\nAccording to the context informat...


In [21]:
# results = results.rename(columns={"qid": "query_id"})
scored_nuggets = scored_nuggets.rename(columns={"query_id": "qid"})
results = results.rename(columns={"query_id": "qid", "query_0": "query"})
for element in nuggetizer.VitalScore().iter_calc(scored_nuggets, results):
    print(f"Query ID: {element.query_id}, Measure: {element.measure}, Value: {element.value}")

Measure: VitalScore
Supported measures: [AllScore(partial_rel=ANY,strict=ANY), VitalScore(rel=ANY,partial_rel=ANY,strict=ANY), WeightedScore(rel=ANY,partial_rel=ANY,partial_weight=ANY)]
Measure: VitalScore
Supported measures: [AllScore(partial_rel=ANY,strict=ANY), VitalScore(rel=ANY,partial_rel=ANY,strict=ANY), WeightedScore(rel=ANY,partial_rel=ANY,partial_weight=ANY)]


pt.apply.by_query():   0%|          | 0/33 [00:00<?, ?it/s]The following generation flags are not valid and may be ignored: ['top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
100%|██████████| 2/2 [00:15<00:00,  7.67s/window]
pt.apply.by_query(): 100%|██████████| 33/33 [00:15<00:00,  2.15it/s]
  0%|          | 0/3 [00:00<?, ?window/s]The following generation flags are not valid and may be ignored: ['top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
 33%|███▎      | 1/3 [00:02<00:04,  2.22s/window]The following generation flags are not valid and may be ignored: ['top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
 67%|██████▋   | 2/3 [00:12<00:06,  6.84s/window]The following generation flags are not valid and may be ignored: ['top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
100%|██████████| 3/3 [00:22<00:00,  7.54s/w

Query ID: 23287, Measure: VitalScore, Value: 0.0
Query ID: 30611, Measure: VitalScore, Value: 0.0





In [20]:
"""
import pyterrier_rag.measures

results = pt.Experiment(
    [
        rag_pipeline
    ],
    topics_df.head(2), 
    answers_df,
    [pyterrier_rag.measures.F1, nuggetizer.VitalScore()],
    #batch_size=25,
    names=['baseline retriever'],
)
"""

"\nimport pyterrier_rag.measures\n\nresults = pt.Experiment(\n    [\n        rag_pipeline\n    ],\n    topics_df.head(2), \n    answers_df,\n    [pyterrier_rag.measures.F1, nuggetizer.VitalScore()],\n    #batch_size=25,\n    names=['baseline retriever'],\n)\n"