# 📘 Nuggetizer: A lightweight nugget-based evaluation framework for pyterrier-rag

## 📌 Introduction
- Objective: Demonstrate how to use NuggetizerRAG, a personal library for nugget-based evaluation in Retrieval-Augmented Generation (RAG).
- Context: Inspired by the AutoNuggetizer framework from the TREC 2024 RAG Track.
- Use case: Provide interpretable and automatic evaluation metrics for open-domain QA and generation tasks using PyTerrier pipelines.

## 🎯 Motivation and Background
- The Problem: Traditional RAG evaluations rely on lexical overlap or ROUGE scores, which miss semantic correctness.
- The Solution: Nugget evaluation, originally proposed in TREC QA 2003, revived by AutoNuggetizer, uses semantically atomic facts (“nuggets”) to evaluate answers.
- Inspiration: This library reimplements a simplified, local version of AutoNuggetizer with modular hooks into PyTerrier and HuggingFace models.

## ⚙️ Installation and Setup

In [1]:
!pip install git+https://github.com/MattiWe/ir_datasets.git@add-msmarco-v2.1-trec-rag

[0mCollecting git+https://github.com/MattiWe/ir_datasets.git@add-msmarco-v2.1-trec-rag
  Cloning https://github.com/MattiWe/ir_datasets.git (to revision add-msmarco-v2.1-trec-rag) to /tmp/pip-req-build-d6j4viou
  Running command git clone --filter=blob:none --quiet https://github.com/MattiWe/ir_datasets.git /tmp/pip-req-build-d6j4viou
  Running command git checkout -b add-msmarco-v2.1-trec-rag --track origin/add-msmarco-v2.1-trec-rag
  Switched to a new branch 'add-msmarco-v2.1-trec-rag'
  Branch 'add-msmarco-v2.1-trec-rag' set up to track remote branch 'add-msmarco-v2.1-trec-rag' from 'origin'.
  Resolved https://github.com/MattiWe/ir_datasets.git to commit bd018b783e3d25942b69290f7be19eeb929022c2
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[0m

In [2]:
!pip install -q python-terrier pyterrier_t5 pyterrier_pisa

[0m

In [3]:
!pip install -q git+https://github.com/terrierteam/pyterrier_rag.git

[0m

In [4]:
!pip install -q --no-deps ../.

[0m

In [5]:
import pyterrier as pt
from pyterrier_rag.backend import Backend

  from .autonotebook import tqdm as notebook_tqdm


# Dataset

In [6]:
import ir_datasets
dataset = ir_datasets.load('msmarco-segment-v2.1')

In [7]:
pt_dataset = pt.get_dataset("irds:msmarco-segment-v2.1")

# Pipelines

In [8]:
from pyterrier_pisa import PisaIndex
from pyterrier_t5 import MonoT5ReRanker

def rename_segment(run):
    run = run.rename(columns={"segment": "text"})
    return run
rename_pipe = pt.apply.generic(rename_segment)

index = PisaIndex('/mnt/indices/msmarco-segment-v2.1.pisa/')
bm25_ret = index.bm25() >> pt.text.get_text(pt_dataset, "segment") >> rename_pipe
monoT5 = MonoT5ReRanker(batch_size=64, verbose=False)
monoT5_ret = bm25_ret % 10 >> monoT5

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


# Building a baseline retrieval run to generate baseline nuggets

In [9]:
import pandas as pd
df = pd.read_csv("../../diversification/diversy-rag/datasets/TREC-RAGgy/raggy-dev.tsv", sep="\t", dtype=str)
topics_df  = df[["qid", "query"]]
answers_df = df[["qid", "query", "gold_answer"]]

baseline = (monoT5_ret)(topics_df.head(10))
baseline

# Nuggetizer setup

In [10]:
from pyterrier_rag.backend import HuggingFaceBackend

backend =  HuggingFaceBackend("hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4",
                                          max_new_tokens=2048,
                                          model_args={
                                              "device_map": "cuda"
                                          }
                                         )

Loading checkpoint shards: 100%|██████████| 9/9 [00:05<00:00,  1.72it/s]


In [11]:
from fastchat.conversation import register_conv_template, get_conv_template, Conversation, SeparatorStyle

register_conv_template(
    Conversation(
        name="meta-llama-3.1-sp",
        system_message="",
        roles=("user", "assistant"),
        sep_style=SeparatorStyle.ADD_COLON_SINGLE,
        sep="\n",
        messages=[],
    )
)

conv_template = get_conv_template("meta-llama-3.1-sp")

In [12]:
import pandas as pd

def save_csv(path, content):
    content.to_csv(path, index=False)

def load_csv(path):
    try:
        content = pd.read_csv(path)
        return content
    except Exception:
        return None

In [13]:
from open_nuggetizer.nuggetizer import Nuggetizer

nuggetizer = Nuggetizer(
    backend=backend, 
    conversation_template=conv_template,
    verbose=True
)

nuggets = load_csv("nuggets.csv")
if nuggets is None:
    nuggets = nuggetizer.create(baseline)
    save_csv("nuggets.csv", nuggets)

scored_nuggets = load_csv("scored_nuggets")
if scored_nuggets is None:
    scored_nuggets = nuggetizer.score(nuggets)
    save_csv("scored_nuggets.csv", scored_nuggets)

pt.apply.by_query():   0%|          | 0/225 [00:00<?, ?it/s]

LEN:12 - 10





PROMPT:
 ['You are NuggetizeScoreLLM, an intelligent assistant that can label a list of atomic nuggets based on their importance for a given search query.\nuser: Based on the query, label each of the 2 nuggets either a vital or okay based on the following criteria. Vital nuggets represent concepts that must be present in a “good” answer; on the other hand, okay nuggets contribute worthwhile information about the target but are not essential. Return the list of labels in a Pythonic list format (type: List[str]). The list should be in the same order as the input nuggets. Make sure to provide a label for each nugget.\n\nSearch Query: are landlords liable if someone breaks in a hurts tenant\nNugget List: [\'Landlord not liable for conditions arising after tenant takes possession\', "Landlord can be held liable for tenant\'s behavior if aware and does nothing"]\n\nOnly return the list of labels (List[str]). Do not explain.\nLabels:\n']


From v4.47 onwards, when a model cache is to be returned, `generate` will return a `Cache` instance instead by default (as opposed to the legacy tuple of tuples format). If you want to keep returning the legacy format, please set `return_legacy_cache=True`.

 50%|█████     | 1/2 [00:03<00:03,  3.18s/window][A

PROMPT:
 ['You are NuggetizeScoreLLM, an intelligent assistant that can label a list of atomic nuggets based on their importance for a given search query.\nuser: Based on the query, label each of the 10 nuggets either a vital or okay based on the following criteria. Vital nuggets represent concepts that must be present in a “good” answer; on the other hand, okay nuggets contribute worthwhile information about the target but are not essential. Return the list of labels in a Pythonic list format (type: List[str]). The list should be in the same order as the input nuggets. Make sure to provide a label for each nugget.\n\nSearch Query: are landlords liable if someone breaks in a hurts tenant\nNugget List: [\'Landlord liable for injuries on rental property\', \'Tenant must prove landlord negligence\', \'Landlord must maintain common areas\', \'Tenant can sue for medical bills and lost earnings\', "Landlord liable for injuries caused by tenant\'s dog", \'Landlord must know dog is dangerous\'


100%|██████████| 2/2 [00:10<00:00,  5.02s/window][A
pt.apply.by_query():   6%|▌         | 13/225 [00:10<02:43,  1.29it/s]

LEN:21 - 10



  0%|          | 0/3 [00:00<?, ?window/s][A

PROMPT:
 ["You are NuggetizeScoreLLM, an intelligent assistant that can label a list of atomic nuggets based on their importance for a given search query.\nuser: Based on the query, label each of the 1 nuggets either a vital or okay based on the following criteria. Vital nuggets represent concepts that must be present in a “good” answer; on the other hand, okay nuggets contribute worthwhile information about the target but are not essential. Return the list of labels in a Pythonic list format (type: List[str]). The list should be in the same order as the input nuggets. Make sure to provide a label for each nugget.\n\nSearch Query: average age of men at marriage\nNugget List: ['average age of men at marriage has increased in the last 60 years']\n\nOnly return the list of labels (List[str]). Do not explain.\nLabels:\n"]



 33%|███▎      | 1/3 [00:01<00:03,  1.57s/window][A

PROMPT:
 ["You are NuggetizeScoreLLM, an intelligent assistant that can label a list of atomic nuggets based on their importance for a given search query.\nuser: Based on the query, label each of the 10 nuggets either a vital or okay based on the following criteria. Vital nuggets represent concepts that must be present in a “good” answer; on the other hand, okay nuggets contribute worthwhile information about the target but are not essential. Return the list of labels in a Pythonic list format (type: List[str]). The list should be in the same order as the input nuggets. Make sure to provide a label for each nugget.\n\nSearch Query: average age of men at marriage\nNugget List: ['average age of men at marriage in Western Europe is around 30 years', 'average age of men at marriage has increased over the past two decades', 'average age of men at marriage has increased by two years since 1980', 'average age of men at marriage is higher in countries with higher social status', 'average age o


 67%|██████▋   | 2/3 [00:08<00:04,  4.52s/window][A

PROMPT:
 ["You are NuggetizeScoreLLM, an intelligent assistant that can label a list of atomic nuggets based on their importance for a given search query.\nuser: Based on the query, label each of the 10 nuggets either a vital or okay based on the following criteria. Vital nuggets represent concepts that must be present in a “good” answer; on the other hand, okay nuggets contribute worthwhile information about the target but are not essential. Return the list of labels in a Pythonic list format (type: List[str]). The list should be in the same order as the input nuggets. Make sure to provide a label for each nugget.\n\nSearch Query: average age of men at marriage\nNugget List: ['average age of men at marriage is 26.8 years', 'average age of men at marriage in US is 26.8 years', 'average age of men at marriage in UK is 30.8 years', 'average age of men at marriage in Alabama is 25.5 years', 'average age of men at marriage in District of Columbia is 30 years', 'average age of men at marria


100%|██████████| 3/3 [00:14<00:00,  4.87s/window][A
pt.apply.by_query():  15%|█▌        | 34/225 [00:24<02:17,  1.39it/s]

LEN:30 - 10



  0%|          | 0/3 [00:00<?, ?window/s][A

PROMPT:
 ["You are NuggetizeScoreLLM, an intelligent assistant that can label a list of atomic nuggets based on their importance for a given search query.\nuser: Based on the query, label each of the 10 nuggets either a vital or okay based on the following criteria. Vital nuggets represent concepts that must be present in a “good” answer; on the other hand, okay nuggets contribute worthwhile information about the target but are not essential. Return the list of labels in a Pythonic list format (type: List[str]). The list should be in the same order as the input nuggets. Make sure to provide a label for each nugget.\n\nSearch Query: crest syndrome esophageal dysfunction\nNugget List: ['CENP-C', 'Connective tissue disease', 'Scleroderma variant', 'Clinical signs', 'Laboratory test', 'Diagnostic criteria', 'Prognosis', 'Treatment options', 'Complications', 'Pulmonary function testing']\n\nOnly return the list of labels (List[str]). Do not explain.\nLabels:\n"]



 33%|███▎      | 1/3 [00:06<00:12,  6.31s/window][A

PROMPT:
 ["You are NuggetizeScoreLLM, an intelligent assistant that can label a list of atomic nuggets based on their importance for a given search query.\nuser: Based on the query, label each of the 10 nuggets either a vital or okay based on the following criteria. Vital nuggets represent concepts that must be present in a “good” answer; on the other hand, okay nuggets contribute worthwhile information about the target but are not essential. Return the list of labels in a Pythonic list format (type: List[str]). The list should be in the same order as the input nuggets. Make sure to provide a label for each nugget.\n\nSearch Query: crest syndrome esophageal dysfunction\nNugget List: ['Antinuclear antibodies', 'Centromere antibodies', 'Esophageal hypomotility', 'Reflux esophagitis', 'Nonpitting digital edema', 'Pulmonary hypertension', 'Biliary cirrhosis', 'HLA-DR1', 'CENP-A', 'CENP-B']\n\nOnly return the list of labels (List[str]). Do not explain.\nLabels:\n"]



 67%|██████▋   | 2/3 [00:12<00:06,  6.53s/window][A

PROMPT:
 ['You are NuggetizeScoreLLM, an intelligent assistant that can label a list of atomic nuggets based on their importance for a given search query.\nuser: Based on the query, label each of the 10 nuggets either a vital or okay based on the following criteria. Vital nuggets represent concepts that must be present in a “good” answer; on the other hand, okay nuggets contribute worthwhile information about the target but are not essential. Return the list of labels in a Pythonic list format (type: List[str]). The list should be in the same order as the input nuggets. Make sure to provide a label for each nugget.\n\nSearch Query: crest syndrome esophageal dysfunction\nNugget List: [\'CREST syndrome\', \'Calcinosis cutis\', "Raynaud\'s phenomenon", \'Esophageal dysfunction\', \'Sclerodactyly\', \'Telangiectasia\', \'Systemic sclerosis\', \'Limited scleroderma\', \'Diffuse scleroderma\', \'Autoimmune disease\']\n\nOnly return the list of labels (List[str]). Do not explain.\nLabels:\n']


100%|██████████| 3/3 [00:19<00:00,  6.51s/window][A
pt.apply.by_query():  28%|██▊       | 64/225 [00:44<01:49,  1.47it/s]

LEN:21 - 10



  0%|          | 0/3 [00:00<?, ?window/s][A

PROMPT:
 ["You are NuggetizeScoreLLM, an intelligent assistant that can label a list of atomic nuggets based on their importance for a given search query.\nuser: Based on the query, label each of the 1 nuggets either a vital or okay based on the following criteria. Vital nuggets represent concepts that must be present in a “good” answer; on the other hand, okay nuggets contribute worthwhile information about the target but are not essential. Return the list of labels in a Pythonic list format (type: List[str]). The list should be in the same order as the input nuggets. Make sure to provide a label for each nugget.\n\nSearch Query: does light intensity or concentration of carbon dioxide have a higher rate of photosynthesis\nNugget List: ['Photosynthesis rate levels off at high light intensities and carbon dioxide concentrations']\n\nOnly return the list of labels (List[str]). Do not explain.\nLabels:\n"]



 33%|███▎      | 1/3 [00:01<00:03,  1.63s/window][A

PROMPT:
 ["You are NuggetizeScoreLLM, an intelligent assistant that can label a list of atomic nuggets based on their importance for a given search query.\nuser: Based on the query, label each of the 10 nuggets either a vital or okay based on the following criteria. Vital nuggets represent concepts that must be present in a “good” answer; on the other hand, okay nuggets contribute worthwhile information about the target but are not essential. Return the list of labels in a Pythonic list format (type: List[str]). The list should be in the same order as the input nuggets. Make sure to provide a label for each nugget.\n\nSearch Query: does light intensity or concentration of carbon dioxide have a higher rate of photosynthesis\nNugget List: ['Chlorophyll concentration affects photosynthesis rate', 'Carbon dioxide concentration limits photosynthesis rate at high light intensities', 'Carbon dioxide concentration of 0.03 to 0.04 percent is sufficient for photosynthesis', 'Carbon dioxide conce


 67%|██████▋   | 2/3 [00:08<00:04,  4.86s/window][A

PROMPT:
 ["You are NuggetizeScoreLLM, an intelligent assistant that can label a list of atomic nuggets based on their importance for a given search query.\nuser: Based on the query, label each of the 10 nuggets either a vital or okay based on the following criteria. Vital nuggets represent concepts that must be present in a “good” answer; on the other hand, okay nuggets contribute worthwhile information about the target but are not essential. Return the list of labels in a Pythonic list format (type: List[str]). The list should be in the same order as the input nuggets. Make sure to provide a label for each nugget.\n\nSearch Query: does light intensity or concentration of carbon dioxide have a higher rate of photosynthesis\nNugget List: ['Light intensity affects photosynthesis rate', 'Carbon dioxide concentration affects photosynthesis rate', 'Temperature affects photosynthesis rate', 'Higher light intensity increases photosynthesis rate', 'Higher carbon dioxide concentration increases


100%|██████████| 3/3 [00:15<00:00,  5.29s/window][A
pt.apply.by_query():  38%|███▊      | 85/225 [01:00<01:39,  1.41it/s]

LEN:30 - 10



  0%|          | 0/3 [00:00<?, ?window/s][A

PROMPT:
 ['You are NuggetizeScoreLLM, an intelligent assistant that can label a list of atomic nuggets based on their importance for a given search query.\nuser: Based on the query, label each of the 10 nuggets either a vital or okay based on the following criteria. Vital nuggets represent concepts that must be present in a “good” answer; on the other hand, okay nuggets contribute worthwhile information about the target but are not essential. Return the list of labels in a Pythonic list format (type: List[str]). The list should be in the same order as the input nuggets. Make sure to provide a label for each nugget.\n\nSearch Query: how does my baby get submitted for medicaid after birth\nNugget List: [\'full Medicaid coverage\', \'Medicaid managed care plan\', "baby\'s Medicaid number", "baby\'s Medicaid ID number", "baby\'s Medicaid ID card", "baby of mother\'s name", \'card control number\', \'Medical Assistance Referral Form\', \'Unborn Activation Form\', \'proof of eligibility\']\n


 33%|███▎      | 1/3 [00:07<00:14,  7.31s/window][A

PROMPT:
 ['You are NuggetizeScoreLLM, an intelligent assistant that can label a list of atomic nuggets based on their importance for a given search query.\nuser: Based on the query, label each of the 10 nuggets either a vital or okay based on the following criteria. Vital nuggets represent concepts that must be present in a “good” answer; on the other hand, okay nuggets contribute worthwhile information about the target but are not essential. Return the list of labels in a Pythonic list format (type: List[str]). The list should be in the same order as the input nuggets. Make sure to provide a label for each nugget.\n\nSearch Query: how does my baby get submitted for medicaid after birth\nNugget List: ["baby\'s plan", \'MMA plan enrollment\', \'Medicaid services\', \'fee for service\', \'Medicaid fiscal agent\', "baby\'s eligibility", \'MMA plan\', "mother\'s eligibility category", \'MU\', \'FP\']\n\nOnly return the list of labels (List[str]). Do not explain.\nLabels:\n']



 67%|██████▋   | 2/3 [00:14<00:07,  7.04s/window][A

PROMPT:
 ['You are NuggetizeScoreLLM, an intelligent assistant that can label a list of atomic nuggets based on their importance for a given search query.\nuser: Based on the query, label each of the 10 nuggets either a vital or okay based on the following criteria. Vital nuggets represent concepts that must be present in a “good” answer; on the other hand, okay nuggets contribute worthwhile information about the target but are not essential. Return the list of labels in a Pythonic list format (type: List[str]). The list should be in the same order as the input nuggets. Make sure to provide a label for each nugget.\n\nSearch Query: how does my baby get submitted for medicaid after birth\nNugget List: ["baby\'s Medicaid eligibility", "mother\'s Medicaid eligibility", \'Medicaid managed care plan\', \'newborn activation request\', \'Florida Medicaid Secure Web Portal\', \'Florida Health Plan Portal\', \'Medicaid fiscal agent\', "baby\'s name, gender, and birth date", \'new Medicaid gold 


100%|██████████| 3/3 [00:21<00:00,  7.06s/window][A
pt.apply.by_query():  51%|█████     | 115/225 [01:21<01:17,  1.41it/s]

LEN:20 - 10



  0%|          | 0/2 [00:00<?, ?window/s][A

PROMPT:
 ["You are NuggetizeScoreLLM, an intelligent assistant that can label a list of atomic nuggets based on their importance for a given search query.\nuser: Based on the query, label each of the 10 nuggets either a vital or okay based on the following criteria. Vital nuggets represent concepts that must be present in a “good” answer; on the other hand, okay nuggets contribute worthwhile information about the target but are not essential. Return the list of labels in a Pythonic list format (type: List[str]). The list should be in the same order as the input nuggets. Make sure to provide a label for each nugget.\n\nSearch Query: how is the oil pollution act of 1990 effect oil companies\nNugget List: ['Enhanced federal response capability', 'Increased potential liabilities', 'Limited liability for companies to $75 million', 'Prevented oil spills from vessels and facilities', 'Assigned liability for cleanup and damage costs', 'Defined responsible parties and financial liability', 'Imp


 50%|█████     | 1/2 [00:07<00:07,  7.89s/window][A

PROMPT:
 ["You are NuggetizeScoreLLM, an intelligent assistant that can label a list of atomic nuggets based on their importance for a given search query.\nuser: Based on the query, label each of the 10 nuggets either a vital or okay based on the following criteria. Vital nuggets represent concepts that must be present in a “good” answer; on the other hand, okay nuggets contribute worthwhile information about the target but are not essential. Return the list of labels in a Pythonic list format (type: List[str]). The list should be in the same order as the input nuggets. Make sure to provide a label for each nugget.\n\nSearch Query: how is the oil pollution act of 1990 effect oil companies\nNugget List: ['Oil Pollution Act of 1990', 'Passed by 101st US Congress', 'Signed by President George H.W. Bush', 'Effective August 18, 1990', 'Amended Clean Water Act', 'Increased penalties for oil spills', 'Required double hulls for oil tankers', 'Established Oil Spill Liability Trust Fund', 'Provi


100%|██████████| 2/2 [00:15<00:00,  7.74s/window][A
pt.apply.by_query():  60%|██████    | 135/225 [01:36<01:05,  1.38it/s]

LEN:24 - 10



  0%|          | 0/3 [00:00<?, ?window/s][A

PROMPT:
 ["You are NuggetizeScoreLLM, an intelligent assistant that can label a list of atomic nuggets based on their importance for a given search query.\nuser: Based on the query, label each of the 4 nuggets either a vital or okay based on the following criteria. Vital nuggets represent concepts that must be present in a “good” answer; on the other hand, okay nuggets contribute worthwhile information about the target but are not essential. Return the list of labels in a Pythonic list format (type: List[str]). The list should be in the same order as the input nuggets. Make sure to provide a label for each nugget.\n\nSearch Query: how long does it take for a sprained wrist to heal\nNugget List: ['sprained pinky healing time 2-3 weeks', 'AC sprain healing time 4-6 weeks', 'sprained rib ligament healing time 3-6 weeks', 'sprained foot healing time varies']\n\nOnly return the list of labels (List[str]). Do not explain.\nLabels:\n"]



 33%|███▎      | 1/3 [00:03<00:06,  3.34s/window][A

PROMPT:
 ["You are NuggetizeScoreLLM, an intelligent assistant that can label a list of atomic nuggets based on their importance for a given search query.\nuser: Based on the query, label each of the 10 nuggets either a vital or okay based on the following criteria. Vital nuggets represent concepts that must be present in a “good” answer; on the other hand, okay nuggets contribute worthwhile information about the target but are not essential. Return the list of labels in a Pythonic list format (type: List[str]). The list should be in the same order as the input nuggets. Make sure to provide a label for each nugget.\n\nSearch Query: how long does it take for a sprained wrist to heal\nNugget List: ['mild sprain heals 1 month over 50', 'sprained wrist ligament healing time 2-10 weeks', 'sprained wrist healing time 2-3 weeks to be normal', 'sprained wrist healing time 5-6 days with rest', 'sprained hand healing time 7-10 days', 'sprained foot healing time 1 week', 'sprained arm healing tim


 67%|██████▋   | 2/3 [00:10<00:05,  5.67s/window][A

PROMPT:
 ["You are NuggetizeScoreLLM, an intelligent assistant that can label a list of atomic nuggets based on their importance for a given search query.\nuser: Based on the query, label each of the 10 nuggets either a vital or okay based on the following criteria. Vital nuggets represent concepts that must be present in a “good” answer; on the other hand, okay nuggets contribute worthwhile information about the target but are not essential. Return the list of labels in a Pythonic list format (type: List[str]). The list should be in the same order as the input nuggets. Make sure to provide a label for each nugget.\n\nSearch Query: how long does it take for a sprained wrist to heal\nNugget List: ['wrist sprain healing time 2-10 weeks', 'mild wrist sprain heals 2-3 days', 'moderate wrist sprain heals 1-2 weeks', 'severe wrist sprain heals several weeks to months', 'rest ice compression helps wrist sprain healing', 'wrist sprain occurs due to ligament injury', 'wrist sprain common due to


100%|██████████| 3/3 [00:17<00:00,  5.90s/window][A
pt.apply.by_query():  71%|███████   | 159/225 [01:54<00:48,  1.37it/s]

LEN:18 - 10



  0%|          | 0/2 [00:00<?, ?window/s][A

PROMPT:
 ["You are NuggetizeScoreLLM, an intelligent assistant that can label a list of atomic nuggets based on their importance for a given search query.\nuser: Based on the query, label each of the 8 nuggets either a vital or okay based on the following criteria. Vital nuggets represent concepts that must be present in a “good” answer; on the other hand, okay nuggets contribute worthwhile information about the target but are not essential. Return the list of labels in a Pythonic list format (type: List[str]). The list should be in the same order as the input nuggets. Make sure to provide a label for each nugget.\n\nSearch Query: how many years in jail for money laundering\nNugget List: ['Money laundering is a felony offense', 'Money laundering is a white-collar crime', 'Money laundering involves deceit and financial gain', 'First-degree money laundering is a class 2 felony', 'Second-degree money laundering is a class 3 felony', 'Money laundering penalties depend on prior convictions'


 50%|█████     | 1/2 [00:06<00:06,  6.03s/window][A

PROMPT:
 ["You are NuggetizeScoreLLM, an intelligent assistant that can label a list of atomic nuggets based on their importance for a given search query.\nuser: Based on the query, label each of the 10 nuggets either a vital or okay based on the following criteria. Vital nuggets represent concepts that must be present in a “good” answer; on the other hand, okay nuggets contribute worthwhile information about the target but are not essential. Return the list of labels in a Pythonic list format (type: List[str]). The list should be in the same order as the input nuggets. Make sure to provide a label for each nugget.\n\nSearch Query: how many years in jail for money laundering\nNugget List: ['Money laundering jail time depends on amount laundered', 'Texas penal code requires 180 days to 2 years in jail for $1,500 to $20,000', '2 to 10 years in prison for $20,000 to $100,000', '2 to 20 years in prison for $100,000 to $200,000', '5 to 99 years in prison for over $100,000', 'California misd


100%|██████████| 2/2 [00:13<00:00,  6.96s/window][A
pt.apply.by_query():  79%|███████▊  | 177/225 [02:08<00:35,  1.35it/s]

LEN:26 - 10



  0%|          | 0/3 [00:00<?, ?window/s][A

PROMPT:
 ["You are NuggetizeScoreLLM, an intelligent assistant that can label a list of atomic nuggets based on their importance for a given search query.\nuser: Based on the query, label each of the 6 nuggets either a vital or okay based on the following criteria. Vital nuggets represent concepts that must be present in a “good” answer; on the other hand, okay nuggets contribute worthwhile information about the target but are not essential. Return the list of labels in a Pythonic list format (type: List[str]). The list should be in the same order as the input nuggets. Make sure to provide a label for each nugget.\n\nSearch Query: how to help a jammed finger\nNugget List: ['Ibuprofen reduces swelling', 'Follow dosage recommendations', 'Collateral ligaments support joint', 'Jammed finger not serious', 'At-home treatments help', 'Medical treatments help']\n\nOnly return the list of labels (List[str]). Do not explain.\nLabels:\n"]



 33%|███▎      | 1/3 [00:04<00:09,  4.84s/window][A

PROMPT:
 ['You are NuggetizeScoreLLM, an intelligent assistant that can label a list of atomic nuggets based on their importance for a given search query.\nuser: Based on the query, label each of the 10 nuggets either a vital or okay based on the following criteria. Vital nuggets represent concepts that must be present in a “good” answer; on the other hand, okay nuggets contribute worthwhile information about the target but are not essential. Return the list of labels in a Pythonic list format (type: List[str]). The list should be in the same order as the input nuggets. Make sure to provide a label for each nugget.\n\nSearch Query: how to help a jammed finger\nNugget List: [\'Rest is key\', \'Ice for 20 minutes\', \'Repeat as needed\', \'Avoid slamming doors\', \'Finger protection strips help\', \'Door guards help\', "Check children\'s hands", \'Teach children door safety\', \'Arthritis symptoms similar\', \'No pulling a jammed finger\']\n\nOnly return the list of labels (List[str]). D


 67%|██████▋   | 2/3 [00:11<00:06,  6.11s/window][A

PROMPT:
 ["You are NuggetizeScoreLLM, an intelligent assistant that can label a list of atomic nuggets based on their importance for a given search query.\nuser: Based on the query, label each of the 10 nuggets either a vital or okay based on the following criteria. Vital nuggets represent concepts that must be present in a “good” answer; on the other hand, okay nuggets contribute worthwhile information about the target but are not essential. Return the list of labels in a Pythonic list format (type: List[str]). The list should be in the same order as the input nuggets. Make sure to provide a label for each nugget.\n\nSearch Query: how to help a jammed finger\nNugget List: ['Jammed finger causes pain', 'Swelling and stiffness', 'Difficulty moving finger', 'Ice pack helps', 'Epsom salt helps', 'Aloe Vera Gel helps', 'Apple Cider Vinegar helps', 'Turmeric helps', 'Tape injured finger', 'Immobilize finger']\n\nOnly return the list of labels (List[str]). Do not explain.\nLabels:\n"]



100%|██████████| 3/3 [00:19<00:00,  6.34s/window][A
pt.apply.by_query(): 100%|██████████| 225/225 [02:27<00:00,  1.53it/s]


LEN:23 - 10


  0%|          | 0/3 [00:00<?, ?window/s]

PROMPT:
 ["You are NuggetizeScoreLLM, an intelligent assistant that can label a list of atomic nuggets based on their importance for a given search query.\nuser: Based on the query, label each of the 3 nuggets either a vital or okay based on the following criteria. Vital nuggets represent concepts that must be present in a “good” answer; on the other hand, okay nuggets contribute worthwhile information about the target but are not essential. Return the list of labels in a Pythonic list format (type: List[str]). The list should be in the same order as the input nuggets. Make sure to provide a label for each nugget.\n\nSearch Query: information about who howard gardner and what does he do\nNugget List: ['He has been the co-director of The Good Project since 1995', 'Gardner retired from teaching in 2019', 'He published his intellectual memoir A Synthesizing Mind in 2020']\n\nOnly return the list of labels (List[str]). Do not explain.\nLabels:\n"]


 33%|███▎      | 1/3 [00:02<00:05,  2.91s/window]

PROMPT:
 ['You are NuggetizeScoreLLM, an intelligent assistant that can label a list of atomic nuggets based on their importance for a given search query.\nuser: Based on the query, label each of the 10 nuggets either a vital or okay based on the following criteria. Vital nuggets represent concepts that must be present in a “good” answer; on the other hand, okay nuggets contribute worthwhile information about the target but are not essential. Return the list of labels in a Pythonic list format (type: List[str]). The list should be in the same order as the input nuggets. Make sure to provide a label for each nugget.\n\nSearch Query: information about who howard gardner and what does he do\nNugget List: ["Gardner\'s theory of multiple intelligences includes eight different types of intelligences", \'Linguistic-Verbal intelligence\', \'Logical-Mathematical intelligence\', \'Visual-Spatial intelligence\', \'Bodily-Kinesthetic intelligence\', \'Musical-Rhythmic intelligence\', \'Interperson

 67%|██████▋   | 2/3 [00:09<00:05,  5.26s/window]

PROMPT:
 ["You are NuggetizeScoreLLM, an intelligent assistant that can label a list of atomic nuggets based on their importance for a given search query.\nuser: Based on the query, label each of the 10 nuggets either a vital or okay based on the following criteria. Vital nuggets represent concepts that must be present in a “good” answer; on the other hand, okay nuggets contribute worthwhile information about the target but are not essential. Return the list of labels in a Pythonic list format (type: List[str]). The list should be in the same order as the input nuggets. Make sure to provide a label for each nugget.\n\nSearch Query: information about who howard gardner and what does he do\nNugget List: ['Howard Gardner is an American psychologist', 'He specializes in cognitive and developmental psychology', 'He is best known for his theory of multiple intelligences', 'Gardner believes that the way people usually think about intelligence is too narrow', 'He was born on July 11, 1943 in S

100%|██████████| 3/3 [00:17<00:00,  5.92s/window]


# Evaluation

In [14]:
from pyterrier_rag.prompt import Concatenator
from pyterrier_rag.readers import Reader

reader = Reader(backend)
rag_pipeline = monoT5_ret % 3 >> Concatenator() >> reader

(monoT5_ret % 3 >> Concatenator() >> reader)(topics_df.head(1))

Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


Unnamed: 0,prompt,qid,query_0,qanswer
0,A chat between a curious human and an artifici...,23287,are landlords liable if someone breaks in a hu...,Landlords are generally not liable for injuri...


In [15]:
import pyterrier_rag.measures

results = pt.Experiment(
    [
        rag_pipeline
    ],
    topics_df.head(2), 
    answers_df,
    [pyterrier_rag.measures.F1, nuggetizer.VitalScore()],
    #batch_size=25,
    names=['baseline retriever'],
)

A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


KeyError: 'qid'