# 📘 Nuggetizer: A lightweight nugget-based evaluation framework for pyterrier-rag

## 📌 Introduction
- Objective: Demonstrate how to use NuggetizerRAG, a personal library for nugget-based evaluation in Retrieval-Augmented Generation (RAG).
- Context: Inspired by the AutoNuggetizer framework from the TREC 2024 RAG Track.
- Use case: Provide interpretable and automatic evaluation metrics for open-domain QA and generation tasks using PyTerrier pipelines.

## 🎯 Motivation and Background
- The Problem: Traditional RAG evaluations rely on lexical overlap or ROUGE scores, which miss semantic correctness.
- The Solution: Nugget evaluation, originally proposed in TREC QA 2003, revived by AutoNuggetizer, uses semantically atomic facts (“nuggets”) to evaluate answers.
- Inspiration: This library reimplements a simplified, local version of AutoNuggetizer with modular hooks into PyTerrier and HuggingFace models.

## ⚙️ Installation and Setup

In [1]:
!pip install git+https://github.com/MattiWe/ir_datasets.git@add-msmarco-v2.1-trec-rag

[0mCollecting git+https://github.com/MattiWe/ir_datasets.git@add-msmarco-v2.1-trec-rag
  Cloning https://github.com/MattiWe/ir_datasets.git (to revision add-msmarco-v2.1-trec-rag) to /tmp/pip-req-build-btwtay6w
  Running command git clone --filter=blob:none --quiet https://github.com/MattiWe/ir_datasets.git /tmp/pip-req-build-btwtay6w
  Running command git checkout -b add-msmarco-v2.1-trec-rag --track origin/add-msmarco-v2.1-trec-rag
  Switched to a new branch 'add-msmarco-v2.1-trec-rag'
  Branch 'add-msmarco-v2.1-trec-rag' set up to track remote branch 'add-msmarco-v2.1-trec-rag' from 'origin'.
  Resolved https://github.com/MattiWe/ir_datasets.git to commit 24a983d51b04f11a11c2f654dab3c275905c67a0
  Preparing metadata (setup.py) ... [?25ldone
[0m

In [2]:
!pip install -q python-terrier pyterrier_t5 pyterrier_pisa

[0m

In [3]:
!pip install -q git+https://github.com/terrierteam/pyterrier_rag.git

[0m

In [4]:
!pip install -q --no-deps ../.

[0m

In [5]:
import pyterrier as pt
from pyterrier_rag.backend import Backend

  from .autonotebook import tqdm as notebook_tqdm


# Dataset

In [6]:
import ir_datasets
dataset = ir_datasets.load('msmarco-segment-v2.1')

In [7]:
pt_dataset = pt.get_dataset("irds:msmarco-segment-v2.1")

# Pipelines

In [8]:
from pyterrier_pisa import PisaIndex
from pyterrier_t5 import MonoT5ReRanker

def rename_segment(run):
    run = run.rename(columns={"segment": "text"})
    return run
rename_pipe = pt.apply.generic(rename_segment)

index = PisaIndex('/mnt/indices/msmarco-segment-v2.1.pisa/')
bm25_ret = index.bm25() >> pt.text.get_text(pt_dataset, "segment") >> rename_pipe
monoT5 = MonoT5ReRanker(batch_size=64, verbose=False)
monoT5_ret = bm25_ret % 10 >> monoT5

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


# Building a baseline retrieval run to generate baseline nuggets

In [9]:
import pandas as pd
df = pd.read_csv("../../diversification/diversy-rag/datasets/TREC-RAGgy/raggy-dev.tsv", sep="\t", dtype=str)
topics_df  = df[["qid", "query"]]
answers_df = df[["qid", "query", "gold_answer"]]

In [10]:
bm25_ret.search("hello world")

Unnamed: 0,qid,query,docno,score,rank,text
0,1,hello world,msmarco_v2.1_doc_30_206096153#9_462542248,15.437883,0,"However, if you’re looking for something a bit..."
1,1,hello world,msmarco_v2.1_doc_30_206096153#10_462544673,15.426567,1,"In other words, I explore languages almost ran..."
2,1,hello world,msmarco_v2.1_doc_56_166221946#0_369357076,15.149368,2,Hello World Program in Java\n\n\n\n\n\n\n\nRel...
3,1,hello world,msmarco_v2.1_doc_24_1026769510#2_2162703600,15.075923,3,"1\n""Hello World"" -ieq ""hello world""\nEqual Che..."
4,1,hello world,msmarco_v2.1_doc_01_1568473385#1_2272488579,15.056515,4,nmake | msbuild\n\nmake\nrake\nant\ngradle\nve...
...,...,...,...,...,...,...
995,1,hello world,msmarco_v2.1_doc_28_1238377198#4_2648854685,13.674044,995,Share\nedited Dec 3 '19 at 16:23\nVadim Shkabe...
996,1,hello world,msmarco_v2.1_doc_09_332951182#16_455124934,13.672531,996,"WAIT KEY\nPBasic\nDEBUG ""Hello, World!"", CR\no..."
997,1,hello world,msmarco_v2.1_doc_25_897484862#11_1695398270,13.672531,997,If you don't already have Cargo installed on y...
998,1,hello world,msmarco_v2.1_doc_51_1184196669#3_2398216740,13.672531,998,It does nothing. It takes the decorated functi...


In [11]:
baseline = (monoT5_ret)(topics_df.head(10))
baseline

Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


Unnamed: 0,qid,query,docno,text,score,rank
0,23287,are landlords liable if someone breaks in a hu...,msmarco_v2.1_doc_02_759557285#0_1325339642,Is a landlord liable if a tenant or visitor is...,-0.002110,0
1,23287,are landlords liable if someone breaks in a hu...,msmarco_v2.1_doc_48_841527758#3_1529122925,"1996), reh'g denied (1996).) If a landlord is ...",-0.035341,6
2,23287,are landlords liable if someone breaks in a hu...,msmarco_v2.1_doc_02_759557285#1_1325342568,"To do this, the injured person must show that:...",-0.003268,1
3,23287,are landlords liable if someone breaks in a hu...,msmarco_v2.1_doc_48_841527758#12_1529136555,But if the tenant has a month-to-month rental ...,-0.027581,5
4,23287,are landlords liable if someone breaks in a hu...,msmarco_v2.1_doc_48_841527758#11_1529134815,most courts hold landlords liable for knowing ...,-0.065491,8
...,...,...,...,...,...,...
95,395948,information about who howard gardner and what ...,msmarco_v2.1_doc_43_701261032#3_1477785431,Multiple Intelligences Test\nBased on the work...,-4.660836,8
96,395948,information about who howard gardner and what ...,msmarco_v2.1_doc_45_911384240#2_1740648328,"He is the director of Harvard Project Zero , A...",-0.016688,2
97,395948,information about who howard gardner and what ...,msmarco_v2.1_doc_01_1630976798#1_2378503407,and some of the issues around its conceptualiz...,-0.748225,6
98,395948,information about who howard gardner and what ...,msmarco_v2.1_doc_44_376498252#0_936606965,Multiple Intelligences (Howard Gardner) - Inst...,-5.776029,9


# Nuggetizer setup

In [12]:
from pyterrier_rag.backend import HuggingFaceBackend

backend =  HuggingFaceBackend("hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4",
                                          max_new_tokens=2048,
                                          model_args={
                                              "device_map": "cuda"
                                          }
                                         )

Loading checkpoint shards: 100%|██████████| 9/9 [00:05<00:00,  1.73it/s]


In [13]:
from fastchat.conversation import register_conv_template, get_conv_template, Conversation, SeparatorStyle

register_conv_template(
    Conversation(
        name="meta-llama-3.1-sp",
        system_message="",
        roles=("user", "assistant"),
        sep_style=SeparatorStyle.ADD_COLON_SINGLE,
        sep="\n",
        messages=[],
    )
)

conv_template = get_conv_template("meta-llama-3.1-sp")

In [14]:
from open_nuggetizer.nuggetizer import Nuggetizer

nuggetizer = Nuggetizer(
    backend=backend, 
    conversation_template=conv_template,
    verbose=True
)
nuggets = nuggetizer.create(baseline)
nuggetizer.score(nuggets)

pt.apply.by_query():   0%|          | 0/100 [00:00<?, ?it/s]


PROMPT:


From v4.47 onwards, when a model cache is to be returned, `generate` will return a `Cache` instance instead by default (as opposed to the legacy tuple of tuples format). If you want to keep returning the legacy format, please set `return_legacy_cache=True`.

100%|██████████| 1/1 [00:29<00:00, 29.36s/window][A
pt.apply.by_query():  11%|█         | 11/100 [00:29<03:57,  2.67s/it]

--- RAW TEXT: ---
 ["Landlord liable for injuries on rental property", "Tenant must prove landlord negligence", "Landlord must maintain common areas", "Tenant can sue for medical bills and lost earnings", "Landlord liable for injuries caused by tenant's dog", "Landlord must know dog is dangerous", "Landlord must have power to remove dog", "Landlord liable for injuries off rental property", "Landlord must maintain property in safe condition", "Landlord liable for faulty wiring and toxic mold", "Landlord not liable for conditions arising after tenant takes possession", "Landlord can be held liable for tenant's behavior if aware and does nothing"] 
 --- END OF RAW TEXT---



  0%|          | 0/1 [00:00<?, ?window/s][A

PROMPT:
 ['You are NuggetizeLLM, an intelligent assistant that can update a list of atomic nuggets to best provide all the information required for the query.\nuser: Update the list of atomic nuggets of information (1-12 words), if needed, so they best provide the information required for the query. Leverage only the initial list of nuggets (if exists) and the provided context (this is an iterative process).  Return only the final list of all nuggets in a Pythonic list format (even if no updates). Make sure there is no redundant information. Ensure the updated nugget list has at most 30 nuggets (can be less), keeping only the most vital ones. Order them in decreasing order of importance. Prefer nuggets that provide more interesting information.\n\nSearch Query: average age of men at marriage\nContext:\n[1] The Nations Of Europe By The Average Age At First Marriage - WorldAtlas\nThe Nations Of Europe By The Average Age At First Marriage\nA happily married couple celebrating their new re


100%|██████████| 1/1 [01:03<00:00, 63.77s/window][A
pt.apply.by_query():  21%|██        | 21/100 [01:33<06:16,  4.76s/it]

--- RAW TEXT: ---
 ["average age of men at marriage is 26.8 years", "average age of men at marriage in US is 26.8 years", "average age of men at marriage in UK is 30.8 years", "average age of men at marriage in Alabama is 25.5 years", "average age of men at marriage in District of Columbia is 30 years", "average age of men at marriage in Moldova is 26 years", "average age of men at marriage in Mexico is 23.3 years", "average age of men at marriage in Europe varies by country", "average age of men at marriage in Northern Europe is around 30 years", "average age of men at marriage in Southern Europe is around 30 years", "average age of men at marriage in Western Europe is around 30 years", "average age of men at marriage has increased over the past two decades", "average age of men at marriage has increased by two years since 1980", "average age of men at marriage is higher in countries with higher social status", "average age of men at marriage is higher in countries with better educati


  0%|          | 0/1 [00:00<?, ?window/s][A

PROMPT:
 ['You are NuggetizeLLM, an intelligent assistant that can update a list of atomic nuggets to best provide all the information required for the query.\nuser: Update the list of atomic nuggets of information (1-12 words), if needed, so they best provide the information required for the query. Leverage only the initial list of nuggets (if exists) and the provided context (this is an iterative process).  Return only the final list of all nuggets in a Pythonic list format (even if no updates). Make sure there is no redundant information. Ensure the updated nugget list has at most 30 nuggets (can be less), keeping only the most vital ones. Order them in decreasing order of importance. Prefer nuggets that provide more interesting information.\n\nSearch Query: crest syndrome esophageal dysfunction\nContext:\n[1] DermIS - CREST Syndrome (information on the diagnosis)\nCREST Syndrome\ndefinition\nThis form of systemic scleroderma is usually less severe than other forms, consisting of ca


100%|██████████| 1/1 [00:42<00:00, 42.43s/window][A
pt.apply.by_query():  31%|███       | 31/100 [02:15<05:12,  4.53s/it]

--- RAW TEXT: ---
 ["CREST syndrome", "Calcinosis cutis", "Raynaud's phenomenon", "Esophageal dysfunction", "Sclerodactyly", "Telangiectasia", "Systemic sclerosis", "Limited scleroderma", "Diffuse scleroderma", "Autoimmune disease", "Antinuclear antibodies", "Centromere antibodies", "Esophageal hypomotility", "Reflux esophagitis", "Nonpitting digital edema", "Pulmonary hypertension", "Biliary cirrhosis", "HLA-DR1", "CENP-A", "CENP-B", "CENP-C", "Connective tissue disease", "Scleroderma variant", "Clinical signs", "Laboratory test", "Diagnostic criteria", "Prognosis", "Treatment options", "Complications", "Pulmonary function testing", "Echocardiogram"] 
 --- END OF RAW TEXT---



  0%|          | 0/1 [00:00<?, ?window/s][A

PROMPT:
 ['You are NuggetizeLLM, an intelligent assistant that can update a list of atomic nuggets to best provide all the information required for the query.\nuser: Update the list of atomic nuggets of information (1-12 words), if needed, so they best provide the information required for the query. Leverage only the initial list of nuggets (if exists) and the provided context (this is an iterative process).  Return only the final list of all nuggets in a Pythonic list format (even if no updates). Make sure there is no redundant information. Ensure the updated nugget list has at most 30 nuggets (can be less), keeping only the most vital ones. Order them in decreasing order of importance. Prefer nuggets that provide more interesting information.\n\nSearch Query: does light intensity or concentration of carbon dioxide have a higher rate of photosynthesis\nContext:\n[1] Light\nAs light intensity increases so too does the rate of photosynthesis until a certain point where the graph levels 


100%|██████████| 1/1 [00:48<00:00, 48.50s/window][A
pt.apply.by_query():  41%|████      | 41/100 [03:04<04:34,  4.66s/it]

--- RAW TEXT: ---
 ["Light intensity affects photosynthesis rate", "Carbon dioxide concentration affects photosynthesis rate", "Temperature affects photosynthesis rate", "Higher light intensity increases photosynthesis rate", "Higher carbon dioxide concentration increases photosynthesis rate", "Higher temperature increases photosynthesis rate", "Optimum temperature range is 25 to 35 o C", "Water availability affects photosynthesis rate", "Carbon dioxide is essential for photosynthesis", "Light wavelength affects photosynthesis rate", "Chlorophyll concentration affects photosynthesis rate", "Carbon dioxide concentration limits photosynthesis rate at high light intensities", "Carbon dioxide concentration of 0.03 to 0.04 percent is sufficient for photosynthesis", "Carbon dioxide concentration above 0.5 percent is injurious to plants", "Light intensity is limiting factor at low light intensities", "Carbon dioxide concentration is limiting factor at high light intensities", "Temperature is 


  0%|          | 0/1 [00:00<?, ?window/s][A

PROMPT:
 ['You are NuggetizeLLM, an intelligent assistant that can update a list of atomic nuggets to best provide all the information required for the query.\nuser: Update the list of atomic nuggets of information (1-12 words), if needed, so they best provide the information required for the query. Leverage only the initial list of nuggets (if exists) and the provided context (this is an iterative process).  Return only the final list of all nuggets in a Pythonic list format (even if no updates). Make sure there is no redundant information. Ensure the updated nugget list has at most 30 nuggets (can be less), keeping only the most vital ones. Order them in decreasing order of importance. Prefer nuggets that provide more interesting information.\n\nSearch Query: how does my baby get submitted for medicaid after birth\nContext:\n[1] You\n \nneed\n \nto\n \nsee\n \na\n \ndoctor\n \nin\n \norder\n \nto\n \nsubmit\n \nan\n \napplication. At\n \nCHEMED,\n \nyou\n \ncannot\n \napply\n \nfor\n


100%|██████████| 1/1 [00:47<00:00, 47.98s/window][A
pt.apply.by_query():  51%|█████     | 51/100 [03:52<03:50,  4.71s/it]

--- RAW TEXT: ---
 ["baby's Medicaid eligibility", "mother's Medicaid eligibility", "Medicaid managed care plan", "newborn activation request", "Florida Medicaid Secure Web Portal", "Florida Health Plan Portal", "Medicaid fiscal agent", "baby's name, gender, and birth date", "new Medicaid gold card", "mother's plan", "baby's plan", "MMA plan enrollment", "Medicaid services", "fee for service", "Medicaid fiscal agent", "baby's eligibility", "MMA plan", "mother's eligibility category", "MU", "FP", "full Medicaid coverage", "Medicaid managed care plan", "baby's Medicaid number", "baby's Medicaid ID number", "baby's Medicaid ID card", "baby of mother's name", "card control number", "Medical Assistance Referral Form", "Unborn Activation Form", "proof of eligibility", "recipient eligibility", "baby's Medicaid ID number assignment", "newborn's Medicaid ID number activation"] 
 --- END OF RAW TEXT---



  0%|          | 0/1 [00:00<?, ?window/s][A

PROMPT:
 ['You are NuggetizeLLM, an intelligent assistant that can update a list of atomic nuggets to best provide all the information required for the query.\nuser: Update the list of atomic nuggets of information (1-12 words), if needed, so they best provide the information required for the query. Leverage only the initial list of nuggets (if exists) and the provided context (this is an iterative process).  Return only the final list of all nuggets in a Pythonic list format (even if no updates). Make sure there is no redundant information. Ensure the updated nugget list has at most 30 nuggets (can be less), keeping only the most vital ones. Order them in decreasing order of importance. Prefer nuggets that provide more interesting information.\n\nSearch Query: how is the oil pollution act of 1990 effect oil companies\nContext:\n[1] Oil Pollution Act of 1990 - Wikipedia\nOil Pollution Act of 1990\nLanguage\nWatch\nEdit\nThe Oil Pollution Act of 1990 (OPA) (101 H.R.1465, P.L. 101-380) w


100%|██████████| 1/1 [00:41<00:00, 41.47s/window][A
pt.apply.by_query():  61%|██████    | 61/100 [04:33<02:56,  4.52s/it]

--- RAW TEXT: ---
 ["Oil Pollution Act of 1990", "Passed by 101st US Congress", "Signed by President George H.W. Bush", "Effective August 18, 1990", "Amended Clean Water Act", "Increased penalties for oil spills", "Required double hulls for oil tankers", "Established Oil Spill Liability Trust Fund", "Provided financial responsibility requirements", "Mandated contingency planning", "Enhanced federal response capability", "Increased potential liabilities", "Limited liability for companies to $75 million", "Prevented oil spills from vessels and facilities", "Assigned liability for cleanup and damage costs", "Defined responsible parties and financial liability", "Implemented processes for measuring damages", "Specified damages for which violators are liable", "Established fund for damages, cleanup, and removal costs", "Resulted in changes to oil production, transportation, and distribution industries"] 
 --- END OF RAW TEXT---



  0%|          | 0/1 [00:00<?, ?window/s][A

PROMPT:
 ['You are NuggetizeLLM, an intelligent assistant that can update a list of atomic nuggets to best provide all the information required for the query.\nuser: Update the list of atomic nuggets of information (1-12 words), if needed, so they best provide the information required for the query. Leverage only the initial list of nuggets (if exists) and the provided context (this is an iterative process).  Return only the final list of all nuggets in a Pythonic list format (even if no updates). Make sure there is no redundant information. Ensure the updated nugget list has at most 30 nuggets (can be less), keeping only the most vital ones. Order them in decreasing order of importance. Prefer nuggets that provide more interesting information.\n\nSearch Query: how long does it take for a sprained wrist to heal\nContext:\n[1] How long for a sprained wrist ligament take to heal? - Answers\nA wrist sprain can take anywhere from 2 to 10 weeks to heal\ncompletely\nHome\n        Science\n  


100%|██████████| 1/1 [00:56<00:00, 56.43s/window][A
pt.apply.by_query():  71%|███████   | 71/100 [05:29<02:21,  4.88s/it]

--- RAW TEXT: ---
 ["wrist sprain healing time 2-10 weeks", "mild wrist sprain heals 2-3 days", "moderate wrist sprain heals 1-2 weeks", "severe wrist sprain heals several weeks to months", "rest ice compression helps wrist sprain healing", "wrist sprain occurs due to ligament injury", "wrist sprain common due to falling", "sprained wrist symptoms swelling stiffness", "wrist sprain healing time varies with age", "mild sprain heals 10 days under 25", "mild sprain heals 1 month over 50", "sprained wrist ligament healing time 2-10 weeks", "sprained wrist healing time 2-3 weeks to be normal", "sprained wrist healing time 5-6 days with rest", "sprained hand healing time 7-10 days", "sprained foot healing time 1 week", "sprained arm healing time 2-3 weeks", "sprained knee healing time few weeks to months", "sprained ankle healing time 1 week", "sprained toe healing time couple weeks", "sprained pinky healing time 2-3 weeks", "AC sprain healing time 4-6 weeks", "sprained rib ligament healing 


  0%|          | 0/1 [00:00<?, ?window/s][A

PROMPT:
 ['You are NuggetizeLLM, an intelligent assistant that can update a list of atomic nuggets to best provide all the information required for the query.\nuser: Update the list of atomic nuggets of information (1-12 words), if needed, so they best provide the information required for the query. Leverage only the initial list of nuggets (if exists) and the provided context (this is an iterative process).  Return only the final list of all nuggets in a Pythonic list format (even if no updates). Make sure there is no redundant information. Ensure the updated nugget list has at most 30 nuggets (can be less), keeping only the most vital ones. Order them in decreasing order of importance. Prefer nuggets that provide more interesting information.\n\nSearch Query: how many years in jail for money laundering\nContext:\n[1] When it comes to money laundering jail time, Texas penal code requirements state that your punishment depends on how much money is involved. If you’re accused of launder


100%|██████████| 1/1 [00:50<00:00, 50.91s/window][A
pt.apply.by_query():  81%|████████  | 81/100 [06:20<01:34,  4.95s/it]

--- RAW TEXT: ---
 ["Money laundering jail time depends on amount laundered", "Texas penal code requires 180 days to 2 years in jail for $1,500 to $20,000", "2 to 10 years in prison for $20,000 to $100,000", "2 to 20 years in prison for $100,000 to $200,000", "5 to 99 years in prison for over $100,000", "California misdemeanor money laundering carries up to 1 year in jail", "California felony money laundering carries 16 months to 4 years in jail", "Federal money laundering carries up to 20 years in prison", "Fines for money laundering vary by state and federal cases", "Repeat offenders can face up to 35 years in jail", "Money laundering is a felony offense", "Money laundering is a white-collar crime", "Money laundering involves deceit and financial gain", "First-degree money laundering is a class 2 felony", "Second-degree money laundering is a class 3 felony", "Money laundering penalties depend on prior convictions", "Money laundering can be charged as a misdemeanor or felony in Califo


  0%|          | 0/1 [00:00<?, ?window/s][A

PROMPT:
 ['You are NuggetizeLLM, an intelligent assistant that can update a list of atomic nuggets to best provide all the information required for the query.\nuser: Update the list of atomic nuggets of information (1-12 words), if needed, so they best provide the information required for the query. Leverage only the initial list of nuggets (if exists) and the provided context (this is an iterative process).  Return only the final list of all nuggets in a Pythonic list format (even if no updates). Make sure there is no redundant information. Ensure the updated nugget list has at most 30 nuggets (can be less), keeping only the most vital ones. Order them in decreasing order of importance. Prefer nuggets that provide more interesting information.\n\nSearch Query: how to help a jammed finger\nContext:\n[1] Before becoming a professional writer, Michael worked as an English tutor, poet, voice-over artist, and DJ. 📦Amazon Doesn\'t Want You to Know About This Plugin\nLearn about a little kno


100%|██████████| 1/1 [00:33<00:00, 33.18s/window][A
pt.apply.by_query(): 100%|██████████| 100/100 [06:54<00:00,  4.14s/it]


--- RAW TEXT: ---
 ["Jammed finger causes pain", "Swelling and stiffness", "Difficulty moving finger", "Ice pack helps", "Epsom salt helps", "Aloe Vera Gel helps", "Apple Cider Vinegar helps", "Turmeric helps", "Tape injured finger", "Immobilize finger", "Rest is key", "Ice for 20 minutes", "Repeat as needed", "Avoid slamming doors", "Finger protection strips help", "Door guards help", "Check children's hands", "Teach children door safety", "Arthritis symptoms similar", "No pulling a jammed finger", "Ibuprofen reduces swelling", "Follow dosage recommendations", "Collateral ligaments support joint", "Jammed finger not serious", "At-home treatments help", "Medical treatments help"] 
 --- END OF RAW TEXT---


  0%|          | 0/1 [00:00<?, ?window/s]

PROMPT:
 ['You are NuggetizeLLM, an intelligent assistant that can update a list of atomic nuggets to best provide all the information required for the query.\nuser: Update the list of atomic nuggets of information (1-12 words), if needed, so they best provide the information required for the query. Leverage only the initial list of nuggets (if exists) and the provided context (this is an iterative process).  Return only the final list of all nuggets in a Pythonic list format (even if no updates). Make sure there is no redundant information. Ensure the updated nugget list has at most 30 nuggets (can be less), keeping only the most vital ones. Order them in decreasing order of importance. Prefer nuggets that provide more interesting information.\n\nSearch Query: information about who howard gardner and what does he do\nContext:\n[1] Howard Gardner (Intelligence Psychologist Biography) - Practical Psychology\nHoward Gardner (Intelligence Psychologist Biography)\nHoward Gardner (Intellige

100%|██████████| 1/1 [00:59<00:00, 59.68s/window]


--- RAW TEXT: ---
 ["Howard Gardner is an American psychologist", "He specializes in cognitive and developmental psychology", "He is best known for his theory of multiple intelligences", "Gardner believes that the way people usually think about intelligence is too narrow", "He was born on July 11, 1943 in Scranton, Pennsylvania", "His parents were Ralph and Hilde Gardner", "They were German-Jewish immigrants who fled from Nazi persecution in Germany", "Gardner is the John H. and Elisabeth A. Hobbs Research Professor of Cognition and Education at Harvard University", "He is the senior director of Harvard Project Zero", "He has written hundreds of research articles and thirty books", "Gardner's theory of multiple intelligences includes eight different types of intelligences", "Linguistic-Verbal intelligence", "Logical-Mathematical intelligence", "Visual-Spatial intelligence", "Bodily-Kinesthetic intelligence", "Musical-Rhythmic intelligence", "Interpersonal intelligence", "Intrapersonal 

pt.apply.by_query():   0%|          | 0/225 [00:00<?, ?it/s]
  0%|          | 0/1 [00:00<?, ?window/s][A

PROMPT:
 ["You are NuggetizeScoreLLM, an intelligent assistant that can label a list of atomic nuggets based on their importance for a given search query.\nuser: Based on the query, label each of the 488 nuggets either a vital or okay based on the following criteria. Vital nuggets represent concepts that must be present in a “good” answer; on the other hand, okay nuggets contribute worthwhile information about the target but are not essential. Return the list of labels in a Pythonic list format (type: List[str]). The list should be in the same order as the input nuggets. Make sure to provide a label for each nugget.\n\nSearch Query: are landlords liable if someone breaks in a hurts tenant\nNugget List: [1] Landlord liable for injuries on rental property\n[2] Tenant must prove landlord negligence\n[3] Landlord must maintain common areas\n[4] Tenant can sue for medical bills and lost earnings\n[5] Landlord liable for injuries caused by tenant's dog\n[6] Landlord must know dog is dangerou


100%|██████████| 1/1 [00:07<00:00,  7.57s/window][A
pt.apply.by_query():   6%|▌         | 13/225 [00:07<02:03,  1.72it/s]

--- RAW TEXT: ---
 ['vital', 'vital', 'vital', 'okay', 'okay', 'okay', 'okay', 'okay', 'vital', 'vital'] 
 --- END OF RAW TEXT---



  0%|          | 0/2 [00:00<?, ?window/s][A

PROMPT:
 ['You are NuggetizeScoreLLM, an intelligent assistant that can label a list of atomic nuggets based on their importance for a given search query.\nuser: Based on the query, label each of the 616 nuggets either a vital or okay based on the following criteria. Vital nuggets represent concepts that must be present in a “good” answer; on the other hand, okay nuggets contribute worthwhile information about the target but are not essential. Return the list of labels in a Pythonic list format (type: List[str]). The list should be in the same order as the input nuggets. Make sure to provide a label for each nugget.\n\nSearch Query: average age of men at marriage\nNugget List: [1] average age of men at marriage is 26.8 years\n[2] average age of men at marriage in US is 26.8 years\n[3] average age of men at marriage in UK is 30.8 years\n[4] average age of men at marriage in Alabama is 25.5 years\n[5] average age of men at marriage in District of Columbia is 30 years\n[6] average age of 


100%|██████████| 2/2 [00:07<00:00,  3.56s/window][A
pt.apply.by_query():  15%|█▌        | 34/225 [00:14<01:18,  2.43it/s]

--- RAW TEXT: ---
 ['vital', 'okay', 'okay', 'okay', 'okay', 'okay', 'okay', 'okay', 'okay', 'okay'] 
 --- END OF RAW TEXT---



  0%|          | 0/3 [00:00<?, ?window/s][A

PROMPT:
 ["You are NuggetizeScoreLLM, an intelligent assistant that can label a list of atomic nuggets based on their importance for a given search query.\nuser: Based on the query, label each of the 223 nuggets either a vital or okay based on the following criteria. Vital nuggets represent concepts that must be present in a “good” answer; on the other hand, okay nuggets contribute worthwhile information about the target but are not essential. Return the list of labels in a Pythonic list format (type: List[str]). The list should be in the same order as the input nuggets. Make sure to provide a label for each nugget.\n\nSearch Query: crest syndrome esophageal dysfunction\nNugget List: [1] CREST syndrome\n[2] Calcinosis cutis\n[3] Raynaud's phenomenon\n[4] Esophageal dysfunction\n[5] Sclerodactyly\n[6] Telangiectasia\n[7] Systemic sclerosis\n[8] Limited scleroderma\n[9] Diffuse scleroderma\n[10] Autoimmune disease\n\nOnly return the list of labels (List[str]). Do not explain.\nLabels:\n"


100%|██████████| 3/3 [00:06<00:00,  2.30s/window][A
pt.apply.by_query():  28%|██▊       | 64/225 [00:21<00:49,  3.28it/s]

--- RAW TEXT: ---
 ['vital', 'okay', 'okay', 'vital', 'okay', 'okay', 'vital', 'okay', 'okay', 'okay'] 
 --- END OF RAW TEXT---



  0%|          | 0/2 [00:00<?, ?window/s][A

PROMPT:
 ['You are NuggetizeScoreLLM, an intelligent assistant that can label a list of atomic nuggets based on their importance for a given search query.\nuser: Based on the query, label each of the 530 nuggets either a vital or okay based on the following criteria. Vital nuggets represent concepts that must be present in a “good” answer; on the other hand, okay nuggets contribute worthwhile information about the target but are not essential. Return the list of labels in a Pythonic list format (type: List[str]). The list should be in the same order as the input nuggets. Make sure to provide a label for each nugget.\n\nSearch Query: does light intensity or concentration of carbon dioxide have a higher rate of photosynthesis\nNugget List: [1] Light intensity affects photosynthesis rate\n[2] Carbon dioxide concentration affects photosynthesis rate\n[3] Temperature affects photosynthesis rate\n[4] Higher light intensity increases photosynthesis rate\n[5] Higher carbon dioxide concentratio


100%|██████████| 2/2 [00:07<00:00,  3.71s/window][A
pt.apply.by_query():  38%|███▊      | 85/225 [00:28<00:45,  3.10it/s]

--- RAW TEXT: ---
 ['vital', 'vital', 'okay', 'vital', 'vital', 'okay', 'okay', 'okay', 'vital', 'okay'] 
 --- END OF RAW TEXT---



  0%|          | 0/3 [00:00<?, ?window/s][A

PROMPT:
 ["You are NuggetizeScoreLLM, an intelligent assistant that can label a list of atomic nuggets based on their importance for a given search query.\nuser: Based on the query, label each of the 309 nuggets either a vital or okay based on the following criteria. Vital nuggets represent concepts that must be present in a “good” answer; on the other hand, okay nuggets contribute worthwhile information about the target but are not essential. Return the list of labels in a Pythonic list format (type: List[str]). The list should be in the same order as the input nuggets. Make sure to provide a label for each nugget.\n\nSearch Query: how does my baby get submitted for medicaid after birth\nNugget List: [1] baby's Medicaid eligibility\n[2] mother's Medicaid eligibility\n[3] Medicaid managed care plan\n[4] newborn activation request\n[5] Florida Medicaid Secure Web Portal\n[6] Florida Health Plan Portal\n[7] Medicaid fiscal agent\n[8] baby's name, gender, and birth date\n[9] new Medicaid 


100%|██████████| 3/3 [00:07<00:00,  2.41s/window][A
pt.apply.by_query():  51%|█████     | 115/225 [00:36<00:31,  3.48it/s]

--- RAW TEXT: ---
 ['vital', 'vital', 'okay', 'vital', 'okay', 'okay', 'okay', 'vital', 'vital', 'okay'] 
 --- END OF RAW TEXT---



  0%|          | 0/2 [00:00<?, ?window/s][A

PROMPT:
 ['You are NuggetizeScoreLLM, an intelligent assistant that can label a list of atomic nuggets based on their importance for a given search query.\nuser: Based on the query, label each of the 374 nuggets either a vital or okay based on the following criteria. Vital nuggets represent concepts that must be present in a “good” answer; on the other hand, okay nuggets contribute worthwhile information about the target but are not essential. Return the list of labels in a Pythonic list format (type: List[str]). The list should be in the same order as the input nuggets. Make sure to provide a label for each nugget.\n\nSearch Query: how is the oil pollution act of 1990 effect oil companies\nNugget List: [1] Oil Pollution Act of 1990\n[2] Passed by 101st US Congress\n[3] Signed by President George H.W. Bush\n[4] Effective August 18, 1990\n[5] Amended Clean Water Act\n[6] Increased penalties for oil spills\n[7] Required double hulls for oil tankers\n[8] Established Oil Spill Liability Tr


100%|██████████| 2/2 [00:07<00:00,  3.87s/window][A
pt.apply.by_query():  60%|██████    | 135/225 [00:43<00:28,  3.16it/s]

--- RAW TEXT: ---
 ['vital', 'okay', 'okay', 'okay', 'vital', 'vital', 'vital', 'vital', 'vital', 'vital'] 
 --- END OF RAW TEXT---



  0%|          | 0/2 [00:00<?, ?window/s][A

PROMPT:
 ['You are NuggetizeScoreLLM, an intelligent assistant that can label a list of atomic nuggets based on their importance for a given search query.\nuser: Based on the query, label each of the 444 nuggets either a vital or okay based on the following criteria. Vital nuggets represent concepts that must be present in a “good” answer; on the other hand, okay nuggets contribute worthwhile information about the target but are not essential. Return the list of labels in a Pythonic list format (type: List[str]). The list should be in the same order as the input nuggets. Make sure to provide a label for each nugget.\n\nSearch Query: how long does it take for a sprained wrist to heal\nNugget List: [1] wrist sprain healing time 2-10 weeks\n[2] mild wrist sprain heals 2-3 days\n[3] moderate wrist sprain heals 1-2 weeks\n[4] severe wrist sprain heals several weeks to months\n[5] rest ice compression helps wrist sprain healing\n[6] wrist sprain occurs due to ligament injury\n[7] wrist sprai


100%|██████████| 2/2 [00:07<00:00,  3.54s/window][A
pt.apply.by_query():  71%|███████   | 159/225 [00:51<00:20,  3.23it/s]

--- RAW TEXT: ---
 ['vital', 'okay', 'okay', 'vital', 'okay', 'okay', 'okay', 'okay', 'okay', 'okay'] 
 --- END OF RAW TEXT---



  0%|          | 0/1 [00:00<?, ?window/s][A

PROMPT:
 ['You are NuggetizeScoreLLM, an intelligent assistant that can label a list of atomic nuggets based on their importance for a given search query.\nuser: Based on the query, label each of the 617 nuggets either a vital or okay based on the following criteria. Vital nuggets represent concepts that must be present in a “good” answer; on the other hand, okay nuggets contribute worthwhile information about the target but are not essential. Return the list of labels in a Pythonic list format (type: List[str]). The list should be in the same order as the input nuggets. Make sure to provide a label for each nugget.\n\nSearch Query: how many years in jail for money laundering\nNugget List: [1] Money laundering jail time depends on amount laundered\n[2] Texas penal code requires 180 days to 2 years in jail for $1,500 to $20,000\n[3] 2 to 10 years in prison for $20,000 to $100,000\n[4] 2 to 20 years in prison for $100,000 to $200,000\n[5] 5 to 99 years in prison for over $100,000\n[6] Ca


100%|██████████| 1/1 [00:07<00:00,  7.97s/window][A
pt.apply.by_query():  79%|███████▊  | 177/225 [00:59<00:16,  2.90it/s]

--- RAW TEXT: ---
 ['vital', 'vital', 'vital', 'vital', 'vital', 'okay', 'okay', 'vital', 'okay', 'okay'] 
 --- END OF RAW TEXT---



  0%|          | 0/2 [00:00<?, ?window/s][A

PROMPT:
 ['You are NuggetizeScoreLLM, an intelligent assistant that can label a list of atomic nuggets based on their importance for a given search query.\nuser: Based on the query, label each of the 245 nuggets either a vital or okay based on the following criteria. Vital nuggets represent concepts that must be present in a “good” answer; on the other hand, okay nuggets contribute worthwhile information about the target but are not essential. Return the list of labels in a Pythonic list format (type: List[str]). The list should be in the same order as the input nuggets. Make sure to provide a label for each nugget.\n\nSearch Query: how to help a jammed finger\nNugget List: [1] Jammed finger causes pain\n[2] Swelling and stiffness\n[3] Difficulty moving finger\n[4] Ice pack helps\n[5] Epsom salt helps\n[6] Aloe Vera Gel helps\n[7] Apple Cider Vinegar helps\n[8] Turmeric helps\n[9] Tape injured finger\n[10] Immobilize finger\n\nOnly return the list of labels (List[str]). Do not explain.


100%|██████████| 2/2 [00:07<00:00,  3.61s/window][A
pt.apply.by_query(): 100%|██████████| 225/225 [01:06<00:00,  3.40it/s]


--- RAW TEXT: ---
 ['vital', 'vital', 'vital', 'okay', 'okay', 'okay', 'okay', 'okay', 'vital', 'vital'] 
 --- END OF RAW TEXT---


  0%|          | 0/2 [00:00<?, ?window/s]

PROMPT:
 ['You are NuggetizeScoreLLM, an intelligent assistant that can label a list of atomic nuggets based on their importance for a given search query.\nuser: Based on the query, label each of the 681 nuggets either a vital or okay based on the following criteria. Vital nuggets represent concepts that must be present in a “good” answer; on the other hand, okay nuggets contribute worthwhile information about the target but are not essential. Return the list of labels in a Pythonic list format (type: List[str]). The list should be in the same order as the input nuggets. Make sure to provide a label for each nugget.\n\nSearch Query: information about who howard gardner and what does he do\nNugget List: [1] Howard Gardner is an American psychologist\n[2] He specializes in cognitive and developmental psychology\n[3] He is best known for his theory of multiple intelligences\n[4] Gardner believes that the way people usually think about intelligence is too narrow\n[5] He was born on July 11

100%|██████████| 2/2 [00:08<00:00,  4.05s/window]

--- RAW TEXT: ---
 ['vital', 'vital', 'vital', 'vital', 'okay', 'okay', 'okay', 'vital', 'vital', 'vital'] 
 --- END OF RAW TEXT---





Unnamed: 0,qid,query,nugget_id,nugget,importance
0,23287,are landlords liable if someone breaks in a hu...,23287_1,Landlord liable for injuries on rental property,1
1,23287,are landlords liable if someone breaks in a hu...,23287_2,Tenant must prove landlord negligence,1
2,23287,are landlords liable if someone breaks in a hu...,23287_3,Landlord must maintain common areas,1
3,23287,are landlords liable if someone breaks in a hu...,23287_4,Tenant can sue for medical bills and lost earn...,0
4,23287,are landlords liable if someone breaks in a hu...,23287_5,Landlord liable for injuries caused by tenant'...,0
...,...,...,...,...,...
95,395948,information about who howard gardner and what ...,395948_6,His parents were Ralph and Hilde Gardner,0
96,395948,information about who howard gardner and what ...,395948_7,They were German-Jewish immigrants who fled fr...,0
97,395948,information about who howard gardner and what ...,395948_8,Gardner is the John H. and Elisabeth A. Hobbs ...,1
98,395948,information about who howard gardner and what ...,395948_9,He is the senior director of Harvard Project Zero,1


# Evaluation

In [15]:
results = pt.Experiment(
    [
        load_baseline, 
    ],
    topics_df, 
    answers_df,
    [pyterrier_rag.measures.F1, nuggetizer.VitalScore(), nuggetizer.StrictVitalScore()],
    #batch_size=25,
    names=['baseline retriever'],
)

NameError: name 'load_baseline' is not defined