# CheckThat Task 2

The goal of this notebook is to find a method to extract a claim from a passage of text. For example:


- **Passage**: Hydrate YOURSELF W After Waking Up Water 30 min Before a Meal DRINK Before Taking a Shower →→ Before Going to Bed at the correct time T A YE Helps activate internal organs Helps 

    digestion Helps lower blood pressure Helps to avoid heart attack Health+ by Punjab Kesari

- **Claim**: Drinking water at specific times can have different health benefits


The passage of text will be the input to the method, and the claim will be the output. To evaluate our method, we will use the **METEOR** metric on the **CLEF2025** dataset.




## Data Acquisition

The dataset will be a collection of text passage's and corresponding claims that have been extracted from the dataset. Let's go ahead and download the dataset

In [1]:
import os
from utils.data_utils import download

TRAIN_URL = "https://gitlab.com/checkthat_lab/clef2025-checkthat-lab/-/raw/main/task2/data/train/train-eng.csv?inline=false"
TEST_URL = "https://gitlab.com/checkthat_lab/clef2025-checkthat-lab/-/raw/main/task2/data/dev/dev-eng.csv?inline=false"

os.makedirs("data", exist_ok=True)
download(TRAIN_URL, "data")
download(TEST_URL, "data")

## Dataset

Now we make our dataset to hold the downloaded CSV data. This will allow us to iterate through our data easier. Each index of the dataset will return a text claim pair

In [2]:
from utils.dataset import ClaimVerificationDataset

train_dataset = ClaimVerificationDataset(f"data/train-eng.csv")
test_dataset = ClaimVerificationDataset(f"data/dev-eng.csv")

print(f"Train dataset length: {len(train_dataset)}")
print(f"Test dataset length: {len(test_dataset)}")

Train dataset length: 11374
Test dataset length: 1171


## Method 1: Base Model Few-Shot Prompting

For this method, we will get a baseline score to see how well a model can do without finetuning. A larger model will be chosen here with the hopes that it will be able to get the best results 

due to having more parameters. The model will be access via the Together API. We will be using Llama 3.3 with 70 billion parameters. Additionally, our prompting strategy will employ few shot 

prompting. Input-output pairs will be taken from the dataset and put into the prompt.

In [3]:
# use a diverse set of training samples in the model prompt
few_shot_prompt = [
            {"role": "system", "content": "You are an AI assistant designed to extract claims from a given passage of text. Keep it short and return the claim in the text. Only return the big idea and exclude unneeded details."},
            {"role": "user", "content": train_dataset[0]["text"]},
            {"role": "assistant", "content": train_dataset[0]["claim"]},
            {"role": "user", "content": train_dataset[1]["text"]},
            {"role": "assistant", "content": train_dataset[1]["claim"]},
            {"role": "user", "content": train_dataset[15]["text"]},
            {"role": "assistant", "content": train_dataset[15]["claim"]},
            {"role": "user", "content": train_dataset[23]["text"]},
            {"role": "assistant", "content": train_dataset[23]["claim"]},
            {"role": "user", "content": train_dataset[29]["text"]},
            {"role": "assistant", "content": train_dataset[29]["claim"]},
        ]

In [4]:
from agents.together_api_agent import TogetherAgent
from utils.metrics import evaluate_on_dataset

together_agent = TogetherAgent()

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\jwilder\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [5]:
data, avg_score = evaluate_on_dataset(test_dataset, together_agent, few_shot_prompt, limit=200)

Evaluating LLM claim extraction:  17%|█▋        | 199/1171 [27:16<2:13:13,  8.22s/it]


In [None]:
print(f"Average METEOR score for base LLM few-shot prompting method: {avg_score}") # 0.2916

Average METEOR score for base LLM few-shot prompting method: 0.291637


# Method 2: LLM Finetuning

In [7]:
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from tqdm import tqdm

from utils.metrics import evaluate_claim_extraction

model_checkpoint = 'google/flan-t5-large'
model_code = model_checkpoint.split("/")[-1]

model = AutoModelForSeq2SeqLM.from_pretrained(f"./{model_code}/finetuned_{model_code}")
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

In [11]:
def evaluate_finetuned_model(limit: int = 2000):
    data = []
    scores = []

    counter = 0
    for entry in tqdm(test_dataset, desc="Evaluating LLM claim extraction"):
        if counter > limit: break
        user_prompt = entry["text"]

        user_prompt = (
        "Please read the following social media post and extract the claim made within it. "
        "Normalize the claim by rephrasing it in a clear and concise manner.\n\n"
        f"Post: {user_prompt}\n\nExtracted Claim:"
        )

        #prompt.append({"role": "user", "content": user_prompt})
        #output = agent.ask(prompt)

        inputs = tokenizer(user_prompt, return_tensors="pt", padding=True, truncation=True, max_length=128)
        model.eval()
        with torch.no_grad():
            generated_ids = model.generate(inputs["input_ids"], max_length=128, num_beams=5, early_stopping=True)

        output = tokenizer.decode(generated_ids[0], skip_special_tokens=True)


        meteor_score = evaluate_claim_extraction(entry["claim"], output)

        data.append({
            "ground_truth_claim": entry["claim"],
            "generated_claim": output,
            "meteor_score": meteor_score
        })

        scores.append(meteor_score)
        counter += 1


    avg_score = sum(scores) / len(scores) if scores else 0

    return data, avg_score

In [12]:
finetuned_data, finetuned_avg_score = evaluate_finetuned_model()

Evaluating LLM claim extraction: 100%|██████████| 1171/1171 [49:24<00:00,  2.53s/it]


In [13]:
print(f"Average METEOR score for finetuned LLM method: {finetuned_avg_score}") # 0.5569

Average METEOR score for finetuned LLM method: 0.5568712211784799
