# CheckThat Task 2

The goal here is to find a method to extract a claim from a passage of text. For example:


- **Passage**: Hydrate YOURSELF W After Waking Up Water 30 min Before a Meal DRINK Before Taking a Shower →→ Before Going to Bed at the correct time T A YE Helps activate internal organs Helps digestion Helps lower blood pressure Helps to avoid heart attack Health+ by Punjab Kesari

- **Claim**: Drinking water at specific times can have different health benefits


To evaluate our method, we will use the **METEOR** metric on the **CLEF2025** dataset.




## Data Acquisition

The dataset will be a collection of text passage's and corresponding claims that have been extracted from the dataset. Let's go ahead and download the dataset

In [1]:
import os
from utils.data_utils import download

TRAIN_URL = "https://gitlab.com/checkthat_lab/clef2025-checkthat-lab/-/raw/main/task2/data/train/train-eng.csv?inline=false"
TEST_URL = "https://gitlab.com/checkthat_lab/clef2025-checkthat-lab/-/raw/main/task2/data/dev/dev-eng.csv?inline=false"

os.makedirs("data", exist_ok=True)
download(TRAIN_URL, "data")
download(TEST_URL, "data")

## Dataset

Now we make our dataset to hold the downloaded CSV data. This will allow us to iterate through our data easier. Each index of the dataset will return a text claim pair

In [4]:
from utils.dataset import ClaimVerificationDataset

train_dataset = ClaimVerificationDataset(f"data/train-eng.csv")
test_dataset = ClaimVerificationDataset(f"data/dev-eng.csv")

print(f"Train dataset length: {len(train_dataset)}")
print(f"Test dataset length: {len(test_dataset)}")

Train dataset length: 11374
Test dataset length: 1171


In [3]:
train_dataset.export_subset_to_csv("data/reduced-train-eng.csv")
test_dataset.export_subset_to_csv("data/reduced-test-eng.csv")

Subset successfully exported to data/reduced-train-eng.csv
Subset successfully exported to data/reduced-test-eng.csv


In [6]:
from utils.dataset import ClaimVerificationDataset

reduced_train_dataset = ClaimVerificationDataset(f"data/reduced-train-eng.csv")
reduced_test_dataset = ClaimVerificationDataset(f"data/reduced-test-eng.csv")

print(f"Reduced Train dataset length: {len(reduced_train_dataset)}")
print(f"Reduced Test dataset length: {len(reduced_test_dataset)}")

Reduced Train dataset length: 568
Reduced Test dataset length: 58


### Subset of Dataset for quick initial testing

## Method 1: Base Model Prompt Engineering

For this method, we will get a baseline score to see how well a model can do without finetuning

In [5]:
index = 80
print(train_dataset[index]["text"], "\n", train_dataset[index]["claim"])

Removing the death penalty for so-called "hate speech" is not enough. You must scrap the entire Bill, send it back to Singapore where it came from, get on your grubby and scabby knees and ask God and the Nigerian people for forgiveness for having the temerity to try to deprive them of their right to speak freely.

(Femi Fani-Kayode) Removing the death penalty for so-called "hate speech" is not enough. You must scrap the entire Bill, send it back to Singapore where it came from, get on your grubby and scabby knees and ask God and the Nigerian people for forgiveness for having the temerity to try to deprive them of their right to speak freely.

(Femi Fani-Kayode) Removing the death penalty for so-called "hate speech" is not enough. You must scrap the entire Bill, send it back to Singapore where it came from, get on your grubby and scabby knees and ask God and the Nigerian people for forgiveness for having the temerity to try to deprive them of their right to speak freely.

(Femi Fani-Kay

In [6]:
base_prompt = [
            {"role": "system", "content": "You are an AI assistant designed to extract claims from a given passage of text. Keep it short and return the claim in the text. Only return the big idea and exclude unneeded details."},
            {"role": "user", "content": train_dataset[0]["text"]},
            {"role": "assistant", "content": train_dataset[0]["claim"]},
            {"role": "user", "content": train_dataset[1]["text"]},
            {"role": "assistant", "content": train_dataset[1]["claim"]},
            {"role": "user", "content": train_dataset[15]["text"]},
            {"role": "assistant", "content": train_dataset[15]["claim"]},
            {"role": "user", "content": train_dataset[23]["text"]},
            {"role": "assistant", "content": train_dataset[23]["claim"]},
            {"role": "user", "content": train_dataset[29]["text"]},
            {"role": "assistant", "content": train_dataset[29]["claim"]},
        ]

In [7]:
from agents.together_api_agent import TogetherAgent

together_agent = TogetherAgent()

In [9]:
from utils.metrics import evaluate_on_dataset

[nltk_data] Downloading package wordnet to C:\Users\Joseph
[nltk_data]     Wilder\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [None]:
data, avg_score = evaluate_on_dataset(reduced_test_dataset, together_agent, base_prompt)

before added in 3 samples: 0.288298275862069

In [None]:
print(avg_score)

# Method 2: LLM Finetuning

Please view testing.ipynb for the finetuning process

###########################################################################################################################################

In [None]:
from transformers import StoppingCriteria, StoppingCriteriaList
from unsloth import FastLanguageModel
import torch

class FinetunedAgent():
    def __init__(self):
        max_seq_length = 1024
        dtype = None
        load_in_4bit = True
        self.model, self.tokenizer = FastLanguageModel.from_pretrained(
            model_name="model-good/3B_finetuned_llama3.2",
            max_seq_length=max_seq_length,
            dtype=dtype,
            load_in_4bit=load_in_4bit
        )

        self.model = FastLanguageModel.for_inference(self.model)

    def ask(self, prompt: str):
        instruction = "You are an AI assistant designed to extract claims from a given passage of text. Keep it short and return the claim in the text. Only return the big idea and exclude unneeded details."
        
        class StopOnTokens(StoppingCriteria):
            def __init__(self, stop_ids):
                self.stop_ids = stop_ids
            
            def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
                for stop_id in self.stop_ids:
                    if input_ids[0][-1] == stop_id:
                        return True
                return False
        
        if not prompt or len(prompt.strip()) == 0:
            return "No input provided"
        
        inference_prompt = f"""Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

        ### Instruction:
        {instruction}

        ### Input:
        {prompt}

        ### Response:
        """
        
        try:
            inputs = self.tokenizer(
                [inference_prompt], 
                return_tensors="pt"
            ).to("cuda")
            
            
            outputs = self.model.generate(
                **inputs,
                max_new_tokens=128,
                min_new_tokens=10,
                use_cache=True,
                temperature=0.7,
                do_sample=True,
                top_p=0.9,
                num_return_sequences=1,
                eos_token_id=self.tokenizer.eos_token_id,
                pad_token_id=self.tokenizer.pad_token_id or self.tokenizer.eos_token_id,
                stopping_criteria=[
                    StoppingCriteriaList([
                        StopOnTokens(
                            stop_ids=[
                                self.tokenizer.eos_token_id,
                                self.tokenizer.convert_tokens_to_ids("### Input:"),
                                self.tokenizer.convert_tokens_to_ids("### Instruction:")
                            ]
                        )
                    ])
                ]
            )
            
            full_response = self.tokenizer.batch_decode(outputs)[0]
            print("Full raw response:", full_response)
            
            response_parts = full_response.split("### Response:")
            
            if len(response_parts) > 1:
                answer = response_parts[1].split("###")[0].strip()
            else:
                answer = full_response.split("### Response:")[-1].strip()
            
            if not answer:
                answer = "No claim extracted"
            
            return answer
        
        except Exception as e:
            print(f"Error in claim extraction: {e}")
            return "Error during claim extraction"


Please restructure your imports with 'import unsloth' at the top of your file.
  from unsloth import FastLanguageModel


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


    PyTorch 2.6.0+cu124 with CUDA 1204 (you have 2.6.0+cu118)
    Python  3.10.11 (you have 3.10.0)
  Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
  Memory-efficient attention, SwiGLU, sparse and more won't be available.
  Set XFORMERS_MORE_DETAILS=1 for more details


🦥 Unsloth Zoo will now patch everything to make training faster!


In [2]:
finetuned_agent = FinetunedAgent()

  GPU_BUFFERS = tuple([torch.empty(2*256*2048, dtype = dtype, device = f"cuda:{i}") for i in range(n_gpus)])


==((====))==  Unsloth 2025.3.18: Fast Llama patching. Transformers: 4.50.0.
   \\   /|    NVIDIA GeForce GTX 1660 Ti. Num GPUs = 1. Max memory: 6.0 GB. Platform: Windows.
O^O/ \_/ \    Torch: 2.6.0+cu118. CUDA: 7.5. CUDA Toolkit: 11.8. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Unsloth 2025.3.18 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


In [9]:
from tqdm import tqdm
from utils.metrics import evaluate_claim_extraction
from utils.dataset import ClaimVerificationDataset

def evaluate_on_dataset_finetuned(test_dataset: ClaimVerificationDataset, agent: FinetunedAgent, limit = None):
    data = []
    scores = []

    counter = 0
    for entry in tqdm(test_dataset, desc="Evaluating LLM claim extraction"):
        user_prompt = entry["text"]
        output = agent.ask(user_prompt)
        meteor_score = evaluate_claim_extraction(entry["claim"], output)

        data.append({
            "ground_truth_claim": entry["claim"],
            "generated_claim": output,
            "meteor_score": meteor_score
        })

        scores.append(meteor_score)

        counter += 1
        if limit and counter >= limit:
            break

    avg_score = sum(scores) / len(scores) if scores else 0

    return data, avg_score

In [10]:
data, avg_score = evaluate_on_dataset_finetuned(test_dataset, finetuned_agent, limit=20)

Evaluating LLM claim extraction:   2%|▏         | 19/1171 [02:08<2:10:15,  6.78s/it]


In [11]:
print(avg_score)

0.00401


The model seems to need more training. The responses seem to be struggling with the longer length text. With some shorter prompts I am able to get good results. Best results I have seen
with a fine tuned model so far is ~0.2