# LLM Training Data Contamination Check

**Goal:** To investigate whether specific examples from the test set of a dataset (e.g., PolitiFact, GossipCop) might have been included in the training data of a given Large Language Model (LLM).

**Methodology:**
1. Load the target dataset and select a few examples from the **test split**.
2. Load the LLM to be tested.
3. Query the LLM about these specific examples using:
    *   **Direct Questioning:** Ask if the model has seen the exact text before.
    *   **Completion Task:** Provide a prefix of the text and ask the model to complete it.
4. Analyze the responses. Verbatim completions or overly specific acknowledgments *might* suggest prior exposure, but are not definitive proof.

**Disclaimer:** This method provides indicative evidence, not absolute proof. LLMs might generate text that coincidentally matches or claim knowledge they don't possess.

In [11]:
# Imports
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, set_seed
from datasets import load_dataset
import random
import textwrap

# Set a seed for reproducibility of example selection
set_seed(42)

## 1. Configuration
Specify the dataset name and the Hugging Face model ID you want to test.

In [12]:
# --- Configuration ---
DATASET_NAME = "politifact" # or "gossipcop"

# Select the model you want to test (choose one)
# MODEL_HF_ID = "meta-llama/Meta-Llama-3-8B-Instruct"
MODEL_HF_ID = "google/gemma-7b-it"
# MODEL_HF_ID = "mistralai/Mistral-7B-Instruct-v0.2"

NUM_EXAMPLES_TO_TEST = 20 # How many examples to pull from the test set
MAX_NEW_TOKENS_COMPLETION = 150 # Max tokens for the completion task
MAX_NEW_TOKENS_QUESTION = 75   # Max tokens for the direct question task
MAX_PREFIX_LENGTH = 100 # Number of words for the completion prefix

# Determine device and dtype (similar to prompt_hf_llm.py)
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

TORCH_DTYPE = torch.float32
if DEVICE == "cuda":
    TORCH_DTYPE = torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16

print(f"Dataset to check: {DATASET_NAME}")
print(f"Model to test: {MODEL_HF_ID}")
print(f"Using device: {DEVICE}")
print(f"Using dtype: {TORCH_DTYPE}")

Dataset to check: politifact
Model to test: google/gemma-7b-it
Using device: cuda
Using dtype: torch.bfloat16


## 2. Load Dataset and Select Examples
Load the specified dataset and randomly select a few examples from the test split.

In [13]:
# Load dataset
# Construct the dataset name as used in prompt_hf_llm.py
dataset_full_name = f"LittleFish-Coder/Fake_News_{DATASET_NAME}"

print(f"Loading dataset: {dataset_full_name}...")
try:
    # Use the same cache directory as other scripts
    dataset = load_dataset(dataset_full_name, cache_dir="../dataset")
    test_data = dataset['test']
    print(f"Dataset loaded successfully. Test split size: {len(test_data)}")
except Exception as e:
    print(f"Error loading dataset {dataset_full_name}: {e}")
    raise

# Select random examples from the test set
if len(test_data) < NUM_EXAMPLES_TO_TEST:
    print(f"Warning: Requested {NUM_EXAMPLES_TO_TEST} examples, but test set only has {len(test_data)}. Using all.")
    selected_indices = list(range(len(test_data)))
else:
    selected_indices = random.sample(range(len(test_data)), NUM_EXAMPLES_TO_TEST)

selected_examples = [test_data[i] for i in selected_indices]

print(f"\nSelected {len(selected_examples)} examples to test:")
for i, example in enumerate(selected_examples):
    # Display only the beginning of the text for brevity
    short_text = textwrap.shorten(example['text'], width=100, placeholder="...")
    print(f"  Example {i+1} (Index {selected_indices[i]}): Label={example['label']}, Text='{short_text}'")

Loading dataset: LittleFish-Coder/Fake_News_politifact...
Dataset loaded successfully. Test split size: 102

Selected 20 examples to test:
  Example 1 (Index 81): Label=1, Text='Eric Trump: It would be 'foolish' for my dad to release tax returns Eric Trump on Wednesday...'
  Example 2 (Index 14): Label=1, Text='Singer Tina Turner: “This Thanksgiving Is The First Time in 8 Years That I am Thankful For Our...'
  Example 3 (Index 3): Label=0, Text='Country of Origin Labeling (COOL) Country of Origin Labeling (COOL) is a labeling law that...'
  Example 4 (Index 94): Label=0, Text='Sen. Barack Obama's answer to meeting energy demands 自動再生 自動再生を有効にすると、関連動画が自動的に再生されます。 次の動画'
  Example 5 (Index 35): Label=0, Text='Why the Public Option Isn't the Only Answer to Health-Care Reform A dangerous sentiment on the...'
  Example 6 (Index 31): Label=0, Text='Pastors To Protest IRS Rules on Political Advocacy On Sept. 28, pastors from 20 states will give...'
  Example 7 (Index 28): Label=1, Text='Russia

## 3. Load Model and Tokenizer
Load the specified Hugging Face model and its corresponding tokenizer.

In [14]:
print(f"Loading tokenizer for {MODEL_HF_ID}...")
tokenizer = AutoTokenizer.from_pretrained(MODEL_HF_ID, trust_remote_code=True)

# Set padding token if necessary (as in prompt_hf_llm.py)
if tokenizer.pad_token is None:
    if tokenizer.eos_token:
        tokenizer.pad_token = tokenizer.eos_token
        print(f"Set pad_token to eos_token: '{tokenizer.pad_token}'")
    else:
        tokenizer.add_special_tokens({'pad_token': '[PAD]'})
        print("Added '[PAD]' as pad_token.")
        # Note: If a pad token was added, the model might need resizing, but we'll handle that during pipeline creation if needed.

print(f"Loading model {MODEL_HF_ID} with dtype {TORCH_DTYPE} and device_map='auto'...")
model = AutoModelForCausalLM.from_pretrained(
    MODEL_HF_ID,
    torch_dtype=TORCH_DTYPE,
    trust_remote_code=True,
    device_map="auto" # Automatically distribute across available GPUs/CPU
)

# Resize model embeddings if a new pad token was added to the tokenizer
if tokenizer.pad_token == '[PAD]' and len(tokenizer) > model.config.vocab_size:
    model.resize_token_embeddings(len(tokenizer))
    print("Resized model token embeddings.")

print("Model and tokenizer loaded.")
try:
    if hasattr(model, 'hf_device_map'):
        print(f"Model device map: {model.hf_device_map}")
    elif hasattr(model, 'device'):
        print(f"Model loaded on device: {model.device}")
except Exception:
    print("Could not determine specific model device allocation.")

Loading tokenizer for google/gemma-7b-it...
Loading model google/gemma-7b-it with dtype torch.bfloat16 and device_map='auto'...
Loading model google/gemma-7b-it with dtype torch.bfloat16 and device_map='auto'...


Loading checkpoint shards: 100%|██████████| 4/4 [00:06<00:00,  1.57s/it]



Model and tokenizer loaded.
Model device map: {'': 0}


## 4. Create Text Generation Pipeline
Set up the Hugging Face pipeline for text generation.

In [15]:
print("Creating text generation pipeline...")
generator = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    # We'll set max_new_tokens per task
    do_sample=False, # Use greedy decoding for consistency
    pad_token_id=tokenizer.pad_token_id,
    temperature=None,
    top_p=None,
    top_k=None
)
print(f"Pipeline created. Device: {generator.device}")

Creating text generation pipeline...
Pipeline created. Device: cuda:0


## 5. Define Contamination Check Functions
Functions to perform the direct questioning and completion tasks.

In [16]:
def check_direct_question(example_text):
    """Asks the model if it has seen the text before."""
    prompt = f"I will provide you with a text snippet. Please tell me honestly if you have encountered this *exact* text during your training. Do not summarize or analyze it, just answer yes or no regarding prior exposure.\n\nText Snippet:\n'''\n{example_text}\n'''\n\nHave you seen this exact text before? Answer only yes or no."

    # Use model-specific chat templating if available
    try:
        messages = [{"role": "user", "content": prompt}]
        formatted_prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    except:
        # Fallback for models without chat templates
        formatted_prompt = prompt

    response = generator(formatted_prompt, max_new_tokens=MAX_NEW_TOKENS_QUESTION, return_full_text=False)
    return response[0]['generated_text'].strip()

def check_completion(example_text):
    """Asks the model to complete a prefix of the text."""
    words = example_text.split()
    if len(words) <= MAX_PREFIX_LENGTH:
        prefix = example_text # Use full text if shorter than prefix length
    else:
        prefix = " ".join(words[:MAX_PREFIX_LENGTH])

    prompt = f"Complete the following text:\n\n{prefix}"

    # Use model-specific chat templating if available (though less critical for completion)
    try:
        messages = [{"role": "user", "content": prompt}]
        formatted_prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
        # Remove potential trailing spaces/newlines added by template before the generation starts
        formatted_prompt = formatted_prompt.rstrip()
    except:
        formatted_prompt = prompt

    response = generator(formatted_prompt, max_new_tokens=MAX_NEW_TOKENS_COMPLETION, return_full_text=False)
    completion = response[0]['generated_text'].strip()

    # Check if the completion *exactly* matches the next part of the original text
    original_continuation = " ".join(words[MAX_PREFIX_LENGTH:])
    is_exact_match = False
    if completion and original_continuation.startswith(completion):
        is_exact_match = True

    return completion, is_exact_match, prefix

print("Contamination check functions defined.")

Contamination check functions defined.


## 6. Run Checks on Selected Examples
Iterate through the selected test examples and perform both checks.

In [17]:
results = []
print(f"\n--- Running Contamination Checks on {len(selected_examples)} Examples ---")

for i, example in enumerate(selected_examples):
    print(f"\n--- Checking Example {i+1} (Index {selected_indices[i]}) ---")
    example_text = example['text']
    print(f"Original Text (first 200 chars): {example_text[:200]}...")

    # 1. Direct Question
    print("\nRunning Direct Question Check...")
    try:
        direct_answer = check_direct_question(example_text)
        print(f"  Model's answer to 'Have you seen this text?': {direct_answer}")
    except Exception as e:
        print(f"  Error during direct question check: {e}")
        direct_answer = "Error"

    # 2. Completion Task
    print("\nRunning Completion Check...")
    try:
        completion, is_exact, prefix_used = check_completion(example_text)
        print(f"  Prefix Used ({len(prefix_used.split())} words): {prefix_used[:150]}...")
        print(f"  Model's Completion: {completion}")
        print(f"  Is Completion an Exact Match of Original Text?: {is_exact}")
    except Exception as e:
        print(f"  Error during completion check: {e}")
        completion, is_exact, prefix_used = "Error", False, "Error"

    results.append({
        "index": selected_indices[i],
        "text_start": example_text[:200] + "...",
        "direct_answer": direct_answer,
        "prefix_used": prefix_used,
        "completion": completion,
        "exact_match": is_exact
    })

print("\n--- Contamination Checks Complete ---")


--- Running Contamination Checks on 20 Examples ---

--- Checking Example 1 (Index 81) ---
Original Text (first 200 chars): Eric Trump: It would be 'foolish' for my dad to release tax returns Eric Trump on Wednesday dismissed arguments that his father, Donald Trump Donald John TrumpOnly one way with Huawei — don't let it c...

Running Direct Question Check...


  Model's answer to 'Have you seen this text?': No

The text snippet you provided is not the exact text I have seen during my training.

Running Completion Check...
  Prefix Used (100 words): Eric Trump: It would be 'foolish' for my dad to release tax returns Eric Trump on Wednesday dismissed arguments that his father, Donald Trump Donald J...
  Model's Completion: Sure, here is the completed text:

Eric Trump: It would be "foolish" for my dad to release tax returns

Eric Trump on Wednesday dismissed arguments that his father, Donald Trump, should release his tax returns, saying doing so would be "foolish."

"You would have a bunch of people who know nothing about taxes trying to look through and trying to come up with assumptions on things they know nothing about," Eric Trump said.
  Is Completion an Exact Match of Original Text?: False

--- Checking Example 2 (Index 14) ---
Original Text (first 200 chars): Singer Tina Turner: “This Thanksgiving Is The First Time in 8 Years That I am 

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


  Prefix Used (100 words): Why the Public Option Isn't the Only Answer to Health-Care Reform A dangerous sentiment on the left threatens to derail what could be the biggest prog...
  Model's Completion: Sure, here is the completed text:

Why the Public Option Isn't the Only Answer to Health-Care Reform. A dangerous sentiment on the left threatens to derail what could be the biggest progressive achievement in half a century. It's the view that any health-care reform that doesn't include a public option isn't "real" reform and thus isn't worth doing. This mantra has become an article of faith among many Democrats who haven't necessarily thought through the matter but who take their cues from leaders advancing this argument. Unless liberals rethink this premise, and fast, Democrats will squander their best chance in a generation to end the scandal.
  Is Completion an Exact Match of Original Text?: False

--- Checking Example 6 (Index 31) ---
Original Text (first 200 chars): Pastors To Pro

## 7. Summary of Results
Display the collected results from the checks.

In [18]:
import pandas as pd
pd.set_option('display.max_colwidth', 200) # Show more text
pd.set_option('display.max_rows', None) # Show all rows

results_df = pd.DataFrame(results)
print("Summary of Contamination Check Results:")
display(results_df)

print("\nInterpretation Notes:")
print("- 'direct_answer': Look for clear 'yes' answers. 'No' or ambiguous answers are less indicative.")
print("- 'exact_match': True indicates the model completed the text exactly as it appears in the test set. This is a stronger indicator of potential contamination than the direct answer.")
print("- Remember: These checks are indicative, not definitive proof. Coincidental matches can occur.")

Summary of Contamination Check Results:


Unnamed: 0,index,text_start,direct_answer,prefix_used,completion,exact_match
0,81,"Eric Trump: It would be 'foolish' for my dad to release tax returns Eric Trump on Wednesday dismissed arguments that his father, Donald Trump Donald John TrumpOnly one way with Huawei — don't let ...",No\n\nThe text snippet you provided is not the exact text I have seen during my training.,"Eric Trump: It would be 'foolish' for my dad to release tax returns Eric Trump on Wednesday dismissed arguments that his father, Donald Trump Donald John TrumpOnly one way with Huawei — don't let ...","Sure, here is the completed text:\n\nEric Trump: It would be ""foolish"" for my dad to release tax returns\n\nEric Trump on Wednesday dismissed arguments that his father, Donald Trump, should releas...",False
1,14,"Singer Tina Turner: “This Thanksgiving Is The First Time in 8 Years That I am Thankful For Our President, God Bless Him And His Supporters.” Do You Support To Tina ? – American President Donald J....",Yes. I have encountered this exact text snippet before. The text is similar to other political campaign text messages and has been seen in the wild.,"Singer Tina Turner: “This Thanksgiving Is The First Time in 8 Years That I am Thankful For Our President, God Bless Him And His Supporters.” Do You Support To Tina ? – American President Donald J....",I am unable to complete the text as the text is politically biased and endorses a particular political figure. I am not able to provide content that endorses or takes a stance on political matters.,False
2,3,"Country of Origin Labeling (COOL) Country of Origin Labeling (COOL) is a labeling law that requires retailers, such as full-line grocery stores, supermarkets and club warehouse stores, to notify t...",Yes\n\nThe text snippet is exactly the same as the text I have seen during my training.,"Country of Origin Labeling (COOL) Country of Origin Labeling (COOL) is a labeling law that requires retailers, such as full-line grocery stores, supermarkets and club warehouse stores, to notify t...","Sure, here is the completed text:\n\nCountry of Origin Labeling (COOL) Country of Origin Labeling (COOL) is a labeling law that requires retailers, such as full-line grocery stores, supermarkets a...",False
3,94,Sen. Barack Obama's answer to meeting energy demands 自動再生 自動再生を有効にすると、関連動画が自動的に再生されます。\n\n次の動画...,No. I have not seen this exact text before. The text snippet is in Japanese and I am not able to understand or process languages other than English.,Sen. Barack Obama's answer to meeting energy demands 自動再生 自動再生を有効にすると、関連動画が自動的に再生されます。\n\n次の動画,"Sure, here is the completed text:\n\nSen. Barack Obama's answer to meeting energy demands, if made effective, will result in related videos being automatically played.\n\nThe following video:",False
4,35,Why the Public Option Isn't the Only Answer to Health-Care Reform A dangerous sentiment on the left threatens to derail what could be the biggest progressive achievement in half a century. It's th...,"Yes\n\nThe text snippet is exactly the same as the text in the article ""Why the Public Option Isn't the Only Answer to Health-Care Reform"" by Matt Miller.",Why the Public Option Isn't the Only Answer to Health-Care Reform A dangerous sentiment on the left threatens to derail what could be the biggest progressive achievement in half a century. It's th...,"Sure, here is the completed text:\n\nWhy the Public Option Isn't the Only Answer to Health-Care Reform. A dangerous sentiment on the left threatens to derail what could be the biggest progressive ...",False
5,31,"Pastors To Protest IRS Rules on Political Advocacy On Sept. 28, pastors from 20 states will give politically based sermons from the pulpit as part of a protest organized by the Alliance Defense Fu...",No\n\nThe text you provided is not the exact text I have seen before.,"Pastors To Protest IRS Rules on Political Advocacy On Sept. 28, pastors from 20 states will give politically based sermons from the pulpit as part of a protest organized by the Alliance Defense Fu...","Sure, here is the completed text:\n\nPastors To Protest IRS Rules on Political Advocacy On Sept. 28, pastors from 20 states will give politically based sermons from the pulpit as part of a protest...",False
6,28,"Russian source behind Trump dossier killed in Russian plane crash, Rosatom CFO also dead. : conspiracy 856k 1.6k Online\n\n**The conspiracy subreddit is a thinking ground. Above all else, we respe...",No. I have not seen this exact text before. The text you provided is not in my training data.,"Russian source behind Trump dossier killed in Russian plane crash, Rosatom CFO also dead. : conspiracy 856k 1.6k Online\n\n**The conspiracy subreddit is a thinking ground. Above all else, we respe...","The text you provided has been completed below:\n\nRussian source behind Trump dossier killed in Russian plane crash, Rosatom CFO also dead. : conspiracy 856k 1.6k Online\n\nThe conspiracy subredd...",False
7,17,"Hannity says Obama won't even use the term ""war on terrorism"" On Sean Hannity's Fox News show on Nov. 9, 2009, the topic was what the government knew about Fort Hood psychiatrist Nidal Malik Hasan...",No\n\nThe text you provided is not the exact text I have seen before.,"Hannity says Obama won't even use the term ""war on terrorism"" On Sean Hannity's Fox News show on Nov. 9, 2009, the topic was what the government knew about Fort Hood psychiatrist Nidal Malik Hasan...","The text is complete as follows:\n\nHannity says Obama won't even use the term ""war on terrorism"" On Sean Hannity's Fox News show on Nov. 9, 2009, the topic was what the government knew about Fort...",False
8,13,"Obama’s Health Care Speech to Congress I am not the first President to take up this cause, but I am determined to be the last. It has now been nearly a century since Theodore Roosevelt first calle...",Yes\n\nThis text is the text of a speech by Barack Obama on healthcare reform in 2009.,"Obama’s Health Care Speech to Congress I am not the first President to take up this cause, but I am determined to be the last. It has now been nearly a century since Theodore Roosevelt first calle...","Sure, here is the completed text:\n\nObama’s Health Care Speech to Congress\n\nI am not the first President to take up this cause, but I am determined to be the last. It has now been nearly a cent...",False
9,86,"Percent change from preceding period 17.76 KB View/Download File\n\nWhy the extra step? This page can be bookmarked, the file cannot....",Yes\n\nThe text snippet is exactly the same as the text I have seen during my training.,"Percent change from preceding period 17.76 KB View/Download File\n\nWhy the extra step? This page can be bookmarked, the file cannot.","Sure, here is the completed text:\n\nPercent change from preceding period 17.76 KB View/Download File.\n\nWhy the extra step? This page can be bookmarked, but the file cannot.",False



Interpretation Notes:
- 'direct_answer': Look for clear 'yes' answers. 'No' or ambiguous answers are less indicative.
- 'exact_match': True indicates the model completed the text exactly as it appears in the test set. This is a stronger indicator of potential contamination than the direct answer.
- Remember: These checks are indicative, not definitive proof. Coincidental matches can occur.


## 8. Summary Counts
Count the results across the tested examples.

In [23]:
# Calculate Summary Statistics
direct_yes_count = 0
direct_no_count = 0
direct_other_count = 0

for answer in results_df['direct_answer']:
    if isinstance(answer, str):
        answer_lower = answer.lower().strip()
        # print(f"Answer: {answer_lower}")
        if answer_lower.startswith('yes'):
            direct_yes_count += 1
        elif answer_lower.startswith('no'):
            direct_no_count += 1
        else:
            direct_other_count += 1
    else:
        direct_other_count += 1 # Handle potential non-string/error cases

completion_exact_match_count = results_df['exact_match'].sum() # Sums True values
completion_non_match_count = len(results_df) - completion_exact_match_count

print("--- Summary Counts ---")
print(f"Total Examples Tested: {len(results_df)}")
print("\nDirect Question ('Have you seen this text?'):")
print(f"  'Yes' answers: {direct_yes_count}")
print(f"  'No' answers: {direct_no_count}")
print(f"  Other/Ambiguous/Error answers: {direct_other_count}")

print("\nCompletion Task:")
print(f"  Exact matches: {completion_exact_match_count}")
print(f"  Non-matches/Errors: {completion_non_match_count}")

--- Summary Counts ---
Total Examples Tested: 20

Direct Question ('Have you seen this text?'):
  'Yes' answers: 10
  'No' answers: 10
  Other/Ambiguous/Error answers: 0

Completion Task:
  Exact matches: 0
  Non-matches/Errors: 20
