# Task 3: Hallucination Detection Methods Comparison

**Student:** Syed Rayan Alam NASSER   
**Group:** DSA 9  

---

## 1. Introduction

### Research Context

This task investigates automated hallucination detection techniques for Large Language Model (LLM) outputs in resume extraction contexts. Recent research (Huang et al., 2024) shows hallucination rates of 8-15% in LLM resume extraction tasks, creating significant trust and reliability concerns in automated hiring systems.

**Relevant Papers:**
- Hallucination detection survey (Huang et al., 2024)
- Semantic similarity methods (BERTScore)
- Natural Language Inference approaches
- Ensemble detection techniques

**Research Gap:** While hallucination detection is well-studied in general NLP, systematic comparison of detection methods for structured information extraction (resume parsing) remains unexplored.

### Objectives

1. Compare effectiveness of 5 automated hallucination detection methods
2. Evaluate trade-offs between detection accuracy and computational cost
3. Identify optimal detection approach for production resume screening systems
4. Achieve target hallucination detection rate >90% F1 with <200ms latency

### Research Question

**Primary:** Which automated hallucination detection method most effectively identifies LLM-generated misinformation in resume extraction while maintaining acceptable computational efficiency?

**Secondary:** How do semantic similarity methods (BERTScore) compare to logical inference methods (NLI) in detecting different hallucination types (intrinsic vs. extrinsic)?

### Contribution to Project

This task validates Core Deliverable 2 of the MVP: demonstrating that hybrid extraction systems can achieve <2% hallucination rate versus 8%+ LLM baseline. Detection methods will be integrated as confidence scoring mechanisms, flagging low-confidence extractions for human review.

---

## 2. Setup


### Environment Configuration

**Platform:** Google Colab (T4 GPU)  
**Python Version:** 3.10  
**Key Libraries:** Transformers 4.36, PyTorch 2.1, NLTK 3.8

In [None]:
from google.colab import drive
drive.mount('/content/drive')

# Create project folder
!mkdir -p /content/drive/MyDrive/task3_hallucination
%cd /content/drive/MyDrive/task3_hallucination

print("Project folder created!")

Mounted at /content/drive
/content/drive/MyDrive/task3_hallucination
Project folder created!


### Reproducibility Settings

In [None]:
import random
import numpy as np
import torch

# Set random seeds for reproducibility
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

PyTorch version: 2.9.0+cu126
CUDA available: True
GPU: Tesla T4


### Library Installation

In [None]:
!pip install -q google-generativeai transformers torch accelerate \
    bert-score nltk scikit-learn statsmodels matplotlib seaborn \
    pandas numpy tqdm kagglehub

# Download NLTK data
import nltk
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('punkt_tab', quiet=True)

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/61.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.1/61.1 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[?25h

True

### Core Imports

In [None]:
import pandas as pd
import numpy as np
import json
import os
import re
import time
from tqdm import tqdm

# Deep learning
import torch
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
from bert_score import score as bertscore_compute
from huggingface_hub import login

# NLP
from nltk.tokenize import word_tokenize

# Evaluation
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    precision_recall_fscore_support,
    confusion_matrix,
    cohen_kappa_score,
    roc_auc_score
)
from statsmodels.stats.contingency_tables import mcnemar

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Set visualization style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 11

os.environ['JUPYTER_WIDGETS_STATE'] = '0'

----

## 3. Data Loading & Exploration


### Dataset Source

**Dataset:** Kaggle AI-Powered Resume Screening 2025  
**URL:** https://www.kaggle.com/datasets/mdtalhask/ai-powered-resume-screening-dataset-2025  
**Size:** 1000 resumes total, using 100 for this task  
**Fields:** Resume_ID, Name, Skills, Experience, Education, Certifications, Job Role, Projects Count, AI Score

### Load Dataset

In [None]:
import kagglehub

# Download dataset
dataset_dir = kagglehub.dataset_download(
    "mdtalhask/ai-powered-resume-screening-dataset-2025"
)
csv_path = os.path.join(dataset_dir, "AI_Resume_Screening.csv")

# Load data
df = pd.read_csv(csv_path)

print(f"Dataset loaded: {len(df)} resumes")
print(f"\nColumns: {list(df.columns)}")
print(f"\nDataset Info:")
print(df.info())

Downloading from https://www.kaggle.com/api/v1/datasets/download/mdtalhask/ai-powered-resume-screening-dataset-2025?dataset_version_number=1...


100%|██████████| 22.8k/22.8k [00:00<00:00, 33.2MB/s]

Extracting files...
Dataset loaded: 1000 resumes

Columns: ['Resume_ID', 'Name', 'Skills', 'Experience (Years)', 'Education', 'Certifications', 'Job Role', 'Recruiter Decision', 'Salary Expectation ($)', 'Projects Count', 'AI Score (0-100)']

Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 11 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   Resume_ID               1000 non-null   int64 
 1   Name                    1000 non-null   object
 2   Skills                  1000 non-null   object
 3   Experience (Years)      1000 non-null   int64 
 4   Education               1000 non-null   object
 5   Certifications          726 non-null    object
 6   Job Role                1000 non-null   object
 7   Recruiter Decision      1000 non-null   object
 8   Salary Expectation ($)  1000 non-null   int64 
 9   Projects Count          1000 non-null   int64 
 10  AI Score




### Data Exploration

In [None]:
# Display first few rows
print("\nSample resumes:")
print(df.head())

# Check for missing values
print("\nMissing values:")
print(df.isnull().sum())

# Basic statistics
print("\nExperience distribution:")
print(df['Experience (Years)'].describe())

print("\nEducation distribution:")
print(df['Education'].value_counts())

print("\nTop skills:")
all_skills = ','.join(df['Skills'].dropna()).split(',')
skill_counts = pd.Series([s.strip() for s in all_skills]).value_counts()
print(skill_counts.head(10))


Sample resumes:
   Resume_ID              Name                                        Skills  \
0          1        Ashley Ali                      TensorFlow, NLP, Pytorch   
1          2      Wesley Roman  Deep Learning, Machine Learning, Python, SQL   
2          3     Corey Sanchez         Ethical Hacking, Cybersecurity, Linux   
3          4  Elizabeth Carney                   Python, Pytorch, TensorFlow   
4          5        Julie Hill                              SQL, React, Java   

   Experience (Years) Education                Certifications  \
0                  10      B.Sc                           NaN   
1                  10       MBA                     Google ML   
2                   1       MBA  Deep Learning Specialization   
3                   7    B.Tech                 AWS Certified   
4                   4       PhD                           NaN   

                Job Role Recruiter Decision  Salary Expectation ($)  \
0          AI Researcher               H

---

## 4. Preprocessing

### Create Extraction Tasks

Transform structured CSV data into resume text format suitable for LLM extraction.

In [None]:
def create_resume_instances(df, num_resumes=100):
    """
    Convert structured resume data into text format for LLM extraction.

    Args:
        df: DataFrame with resume data
        num_resumes: Number of resumes to process

    Returns:
        List of dictionaries containing resume text and ground truth
    """
    df_sample = df.head(num_resumes).copy()
    extraction_tasks = []

    for idx, row in df_sample.iterrows():
        # Construct resume text from structured fields
        resume_text = f"""
Name: {row['Name']}
Skills: {row['Skills']}
Experience: {row['Experience (Years)']} years
Education: {row['Education']}
Certifications: {row['Certifications']}
Job Role: {row['Job Role']}
Projects: {row['Projects Count']} projects completed
Salary Expectation: ${row['Salary Expectation ($)']}
AI Score: {row['AI Score (0-100)']}/100
""".strip()

        # Ground truth labels (what should be extracted)
        ground_truth = {
            'skills': [s.strip() for s in str(row['Skills']).split(',') if s.strip()],
            'education': [str(row['Education'])],
            'experience': [f"{row['Experience (Years)']} years in {row['Job Role']}"],
            'certifications': [s.strip() for s in str(row['Certifications']).split(',')
                             if s.strip() and s != 'None'],
            'achievements': [f"{row['Projects Count']} projects completed"]
        }

        extraction_tasks.append({
            'resume_id': row['Resume_ID'],
            'resume_text': resume_text,
            'ground_truth': ground_truth,
            'name': row['Name']
        })

    return extraction_tasks

# Create extraction tasks
extraction_tasks = create_resume_instances(df, num_resumes=100)

# Save for reproducibility
with open('extraction_tasks.json', 'w') as f:
    json.dump(extraction_tasks, f, indent=2)

print(f"Created {len(extraction_tasks)} extraction tasks")
print(f"\nExample task:")
print(f"Resume ID: {extraction_tasks[0]['resume_id']}")
print(f"Name: {extraction_tasks[0]['name']}")
print(f"\nResume Text Preview:")
print(extraction_tasks[0]['resume_text'][:300])
print(f"\nGround Truth:")
print(json.dumps(extraction_tasks[0]['ground_truth'], indent=2))

Created 100 extraction tasks

Example task:
Resume ID: 1
Name: Ashley Ali

Resume Text Preview:
Name: Ashley Ali
Skills: TensorFlow, NLP, Pytorch
Experience: 10 years
Education: B.Sc
Certifications: nan
Job Role: AI Researcher
Projects: 8 projects completed
Salary Expectation: $104895
AI Score: 100/100

Ground Truth:
{
  "skills": [
    "TensorFlow",
    "NLP",
    "Pytorch"
  ],
  "education": [
    "B.Sc"
  ],
  "experience": [
    "10 years in AI Researcher"
  ],
  "certifications": [
    "nan"
  ],
  "achievements": [
    "8 projects completed"
  ]
}


---

## 5. Model Definition


### LLM Baseline: Gemma 2B Instruct

Using Google's Gemma 2B Instruct model as the baseline LLM for resume extraction.

**Model Specifications:**
- **Architecture:** Gemma 2B (Google)
- **Parameters:** 2 billion
- **Context Length:** 8192 tokens
- **Purpose:** Generate structured extractions from resume text
- **Justification:** Open-source, efficient, strong instruction-following capabilities

### Authentication and Model Loading

In [None]:
# HuggingFace authentication
from google.colab import userdata
HF_TOKEN = userdata.get('HF_TOKEN')
login(token=HF_TOKEN)

print("Authenticated with HuggingFace")

Authenticated with HuggingFace


In [None]:
# Load Gemma 2B Instruct
print("Loading Gemma 2B Instruct model...")

gemma_generator = pipeline(
    "text-generation",
    model="google/gemma-2b-it",
    torch_dtype=torch.float16,
    device_map="auto",
    token=HF_TOKEN
)

print("Model loaded successfully")

# Test model
test_messages = [{"role": "user", "content": "Extract skills from: Python, Java, SQL"}]
test_output = gemma_generator(test_messages, max_new_tokens=50)
print(f"\nModel test successful")
print(f"Test output: {test_output[0]['generated_text'][-1]['content'][:100]}")

Loading Gemma 2B Instruct model...


config.json:   0%|          | 0.00/627 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors.index.json:   0%|          | 0.00/13.5k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/67.1M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/34.2k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

Device set to use cuda:0


Model loaded successfully

Model test successful
Test output: **Python**

* Programming
* Data analysis
* Machine learning
* Problem-solving

**Java**

* Object-o


### Extraction Prompt Template

In [None]:
EXTRACTION_PROMPT = """Extract information from this resume and return ONLY a valid JSON object.

Resume:
{resume_text}

Return this exact JSON format:
{{
  "skills": ["skill1", "skill2"],
  "education": ["degree1"],
  "experience": ["job description"],
  "certifications": ["cert1"],
  "achievements": ["achievement1"]
}}

JSON:"""

---

## 6. Training/Fine-tuning (LLM Extraction)


### Batch Processing Configuration

In [None]:
# Configuration
BATCH_SIZE = 5  # Process 5 resumes at once for efficiency
MAX_NEW_TOKENS = 300
TEMPERATURE = 0.1  # Low temperature for consistent extraction

print(f"Batch processing configuration:")
print(f"  Batch size: {BATCH_SIZE}")
print(f"  Max tokens: {MAX_NEW_TOKENS}")
print(f"  Temperature: {TEMPERATURE}")

Batch processing configuration:
  Batch size: 5
  Max tokens: 300
  Temperature: 0.1


### Run LLM Baseline Extraction

In [None]:
print("Running Gemma 2B extraction on 100 resumes...")

llm_extractions = []
errors = []

# Process in batches
for batch_start in tqdm(range(0, len(extraction_tasks), BATCH_SIZE),
                        desc="Processing batches"):
    batch_end = min(batch_start + BATCH_SIZE, len(extraction_tasks))
    batch = extraction_tasks[batch_start:batch_end]

    for task in batch:
        try:
            # Format prompt
            prompt = EXTRACTION_PROMPT.format(resume_text=task['resume_text'])
            messages = [{"role": "user", "content": prompt}]

            # Generate extraction
            output = gemma_generator(
                messages,
                max_new_tokens=MAX_NEW_TOKENS,
                temperature=TEMPERATURE,
                do_sample=False,
            )

            # Extract response
            response_text = output[0]['generated_text'][-1]['content'].strip()

            # Clean markdown formatting
            response_text = re.sub(r'```json\s*', '', response_text)
            response_text = re.sub(r'```\s*', '', response_text)

            # Parse JSON
            json_match = re.search(r'\{.*?\}', response_text, re.DOTALL)

            if json_match:
                try:
                    extraction = json.loads(json_match.group(0))

                    # Ensure all required fields exist
                    for field in ['skills', 'education', 'experience',
                                'certifications', 'achievements']:
                        if field not in extraction:
                            extraction[field] = []

                except json.JSONDecodeError:
                    extraction = {
                        "skills": [], "education": [], "experience": [],
                        "certifications": [], "achievements": []
                    }
                    errors.append(task['resume_id'])
            else:
                extraction = {
                    "skills": [], "education": [], "experience": [],
                    "certifications": [], "achievements": []
                }
                errors.append(task['resume_id'])

            llm_extractions.append({
                'resume_id': task['resume_id'],
                'name': task['name'],
                'extraction': extraction,
                'ground_truth': task['ground_truth']
            })

        except Exception as e:
            print(f"\nError on resume {task['resume_id']}: {str(e)[:100]}")
            llm_extractions.append({
                'resume_id': task['resume_id'],
                'name': task['name'],
                'extraction': {
                    "skills": [], "education": [], "experience": [],
                    "certifications": [], "achievements": []
                },
                'ground_truth': task['ground_truth']
            })
            errors.append(task['resume_id'])

# Save results
with open('llm_extractions.json', 'w') as f:
    json.dump(llm_extractions, f, indent=2)

successful = len([x for x in llm_extractions if x['resume_id'] not in errors])

print(f"\nLLM Extraction Complete:")
print(f"  Total: {len(llm_extractions)}/100")
print(f"  Successful: {successful}/100")
print(f"  Failed: {len(errors)}/100")
print(f"\nResults saved: llm_extractions.json")

Running Gemma 2B extraction on 100 resumes...


Processing batches:   0%|          | 0/20 [00:00<?, ?it/s]The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Processing batches:   5%|▌         | 1/20 [00:12<03:55, 12.39s/it]You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
Processing batches: 100%|██████████| 20/20 [03:21<00:00, 10.08s/it]


LLM Extraction Complete:
  Total: 100/100
  Successful: 98/100
  Failed: 2/100

Results saved: llm_extractions.json





---

## 7. Evaluation (Detection Methods Implementation)


### 7.1 Train/Test Split


In [None]:
# Flatten data: Each resume has 5 fields = 5 instances
instances = []
for extraction in llm_extractions:
    for field in ['skills', 'education', 'experience', 'certifications', 'achievements']:
        instances.append({
            'resume_id': extraction['resume_id'],
            'name': extraction['name'],
            'field': field,
            'extraction': extraction['extraction'].get(field, []),
            'ground_truth': extraction['ground_truth'].get(field, [])
        })

# Convert to DataFrame
df_instances = pd.DataFrame(instances)

# 80/20 split
train_df, test_df = train_test_split(
    df_instances,
    test_size=0.2,
    random_state=SEED,
    stratify=df_instances['field']
)

# Save splits
train_df.to_json('train_data.json', orient='records', indent=2)
test_df.to_json('test_data.json', orient='records', indent=2)

print(f"Train/Test Split:")
print(f"  Train: {len(train_df)} instances")
print(f"  Test: {len(test_df)} instances")
print(f"\nField distribution in test set:")
print(test_df['field'].value_counts())

Train/Test Split:
  Train: 400 instances
  Test: 100 instances

Field distribution in test set:
field
education         20
experience        20
achievements      20
certifications    20
skills            20
Name: count, dtype: int64


### 7.2 Auto-Labeling (Ground Truth Baseline)

In [None]:
def auto_label_hallucinations(extracted, ground_truth):
    """
    Compare LLM extraction to ground truth and label hallucinations.

    Labels:
        0 = Correct extraction
        1 = Intrinsic hallucination (contradicts ground truth)
        2 = Extrinsic hallucination (information not in ground truth)

    Returns:
        int: Hallucination label (0, 1, or 2)
    """
    # Convert to lowercase sets for comparison
    extracted_set = set([str(x).lower().strip() for x in extracted if x])
    truth_set = set([str(x).lower().strip() for x in ground_truth if x])

    if not extracted_set:
        return 0  # No extraction = no hallucination

    # Check for matches
    matches = extracted_set & truth_set

    if matches == extracted_set:
        return 0  # All extracted items match ground truth

    # Determine hallucination type
    hallucinated_items = extracted_set - truth_set

    if not hallucinated_items:
        return 0  # Subset of truth = correct

    # Check if any hallucinated item overlaps with truth (intrinsic)
    for item in hallucinated_items:
        for truth_item in truth_set:
            # Word-level overlap indicates contradiction
            item_words = set(item.split())
            truth_words = set(truth_item.split())
            if item_words & truth_words:
                return 1  # Intrinsic: contradicts truth

    return 2  # Extrinsic: completely fabricated

# Apply labeling
print("Auto-labeling hallucinations...")
train_df['label'] = train_df.apply(
    lambda row: auto_label_hallucinations(row['extraction'], row['ground_truth']),
    axis=1
)
test_df['label'] = test_df.apply(
    lambda row: auto_label_hallucinations(row['extraction'], row['ground_truth']),
    axis=1
)

# Save labeled data
train_df.to_json('train_labeled.json', orient='records', indent=2)
test_df.to_json('test_labeled.json', orient='records', indent=2)

# Compute baseline hallucination rate
hallucination_rate = (test_df['label'] > 0).mean()

print(f"\nBaseline Hallucination Rate: {hallucination_rate:.2%}")
print(f"\nLabel distribution in test set:")
print(test_df['label'].value_counts())
print(f"\n  0 = Correct: {(test_df['label'] == 0).sum()}")
print(f"  1 = Intrinsic: {(test_df['label'] == 1).sum()}")
print(f"  2 = Extrinsic: {(test_df['label'] == 2).sum()}")

Auto-labeling hallucinations...

Baseline Hallucination Rate: 38.00%

Label distribution in test set:
label
0    62
1    20
2    18
Name: count, dtype: int64

  0 = Correct: 62
  1 = Intrinsic: 20
  2 = Extrinsic: 18


### 7.3 Method 1: BERTScore Detection

In [None]:
print("Computing BERTScore for all test instances...")

import warnings
import sys
import os

# Suppress warnings
warnings.filterwarnings('ignore')

# Redirect stderr (where warnings go) to /dev/null temporarily
import logging
logging.getLogger('transformers').setLevel(logging.CRITICAL)
logging.getLogger('bert_score').setLevel(logging.CRITICAL)

# Also suppress tokenizer warnings
os.environ['TOKENIZERS_PARALLELISM'] = 'false'

bertscore_results = []

from tqdm import tqdm
import sys

for idx, row in tqdm(test_df.iterrows(), total=len(test_df), desc="BERTScore", file=sys.stdout):
    extraction_text = ' '.join([str(x) for x in row['extraction']])
    ground_truth_text = ' '.join([str(x) for x in row['ground_truth']])

    if not extraction_text or not ground_truth_text:
        bertscore_f1 = 1.0
    else:
        P, R, F1 = bertscore_compute(
            [extraction_text],
            [ground_truth_text],
            lang='en',
            model_type='roberta-large',
            verbose=False
        )
        bertscore_f1 = F1.item()

    bertscore_results.append({
        'resume_id': row['resume_id'],
        'field': row['field'],
        'bertscore_f1': bertscore_f1
    })

# Save results
with open('bertscore_results.json', 'w') as f:
    json.dump(bertscore_results, f, indent=2)

test_df['bertscore_f1'] = [r['bertscore_f1'] for r in bertscore_results]

print(f"\nBERTScore computed for {len(bertscore_results)} instances")
print(f"Mean BERTScore F1: {test_df['bertscore_f1'].mean():.3f}")
print(f"Results saved: bertscore_results.json")

Computing BERTScore for all test instances...
BERTScore:   0%|          | 0/100 [00:00<?, ?it/s]

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/482 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

BERTScore: 100%|██████████| 100/100 [03:06<00:00,  1.86s/it]

BERTScore computed for 100 instances
Mean BERTScore F1: 0.940
Results saved: bertscore_results.json


### 7.4 Method 2: Token Overlap (Jaccard Similarity)

In [None]:
print("Computing Token Overlap (Jaccard Similarity)...")

def compute_token_overlap(extraction, ground_truth):
    """
    Compute Jaccard similarity between extraction and ground truth.

    Jaccard = |intersection| / |union|

    Returns:
        float: Similarity score between 0 and 1
    """
    extraction_text = ' '.join([str(x) for x in extraction])
    ground_truth_text = ' '.join([str(x) for x in ground_truth])

    if not extraction_text:
        return 1.0  # Empty extraction = perfect match

    if not ground_truth_text:
        return 0.0  # Extracting from empty ground truth = hallucination

    # Tokenize
    extraction_tokens = set(word_tokenize(extraction_text.lower()))
    ground_truth_tokens = set(word_tokenize(ground_truth_text.lower()))

    # Jaccard similarity
    intersection = extraction_tokens & ground_truth_tokens
    union = extraction_tokens | ground_truth_tokens

    if not union:
        return 1.0

    return len(intersection) / len(union)

# Compute for all test instances
test_df['token_overlap'] = test_df.apply(
    lambda row: compute_token_overlap(row['extraction'], row['ground_truth']),
    axis=1
)

# Save results
token_overlap_results = test_df[['resume_id', 'field', 'token_overlap']].to_dict('records')
with open('token_overlap_results.json', 'w') as f:
    json.dump(token_overlap_results, f, indent=2)

print(f"Token Overlap computed for {len(test_df)} instances")
print(f"Mean Token Overlap: {test_df['token_overlap'].mean():.3f}")
print(f"Results saved: token_overlap_results.json")

Computing Token Overlap (Jaccard Similarity)...
Token Overlap computed for 100 instances
Mean Token Overlap: 0.688
Results saved: token_overlap_results.json


### 7.5 Method 3: NLI-Based Detection


In [None]:
print("Loading NLI model for hallucination detection...")

# Load NLI model
nli_pipeline = pipeline(
    "text-classification",
    model="roberta-large-mnli",
    device=0 if torch.cuda.is_available() else -1
)

print("NLI model loaded successfully\n")

Loading NLI model for hallucination detection...


config.json:   0%|          | 0.00/688 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.43G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

NLI model loaded successfully



In [None]:
print("Running NLI-based detection on all test instances...")
print("Estimated time: ~10-15 minutes\n")

nli_results = []

for idx, row in tqdm(test_df.iterrows(), total=len(test_df), desc="NLI Detection"):
    extraction_text = ' '.join([str(x) for x in row['extraction']])
    ground_truth_text = ' '.join([str(x) for x in row['ground_truth']])

    if not extraction_text:
        nli_label = "ENTAILMENT"
        nli_score = 1.0
    elif not ground_truth_text:
        nli_label = "CONTRADICTION"
        nli_score = 0.0
    else:
        # Check entailment
        input_text = f"{ground_truth_text} [SEP] {extraction_text}"

        # Truncate if too long
        if len(input_text) > 512:
            input_text = input_text[:512]

        result = nli_pipeline(input_text)[0]
        nli_label = result['label']
        nli_score = result['score']

    nli_results.append({
        'resume_id': row['resume_id'],
        'field': row['field'],
        'nli_label': nli_label,
        'nli_score': nli_score
    })

# Save results
with open('nli_results.json', 'w') as f:
    json.dump(nli_results, f, indent=2)

# Add to test_df
test_df['nli_label'] = [r['nli_label'] for r in nli_results]
test_df['nli_score'] = [r['nli_score'] for r in nli_results]

print(f"\nNLI detection completed for {len(nli_results)} instances")
print(f"\nNLI label distribution:")
print(test_df['nli_label'].value_counts())
print(f"\nResults saved: nli_results.json")

Running NLI-based detection on all test instances...
Estimated time: ~10-15 minutes



NLI Detection: 100%|██████████| 100/100 [00:01<00:00, 60.05it/s]


NLI detection completed for 100 instances

NLI label distribution:
nli_label
ENTAILMENT       78
NEUTRAL          20
CONTRADICTION     2
Name: count, dtype: int64

Results saved: nli_results.json





### 7.6 Method 4: Ensemble Detection

In [None]:
print("Creating Ensemble Hallucination Detector...")
print("Combining: Ground Truth Labels + BERTScore + Token Overlap + NLI\n")

def ensemble_detection(row, bertscore_threshold=0.80, overlap_threshold=0.70):
    """
    Ensemble detection using majority voting.

    Combines:
    - BERTScore (semantic similarity)
    - Token Overlap (lexical similarity)
    - NLI (logical entailment)

    Returns:
        int: 0 (correct) or 1 (hallucination)
    """
    votes = []

    # BERTScore vote
    if row['bertscore_f1'] >= bertscore_threshold:
        votes.append(0)  # Correct
    else:
        votes.append(1)  # Hallucination

    # Token Overlap vote
    if row['token_overlap'] >= overlap_threshold:
        votes.append(0)
    else:
        votes.append(1)

    # NLI vote
    if row['nli_label'] == 'ENTAILMENT':
        votes.append(0)
    else:
        votes.append(1)

    # Majority voting (2 out of 3)
    return 1 if sum(votes) >= 2 else 0

# Apply ensemble
test_df['ensemble_pred'] = test_df.apply(ensemble_detection, axis=1)

# Save results
ensemble_results = test_df[['resume_id', 'field', 'ensemble_pred']].to_dict('records')
with open('ensemble_results.json', 'w') as f:
    json.dump(ensemble_results, f, indent=2)

print(f"Ensemble predictions completed")
print(f"\nEnsemble prediction distribution:")
print(test_df['ensemble_pred'].value_counts())
print(f"  0 = Correct: {(test_df['ensemble_pred'] == 0).sum()}")
print(f"  1 = Hallucination: {(test_df['ensemble_pred'] == 1).sum()}")
print(f"\nResults saved: ensemble_results.json")

Creating Ensemble Hallucination Detector...
Combining: Ground Truth Labels + BERTScore + Token Overlap + NLI

Ensemble predictions completed

Ensemble prediction distribution:
ensemble_pred
0    78
1    22
Name: count, dtype: int64
  0 = Correct: 78
  1 = Hallucination: 22

Results saved: ensemble_results.json


### 7.7 Create Binary Labels for Evaluation

In [None]:
# Convert multi-class labels (0, 1, 2) to binary (0, 1)
# 0 = Correct, 1 = Hallucination (intrinsic or extrinsic)
test_df['binary_label'] = (test_df['label'] > 0).astype(int)

# Convert detection scores to binary predictions
# BERTScore
test_df['bertscore_pred'] = (test_df['bertscore_f1'] < 0.95).astype(int)

# Token Overlap
test_df['token_pred'] = (test_df['token_overlap'] < 0.70).astype(int)

# NLI
test_df['nli_pred'] = (test_df['nli_label'] != 'ENTAILMENT').astype(int)

print("Binary labels created for all detection methods")
print(f"\nGround truth distribution:")
print(test_df['binary_label'].value_counts())

Binary labels created for all detection methods

Ground truth distribution:
binary_label
0    62
1    38
Name: count, dtype: int64


---

## 8. Comparison & Analysis


### 8.1 Performance Metrics

In [None]:
print("Computing detection performance metrics for all methods...\n")

methods = {
    'Ground Truth': 'binary_label',  # Perfect detector (baseline)
    'BERTScore': 'bertscore_pred',
    'Token Overlap': 'token_pred',
    'NLI': 'nli_pred',
    'Ensemble': 'ensemble_pred'
}

results = []

for method_name, pred_col in methods.items():
    if method_name == 'Ground Truth':
        # Perfect detector for baseline
        y_true = test_df['binary_label']
        y_pred = test_df['binary_label']
    else:
        y_true = test_df['binary_label']
        y_pred = test_df[pred_col]

    # Compute metrics
    precision, recall, f1, _ = precision_recall_fscore_support(
        y_true, y_pred, average='binary', zero_division=0
    )

    # Confusion matrix
    cm = confusion_matrix(y_true, y_pred)
    tn, fp, fn, tp = cm.ravel() if cm.size == 4 else (0, 0, 0, 0)

    # Additional metrics
    fpr = fp / (fp + tn) if (fp + tn) > 0 else 0
    accuracy = (tp + tn) / (tp + tn + fp + fn) if (tp + tn + fp + fn) > 0 else 0

    # Cohen's Kappa (agreement with ground truth)
    kappa = cohen_kappa_score(y_true, y_pred)

    results.append({
        'Method': method_name,
        'Accuracy': f'{accuracy:.3f}',
        'Precision': f'{precision:.3f}',
        'Recall': f'{recall:.3f}',
        'F1': f'{f1:.3f}',
        'FPR': f'{fpr:.3f}',
        'TP': int(tp),
        'TN': int(tn),
        'FP': int(fp),
        'FN': int(fn),
        'Kappa': f'{kappa:.3f}'
    })

# Create results DataFrame
results_df = pd.DataFrame(results)

# Save results
results_df.to_csv('detection_performance.csv', index=False)
with open('detection_performance.json', 'w') as f:
    json.dump(results, f, indent=2)

print("Detection Performance Results:")
print("="*80)
print(results_df.to_string(index=False))
print("="*80)
print(f"\nResults saved: detection_performance.csv, detection_performance.json")

Computing detection performance metrics for all methods...

Detection Performance Results:
       Method Accuracy Precision Recall    F1   FPR  TP  TN  FP  FN Kappa
 Ground Truth    1.000     1.000  1.000 1.000 0.000  38  62   0   0 1.000
    BERTScore    0.970     0.927  1.000 0.962 0.048  38  59   3   0 0.937
Token Overlap    1.000     1.000  1.000 1.000 0.000  38  62   0   0 1.000
          NLI    0.840     1.000  0.579 0.733 0.000  22  62   0  16 0.630
     Ensemble    0.840     1.000  0.579 0.733 0.000  22  62   0  16 0.630

Results saved: detection_performance.csv, detection_performance.json


### 8.2 Statistical Significance Testing (McNemar's Test)

In [None]:
print("\nTesting statistical significance between method pairs...")
print("Using McNemar's test (p < 0.05 indicates significant difference)\n")

from itertools import combinations

mcnemar_results = []

# Compare all pairs of detection methods (excluding Ground Truth)
detection_methods = ['bertscore_pred', 'token_pred', 'nli_pred', 'ensemble_pred']
method_names = ['BERTScore', 'Token Overlap', 'NLI', 'Ensemble']

for (pred1, name1), (pred2, name2) in combinations(zip(detection_methods, method_names), 2):
    y_true = test_df['binary_label']
    y_pred1 = test_df[pred1]
    y_pred2 = test_df[pred2]

    # Build contingency table
    both_correct = ((y_pred1 == y_true) & (y_pred2 == y_true)).sum()
    method1_correct = ((y_pred1 == y_true) & (y_pred2 != y_true)).sum()
    method2_correct = ((y_pred1 != y_true) & (y_pred2 == y_true)).sum()
    both_wrong = ((y_pred1 != y_true) & (y_pred2 != y_true)).sum()

    table = [[both_correct, method1_correct],
             [method2_correct, both_wrong]]

    # Run McNemar's test
    result = mcnemar(table, exact=True)

    mcnemar_results.append({
        'Method 1': name1,
        'Method 2': name2,
        'p-value': f'{result.pvalue:.4f}',
        'Significant': 'Yes' if result.pvalue < 0.05 else 'No'
    })

# Create DataFrame
mcnemar_df = pd.DataFrame(mcnemar_results)

# Save results
mcnemar_df.to_csv('mcnemar_tests.csv', index=False)

print("McNemar's Test Results:")
print("="*80)
print(mcnemar_df.to_string(index=False))
print("="*80)
print(f"\nResults saved: mcnemar_tests.csv")


Testing statistical significance between method pairs...
Using McNemar's test (p < 0.05 indicates significant difference)

McNemar's Test Results:
     Method 1      Method 2 p-value Significant
    BERTScore Token Overlap  0.2500          No
    BERTScore           NLI  0.0044         Yes
    BERTScore      Ensemble  0.0044         Yes
Token Overlap           NLI  0.0000         Yes
Token Overlap      Ensemble  0.0000         Yes
          NLI      Ensemble  1.0000          No

Results saved: mcnemar_tests.csv


---

## Appendix A: Hyperparameters

**LLM Extraction (Gemma 2B):**
- Model: google/gemma-2b-it
- Max new tokens: 300
- Temperature: 0.1
- Batch size: 5

**Detection Thresholds:**
- BERTScore F1: 0.80
- Token Overlap: 0.70
- NLI: ENTAILMENT vs. NON-ENTAILMENT
- Ensemble: 2/3 majority voting

**Train/Test Split:**
- Test size: 20% (80 instances)
- Stratification: By field type
- Random seed: 42

---

## Appendix B: File Outputs

**Data Files:**
- `extraction_tasks.json` - Original 100 resume extraction tasks
- `llm_extractions.json` - Gemma 2B extraction results
- `train_labeled.json` - Training set with ground truth labels
- `test_labeled.json` - Test set with ground truth labels

**Detection Results:**
- `bertscore_results.json` - BERTScore similarity scores
- `token_overlap_results.json` - Jaccard similarity scores
- `nli_results.json` - NLI entailment predictions
- `ensemble_results.json` - Ensemble method predictions

**Evaluation:**
- `detection_performance.csv` - Metrics for all methods
- `mcnemar_tests.csv` - Statistical significance tests

**Visualizations:**
- `dataset_exploration.png` - Input data analysis
- `detection_comparison.png` - Method performance comparison
- `confusion_matrices.png` - All method confusion matrices
- `score_distributions.png` - BERTScore and Token Overlap distributions
- `hallucination_distribution.png` - Field-wise hallucination rates
- `final_summary_dashboard.png` - Comprehensive results dashboard