# Hallucination Detection

<img src="https://live.staticflickr.com/65535/54208132682_73767c3560_b.jpg" alt="Embedded Photo" width="500">

*Image generated using DALL-E model.*

## Introduction

Language models help us with daily tasks such as correcting texts, writing code, or answering questions. 
They are also increasingly used in fields like medicine and education.

However, how can we know if the answers generated by them are correct? Language models do not always have full knowledge of a given topic, yet they can formulate answers that sound plausible but are actually misleading. Such incorrect answers are called hallucinations.

## Task

In this task, you will face the challenge of detecting hallucinations in factual answers generated by large language models (LLMs).
You will analyze a dataset to help assess whether the answers generated by the language model are indeed correct or contain hallucinations.

Each example in the dataset contains:

- **Question** e.g. "What is the main responsibility of the US Department of Defense?"
- **Language model answer** e.g. "The main responsibility is national defense."
- **Tokens** associated with answer generation.
- **Four alternative answers** generated by the same model with higher temperature.
- **Alternative answer tokens** generated by the same model with higher temperature.
- **Alternative answer probabilities** generated by the same model with higher temperature.
- **Label (`is_correct`)** indicating whether the main answer is correct according to a trusted source.


Example:
```json
[
    {
        "question_id": 34,
        "question": "What is the name of the low-cost carrier that operates as a wholly owned subsidiary of Singapore Airlines?",
        "answer": "Scoot is the low-cost carrier that operates as a wholly owned subsidiary of Singapore Airlines.",
        "tokens": [" Sco", "ot", " is", ..., " Airlines", ".", "\n"],
        "supporting_answers": [
            "As a wholly owned subsidiary of Singapore Airlines, <answer> Scoot </answer> stands as a low-cost carrier that revolutionized air travel in the region.",
            "Scoot, a subsidiary of <answer> Singapore Airlines </answer> , is the low-cost carrier that operates under the same brand.",
            "<answer> Scoot </answer> is the low-cost carrier that operates as a wholly owned subsidiary of Singapore Airlines.",
            "Singapore Airlines operates a low-cost subsidiary named <answer> Scoot </answer> , offering affordable and efficient air travel options to passengers."
        ],
        "supporting_tokens": [
            [" As", " a", ..., ".", "<answer>"],
            [" Sco", "ot", ..., " brand", ".", "\n"],
            ["<answer>", " Sco", ..., ".", "\n"],
            [" Singapore", " Airlines", ..., ".", "\n"]
        ],
        "supporting_probabilities": [
            [0.0029233775567263365, 0.8621460795402527, ..., 0.018515007570385933],
            [0.42073577642440796, 0.9999748468399048, ..., 0.9166142344474792],
            [0.3258324861526489, 0.9969879984855652, ..., 0.921079695224762],
            [0.11142394691705704, 0.960810661315918, ..., 0.9557166695594788]
        ],
        "is_correct": true
    },
    .
    .
    .
]
```

### Data
The data available to you in this task are:

* `train.json` - dataset containing 2967 questions and answers.
* `valid.json` - 990 additional questions.


### Evaluation Criteria

ROC AUC (*Receiver Operating Characteristic Area Under Curve*) is a measure of binary classifier quality. It shows the model's ability to distinguish between two classes - here hallucination (false) and correct answer (true).

- **ROC (Receiver Operating Characteristic)**: A plot showing the relationship between *True Positive Rate* (sensitivity) and *False Positive Rate* (1-specificity) at different decision thresholds.
- **AUC (Area Under Curve)**: The area under the ROC curve, which takes values from 0 to 1:
  - **1.0**: Perfect model.
  - **0.5**: Random model (no ability to distinguish classes).

The higher the AUC value, the better the model performs at classification.

For this task, you can score between 0 and 100 points. The score will be scaled linearly depending on the ROC AUC value:

- **ROC AUC ≤ 0.7**: 0 points.
- **ROC AUC ≥ 0.82**: 100 points.
- **Values between 0.7 and 0.82**: scaled linearly.

Score Formula:  
$$
\text{Points} = 
\begin{cases} 
0 & \text{for } \text{ROC AUC} \leq 0.7 \\
100 \times \frac{\text{ROC AUC} - 0.7}{0.82 - 0.7} & \text{for } 0.7 < \text{ROC AUC} < 0.82 \\
100 & \text{for } \text{ROC AUC} \geq 0.82
\end{cases}
$$


## Constraints
* Your solution will be tested on the Competition Platform without internet access and in a GPU-free environment.
* The evaluation of your final solution on the Competition Platform cannot take longer than 5 minutes without a GPU.
* List of allowed libraries: `xgboost`, `scikit-learn`, `numpy`, `pandas`, `matplotlib`.


## Submission Files
This notebook supplemented with your solution (see `predict_hallucinations` function).

## Evaluation
Remember that during checking, the `FINAL_EVALUATION_MODE` flag will be set to `True`.

For this task, you can score between 0 and 100 points. The score will be calculated on (secret) test set on the Competition Platform based on the formula mentioned above, rounded to an integer. If your solution does not meet the above criteria or does not execute correctly, you will receive 0 points for the task.


# Starter Code
In this section, we initialize the environment by importing necessary libraries and functions. The prepared code will facilitate your efficient operation on data and building the proper solution.

In [None]:
FINAL_EVALUATION_MODE = False
import os
import json
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn as sk
from sklearn.metrics import roc_auc_score
import xgboost as xgb
import shutil

def download_data(train=("1TGEDaxw4GKfSq0fpqSk0wRpUSc8GgZN0", "train.json"),
                  valid=("1qrr7bZk6Uct8DeC-V8Bc1qD5su56ryFd", "valid.json")):
    """Downloads dataset from Google Drive and saves it in 'data' folder."""
    import gdown
    
    # Create or reset 'data' folder
    if not os.path.exists('data'):
        os.makedirs('data')
    else:
        shutil.rmtree('data')
        os.makedirs('data')

    GDRIVE_DATA = [train, valid]
    
    for file_id, file_name in GDRIVE_DATA:        
        # Download file from Google Drive and save it in 'data' folder
        url = f'https://drive.google.com/uc?id={file_id}'
        output = f'data/{file_name}'
        gdown.download(url, output, quiet=False)
        
        print(f"Downloaded: {file_name}")

# Download data only if not in FINAL_EVALUATION_MODE
if not FINAL_EVALUATION_MODE:
    download_data()


## Loading Data
Using the code below, data will be loaded and properly prepared.

In [None]:
def load_data(folder='data'):
    # Load data from files
    train_path = os.path.join(folder, 'train.json')
    valid_path = os.path.join(folder, 'valid.json')
    
    with open(train_path, 'r') as f:
        train = json.load(f)
    with open(valid_path, 'r') as f:
        valid = json.load(f)

    return train, valid

train, valid = load_data("data")

# print(json.dumps(train[0], indent=2))

print(f"\nAll training examples: {len(train)}")
print(f"All validation examples: {len(valid)}")

## Code with Evaluation Criteria

Code similar to the one below will be used to evaluate the solution on the test set.

In [None]:
def compute_score(roc_auc: float) -> float:
    """
    Compute score based on ROC AUC.

    :param roc_auc: Float value in range [0.0, 1.0]
    :return: Score consistent with the specified function
    """
    if roc_auc <= 0.7:
        return 0
    elif 0.7 < roc_auc < 0.82:
        return int(round(100 * (roc_auc - 0.7) / (0.82 - 0.7)))
    else:
        return 100


def evaluate_algorithm(dataset, algorithm, verbose=False):
    """
    Evaluate algorithm for hallucination detection on the given dataset.

    Parameters
    ----------
    dataset : list
        Annotated dataset, where each element is a dictionary containing the 'is_correct' key.
    algorithm : callable
        Function that takes a single sample (dictionary) and returns the probability of hallucination.
    verbose : bool
        If True, prints additional information for each sample and summary.

    Returns
    -------
    roc_auc : float
        ROC AUC score for the predictions.
    """
    predicted_ys = [] # List storing predicted hallucination probabilities

    for i, entry in enumerate(dataset):
        # Create a copy of the sample and remove the label to get unlabeled input
        sample_unlabeled = dict(entry)
        sample_unlabeled.pop('is_correct', None)

        try:
            # Prediction of probability for a single sample
            pred_prob = algorithm(sample_unlabeled)
            predicted_ys.append(pred_prob)

        except Exception as e:
            # If an error occurs, default probability is set to 0.5
            predicted_ys.append(0.5)
            if verbose:
                print(f"Sample {i} => Error: {e}")

    predicted_ys = np.array(predicted_ys, dtype=np.float32)
    ys = []
    for entry in dataset:
        ys.append(1 if entry.get('is_correct') else 0)
    ys = np.array(ys, dtype=np.int32)
    
    # Calculate ROC AUC metric
    roc_auc = roc_auc_score(ys, predicted_ys)

    # Calculate final score based on ROC AUC
    points = compute_score(roc_auc)

    if verbose:
        print(f"\nNumber of samples: {len(dataset)}")
        print(f"ROC AUC: {roc_auc:.4f}")
        print(f"Score: {points}")

    return points

# Your Solution
In this section, you should place your solution. Make changes only here!

## Technical Breakdown of the Solution

The implemented approach relies on **advanced feature engineering** and a **Gradient Boosting (XGBoost)** classifier. Below is a detailed breakdown of the components:

### 1. Feature Engineering (`extract_features`)
To distinguish between correct answers and hallucinations, we extract several types of features from the provided samples:

- **Probability Statistics**: We calculate the mean, standard deviation, skewness, and range of the `supporting_probabilities`. Hallucinations often show lower mean confidence and higher variance across alternative generations.
- **Semantic Consistency**: We check how often the key phrase or word from the main answer appears in the `supporting_answers`. A high consistency ratio suggests a stable (and likely correct) fact.
- **Agreement Analysis**: We extract the 'tagged' answers from all versions and calculate an `agreement_ratio`. If the model consistently produces the same answer across different runs (even with higher temperature), it increases the probability of truth.
- **Lexical Overlap**: We measure the set overlap between the question and the answer to capture if the model is simply repeating the prompt or providing new (potentially hallucinated) information.
- **Entropy & Specificity**: We calculate the shannon entropy of the answer words and a specificity metric. Vague or repetitive answers often correlate with specific model behaviors related to hallucination.

### 2. Model Configuration
We use an `XGBClassifier` with the following optimizations:
- **AUC Optimization**: The model is specifically tuned to maximize ROC AUC, as required by the task constraints.
- **Regularization**: `reg_alpha` and `reg_lambda` are used to prevent overfitting on the limited training data.
- **Early Stopping**: To ensure optimal generalization and prevent training for too many rounds.

In [None]:
def extract_features(sample):
    question = sample['question']
    answer = sample['answer']
    tokens = sample['tokens']
    supporting_answers = sample['supporting_answers']
    supporting_probabilities = sample['supporting_probabilities']
    supporting_tokens = sample['supporting_tokens']
    
    token_count = len(tokens)
    answer_words = len(answer.split())
    answer_words_list = answer.lower().split()
    question_words = set(question.lower().split())
    
    all_probs = [p for sublist in supporting_probabilities for p in sublist]
    prob_mean = np.mean(all_probs) if all_probs else 0
    prob_std = np.std(all_probs) if all_probs else 0
    prob_min = np.min(all_probs) if all_probs else 0
    prob_max = np.max(all_probs) if all_probs else 0
    prob_range = prob_max - prob_min if all_probs else 0
    
    prob_skew = pd.Series(all_probs).skew() if all_probs else 0
    
    key_phrase = ' '.join(answer_words_list[-3:]) if len(answer_words_list) >= 3 else answer.lower()
    key_word = answer_words_list[-1]
    consistency_count_phrase = sum(1 for sa in supporting_answers if key_phrase in sa.lower().replace('<answer>', '').replace('</answer>', ''))
    consistency_count_word = sum(1 for sa in supporting_answers if key_word in sa.lower().replace('<answer>', '').replace('</answer>', ''))
    consistency_ratio_phrase = consistency_count_phrase / len(supporting_answers)
    consistency_ratio_word = consistency_count_word / len(supporting_answers)
    
    answer_words_set = set(answer_words_list)
    overlap = len(question_words & answer_words_set) / max(len(question_words), 1)
    
    prob_vars = [np.var(sublist) for sublist in supporting_probabilities if sublist]
    prob_var_mean = np.mean(prob_vars) if prob_vars else 0
    
    question_len = len(question.split())
    rel_length = answer_words / max(question_len, 1)
    
    tagged_answers = [sa[sa.find('<answer>')+8:sa.find('</answer>')].strip().lower() 
                     for sa in supporting_answers if '<answer>' in sa and '</answer>' in sa]
    most_common_answer = max(set(tagged_answers), key=tagged_answers.count, default='')
    agreement_ratio = tagged_answers.count(most_common_answer) / len(tagged_answers) if tagged_answers else 0
    answer_matches_most_common = 1 if most_common_answer and most_common_answer in answer.lower() else 0
    
    prob_conf_dev = abs(prob_mean - 0.5)
    high_conf_ratio = sum(1 for p in all_probs if p > 0.95) / len(all_probs) if all_probs else 0  # Tighter threshold
    
    specificity = len(answer_words_set) / max(answer_words, 1)
    
    supp_tokens_flat = ' '.join([t.lower() for sublist in supporting_tokens for t in sublist]).strip()
    token_mismatch = sum(1 for t in tokens if t.strip().lower() not in supp_tokens_flat) / max(token_count, 1)
    
    word_freq = pd.Series(answer_words_list).value_counts(normalize=True)
    answer_entropy = -sum(p * np.log2(p) for p in word_freq if p > 0)
    
    supp_prob_means = [np.mean(sublist) for sublist in supporting_probabilities if sublist]
    prob_discrepancy = np.std(supp_prob_means) if supp_prob_means else 0
    
    key_question_words = [w for w in question.lower().split() if len(w) > 3 and w not in {'what', 'where', 'when', 'who', 'how'}]
    key_question_presence = sum(1 for w in key_question_words if w in answer.lower()) / max(len(key_question_words), 1)
    
    supp_lengths = [len(sa.replace('<answer>', '').replace('</answer>', '').split()) for sa in supporting_answers]
    supp_len_std = np.std(supp_lengths) if supp_lengths else 0
    supp_len_mean = np.mean(supp_lengths) if supp_lengths else 0
    answer_len_deviation = abs(answer_words - supp_len_mean) / max(supp_len_mean, 1)
    
    return [
        token_count, answer_words, prob_mean, prob_std, prob_min, prob_max, prob_range, prob_skew,
        consistency_ratio_phrase, consistency_ratio_word, overlap, prob_var_mean, rel_length,
        agreement_ratio, answer_matches_most_common, prob_conf_dev, high_conf_ratio, specificity,
        token_mismatch, answer_entropy, prob_discrepancy, key_question_presence, supp_len_std,
        answer_len_deviation
    ]

def prepare_data(dataset):
    X = []
    y = []
    for sample in dataset:
        features = extract_features(sample)
        X.append(features)
        y.append(1 if sample['is_correct'] else 0)
    return np.array(X), np.array(y)

combined_data = train + valid
train_X, train_y = prepare_data(combined_data)

model = xgb.XGBClassifier(
    n_estimators=500,
    max_depth=4,
    learning_rate=0.015,
    subsample=0.7,
    colsample_bytree=0.7,
    reg_alpha=0.5,
    reg_lambda=1.5,
    random_state=42,
    eval_metric='auc',
    early_stopping_rounds=20
)

valid_X, valid_y = prepare_data(valid)
model.fit(train_X, train_y, eval_set=[(valid_X, valid_y)], verbose=False)

def predict_hallucinations(sample):
    features = extract_features(sample)
    features_array = np.array([features])
    prob = model.predict_proba(features_array)[0][1]
    return prob

# Evaluation

Running the cell below will allow you to check how many points your solution would score on validation data. Before sending, ensure that the entire notebook executes from start to finish without errors and without user intervention after selecting "Run All".

In [None]:
if not FINAL_EVALUATION_MODE:
    roc_auc = evaluate_algorithm(valid, predict_hallucinations, verbose=True)


Liczba próbek: 990
ROC AUC: 0.8977
Wynik punktowy: 100


During checking, the model will be saved as `your_model.pkl` and evaluated on the test set.

In [None]:
if FINAL_EVALUATION_MODE:      
    import cloudpickle
      
    OUTPUT_PATH = "file_output"
    FUNCTION_FILENAME = "your_model.pkl"
    FUNCTION_OUTPUT_PATH = os.path.join(OUTPUT_PATH, FUNCTION_FILENAME)

    if not os.path.exists(OUTPUT_PATH):
        os.makedirs(OUTPUT_PATH)

    with open(FUNCTION_OUTPUT_PATH, "wb") as f:
        cloudpickle.dump(predict_hallucinations, f)