This repository contains the code and data for the paper Unsupervised Token-level Hallucination Detection from Summary Generation By-products by Andreas Marfurt and James Henderson, presented at the GEM workshop at EMNLP 2022.
Our method BART-GBP gives token-level hallucination probabilities for summaries generated by BART. We use the facebook/bart-large-cnn
model made available by Hugging Face on their model hub. We first align the summary and source document with the help of BART's cross-attention, then classify aligned tokens for intrinsic hallucination and unaligned tokens for extrinsic hallucination. Our method was evaluated on CNN/DailyMail, but we expect it perform similarly on other equally extractive summarization datasets.
We provide the following data:
- data/frank_annotations.jsonl: Token-level hallucination annotations of 250 CNN/DM summaries with 15700 words, of which 57 (0.4%) are hallucinations (31 intrinsic, 26 extrinsic).
- data/tlhd-cnndm_annotations.jsonl: Token-level hallucination annotations of 150 CNN/DM summaries, one selected sentence per summary, 2100 words with 299 (14.2%) hallucinations (51 intrinsic, 248 extrinsic).
First, install conda, e.g. from Miniconda. Then create and activate the environment:
conda env create -f environment.yml
conda activate hallucination-detection
To reproduce the results of BART-GBP on the FRANK dataset, run the following steps:
- Get BART's outputs (attentions and decoding entropies): Get BART Outputs
- Compute the scores (association strength, fraction unaligned, inverse decoding entropy): Compute BART-GBP Scores
- Evaluate the scores by computing the points on the precision/recall curve: Evaluate Scores
- Plot the results: Plot Results
For the TLHD-CNNDM dataset, please adjust the paths of outputs, scores, predictions and results.
First we need to save BART's outputs from summary generation.
python save_bart_outputs_for_alignment.py --cross_attention_layers 9 10 --encoder_layers 9 10
Since we use beam search decoding, we have to select the cross-attentions, encoder self-attentions and decoding probabilities of the eventually selected beam. This is taken care of in the BartSummarizer
model by the GenerationMixinEncoderDecoder
mixin.
As this is a research project, we store these outputs such that we can run multiple experiments on them. In production, one would include the subsequent steps into summary generation to compute hallucination probabilities in an online fashion.
The following script computes the scores for each token in the summary (sentence):
python bart_gbp_scores.py
To run evaluation, we first convert the token scores into word probabilities of hallucination:
python convert_token_scores_to_word_probs.py
Then we compute the points on the precision/recall curve. They are a result of varying the hallucination probability thresholds for classifying a data point as hallucination. The script also computes the ROC curve:
python evaluate_bart_gbp.py
Once all results are computed, we plot the PR and ROC curves for intrinsic/extrinsic/all hallucinations with:
python plot_results.py
We've uploaded our model predictions and results for BART-GBP and the baselines here.
In case of problems or questions open a Github issue or write an email to andreas.marfurt [at] idiap.ch.
The work was supported as a part of the grant Automated interpretation of political and economic policy documents: Machine learning using semantic and syntactic information, funded by the Swiss National Science Foundation (grant number CRSII5_180320).
If you use our code, data or models, please cite us.
@inproceedings{marfurt-etal-2022-corpus,
title = "Unsupervised Token-level Hallucination Detection from Summary Generation By-products",
author = "Marfurt, Andreas and
Henderson, James",
booktitle = "Proceedings of the Second Workshop on Generation, Evaluation and Metrics",
month = dec,
year = "2022",
publisher = "Association for Computational Linguistics",
}