#A Japanese Faithfulness Evaluation Dataset for LLM Summarization
A Japanese evaluation dataset for hallucination detection in LLM-generated summarization, with sentence-level faithfulness annotations.
For details, see our paper:
Hikari Tanaka, Atsushi Keyaki, and Mamoru Komachi. 2026. Constructing a Dataset for Hallucination Detection in Japanese Summarization with Fine-grained Faithfulness Labels. In Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 4: Student Research Workshop), pages 207–218, Rabat, Morocco. Association for Computational Linguistics.
日本語版 README はこちら → README_ja.md
This dataset consists of:
- Japanese news articles (source documents)
- Summaries generated by three LLMs
- Sentence-level faithfulness annotations by human annotators
Each summary sentence is annotated with a faithfulness label indicating whether it contains a hallucination with respect to the source document.
| Item | Count |
|---|---|
| Articles | 130 |
| Summaries (examples) | 390 |
| Summary sentences | 1,949 |
Examples by model:
| Model | Count |
|---|---|
| GPT-4o | 130 (33.3%) |
| Swallow | 130 (33.3%) |
| LLM-jp | 130 (33.3%) |
Distribution of hallucination_present:
| Label | Count | % |
|---|---|---|
faithful |
220 | 56.4% |
hallucinated |
148 | 37.9% |
paraphrase_error_only |
22 | 5.6% |
Distribution of sentence-level hallucination_type:
| Label | Count | % |
|---|---|---|
Not-hallucination |
1,512 | 77.6% |
Unresolved-Single |
163 | 8.4% |
Intrinsic Hallucination |
130 | 6.7% |
Extrinsic Hallucination |
49 | 2.5% |
Paraphrase Error |
45 | 2.3% |
Unresolved-Disagreement |
39 | 2.0% |
Mixed |
11 | 0.6% |
The dataset is distributed as a JSONL file. Each line is a JSON object representing one summary (example).
| Field | Type | Description |
|---|---|---|
example_id |
string | Unique identifier. Format: {input_id:03d}_{model} (e.g., 000_GPT-4o) |
input_id |
int | ID of the source article |
source_url |
string | URL of the source news article |
source_text |
string | Full text of the source article |
model |
string | LLM used to generate the summary (see Model Details) |
summary_text |
string | Full summary text |
hallucination_present |
string | Summary-level faithfulness label (see Label Definitions) |
annotation_mode |
string | Annotation procedure: validated or independent |
sentences |
list | List of sentence-level annotation objects (see below) |
xlsum_split |
string | Original split in XL-Sum (train or validation) |
xlsum_summary |
string | Original XL-Sum summary for the article |
| Field | Type | Description |
|---|---|---|
sentence_id |
int | Index of the sentence within the summary |
text |
string | Sentence text |
start_char |
int | Start character offset in summary_text |
end_char |
int | End character offset in summary_text |
hallucination_type |
string | Aggregated faithfulness label (see Label Definitions) |
votes |
object | Number of annotators who assigned each non-faithful label (e.g., {"Intrinsic Hallucination": 2}) |
individual_annotations |
list | Per-annotator labels: list of {"annotator_id": ..., "label": ...} |
| Label | Description |
|---|---|
faithful |
No unfaithful sentences in the summary |
hallucinated |
The summary contains at least one hallucinated sentence |
paraphrase_error_only |
The summary contains unfaithful sentences, but all are Paraphrase Errors (no hallucination) |
Aggregation rule:
- If all 6 annotators assigned
Not-hallucination→Not-hallucination - If 2 or more annotators assigned a non-faithful label and a majority agreed on the same label → that label
- If only 1 annotator assigned a non-faithful label →
Unresolved-Single - If 2 or more annotators assigned non-faithful labels but no majority was reached →
Unresolved-Disagreement
| Label | Description |
|---|---|
Not-hallucination |
All annotators judged the sentence as faithful |
Intrinsic Hallucination |
The sentence contradicts information stated in the source document |
Extrinsic Hallucination |
The sentence contains information not present in the source document |
Paraphrase Error |
The sentence is unfaithful due to paraphrasing that distorts meaning, but is not strictly a hallucination |
Mixed |
The sentence contains multiple types of non-faithful errors |
Unresolved-Single |
Only one annotator judged the sentence as non-faithful; label not determined |
Unresolved-Disagreement |
Multiple annotators judged the sentence as non-faithful, but no majority label was reached |
| Model ID in dataset | Model |
|---|---|
GPT-4o |
GPT-4o |
Swallow |
Swallow |
LLM-jp |
LLM-jp |
import json
def load_jsonl(path):
with open(path, encoding="utf-8") as f:
return [json.loads(line) for line in f if line.strip()]
dataset = load_jsonl("dataset.jsonl")gpt4o_examples = [ex for ex in dataset if ex["model"] == "GPT-4o"]The hallucination_present field has three values. Depending on your task, you may want to treat paraphrase_error_only as either faithful or unfaithful:
# Option A: treat paraphrase_error_only as unfaithful
def to_binary(ex, paraphrase_as_unfaithful=True):
label = ex["hallucination_present"]
if label == "faithful":
return 0
elif label == "hallucinated":
return 1
else: # paraphrase_error_only
return 1 if paraphrase_as_unfaithful else 0
# Option B: treat paraphrase_error_only as faithful
labels = [to_binary(ex, paraphrase_as_unfaithful=False) for ex in dataset]The individual_annotations field allows you to re-aggregate labels using your own rules:
from collections import Counter
def custom_aggregate(sentence, min_votes=2):
"""Require at least min_votes annotators to agree on a non-faithful label."""
labels = [ann["label"] for ann in sentence["individual_annotations"]]
if not labels:
return "Not-hallucination"
counter = Counter(labels)
most_common_label, count = counter.most_common(1)[0]
if count >= min_votes:
return most_common_label
return "Unresolved"This repository is licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0).
Copyright
- The annotation data and accompanying code (e.g.,
utils.py) are copyright the authors. - The source articles are derived from XL-Sum (CC BY-NC-SA 4.0). Copyright of the original article contents belongs to the respective copyright holders.
- No copyright is claimed over the LLM-generated summaries.
If you use this dataset, please cite our paper:
@inproceedings{tanaka-etal-2026-constructing,
title = "Constructing a Dataset for Hallucination Detection in {J}apanese Summarization with Fine-grained Faithfulness Labels",
author = "Tanaka, Hikari and
Keyaki, Atsushi and
Komachi, Mamoru",
booktitle = "Proceedings of the 19th Conference of the {E}uropean Chapter of the {A}ssociation for {C}omputational {L}inguistics (Volume 4: Student Research Workshop)",
month = mar,
year = "2026",
address = "Rabat, Morocco",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2026.eacl-srw.15/",
doi = "10.18653/v1/2026.eacl-srw.15",
pages = "207--218",
}Please also cite XL-Sum if you use the source articles:
@inproceedings{hasan-etal-2021-xl,
title = "{XL}-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages",
author = "Hasan, Tahmid and
Bhattacharjee, Abhik and
Islam, Md. Saiful and
Mubasshir, Kazi and
Li, Yuan-Fang and
Kang, Yong-Bin and
Rahman, M. Sohel and
Shahriyar, Rifat",
booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021",
month = aug,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.findings-acl.82",
doi = "10.18653/v1/2021.findings-acl.82",
pages = "4693--4703",
}