Skip to content

SDS-NLP/JaFaithSum

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

#A Japanese Faithfulness Evaluation Dataset for LLM Summarization

A Japanese evaluation dataset for hallucination detection in LLM-generated summarization, with sentence-level faithfulness annotations.

For details, see our paper:

Hikari Tanaka, Atsushi Keyaki, and Mamoru Komachi. 2026. Constructing a Dataset for Hallucination Detection in Japanese Summarization with Fine-grained Faithfulness Labels. In Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 4: Student Research Workshop), pages 207–218, Rabat, Morocco. Association for Computational Linguistics.

日本語版 README はこちら → README_ja.md


Overview

This dataset consists of:

  • Japanese news articles (source documents)
  • Summaries generated by three LLMs
  • Sentence-level faithfulness annotations by human annotators

Each summary sentence is annotated with a faithfulness label indicating whether it contains a hallucination with respect to the source document.


Dataset Statistics

Item Count
Articles 130
Summaries (examples) 390
Summary sentences 1,949

Examples by model:

Model Count
GPT-4o 130 (33.3%)
Swallow 130 (33.3%)
LLM-jp 130 (33.3%)

Distribution of hallucination_present:

Label Count %
faithful 220 56.4%
hallucinated 148 37.9%
paraphrase_error_only 22 5.6%

Distribution of sentence-level hallucination_type:

Label Count %
Not-hallucination 1,512 77.6%
Unresolved-Single 163 8.4%
Intrinsic Hallucination 130 6.7%
Extrinsic Hallucination 49 2.5%
Paraphrase Error 45 2.3%
Unresolved-Disagreement 39 2.0%
Mixed 11 0.6%

Data Format

The dataset is distributed as a JSONL file. Each line is a JSON object representing one summary (example).

Top-level Fields

Field Type Description
example_id string Unique identifier. Format: {input_id:03d}_{model} (e.g., 000_GPT-4o)
input_id int ID of the source article
source_url string URL of the source news article
source_text string Full text of the source article
model string LLM used to generate the summary (see Model Details)
summary_text string Full summary text
hallucination_present string Summary-level faithfulness label (see Label Definitions)
annotation_mode string Annotation procedure: validated or independent
sentences list List of sentence-level annotation objects (see below)
xlsum_split string Original split in XL-Sum (train or validation)
xlsum_summary string Original XL-Sum summary for the article

Sentence-level Fields (sentences)

Field Type Description
sentence_id int Index of the sentence within the summary
text string Sentence text
start_char int Start character offset in summary_text
end_char int End character offset in summary_text
hallucination_type string Aggregated faithfulness label (see Label Definitions)
votes object Number of annotators who assigned each non-faithful label (e.g., {"Intrinsic Hallucination": 2})
individual_annotations list Per-annotator labels: list of {"annotator_id": ..., "label": ...}

Label Definitions

hallucination_present (summary level)

Label Description
faithful No unfaithful sentences in the summary
hallucinated The summary contains at least one hallucinated sentence
paraphrase_error_only The summary contains unfaithful sentences, but all are Paraphrase Errors (no hallucination)

hallucination_type (sentence level)

Aggregation rule:

  • If all 6 annotators assigned Not-hallucinationNot-hallucination
  • If 2 or more annotators assigned a non-faithful label and a majority agreed on the same label → that label
  • If only 1 annotator assigned a non-faithful label → Unresolved-Single
  • If 2 or more annotators assigned non-faithful labels but no majority was reached → Unresolved-Disagreement
Label Description
Not-hallucination All annotators judged the sentence as faithful
Intrinsic Hallucination The sentence contradicts information stated in the source document
Extrinsic Hallucination The sentence contains information not present in the source document
Paraphrase Error The sentence is unfaithful due to paraphrasing that distorts meaning, but is not strictly a hallucination
Mixed The sentence contains multiple types of non-faithful errors
Unresolved-Single Only one annotator judged the sentence as non-faithful; label not determined
Unresolved-Disagreement Multiple annotators judged the sentence as non-faithful, but no majority label was reached

Model Details

Model ID in dataset Model
GPT-4o GPT-4o
Swallow Swallow
LLM-jp LLM-jp

Usage

Loading the dataset

import json

def load_jsonl(path):
    with open(path, encoding="utf-8") as f:
        return [json.loads(line) for line in f if line.strip()]

dataset = load_jsonl("dataset.jsonl")

Filtering by model

gpt4o_examples = [ex for ex in dataset if ex["model"] == "GPT-4o"]

Using hallucination_present as a binary label

The hallucination_present field has three values. Depending on your task, you may want to treat paraphrase_error_only as either faithful or unfaithful:

# Option A: treat paraphrase_error_only as unfaithful
def to_binary(ex, paraphrase_as_unfaithful=True):
    label = ex["hallucination_present"]
    if label == "faithful":
        return 0
    elif label == "hallucinated":
        return 1
    else:  # paraphrase_error_only
        return 1 if paraphrase_as_unfaithful else 0

# Option B: treat paraphrase_error_only as faithful
labels = [to_binary(ex, paraphrase_as_unfaithful=False) for ex in dataset]

Applying a custom aggregation rule to sentence-level annotations

The individual_annotations field allows you to re-aggregate labels using your own rules:

from collections import Counter

def custom_aggregate(sentence, min_votes=2):
    """Require at least min_votes annotators to agree on a non-faithful label."""
    labels = [ann["label"] for ann in sentence["individual_annotations"]]
    if not labels:
        return "Not-hallucination"
    counter = Counter(labels)
    most_common_label, count = counter.most_common(1)[0]
    if count >= min_votes:
        return most_common_label
    return "Unresolved"

License

This repository is licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0).

Copyright

  • The annotation data and accompanying code (e.g., utils.py) are copyright the authors.
  • The source articles are derived from XL-Sum (CC BY-NC-SA 4.0). Copyright of the original article contents belongs to the respective copyright holders.
  • No copyright is claimed over the LLM-generated summaries.

Citation

If you use this dataset, please cite our paper:

@inproceedings{tanaka-etal-2026-constructing,
    title = "Constructing a Dataset for Hallucination Detection in {J}apanese Summarization with Fine-grained Faithfulness Labels",
    author = "Tanaka, Hikari  and
      Keyaki, Atsushi  and
      Komachi, Mamoru",
    booktitle = "Proceedings of the 19th Conference of the {E}uropean Chapter of the {A}ssociation for {C}omputational {L}inguistics (Volume 4: Student Research Workshop)",
    month = mar,
    year = "2026",
    address = "Rabat, Morocco",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2026.eacl-srw.15/",
    doi = "10.18653/v1/2026.eacl-srw.15",
    pages = "207--218",
}

Please also cite XL-Sum if you use the source articles:

@inproceedings{hasan-etal-2021-xl,
    title = "{XL}-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages",
    author = "Hasan, Tahmid  and
      Bhattacharjee, Abhik  and
      Islam, Md. Saiful  and
      Mubasshir, Kazi  and
      Li, Yuan-Fang  and
      Kang, Yong-Bin  and
      Rahman, M. Sohel  and
      Shahriyar, Rifat",
    booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.findings-acl.82",
    doi = "10.18653/v1/2021.findings-acl.82",
    pages = "4693--4703",
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages