GitHub - SDS-NLP/JaFaithSum

#A Japanese Faithfulness Evaluation Dataset for LLM Summarization

A Japanese evaluation dataset for hallucination detection in LLM-generated summarization, with sentence-level faithfulness annotations.

For details, see our paper:

Hikari Tanaka, Atsushi Keyaki, and Mamoru Komachi. 2026. Constructing a Dataset for Hallucination Detection in Japanese Summarization with Fine-grained Faithfulness Labels. In Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 4: Student Research Workshop), pages 207–218, Rabat, Morocco. Association for Computational Linguistics.

日本語版 README はこちら → README_ja.md

Overview

This dataset consists of:

Japanese news articles (source documents)
Summaries generated by three LLMs
Sentence-level faithfulness annotations by human annotators

Each summary sentence is annotated with a faithfulness label indicating whether it contains a hallucination with respect to the source document.

Dataset Statistics

Item	Count
Articles	130
Summaries (examples)	390
Summary sentences	1,949

Examples by model:

Model	Count
GPT-4o	130 (33.3%)
Swallow	130 (33.3%)
LLM-jp	130 (33.3%)

Distribution of hallucination_present:

Label	Count	%
`faithful`	220	56.4%
`hallucinated`	148	37.9%
`paraphrase_error_only`	22	5.6%

Distribution of sentence-level hallucination_type:

Label	Count	%
`Not-hallucination`	1,512	77.6%
`Unresolved-Single`	163	8.4%
`Intrinsic Hallucination`	130	6.7%
`Extrinsic Hallucination`	49	2.5%
`Paraphrase Error`	45	2.3%
`Unresolved-Disagreement`	39	2.0%
`Mixed`	11	0.6%

Data Format

The dataset is distributed as a JSONL file. Each line is a JSON object representing one summary (example).

Top-level Fields

Field	Type	Description
`example_id`	string	Unique identifier. Format: `{input_id:03d}_{model}` (e.g., `000_GPT-4o`)
`input_id`	int	ID of the source article
`source_url`	string	URL of the source news article
`source_text`	string	Full text of the source article
`model`	string	LLM used to generate the summary (see Model Details)
`summary_text`	string	Full summary text
`hallucination_present`	string	Summary-level faithfulness label (see Label Definitions)
`annotation_mode`	string	Annotation procedure: `validated` or `independent`
`sentences`	list	List of sentence-level annotation objects (see below)
`xlsum_split`	string	Original split in XL-Sum (`train` or `validation`)
`xlsum_summary`	string	Original XL-Sum summary for the article

Sentence-level Fields (`sentences`)

Field	Type	Description
`sentence_id`	int	Index of the sentence within the summary
`text`	string	Sentence text
`start_char`	int	Start character offset in `summary_text`
`end_char`	int	End character offset in `summary_text`
`hallucination_type`	string	Aggregated faithfulness label (see Label Definitions)
`votes`	object	Number of annotators who assigned each non-faithful label (e.g., `{"Intrinsic Hallucination": 2}`)
`individual_annotations`	list	Per-annotator labels: list of `{"annotator_id": ..., "label": ...}`

Label Definitions

`hallucination_present` (summary level)

Label	Description
`faithful`	No unfaithful sentences in the summary
`hallucinated`	The summary contains at least one hallucinated sentence
`paraphrase_error_only`	The summary contains unfaithful sentences, but all are Paraphrase Errors (no hallucination)

`hallucination_type` (sentence level)

Aggregation rule:

If all 6 annotators assigned Not-hallucination → Not-hallucination
If 2 or more annotators assigned a non-faithful label and a majority agreed on the same label → that label
If only 1 annotator assigned a non-faithful label → Unresolved-Single
If 2 or more annotators assigned non-faithful labels but no majority was reached → Unresolved-Disagreement

Label	Description
`Not-hallucination`	All annotators judged the sentence as faithful
`Intrinsic Hallucination`	The sentence contradicts information stated in the source document
`Extrinsic Hallucination`	The sentence contains information not present in the source document
`Paraphrase Error`	The sentence is unfaithful due to paraphrasing that distorts meaning, but is not strictly a hallucination
`Mixed`	The sentence contains multiple types of non-faithful errors
`Unresolved-Single`	Only one annotator judged the sentence as non-faithful; label not determined
`Unresolved-Disagreement`	Multiple annotators judged the sentence as non-faithful, but no majority label was reached

Model Details

Model ID in dataset	Model
`GPT-4o`	GPT-4o
`Swallow`	Swallow
`LLM-jp`	LLM-jp

Usage

Loading the dataset

import json

def load_jsonl(path):
    with open(path, encoding="utf-8") as f:
        return [json.loads(line) for line in f if line.strip()]

dataset = load_jsonl("dataset.jsonl")

Filtering by model

gpt4o_examples = [ex for ex in dataset if ex["model"] == "GPT-4o"]

Using `hallucination_present` as a binary label

The hallucination_present field has three values. Depending on your task, you may want to treat paraphrase_error_only as either faithful or unfaithful:

# Option A: treat paraphrase_error_only as unfaithful
def to_binary(ex, paraphrase_as_unfaithful=True):
    label = ex["hallucination_present"]
    if label == "faithful":
        return 0
    elif label == "hallucinated":
        return 1
    else:  # paraphrase_error_only
        return 1 if paraphrase_as_unfaithful else 0

# Option B: treat paraphrase_error_only as faithful
labels = [to_binary(ex, paraphrase_as_unfaithful=False) for ex in dataset]

Applying a custom aggregation rule to sentence-level annotations

The individual_annotations field allows you to re-aggregate labels using your own rules:

from collections import Counter

def custom_aggregate(sentence, min_votes=2):
    """Require at least min_votes annotators to agree on a non-faithful label."""
    labels = [ann["label"] for ann in sentence["individual_annotations"]]
    if not labels:
        return "Not-hallucination"
    counter = Counter(labels)
    most_common_label, count = counter.most_common(1)[0]
    if count >= min_votes:
        return most_common_label
    return "Unresolved"

License

This repository is licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0).

Copyright

The annotation data and accompanying code (e.g., utils.py) are copyright the authors.
The source articles are derived from XL-Sum (CC BY-NC-SA 4.0). Copyright of the original article contents belongs to the respective copyright holders.
No copyright is claimed over the LLM-generated summaries.

Citation

If you use this dataset, please cite our paper:

@inproceedings{tanaka-etal-2026-constructing,
    title = "Constructing a Dataset for Hallucination Detection in {J}apanese Summarization with Fine-grained Faithfulness Labels",
    author = "Tanaka, Hikari  and
      Keyaki, Atsushi  and
      Komachi, Mamoru",
    booktitle = "Proceedings of the 19th Conference of the {E}uropean Chapter of the {A}ssociation for {C}omputational {L}inguistics (Volume 4: Student Research Workshop)",
    month = mar,
    year = "2026",
    address = "Rabat, Morocco",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2026.eacl-srw.15/",
    doi = "10.18653/v1/2026.eacl-srw.15",
    pages = "207--218",
}

Please also cite XL-Sum if you use the source articles:

@inproceedings{hasan-etal-2021-xl,
    title = "{XL}-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages",
    author = "Hasan, Tahmid  and
      Bhattacharjee, Abhik  and
      Islam, Md. Saiful  and
      Mubasshir, Kazi  and
      Li, Yuan-Fang  and
      Kang, Yong-Bin  and
      Rahman, M. Sohel  and
      Shahriyar, Rifat",
    booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.findings-acl.82",
    doi = "10.18653/v1/2021.findings-acl.82",
    pages = "4693--4703",
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
README.md		README.md
README_ja.md		README_ja.md
dataset.jsonl		dataset.jsonl
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

Dataset Statistics

Data Format

Top-level Fields

Sentence-level Fields (`sentences`)

Label Definitions

`hallucination_present` (summary level)

`hallucination_type` (sentence level)

Model Details

Usage

Loading the dataset

Filtering by model

Using `hallucination_present` as a binary label

Applying a custom aggregation rule to sentence-level annotations

License

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Overview

Dataset Statistics

Data Format

Top-level Fields

Sentence-level Fields (sentences)

Label Definitions

hallucination_present (summary level)

hallucination_type (sentence level)

Model Details

Usage

Loading the dataset

Filtering by model

Using hallucination_present as a binary label

Applying a custom aggregation rule to sentence-level annotations

License

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Sentence-level Fields (`sentences`)

`hallucination_present` (summary level)

`hallucination_type` (sentence level)

Using `hallucination_present` as a binary label

Packages