The full pipeline of creating UHGEval hallucination dataset

1. Collect the raw news

Status: Full data; Avaliable.
Data location: ./sources/xinhua/raw/
Number: 75 txt files, 737,766 news in total
Note: Those data are belong to Xinhua News Agency, and are only used for research purposes.

Status: No data; Need to generate using the script.
Script: ./sources/xinhua/preprocessor.py
Data location: ./sources/xinhua/processed; Use the script to generate the data
Number: Retained 25,005 news articles (constituting 3.39% of the raw news).
Filtering settings:
- Only includes news categories such as: '政治', '法律', '军事', '教育', '体育', '经济', '市场', '科学', '技术', '医疗', '卫生', '社会', '文化', '艺术', '娱乐', '天气', '环保', '灾害', '事故' ('Politics', 'Law', 'Military', 'Education', 'Sports', 'Economics', 'Market', 'Science', 'Technology', 'Medical', 'Health', 'Society', 'Culture', 'Art', 'Entertainment', 'Weather', 'Environmental Protection', 'Disaster', 'Accident').
- The length of newsBeginning + newsRemainder is between [630, 870].
- newsBeginning has [2, 5] sentences. Note: sentence-ending symbols include "。；：？！"
- The length of newsBeginning is between [80, 120].

Status: No data; Need to generate using the script.
Script: ./gen_candidates.py
Data location: ./candidates/
Number: Retained 17,503 news articles (constituting 70.00% of the preprocessed news).
Filtering settings:
- keywordPrecision is between (0, 1), generally should be between (0.2, 0.6).
- candidateHallucinatedContinuation consists of only 1 sentence.
- The length of candidateHallucinatedContinuation is between [20, 70].
- appearedKeywords has at least 2 keywords.

Status: Partial data as examples; Need to generate using the script.
Script: ./gen_machine_annotations.py
Data location: ./machine_annotations/keyword_hallucinated
Note: Only articles labeled as having hallucinations were left for subsequent processing; those without hallucinations are located in ./machine_annotations/unhallucinated

Label Studio is a multi-type data labeling and annotation tool with standardized output format.

Relevant files can be found in ./label_studio_annotations/.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
candidates		candidates
hallucinations		hallucinations
label_studio_annotations		label_studio_annotations
machine_annotations		machine_annotations
sources		sources
utils		utils
.gitignore		.gitignore
README.md		README.md
gen_candidates.py		gen_candidates.py
gen_hallucinations.py		gen_hallucinations.py
gen_machine_annotations.py		gen_machine_annotations.py
gen_pre_annotations.py		gen_pre_annotations.py
requirements.txt		requirements.txt