QALD-9-plus Dataset Description

QALD-9-plus is the dataset for Knowledge Graph Question Answering (KGQA) based on well-known QALD-9.

QALD-9-plus enables to train and test KGQA systems over DBpedia and Wikidata using questions in 9 different languages: English, German, Russian, French, Armenian, Belarusian, Lithuanian, Bashkir, and Ukrainian.

Some of the questions have several alternative writings in particular languages which enables to evaluate the robustness of KGQA systems and train paraphrasing models.

As the questions' translations were provided by native speakers, they are considered as "gold standard", therefore, machine translation tools can be trained and evaluated on the dataset.

Dataset Statistics

	en	de	fr	ru	uk	lt	be	ba	hy	# questions DBpedia	# questions Wikidata
Train	408	543	260	1203	447	468	441	284	80	408	371
Test	150	176	26	348	176	186	155	117	20	150	136

Given the numbers, it is obvious that some of the languages are covered more than once i.e., there is more than one translation for a particular question. For example, there are 1203 Russian translations available while only 408 unique questions exist in the training subset (i.e., 2.9 Russian translations per one question). The availability of such parallel corpora enables the researchers, developers and other dataset users to address the paraphrasing task.

Evaluation

We used GERBIL system for the evaluation of the dataset. The detailed information for the experiments is available at the individual link (click the value in the cells).

Wikidata

QAnswer

	en	de	ru	fr
Test	link	link	link	link
Train	link	link	link	link

DeepPavlov

	en	ru
Test	link	link
Train	link	link

Platypus

	en	fr
Test	link	link
Train	link	link

DBpedia

QAnswer

	en	de	ru	fr
Test	link	link	link	link
Train	link	link	link	link

Wikidata Original Translations

QAnswer

	de	ru	fr
Test	link	link	link
Train	link	link	link

DeepPavlov

	ru
Test	link
Train	link

Platypus

	fr
Test	link
Train	link

DBpedia Original Translations

QAnswer

	de	ru	fr
Test	link	link	link
Train	link	link	link

Cite

@inproceedings{perevalov2022qald9plus,
      author={Perevalov, Aleksandr and Diefenbach, Dennis and Usbeck, Ricardo and Both, Andreas},
      booktitle={2022 IEEE 16th International Conference on Semantic Computing (ICSC)},
      title={QALD-9-plus: A Multilingual Dataset for Question Answering over DBpedia and Wikidata Translated by Native Speakers},
      year={2022},
      pages={229-234},
      doi={10.1109/ICSC52841.2022.00045}
}

Useful Links

ArXiv link
Papers with Code: Paper, Dataset
Video presentation on YouTube: https://youtu.be/W1w7CJTV48c
Presentation slides
Google Colab notebook

Licence

This work is licensed under a Creative Commons Attribution 4.0 International License.

Dataset Metadata

The following table is necessary for this dataset to be indexed by search engines such as Google Dataset Search.

property value

name QALD-9-plus: A Multilingual Dataset for Question Answering over DBpedia and Wikidata Translated by Native Speakers

alternateName QALD-9-plus

url https://github.com/Perevalov/qald_9_plus/tree/main/data

description QALD-9-Plus is the dataset for Knowledge Graph Question Answering (KGQA) based on well-known QALD-9. QALD-9-Plus enables to train and test KGQA systems over DBpedia and Wikidata using questions in 9 different languages: English, German, Russian, French, Armenian, Belarusian, Lithuanian, Bashkir, and Ukrainian. Some of the questions have several alternative writings in particular languages which enables to evaluate the robustness of KGQA systems and train paraphrasing models. As the questions' translations were provided by native speakers, they are considered as "gold standard", therefore, machine translation tools can be trained and evaluated on the dataset.

license

property	value
name	`CC-BY-4.0`
url	`https://creativecommons.org/licenses/by/4.0/`

citation Perevalov, Aleksandr, Diefenbach, Diefenback, Usbeck, Ricardo, Both, Andreas: QALD-9-plus: A multilingual dataset for question answering over DBpedia and Wikidata translated by native speakers. In: 2022 IEEE 16th International Conference on Semantic Computing (ICSC). IEEE (2022)

Name		Name	Last commit message	Last commit date
Latest commit History 89 Commits
data		data
notebooks		notebooks
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
metadata.rdf		metadata.rdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

QALD-9-plus Dataset Description

Dataset Statistics

Evaluation

Wikidata

QAnswer

DeepPavlov

Platypus

DBpedia

QAnswer

Wikidata Original Translations

QAnswer

DeepPavlov

Platypus

DBpedia Original Translations

QAnswer

Cite

Useful Links

Licence

Dataset Metadata

About

Releases

Packages

Languages

License

Perevalov/QALD_9_plus

Folders and files

Latest commit

History

Repository files navigation

QALD-9-plus Dataset Description

Dataset Statistics

Evaluation

Wikidata

QAnswer

DeepPavlov

Platypus

DBpedia

QAnswer

Wikidata Original Translations

QAnswer

DeepPavlov

Platypus

DBpedia Original Translations

QAnswer

Cite

Useful Links

Licence

Dataset Metadata

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages