A data set of natural language queries with corresponding SPARQL queries
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
.github/ISSUE_TEMPLATE
resources
LICENSE.txt
README.md
test-data.json
train-data.json

README.md

LC-QuAD

Largescale Complex Question Answering Dataset

Download

🐣 Train, Test Data

Links

🌍 Webpage | 📄 Paper | 🏢 Lab

Introduction

We release, and maintain a gold standard KBQA (Question Answering over Knowledge Base) dataset containing 5000 Question and SPARQL queries. LC-QuAD uses DBpedia v04.16 as the target KB.

Usage

License: You can download the dataset (released with a GPL 3.0 License), or read below to know more.

Versioning: We use DBpedia version 04-2016 as our target KB. The public DBpedia endpoint (http://dbpedia.org/sparql) no longer uses this version, which might cause many SPARQL queries to not retrieve any answer. We strongly recommend hosting this version locally. To do so, see this guide

Splits: We release the dataset split into training, and test in a 80:20 fashion.

Format: The dataset is released in JSON dumps, where the key corrected_question contains the question, and query contains the corresponding SPARQL query.

The dataset generated has the following JSON structure, kept intact for .

{
 	'_id': 'Unique ID of this datapoint',
  	'corrected_question': 'Corrected, Final Question',
	'id': 'Template ID',
	'query': 'SPARQL Query',
	'template': 'Template used to create SPARQL Query',
	'intermediary_question': 'Automatically generated, grammatically incorrect question'
}

Cite

@inproceedings{trivedi2017lc,
  title={Lc-quad: A corpus for complex question answering over knowledge graphs},
  author={Trivedi, Priyansh and Maheshwari, Gaurav and Dubey, Mohnish and Lehmann, Jens},
  booktitle={International Semantic Web Conference},
  pages={210--218},
  year={2017},
  organization={Springer}
}

Benchmarking/Leaderboard

We're in the process of automating the benchmarking process (and updating results on our webpage). In the meantime, please get in touch with us at priyansh.trivedi@uni-bonn.de, and we'll do it manually. Apologies for this inconvinience.

Methodology

Overview

  • Automatically create SPARQL queries.
  • Convert SPARQL queries to intermediary NLQs.
  • Manually correct intermediary NLQs to create Questions

We start with a set of Seed Entities, and Predicate Whitelist. Using the whitelist, we generate 2-hop subgraphs around seed entities. With a seed entity as supposed answer, we juxtapose SPARQL Templates onto the subgraph, and generate SPARQL queries.

Corresponding to SPARQL template, and based on certain conditions, we assign hand-made NL question templates to the SPARQLs. Refer to this diagram to understand the nomenclature used in templates.

Finally, we follow a two-step (Correct, Review) system to generate a grammatically correct question for every template-generated one.

Changelog

0.1.3 - 19-06-2018

  • Published train-test splits
  • Website Updated

0.1.2 - 28-01-2018

  • Updated public website
  • Dataset now available in QALD format
  • Leaderboard underway

0.1.1 - 27-10-2017

  • Fixed a bug with rdf:type filter in SPARQL
  • data_set.json updated
  • updated templates.py

0.1.0 - 01-05-2017