Skip to content

Crowdsourced Geospatial Question-Answering dataset containing triples of question-queries-answers.

License

Notifications You must be signed in to change notification settings

AI-team-UoA/GeoQuestions1089

Repository files navigation


GeoQuestions1089


A crowdsourced geospatial question-answering dataset that contains 1089 triples of natural language questions, SPARQL/GeoSPARQL queries, and their answers over YAGO2geo.

Overview

GeoQuestions1089 is a crowdsourced geospatial question-answering dataset that targets the Knowledge Graph YAGO2geo. It contains 1089 triples of geospatial questions, their answers, and the respective SPARQL/GeoSPARQL queries.

It has been used to benchmark two state of the art Question Answering engines, GeoQA2 and the engine of Hamzei et al.

Also available on:

Repository information

The repository is organized as follows:

  • engines/: contains the versions of the engines that were used in the benchmark.
  • GoST/: contains a transpiler that rewrites queries to use materialized relations.
  • results/generated-queries/: contains the queries generated by the engines for the benchmark.
  • GeoQuestions1089.json: contains 1089 natural language questions and their queries.
  • GeoQuestions1089_answers.json: contains the results of the queries.
  • GeoQuestions1089.csv: the entire dataset in CSV format.

Dataset

The dataset is described in the following paper (also used to cite the dataset):

@inproceedings{10.1007/978-3-031-47243-5_15,
  title = {Benchmarking Geospatial Question Answering Engines Using the Dataset GeoQuestions1089},
  author = {Sergios-Anestis Kefalidis, Dharmen Punjani, Eleni Tsalapati, 
         Konstantinos Plas, Mariangela Pollali, Michail Mitsios, 
         Myrto Tsokanaridou, Manolis Koubarakis and Pierre Maret},
  booktitle = {The Semantic Web - {ISWC} 2023 - 22nd International Semantic Web Conference,
            Athens, Greece, November 6-10, 2023, Proceedings, Part {II}},
  year = {2023}
}

Shortly, the GeoQuestions1089 dataset consists of two parts, which we will refer to as GeoQuestions_c and GeoQuestions_w both of which target the union of YAGO2 and YAGO2geo.

GeoQuestions_c consits of 1017 entries and GeoQuestions_w of 72 entries. The difference between the two is that the natural language questions of GeoQuestions_w contain grammatical, syntactical and spelling mistakes.

Description Range
Triples targeting YAGO2geo (GeoQuestions_c) 1-895
Triples targeting YAGO2 + YAGO2geo (GeoQuestions_c) 896-1017
Triples with questions that contain mistakes (GeoQuestions_w) 1018-1089

Categories

The questions of the dataset are split into 9 categories:

  1. Asking for a thematic or a spatial attribute of a feature, e.g., Where is Loch Goil located?
  2. Asking whether a feature is in a geospatial relation with another feature or features, e.g., Is Liverpool east of Ireland?
  3. Asking for features of a given class that are in a geospatial relation with another feature, e.g., Which counties border county Lincolnshire?
  4. Asking for features of a given class that are in a geospatial relation with any features of another class, e.g., Which churches are near castles?
  5. Asking for features of a given class that are in a geospatial relation with an unspecified feature of another class, and either one or both, is/are in another geospatial relation with a feature specified explicitly, e.g., Which churches are near a castle in Scotland?
  6. As in categories C, D and E above, plus more thematic and/or geospatial characteristics of the features expected as answers, e.g., Which mountains in Scotland have height more than 1000 meters?
  7. Questions with quantities and aggregates, e.g., What is the total area of lakes in Monaghan? or How many lakes are there in Monaghan?
  8. Questions with superlatives or comparatives, e.g., Which is the largest island in Greece?
  9. Questions with quantities, aggregates, and superlatives/comparatives, e.g., Which city in the UK has the most hospitals?
Category GeoQuestions1089_c GeoQuestions1089_w
A 173 16
B 139 11
C 176 14
D 22 1
E 138 6
F 24 2
G 174 11
H 145 9
I 26 2

You can read more about these categories in the paper.

Current version of the dataset

The aforementioned paper describes version 1.0. The latest available version is 1.1.

Version 1.1 includes several enhancements:

  • Uniform query format and variable naming
  • Fixes in natural language capitalization
  • Corrections in query categorization
  • Replacement of stSPARQL functions with GeoSPARQL functions where applicable
  • Minor improvements in query correctness of existing queries
  • A few triples that were erroneous (resulting from incorrect file modifications and text editing) have been replaced by correct ones.

These updates ensure greater consistency and accuracy in the dataset, making it a more reliable resource for geospatial QA research.

Benchmark (Version 1.1)

We have used the dataset to evaluate the engines GeoQA2 and the engine of Hamzei et al.. We present the results of the evaluation:

GeoQA2

Combined Table: Evaluation of GeoQA2 over GeoQuestions_C and GeoQuestions_W

Category Executable Queries (C) Correct Answers (C) Correct Answers*(1) (C) Executable Queries (W) Correct Answers (W) Correct Answers*(1) (W)
A 83.81% 50.86% 60.68% 75.00% 50.00% 66.67%
B 74.82% 60.43% 80.76% 81.81% 45.45% 55.56%
C 81.25% 45.45% 55.94% 85.71% 50.00% 58.34%
D 54.54% 9.09% 16.67% 100.00% 0.00% 0.00%
E 76.08% 24.63% 32.38% 50.00% 33.33% 66.67%
F 58.33% 25.00% 42.85% 50.00% 0.00% 0.00%
G 73.56% 33.33% 45.31% 36.36% 0.00% 0.00%
H 66.89% 18.62% 27.83% 66.67% 0.00% 0.00%
I 80.76% 19.23% 23.80% 50.00% 0.00% 0.00%
Total 75.61% 37.75% 49.93% 68.05% 30.55% 44.89%
(1) Corrent Answers* is the percentage of correct answers calculated over the number of Executable Queries generated by the engines.

System of Hamzei et al.

Combined Table: Evaluation of the system of Hamzei et al. over GeoQuestions_C and GeoQuestions_W

Category Executable Queries (C) Correct Answers (C) Correct Answers* (C) Executable Queries (W) Correct Answers (W) Correct Answers* (W)
A 82.08% 23.12% 28.16% 93.75% 6.25% 6.67%
B 94.96% 53.23% 56.06% 100.00% 54.54% 54.54%
C 81.81% 26.13% 31.94% 100.00% 14.28% 14.28%
D 81.81% 4.54% 5.55% 100.00% 0.00% 0.00%
E 92.75% 6.52% 7.03% 83.34% 0.00% 0.00%
F 62.50% 12.50% 20.00% 90.90% 0.00% 0.00%
G 80.45% 10.34% 12.85% 100.00% 0.00% 0.00%
H 77.93% 26.89% 34.51% 77.78% 0.00% 0.00%
I 84.61% 7.96% 9.09% 50.00% 0.00% 0.00%
Total 83.97% 22.81% 27.28% 93.05% 12.50% 13.43%
Additional benchmark results exist and we are working on publishing them. Until then, if you want to see more please send a message at:

s[dot]kefalidis[at]di[dot]uoa[dot]gr

Tools

Materialization and Transpiler

To improve the time performance of query execution, we pre-computed and materialized certain relations between entities in the YAGO2geo KG.

The geospatial relations within, crosses, intersects and borders (and their extensions, e.g., overlaps and covers) are the most expensive ones to be computed. While north, south, east and west are easily computed. Hence, we materialized these relations.

To ease the transformation of GeoSPARQL/stSPARQL FILTERs to materialized triples we have developed and provide publically a transpiler that rewrites queries to use the materialized triples where possible.

To use the provided binary run the command:

java -cp PATH/TO/GOST_EXECUTABLE gr.uoa.di.ai.Transpiler QUERY

RDF Store

To run the experiments and generate the answers for the gold and generated queries we used GraphDB. Because GraphDB does not support stSPARQL functions, we have extended the GeoSPARQL plugin of GraphDB.

Notes

About the definition of near for distance calculations

We decided to define near based on the concept used. This is consistent with the definition of near in GeoQuestions201.

Near to Distance
Near to a City: 5km
Near to a Town: 5km
Near to a Bay: 1km
Near to a Beach: 1km
Near to a Forest: 1km
Near to a Hotel: 1km
Near to a Lake: 1km
Near to a Landmark: 1km
Near to a Village: 1km
Near to a Restaurant: 500 meters
Near to a Park: 500 meters

Prefixes used in GeoQuestions1089:

PREFIX geo: <http://www.opengis.net/ont/geosparql#>
PREFIX geof: <http://www.opengis.net/def/function/geosparql/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX yago: <http://yago-knowledge.org/resource/>
PREFIX y2geor: <http://kr.di.uoa.gr/yago2geo/resource/>
PREFIX y2geoo: <http://kr.di.uoa.gr/yago2geo/ontology/>
PREFIX strdf: <http://strdf.di.uoa.gr/ontology#>
PREFIX uom: <http://www.opengis.net/def/uom/OGC/1.0/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>

Team & Authors

ai-team-uoa

This is a research project by the AI-Team of the Department of Informatics and Telecommunications at the University of Athens.

Funding

This project is being/has been funded in the context of:

  • the first call for H.F.R.I. Research Projects to support faculty members and researchers and the procurement of high-cost research equipment grant (HFRI-FM17-2351)
  • the ESA project DA4DTE (subcontract 202320239)
  • the Horizon 2020 project AI4Copernicus (GA No. 101016798)
  • the Marie Skłodowska-Curie project QuAre (GA No. 101032307)
               

License

Released under the CC0 Attribution 4.0 International license (see LICENSE).

Copyright © 2024 AI-Team, University of Athens