## GeoQuestions1089 v1.1

GeoQuestions1089 is a geospatial question-answering dataset containing 1089 triples of geospatial questions, their answers, and the corresponding SPARQL/GeoSPARQL queries targetting YAGO2geo. GeoQuestions1089 is currently the largest geospatial QA benchmark and it is made freely available to the research community.

### Version 1.1 Improvements

Version 1.1 includes several enhancements:
- Uniform query format and variable naming
- Fixes in natural language capitalization
- Corrections in query categorization
- Replacement of stSPARQL functions with GeoSPARQL functions where applicable
- Minor improvements in query correctness

These updates ensure greater consistency and accuracy in the dataset, making it a more reliable resource for geospatial QA research.

### Packages

In [2]:
import json
import subprocess
import pandas as pd

from tqdm import tqdm
from collections import OrderedDict

### Prefixes

In [33]:
PREFIXES = """PREFIX geo: <http://www.opengis.net/ont/geosparql#>
PREFIX geof: <http://www.opengis.net/def/function/geosparql/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX yago: <http://yago-knowledge.org/resource/>
PREFIX y2geor: <http://kr.di.uoa.gr/yago2geo/resource/>
PREFIX y2geoo: <http://kr.di.uoa.gr/yago2geo/ontology/>
PREFIX strdf: <http://strdf.di.uoa.gr/ontology#>
PREFIX uom: <http://www.opengis.net/def/uom/OGC/1.0/>
"""

### Materialization helper

In [34]:
GOST_EXECUTABLE = "../GoST/GoST-1.0-SNAPSHOT.jar"
def gost_materialize_query(query: str) -> str:
    """
    Replaces expensive geospatial functions with their materialized triple equivalents.
    Also prepends the list of PREFIXES to the query given.
    
    Return:
        A materialized query as a string, if the given query is valid.
        
        An empty string, if the given query is invalid.        
    """
    result = subprocess.run(["java", "-cp", GOST_EXECUTABLE, "gr.uoa.di.ai.Transpiler", PREFIXES + query], capture_output=True)
    return result.stdout.decode("utf-8") # convert to string

### Categories

In [39]:
categories = OrderedDict()

# GeoQuestions1089_c
for key in ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I']:
    categories[key] = 0
with open('../GeoQuestions1089.json') as geoq:
    data = json.load(geoq)
    for i in range(1, 1018):
        category = data[str(i)]['Category']
        categories[category] += 1
print("GeoQuestions1089_c:\n\t", categories)

# GeoQuestions1089_w
for key in ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I']:
    categories[key] = 0
with open('../GeoQuestions1089.json') as geoq:
    data = json.load(geoq)
    for i in range(1018, 1090):
        category_same_as = data[str(i)]['Category'].split(':')[1]
        category = data[category_same_as]['Category']
        categories[category] += 1
print("GeoQuestions1089_w:\n\t", categories)

GeoQuestions1089_c:
	 OrderedDict([('A', 173), ('B', 139), ('C', 176), ('D', 22), ('E', 138), ('F', 24), ('G', 174), ('H', 145), ('I', 26)])
GeoQuestions1089_w:
	 OrderedDict([('A', 16), ('B', 11), ('C', 14), ('D', 1), ('E', 6), ('F', 2), ('G', 11), ('H', 9), ('I', 2)])


### Check query validity

In [40]:
valid_query = "SELECT * WHERE { ?s ?p ?o }"
invalid_query = "SELECT * WHERE { ?s ?p "

assert gost_materialize_query(valid_query) != ''
assert gost_materialize_query(invalid_query) == ''

# Check the validity of queries 1-1017.
with open('../GeoQuestions1089.json') as geoq:
    data = json.load(geoq)
    for i in tqdm(range(1, 1018)):
        query = data[str(i)]['Query']
        materialized = gost_materialize_query(query)
        if materialized == '':
            print("Invalid query:", i)

# Queries 1018-1089 refer to the previous queries, see the dataset file.

100%|██████████| 1017/1017 [06:07<00:00,  2.77it/s]


### Export CSV file

In [14]:
with open('../GeoQuestions1089.json') as geoq:
    data = json.load(geoq)
with open('../GeoQuestions1089_answers.json') as geoq_answers:
    data_answers = json.load(geoq_answers)
    
questions = []
queries = []
answers = []
for i in tqdm(range(1, 1090)):
    questions.append(data[str(i)]['Question'])
    queries.append(data[str(i)]['Query'])
    answers.append(data_answers[str(i)])
    
df = pd.DataFrame({"Question": questions, "Query": queries, "Answer": answers})
df.to_csv('../GeoQuestions1089.csv', index=False)

100%|██████████| 1089/1089 [00:00<00:00, 822249.70it/s]
