This repository contains the corpus SLUICE (SPARQL for Learning and Understanding Interrupted Customer Enquiries). SLUICE contains 21,000 artificially interrupted questions with their underspecified SPARQL queries. This was originally created for a paper published at CUI 2023.
The paper (by Angus Addlesee and Marco Damonte) can be found here and is titled: "Understanding and Answering Incomplete Questions".
If you use this repository in your work - please cite us:
Harvard:
Addlesee, A. and Damonte, M., 2023. Understanding and Answering Incomplete Questions. Proceedings of the 5th Conference on Conversational User Interfaces (CUI).
BibTeX:
@inproceedings{addlesee2023understanding,
title={Understanding and Answering Incomplete Questions},
author={Addlesee, Angus and Damonte, Marco},
booktitle={Proceedings of the 5th Conference on Conversational User Interfaces (CUI)},
year={2023}
}
In the corpus directory, you will find the corpus train and test set. Each instance in the SLUICE corpus contains the chopped question with the underspecfied SPARQL query, the original question with the full SPARQL query, and the metadata from the chopping (NER tag, chopped entity name, and chopped entity Wikidata ID).
The metadata directory contains variations of the corpus discussed in the paper that were ultimately rejected.
This work used SPARQL, a query language that is executed over RDF knowledge graphs. We published a follow-up paper to this one (using AMR) at Interspeech called: "Understanding Disrupted Sentences Using Underspecified Abstract Meaning Representation" that can be found here
In order to create similar corpora, you need to follow these general steps:
- We need some graph representation, like RDF or AMR, and a corpus of sentences paired with their graph meaning representations.
- There must be some way to align the text's words to the parts of the graph that those words contribute. Here is Figure 1 from our AMR paper:
- With the RDF in our SPARQL queries, we used the nodes textual labels to fuzzy match with words in the sentence (matching "Beyonce" and "Beyoncé" for example). Otherwise, you can use the output of alignment models - the output of which is shown in the above image.
- Once a match is found, we then remove the matching text from the question or sentence, and then underspecify the graph MRL by deleting the nodes and edges that the removed text provided.
- The removed content should be replaced with some UNKNOWN tag or variable (we used the ?unknown variable in RDF).
- The removed text and replaced graph content can be now be paired to represent the completion turn. For example, "Europe" and it's graph content in RDF.
- You now have a collection of incomplete questions and their underspecified MRL's, with their corresponding completions (and MRL of the completion).
- The new task is to parse the incomplete question, and then parse the completion with your semantic parser. The (hopefully) generated UNKNOWN tag must then be replaced by the completion parse to conjoin the representations. The resulting parse is correct if the UNKNOWN tag was in the correct location in the MRL structure. Obviously both parses also have to be correct to accurately reconstruct the original questions meaning.
- This can be done with basically any graph MRL corpus, and evaluation results should be compared to the original tasks SotA system when given the full original question or sentence (as this is the upper bound). So in the SLUICE corpus, the primary evaluation metric is the correct answer rate. In the AMR paper, it is Smatch.