Skip to content
πŸ“š A Neural QA Model for DBpedia using Neural SPARQL Machines.
Python Shell C++
Branch: gsoc-stuart
Clone or download
Pull request Compare This branch is 98 commits ahead, 8 commits behind dbpedia:master.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
data
templates_generator
transformer_atten
.gitattributes
.gitignore
.gitmodules
LICENSE
README.md
__init__.py
analyse.py
analyse.sh
ask.sh
build_vocab.py
filter_dataset.py
generator.py
generator_test.py
generator_utils.py
interpreter.py
requirements.txt
sparql.grammar
split_in_train_dev_test.py
train.sh

README.md

Neural SPARQL Machines: Attention-based Transformer with Automatic Templates Generation


An Attention-based Transformer for Neural Question-Answering on Knowledge Graph, via Machine Translation Approach, with Automatic Templates Generator from Long Text.

The project is Stuart Chen's research in Google 2019 GSoC in collaboration with DBpedia and AKSW Research Group.

Here is the website for blogging the research development.

Natural Language DBpedia. Semantic triple flag.


Dependencies


The Pipeline of Architecture

To show the workflow, the model architecture is like:

architecture and pipeline

  • To begin with, please run the requirements.txt to set up all the dependencies. Before running all the scripts, please mind that this repository folder should have been exported to the system path $PYTHONPATH. Also, the model en-core-web-sm==2.1.0 for spaCy need you to download, see the instruction on official page of spaCy.Also, please make a folder called 'glove2wordvec' in 'neural-qa/data', and put the word2vec file into it.

1. Automatic Templates Generator

The component aims at automating the templates generation from the long text, with the help of Universal Sentence Encoder, DBpedia-Spotlight, DBpedia-Lookup, NLTK, and Spacy.

1.1. Extraction Of Wiki pages and article pre-processing

We need abundant natural language textual materials to get more questions with RDFs of DBpedia, to transform them into templates.

For example, if you want to get the articles about Brack Obama(dbr:Barack_Obama), we set DBR_NAME=Barack_Obama, then

neural-qa/templates_generator> python questions_generate_main.py --dbo_class=$DBR_NAME

here, the variable $DBR_NAME should be a certain entity, like Barack_Obama.

the scripts will automatically make a Bank directory in the neural-qa/data/ folder to save the articles.

1.2. Filtering of the sentences in articles to match the DBpedia triple RDFs

The script sentences_filter.py is for filtering out those sentences pertinent to the RDFs that we need.

1.3. Convert sentences containing DBpedia entities to questions with placeholders

The question_convertor.py is the part responsible for converting the caught sentences to template-questions with entity placeholders.

    e.g. She was born in France? --> where <A> was born in ?

1.4. Matching these questions towards the template questions in exiting templates-sets with Universal Sentence Encoder

This sentence_encoder.py is from the implementation of Universal Sentence Encoder[1] which shows efficiency in semantic sentences matching, it helps to match whether there is an existing correspondent template for the new question that we have.

1.5. If the matching similarity score can not pass the threshold, the questions go to the query composing part

To use the pipeline, please run the templates_generate_main.py after the step 1 above,

python templates_generate_main.py --dbo_class=$DBO_CLASS  --temps_fpath=$EXISTING_TEMPLATES_FILE_PATH  --text_fpath=$TEXT_FILE_PATH  --ntriple_fpath=$NTRIPLES_FILE_PATH  --train_vec=$WHETHER_TO_TRAIN_THE_VECTOR  --vecpath=$FILE_PATH_THAT_SAVES_VECTORS   --temp_save_path=$FILE_PATH_SAVING_RESULTS 

which will automatically initiate the pipeline.

Please have a look at the parameters:

    1. for --dbo_class=$DBO_CLASS, the $DBO_CLASS should be a ontology category, like: Person, Monument, etc.
    1. for --temps_fpath=$EXISTING_TEMPLATES_FILE_PATH, the $EXISTING_TEMPLATES_FILE_PATH should be a file path to the templateset for the DBpedia entity resource(dbr), like, for Barack_Obama, we should use the templateset for Person.
    1. for --text_fpath=$TEXT_FILE_PATH, the $TEXT_FILE_PATH should be the text article extracted from the Wikipage.
    1. for --ntriple_fpath=$NTRIPLES_FILE_PATH, it should be the ntriple file.
    1. for --train_vec=$WHETHER_TO_TRAIN_THE_VECTOR, the default is to use the prepared vectors, however if you want, you can set it to True, which trains the vector by Universal Sentence Encoder.
    1. for --vecpath=$FILE_PATH_THAT_SAVES_VECTORS, it's the file path where the vectors are store.
    1. for --temp_save_path=$FILE_PATH_SAVING_RESULTS , please set the file path where you want to save the new templateset generated.

    To find the ntriple files and text files automatically saved, please go into the neural-qa/data/Bank/DBresources/, you will see the folder corespondent to the entity's ontology category, like, for Barack_Obama is in category Person, then you can find the folder neural-qa/data/Bank/DBresources/Person/Barack_Obama, the ntriple file and the text file will be seen there.

  • one result of our works can be seen here, which facilitates to clarify the structure of Templates Bank directory with the output results inside Bank\DBresourses\Person\Barack_Obama.

For example, we run the program for dbr_Barack_Obama, we should use the command as below:

neural-qa/templates_generator> python templates_generate_main.py  --dbo_class=Person --temps_fpath=../data/annotations_Person.csv --text_fpath=../data/Bank/DBresourses/Person/Barack_Obama/Barack_Obama.txt --ntriple_fpath=../data/Bank/DBresourses/Person/Barack_Obama/Barack_Obama.ntriples  --vecpath=../data/Bank/DBresourses/Person/Barack_Obama/Barack_Obama.vectors   --temp_save_fpath=../data/Bank/DBresourses/Person/Barack_Obama/Barack_Obama.template.csv

1.6. Guidance


2. Transformer

The implementation of this neural transformer part gets inspiration from the paper Attention Is All You Need[2] and its official model by TensorFlow[3].

Attention Is All You Need. Figure 1

2.1. Data Preparation

2.1.1. to generate the data

We use the templates in CSV format provided by SPARQL as a Foreign Language[4] to generate the training data for the experiments.

The generated data consists of two parts, namely, data.en the source data, and data.sparql the target data.

In the data.en are the natural language questions with RDF entities annotated to be translated into RDF structured query language SPARQL, like in this example,

    "who is the spouse of dbr_Barack_Obama ?"
    "who is the partner of dbr_Audrey_Hepburn ?"
    ...

To begin with, please run the data generation:

  • this one command must be run in Python 2.7, since it was from the previous project.
cd  neural-qa/
mkdir data/QALD7
neural-qa> python generator.py  --transformer=True  --templates data/QALD-7.csv  --output data/QALD7

after which this script will convert the data into a training set and validation set with building the vocabulary:

cd neural-qa/transformer_atten/transformer

then, we make a folder named 'data' in the transformer folder, and again make a folder QALD7 in the folder data, please copy the generated data files in to the ./data/QALD7/ folder:

neural-qa/transformer_atten/transformer> python data_preprocess.py --data_dir=./data/QALD7

Then, we need to pre-process the data and build the vocabulary file and split the data into tarining set and validation set:

neural-qa/transformer_atten/transformer> python transformer_main.py --data_dir=./data/QALD7/DATA_DIR --model_dir=./data/QALD7/model_QALD7   --vocab_file=./data/QALD7/vocab.en_sparql   --param_set=big 
  • Please make sure the folders and paths that have been set in the commands already exist.
  • We stronly encourage to use one previously generated dataset can be found here, and put it in neural-qa/transformer_atten/transformer. Decompress the zipped file and put it in the neural-qa/transformer_atten/transformer/data/QALD7.

2.2. Model Training

To conduct the training, please notice the parameters to set:

  • Please put all the tfrecord files in the neural-qa/transformer_atten/transformer/data/QALD7/DATA_DIR/ to prevent the runing issues.
   PARAM_SET=big
   DATA_DIR=$path/to/the/data
   MODEL_DIR=$path/to/your/model
   VOCAB_FILE=$DATA_DIR/vocab.en_sparql
  • just a side note, please make sure the generated date for training are put in a folder that only contains the data without any file else, and we should put the generated tfrecords into a DATA_DIR folder in transformer/data/QALD7, otherwise it might raise the tf.errors.DataLossError. The model has risk at handeling the threads and the corrupted data loss error, the feasible solution that we know is to put the tfrecords in a separated folder, and make sure the access to write/read the files inside are already authorized.

In our experiment, we use the command below:

python transformer_main.py --data_dir=./data/QALD7/DATA_DIR --model_dir=./data/QALD7/model_QALD7   --vocab_file=./data/QALD7/vocab.en_sparql   --param_set=big 
  • To see more instructions, this refers to the official model.

  • NOTIFICATION: Since the current model officially by TensorFlow still has a potential issue, we strongly recommend you to train it on CPU or check the CUDA environment in case that the memory run out of storage and the threads get killed.

2.3. Model Results

  • Training Time

Training Time.

  • Loss

    This shows the cross-entropy loss while training:

loss.

  • GERBIL Evaluation

    The table shows the evaluation result for the QALD-7 benchmark:

    The GERBIL is an online platform to do the question-answering F1-score evaluation with confusion matrix, and this table shows the answering accuracy of the model's output.

Evaluation. For better comparison, we have a blog about the results of QALD evaluation of original NSpM model.


Summary

I am so glad to have this experience this summer with my excellent mentors. I now get more knowledge during our research in natural language processing, knowledge graph, and deep learning. It's profounding my mind in scientific research, which ignites the flame of unquenchable curiosity in artificial intelligence.

So here, I want to talk about our project. We are using long natural language text to generate the templates because we know how important the templates are in the training the neural SPARQL machine on the knowledge graph.

Also, we tried to employ the state-of-the-art model, Transformer of attention mechanism, to play the role of the learner from natural language questions into the SPARQL queries.

What's more, we want to make the system a never-ending-learner, like the Never-Ending Learning for Open-Domain Question Answering over Knowledge Bases[5], to keep the long loop of accumulating knowledge. I believe this is a crucial key to artificial general intelligence.

Evolution of the Project

In the beginning, I mean in the initial proposal, we wanted to use DBpedia embedding to do the SQuAD machine reading comprehension tasks with reinforcement learning, but gradually we realize the performance of the neural SPARQL machine is highly dependent on the training data which indicate the crucial necessity of automating the templates generation from long contextual passages. The Wikipedia is a wonderful source of plenty of such articles relevant to DBpedia RDF triples, so we decided to evolve an intelligent neural SPARQL machine with automated templates generation, comparison, and accumulation to try to approach a never-ending-learning intelligent agent.

Of course, during the coding, we have countered so many difficulties, like doing the benchmark evaluations and some tough impediments, but as now I think about these problems, I think they gave me a totally thorough growth. I got to learn more and more about the newest products in the industry and get more adequate with the international coding standards which open my door to a bigger world. For example, in the part of calculating the vector similarity to match existing templates, we first used word mover distance with GLoVe vectors via gensim, but we found that was too heavy and too slow, then we used spaCy and found it much speedier. And soon after this, we found the Universal Sentence Encoder is even better in this task, which is a huge evolution in our development.

Another thing that I still remember is the paraphrasing of the predicates, we used to think load all the phrases in RAM and do the matching. I still remember that file was so huge even more than 17.6 GB. Then I found the wordnet from nltk can accomplish this paraphrasing task without such a huge cost, which is a smart solution.

Future Works

We hope to keep on the work on making the question generation even better and including ASK queries, queries that require filter (how many, how much, etc.) and complex queries as well. Because we believe this can make the neural SPARQL machines get even better and better performance.


References

[1] Daniel Cer et al. (2018) Universal Sentence Encoder

[2] Ashish Vaswani et al. (2017) Attention Is All You Need

[3] TensorFlow - Official Models: https://github.com/tensorflow/models

[4] Tommaso Soru et al. (2017) SPARQL as a Foreign Language

[5] Abdalghani Abujabal et al. (2018) Never-Ending Learning for Open-Domain Question Answering over Knowledge Bases

[6] Rajarshi Das et al. (2017) Question Answering on Knowledge Bases and Text using Universal Schema and Memory Networks

[7] Haitian Sun et al. (2018) Open Domain Question Answering Using Early Fusion of Knowledge Bases and Text

[8] Svetlana Stenchikova et al. (2018) QASR: Spoken Question Answering Using Semantic Role Labeling


You can’t perform that action at this time.