EmbedKGQA: Reproduction and Extended Study

This is the code for the MLRC2020 challenge w.r.t. the ACL 2020 paper Improving Multi-hop Question Answering over Knowledge Graphs using Knowledge Base Embeddings[1]
The code is build upon [1]:5d8fdbd4
Minor modifications have been made to 5d8fdbd4 in order to perform the ablation study. In case of any query relating to the original code[1], please contact Apoorv.

Additional Experiments

Knowledge Graph Embedding model
- TuckER
- Tested on {MetaQA_full, MetaQA_half} datasets
Question embedding models
- ALBERT
- XLNet
- Longformer
- SentenceBERT (SentenceTransformer)
- Tested on {fbwq_full, fbwq_half} datasets

Requirements

Python >= 3.7.5, pip
zip, unzip
Docker (Recommended)
Pytorch version 1.3.0a0+24ae9b5. For more info, visit here.

Helpful pointers

Docker Image: Cuda-Python[2] can be used. Use the runtime tag.

docker run -itd --rm --runtime=nvidia -v /raid/kgdnn/:/raid/kgdnn/ --name embedkgqa__4567 -e NVIDIA_VISIBLE_DEVICES=4,5,6,7  -p 7777:7777 qts8n/cuda-python:runtime

Alternatively, Docker Image: Embed_KGQA[3] can be used as well. It's build upon [2] and contains all the packages for conducting the experiments.
- Use env tag for image without models.
- Use env-models tag for image with models.
- ```
docker run -itd --rm --runtime=nvidia -v /raid/kgdnn/:/raid/kgdnn/ --name embedkgqa__4567 -e NVIDIA_VISIBLE_DEVICES=4,5,6,7  -p 7777:7777 jishnup/embed_kgqa:env
```
- All the required packages and models (from the extended study with better performance) are readily available in [3].
  - Model location within the docker container: /raid/mlrc2020models/
    - /raid/mlrc2020models/embeddings/ contain the KG embedding models.
    - /raid/mlrc2020models/qa_models/ contain the QA models.
The experiments have been done using [2]. The requirements.txt packages' version have been set accordingly. This may vary w.r.t. [1].
KGQA/LSTM and KGQA/RoBERTa directory nomenclature hasn't been changed to avoid unnecessary confusion w.r.t. the original codebase[1].
fbwq_full and fbwq_full_new are the same but independent existence is required because
- Pretrained ComplEx model uses fbwq_full_new as the dataset name
- Trained SimplE model uses fbwq_full as the dataset name
No fbwq_full_new dataset was found in the data shared by the author[1], so went ahead with this setting.
Also, pretrained qa_models were absent in the data shared. The reproduction results are based on training scheme used by us.
For training QA datasets, use batch_size >= 2.

Get started

# Clone the repo
git clone https://github.com/jishnujayakumar/MLRC2020-EmbedKGQA && cd "$_"

# Set a new env variable called EMBED_KGQA_DIR with MLRC2020-EmbedKGQA/ directory's absolute path as value
# If using bash shell, run 
echo 'export EMBED_KGQA_DIR=`pwd`' >> ~/.bash_profile && source ~/.bash_profile

# Change script permissions
chmod -R 700 scripts/

# Initial setup
./scripts/initial_setup.sh

# Download and unzip, data and pretrained_models from the original EmbedKGQA paper
./scripts/download_artifacts.sh

# Install LibKGE
./scripts/install_libkge.sh

Train KG Embeddings

Steps to train KG embeddings.

Train QA Datasets

Hyperparameters in the following commands are set w.r.t. [1].

MetaQA

# Method: 1
cd $EMBED_KGQA_DIR/KGQA/LSTM;
python main.py  --mode train \
            --nb_epochs 100 \
            --relation_dim 200 \
            --hidden_dim 256 \
            --gpu 0 \ #GPU-ID
            --freeze 0 \
            --batch_size 64 \
            --validate_every 4 \
            --hops <1/2/3> \ #n-hops
            --lr 0.0005 \
            --entdrop 0.1 \ 
            --reldrop 0.2 \
            --scoredrop 0.2 \
            --decay 1.0 \
            --model <ComplEx/TuckER> \ #KGE models
            --patience 10 \
            --ls 0.0 \
            --use_cuda True \ #Enable CUDA
            --kg_type <half/full>

        
# Method: 2
# Modify the hyperparameters in the script file w.r.t. your usecase
$EMBED_KGQA_DIR/scripts/train_metaQA.sh \
    <ComplEX/TuckER> \
    <half/full> \
    <1/2/3> \
    <batch_size> \
    <gpu_id> \
    <relation_dim>

WebQuestionsSP

# Method: 1
cd $EMBED_KGQA_DIR/KGQA/RoBERTa;
python main.py  --mode train \
                --relation_dim 200 \
                --que_embedding_model RoBERTa \
                --do_batch_norm 0 \
                --gpu 0 \
                --freeze 1 \
                --batch_size 16 \
                --validate_every 10 \
                --hops webqsp_half \
                --lr 0.00002 \
                --entdrop 0.0 
                --reldrop 0.0 \
                --scoredrop 0.0 \
                --decay 1.0 \
                --model ComplEx \
                --patience 20 \
                --ls 0.0 \
                --l3_reg 0.001 \
                --nb_epochs 200 \
                --outfile delete

# Method: 2
# Modify the hyperparameters in the script file w.r.t. your usecase
$EMBED_KGQA_DIR/scripts/train_webqsp.sh \
    <ComplEx/SimplE> \
    <RoBERTa/ALBERT/XLNet/Longformer/SentenceTransformer> \
    <half/full> \
    <batch_size> \
    <gpu_id> \
    <relation_dim>

Test QA Datasets

Set the mode parameter as test (keep the other hyperparameters same as used in training)

Helpful links

Details about data and pretrained weights.
Details about dataset creation.
Presentation for [1] by Apoorv.

Citation:

Please cite the following if you incorporate our work.

@article{P:2021,
  author = {P, Jishnu Jaykumar and Sardana, Ashish},
  title = {{[Re] Improving Multi-hop Question Answering over Knowledge Graphs using Knowledge Base Embeddings}},
  journal = {ReScience C},
  year = {2021},
  month = may,
  volume = {7},
  number = {2},
  pages = {{#15}},
  doi = {10.5281/zenodo.4834942},
  url = {https://zenodo.org/record/4834942/files/article.pdf},
  code_url = {https://github.com/jishnujayakumar/MLRC2020-EmbedKGQA},
  code_doi = {},
  code_swh = {swh:1:dir:c95bc4fec7023c258c7190975279b5baf6ef6725},
  data_url = {},
  data_doi = {},
  review_url = {https://openreview.net/forum?id=VFAwCMdWY7},
  type = {Replication},
  language = {Python},
  domain = {ML Reproducibility Challenge 2020},
  keywords = {knowledge graph, embeddings, multi-hop, question-answering, deep learning}
}

Following 3 options are available for any clarification, comments or suggestions

Join the discussion forum.
Create an issue.
Contact Jishnu or Ashish.

Name		Name	Last commit message	Last commit date
Latest commit History 442 Commits
KGQA		KGQA
config		config
scripts		scripts
train_embeddings		train_embeddings
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
model.png		model.png
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KGQA

KGQA

config

config

scripts

scripts

train_embeddings

train_embeddings

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

model.png

model.png

requirements.txt

requirements.txt

Repository files navigation

EmbedKGQA: Reproduction and Extended Study

Additional Experiments

Requirements

Helpful pointers

Get started

Train KG Embeddings

Train QA Datasets

MetaQA

WebQuestionsSP

Test QA Datasets

Helpful links

Citation:

About

Releases

Packages

Languages

License

jishnujayakumar/MLRC2020-EmbedKGQA

Folders and files

Latest commit

History

Repository files navigation

EmbedKGQA: Reproduction and Extended Study

Additional Experiments

Requirements

Helpful pointers

Get started

Train KG Embeddings

Train QA Datasets

MetaQA

WebQuestionsSP

Test QA Datasets

Helpful links

Citation:

About

Topics

Resources

License

Stars

Watchers

Forks

Languages