Knowledge Graph Embedding Models in Drug Discovery

This repository accompanies our paper Understanding the Performance of Knowledge Graph Embeddings in Drug Discovery and enables replication of the key results.

Overview

In this work we investigate the predictive performance of five KGE models on two public drug discovery-oriented KGs. Our goal is not to focus on the best overall model or configuration, instead we take a deeper look at how performance can be affected by changes in the training setup, choice of hyperparameters, model parameter initialisation seed and different splits of the datasets. Our results highlight that these factors have significant impact on performance and can even affect the ranking of models. Indeed these factors should be reported along with model architectures to ensure complete reproducibility and fair comparisons of future work, and we argue this is critical for the acceptance of use, and impact of KGEs in a biomedical setting.

The models we investigate are:

With the following datasets being used:

HyperParameter Values

Part of this study was a detailed search over the hyperparameter space for all models and across both datasets. Below we report the best overall values below for each dataset. Note that these values are taken from experiments with a fixed training setup consisting of the Adagrad optimiser and the use of Margin Ranking Loss with no inverse relations.

Hetionet

Model	Embedding Size	Num Epochs	Learning Rate	Num Negatives
ComplEx	272	700	0.03	91
DistMult	80	400	0.02	41
RotatE	512	500	0.03	41
TransE	304	500	0.02	61
TransH	480	800	0.005	1

BioKG

Model	Embedding Size	Num Epochs	Learning Rate	Num Negatives
ComplEx	464	600	0.09	91
DistMult	480	100	0.05	71
RotatE	448	900	0.06	31
TransE	448	600	0.1	91
TransH	368	900	0.06	31

Installation Dependencies

The dependencies required to run the notebooks can be installed as follows:

$ pip install -r requirements.txt

The code relies primarily on the PyKEEN package, which uses PyTorch behind the scenes for gradient computation. If you want to run the experiments, it would be advisable to ensure you install a GPU enabled version of PyTorch first. Details on how to do this are provided here.

Note that all of the models and both of the datasets are provided as part of the the PyKEEN package so there is no need to download the datasets separately.

Reproducing Experiments

This repository contains code to replicate the experiments detailed in the accompanying manuscript. Each experiment is run by using the combination of a python scrip and associated YAML configuration file. The general format of this is experiment.py with the path to the config file providing the remaining information in the format : experiment/dataset/model.

We now provide examples of running each experiment. Please note that the results of these experiments will be saved in the results directory at the root of this repository.

Baseline Experiments

The baseline experiments are run using sensible default hyper-parameters and can be used to compare again more optimised values. The baseline experiments can each be run as follows:

$ python src/baseline.py -c config/baseline/hetionet/rotate.yaml

Where both the dataset and model can be chosen from those available.

Hyper-Parameter Optimisation Experiments

The HPO experiments will perform 100 repeats to find the optimal hyper parameters for a given dataset and model combination. The HPO experiments can each be run as follows:

$ python src/hpo.py -c config/hpo/hetionet/rotate.yaml

Where both the dataset and model can be chosen from those available. Note that by default we use the TPE search method, however a random search can also be used by changing the value for the sampler configuration option in the YAML files.

Task Specific HPO

The previous experiment optimised the hyperparameters across all relation types. In these experiments, we optimise over only edges between gene and disease entities.

$ python src/hpo_relation.py -c config/hpo_relation/hetionet/rotate.yaml

Note that only the Hetionet dataset can be used for these experiments.

Model Random Seed Experiments

The model seeds experiments are designed to assess how variations in the random seed used to initialise the model parameters affect predictive performance. As such, ten repeats over different random seeds are performed. The model random seed experiments can each be run as follows:

$ python src/model_seed_repeats.py -c config/model_repeats/hetionet/rotate.yaml

Where both the dataset and model can be chosen from those available.

Dataset Split Experiments

These experiments are designed to assess how changes in the train/test split in the dataset can affect predictive performance. These experiments can each be run as follows:

$ python src/dataset_repeats.py -c config/datasplits/hetionet/rotate.yaml

Where both the dataset and model can be chosen from those available.

Ablation Experiments

In these experiments, we investigate how the training setup (loss function, optimizer and use of inverse relations) can impact predictive performance. Please note this this experiment will run the repeats over all models and datasets from a single script - thus there is no need to specify a specific model or dataset choice.

$ python src/ablation.py

Citation

Please consider citing the paper for this repo if you find it useful:

@article{bonner2022understanding,
  title={Understanding the performance of knowledge graph embeddings in drug discovery},
  author={Bonner, Stephen and Barrett, Ian P and Ye, Cheng and Swiers, Rowan and Engkvist, Ola and Hoyt, Charles Tapley and Hamilton, William L},
  journal={Artificial Intelligence in the Life Sciences},
  pages={100036},
  year={2022},
  publisher={Elsevier}
}

License

Apache 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
config		config
src		src
.gitignore		.gitignore
AUTHORS.md		AUTHORS.md
CITATION.bib		CITATION.bib
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
result.png		result.png

License

AstraZeneca/kgem-in-drug-discovery

Folders and files

Latest commit

History

Repository files navigation

Knowledge Graph Embedding Models in Drug Discovery

Overview

HyperParameter Values

Hetionet

BioKG

Installation Dependencies

Reproducing Experiments

Baseline Experiments

Hyper-Parameter Optimisation Experiments

Task Specific HPO

Model Random Seed Experiments

Dataset Split Experiments

Ablation Experiments

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Languages