This is the code for the paper "On the Theoretical Limitations of Embedding-based Link Prediction", published at ICML 2026
TL;DR: To reproduce the results from the paper, run scripts/prepare_experiments.sh to download all the data and experiment scripts.
Then, run the experiment scripts that you will find under scripts/experiments/.
Make sure to have set up your wandb account and project under the main config file config/config.yaml.
Download, e.g., FB15K-237 or WN18RR from here.
And put it in a source directory:
data/
|-- src/
| |-- FB15K-237/
| |-- train.tsv
| |-- valid.tsv
| |-- test.tsvThen process with
uv run scripts/preprocess.py --datasets FB15K-237 WN18RR --data-folder data/src --output-folder data/processedThe preprocessing script will ensure that the entities and relations are mapped to consecutive integers starting from 0.
The same can be done for other datasets with tab-separated files following the same format.
Download Hetionet and openbiolink and dump the processed tsv files by running
uv run data/download_hetionet.py --output_dir data/processed/Hetionet --seed 42
uv run data/download_openbiolink.py --output_dir data/processed/openbiolinkThis uses PyKEEN to download the datasets and split them into train/validation/test sets. The original datasets do not come with standard train/validation/test splits, so we use PyKEEN and a fixed seed to split them for reproducibility.
For ogbl datasets, no need to download the data, just run the training scripts and the data will be downloaded automatically.
Run with
uv run scripts/main.py data_folder=data/processed dataset=FB15K-237 model=tail dimension=200 model/fusing_function=hadamard model.fusing_dropout=0.1 engine_config.learning_rate=1e-3 device=cuda:0Choose the datasets from FB15K-237, WN18RR, ogbl-biokg, Hetionet, openbiolink, and ogbl-wikikg2.
- Note that
ogbl-wikikg2is very large and you'll need a lot of resources to run it. We did not use it in the paper due to computational constraints. - For
Hetionetandopenbiolink, we highly recommend settingengine_config.valid_sample_size=10000or a similar small number due to the datasets being large (unlikeFB15K-237andWN18RR) and not having an efficient validation/testing setup (unlikeogbl-biokgorogbl-wikikg2). The whole test sets will be used for evaluation at the end of training (can take several hours). - The evaluation code assumes datasets include inverse relationships (add_inverse=True). This means that head prediction is evaluated through tail prediction on inverse triples. For example, to evaluate "given (r,o), predict s" on triple (s,r,o), the code evaluates "given (o,r_inv), predict s" on the inverse triple (o,r_inv,s).
Set up the config as you like by modifying config/config.yaml (type of model, embedding dimension, learning rate, etc.) as well as the sub-config files under config/model (settings for the specific model).
Under config/config.yaml, you can specify the family of model to use by setting the model field:
defaults:
- model: mixture # pipeline, mixture
...The model types are:
pipeline: regular KGE models including DistMult, ComplEx, ConvE, etc. You can specify the one to use by setting theconfig/model/pipeline.yamlfile.mixture: high-rank variations of the KGE models expanded with a mixture layer. You can specify the one to use by setting theconfig/model/mixture.yamlfile.
In pipeline.yaml and mixture.yaml, you can specify the embedding type, fusing function, and grammatical encoder to use.
Here are some common configurations:
- DistMult:
- embedding: real
- fusing_function: hadamard # element-wise product
- grammatical_encoder: identity
- ComplEx:
- embedding: complex
- fusing_function: hadamard
- grammatical_encoder: complex # takes the conjugate of the complex embeddings
- ConvE:
- embedding: real
- fusing_function: conve
- grammatical_encoder: identity
pipelinemodels then project to the objects using a linear dot product, whereasmixturemodels use a more elaborate function using a mixture layer.
results/ contains undocumented code which can be useful to aggregate results from different runs on wandb.
In addition to the ranking-based engine above, the repository includes a binary classification engine that trains the same KGE models with a per-entity binary cross-entropy objective (instead of ranking / cross-entropy) and evaluates them with classification metrics (F1, AUROC, average precision).
Run it with its own entry point and config (config/bc_config.yaml):
uv run scripts/bc_main.py dataset=ogbl-biokg model=tail_model dimension=1000 device=cuda:0Choose model from tail_model or tail_mixture_sig (uses a Mixtures of Sigmoid instead of Mixtures of Softmaxes). Relevant settings such as neg_keep_ratio,
pos_weight_alpha, and validation_metric (f1, ap, auroc) can be tuned in config/bc_config.yaml.