On the Theoretical Limitations of Embedding-based Link Prediction

This is the code for the paper "On the Theoretical Limitations of Embedding-based Link Prediction", published at ICML 2026

TL;DR: To reproduce the results from the paper, run scripts/prepare_experiments.sh to download all the data and experiment scripts. Then, run the experiment scripts that you will find under scripts/experiments/. Make sure to have set up your wandb account and project under the main config file config/config.yaml.

Data

TSV-formatted KGs

Download, e.g., FB15K-237 or WN18RR from here.

And put it in a source directory:

data/
|-- src/
|   |-- FB15K-237/
|       |-- train.tsv
|       |-- valid.tsv
|       |-- test.tsv

Then process with

uv run scripts/preprocess.py --datasets FB15K-237 WN18RR --data-folder data/src --output-folder data/processed

The preprocessing script will ensure that the entities and relations are mapped to consecutive integers starting from 0.

The same can be done for other datasets with tab-separated files following the same format.

Hetionet and openbiolink

Download Hetionet and openbiolink and dump the processed tsv files by running

uv run data/download_hetionet.py --output_dir data/processed/Hetionet --seed 42
uv run data/download_openbiolink.py --output_dir data/processed/openbiolink

This uses PyKEEN to download the datasets and split them into train/validation/test sets. The original datasets do not come with standard train/validation/test splits, so we use PyKEEN and a fixed seed to split them for reproducibility.

OGBL-biokg

For ogbl datasets, no need to download the data, just run the training scripts and the data will be downloaded automatically.

Running models

Run with

uv run scripts/main.py data_folder=data/processed dataset=FB15K-237 model=tail dimension=200 model/fusing_function=hadamard model.fusing_dropout=0.1 engine_config.learning_rate=1e-3 device=cuda:0

Choose the datasets from FB15K-237, WN18RR, ogbl-biokg, Hetionet, openbiolink, and ogbl-wikikg2.

Note that ogbl-wikikg2 is very large and you'll need a lot of resources to run it. We did not use it in the paper due to computational constraints.
For Hetionet and openbiolink, we highly recommend setting engine_config.valid_sample_size=10000 or a similar small number due to the datasets being large (unlike FB15K-237 and WN18RR) and not having an efficient validation/testing setup (unlike ogbl-biokg or ogbl-wikikg2). The whole test sets will be used for evaluation at the end of training (can take several hours).
The evaluation code assumes datasets include inverse relationships (add_inverse=True). This means that head prediction is evaluated through tail prediction on inverse triples. For example, to evaluate "given (r,o), predict s" on triple (s,r,o), the code evaluates "given (o,r_inv), predict s" on the inverse triple (o,r_inv,s).

Set up the config as you like by modifying config/config.yaml (type of model, embedding dimension, learning rate, etc.) as well as the sub-config files under config/model (settings for the specific model).

Under config/config.yaml, you can specify the family of model to use by setting the model field:

defaults:
  - model: mixture # pipeline, mixture
...

The model types are:

pipeline: regular KGE models including DistMult, ComplEx, ConvE, etc. You can specify the one to use by setting the config/model/pipeline.yaml file.
mixture: high-rank variations of the KGE models expanded with a mixture layer. You can specify the one to use by setting the config/model/mixture.yaml file.

In pipeline.yaml and mixture.yaml, you can specify the embedding type, fusing function, and grammatical encoder to use. Here are some common configurations:

DistMult:
- embedding: real
- fusing_function: hadamard # element-wise product
- grammatical_encoder: identity
ComplEx:
- embedding: complex
- fusing_function: hadamard
- grammatical_encoder: complex # takes the conjugate of the complex embeddings
ConvE:
- embedding: real
- fusing_function: conve
- grammatical_encoder: identity pipeline models then project to the objects using a linear dot product, whereas mixture models use a more elaborate function using a mixture layer.

Aggregating results

results/ contains undocumented code which can be useful to aggregate results from different runs on wandb.

Binary classification engine

In addition to the ranking-based engine above, the repository includes a binary classification engine that trains the same KGE models with a per-entity binary cross-entropy objective (instead of ranking / cross-entropy) and evaluates them with classification metrics (F1, AUROC, average precision).

Run it with its own entry point and config (config/bc_config.yaml):

uv run scripts/bc_main.py dataset=ogbl-biokg model=tail_model dimension=1000 device=cuda:0

Choose model from tail_model or tail_mixture_sig (uses a Mixtures of Sigmoid instead of Mixtures of Softmaxes). Relevant settings such as neg_keep_ratio, pos_weight_alpha, and validation_metric (f1, ap, auroc) can be tuned in config/bc_config.yaml.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
config		config
data		data
kge		kge
results		results
scripts		scripts
tests		tests
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

On the Theoretical Limitations of Embedding-based Link Prediction

Data

TSV-formatted KGs

Hetionet and openbiolink

OGBL-biokg

Running models

Aggregating results

Binary classification engine

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

On the Theoretical Limitations of Embedding-based Link Prediction

Data

TSV-formatted KGs

Hetionet and openbiolink

OGBL-biokg

Running models

Aggregating results

Binary classification engine

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages