Skip to content

SonyResearch/KGEMoS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

On the Theoretical Limitations of Embedding-based Link Prediction

This is the code for the paper "On the Theoretical Limitations of Embedding-based Link Prediction", published at ICML 2026

TL;DR: To reproduce the results from the paper, run scripts/prepare_experiments.sh to download all the data and experiment scripts. Then, run the experiment scripts that you will find under scripts/experiments/. Make sure to have set up your wandb account and project under the main config file config/config.yaml.

Data

TSV-formatted KGs

Download, e.g., FB15K-237 or WN18RR from here.

And put it in a source directory:

data/
|-- src/
|   |-- FB15K-237/
|       |-- train.tsv
|       |-- valid.tsv
|       |-- test.tsv

Then process with

uv run scripts/preprocess.py --datasets FB15K-237 WN18RR --data-folder data/src --output-folder data/processed

The preprocessing script will ensure that the entities and relations are mapped to consecutive integers starting from 0.

The same can be done for other datasets with tab-separated files following the same format.

Hetionet and openbiolink

Download Hetionet and openbiolink and dump the processed tsv files by running

uv run data/download_hetionet.py --output_dir data/processed/Hetionet --seed 42
uv run data/download_openbiolink.py --output_dir data/processed/openbiolink

This uses PyKEEN to download the datasets and split them into train/validation/test sets. The original datasets do not come with standard train/validation/test splits, so we use PyKEEN and a fixed seed to split them for reproducibility.

OGBL-biokg

For ogbl datasets, no need to download the data, just run the training scripts and the data will be downloaded automatically.

Running models

Run with

uv run scripts/main.py data_folder=data/processed dataset=FB15K-237 model=tail dimension=200 model/fusing_function=hadamard model.fusing_dropout=0.1 engine_config.learning_rate=1e-3 device=cuda:0

Choose the datasets from FB15K-237, WN18RR, ogbl-biokg, Hetionet, openbiolink, and ogbl-wikikg2.

  • Note that ogbl-wikikg2 is very large and you'll need a lot of resources to run it. We did not use it in the paper due to computational constraints.
  • For Hetionet and openbiolink, we highly recommend setting engine_config.valid_sample_size=10000 or a similar small number due to the datasets being large (unlike FB15K-237 and WN18RR) and not having an efficient validation/testing setup (unlike ogbl-biokg or ogbl-wikikg2). The whole test sets will be used for evaluation at the end of training (can take several hours).
  • The evaluation code assumes datasets include inverse relationships (add_inverse=True). This means that head prediction is evaluated through tail prediction on inverse triples. For example, to evaluate "given (r,o), predict s" on triple (s,r,o), the code evaluates "given (o,r_inv), predict s" on the inverse triple (o,r_inv,s).

Set up the config as you like by modifying config/config.yaml (type of model, embedding dimension, learning rate, etc.) as well as the sub-config files under config/model (settings for the specific model).

Under config/config.yaml, you can specify the family of model to use by setting the model field:

defaults:
  - model: mixture # pipeline, mixture
...

The model types are:

  • pipeline: regular KGE models including DistMult, ComplEx, ConvE, etc. You can specify the one to use by setting the config/model/pipeline.yaml file.
  • mixture: high-rank variations of the KGE models expanded with a mixture layer. You can specify the one to use by setting the config/model/mixture.yaml file.

In pipeline.yaml and mixture.yaml, you can specify the embedding type, fusing function, and grammatical encoder to use. Here are some common configurations:

  • DistMult:
    • embedding: real
    • fusing_function: hadamard # element-wise product
    • grammatical_encoder: identity
  • ComplEx:
    • embedding: complex
    • fusing_function: hadamard
    • grammatical_encoder: complex # takes the conjugate of the complex embeddings
  • ConvE:
    • embedding: real
    • fusing_function: conve
    • grammatical_encoder: identity pipeline models then project to the objects using a linear dot product, whereas mixture models use a more elaborate function using a mixture layer.

Aggregating results

results/ contains undocumented code which can be useful to aggregate results from different runs on wandb.

Binary classification engine

In addition to the ranking-based engine above, the repository includes a binary classification engine that trains the same KGE models with a per-entity binary cross-entropy objective (instead of ranking / cross-entropy) and evaluates them with classification metrics (F1, AUROC, average precision).

Run it with its own entry point and config (config/bc_config.yaml):

uv run scripts/bc_main.py dataset=ogbl-biokg model=tail_model dimension=1000 device=cuda:0

Choose model from tail_model or tail_mixture_sig (uses a Mixtures of Sigmoid instead of Mixtures of Softmaxes). Relevant settings such as neg_keep_ratio, pos_weight_alpha, and validation_metric (f1, ap, auroc) can be tuned in config/bc_config.yaml.

About

Official implementation of "Breaking Rank Bottlenecks in Knowledge Graph Completion"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors