Cross-embedding vector correspondence via iterative geometric embedding hashing. Given two sets of embeddings (from different models) over the same data that are partially overlapping, VecLink identifies which vectors correspond to the same underlying entities using only a small set of seed anchors.
Requires Python 3.9–3.10.
# Install dependencies
uv sync
# PyTorch Geometric extensions (must be installed separately)
uv pip install torch-cluster torch-scatter torch-sparse torch-spline-conv \
-f https://data.pyg.org/whl/torch-2.1.0+cu118.htmluv run veclink.py \
--dataset scifact \
--emb1 mistral \
--emb2 openai \
--overlap_ratio 0.3 \
--n_seeds 15 \
--seed 42 \
--use_bernoulli_trialsPlace embedding files in the embeddings/ directory as NumPy .npy files, named as:
corpus_embeddings_{model}_{dataset}.npy
For example: corpus_embeddings_mistral_scifact.npy, corpus_embeddings_openai_scifact.npy.
| Argument | Default | Description |
|---|---|---|
--dataset |
scifact |
Dataset name |
--emb1 / --emb2 |
mistral / openai |
Embedding model names |
--overlap_ratio |
0.3 |
Fraction of data shared between the two sets |
--n_seeds |
None |
Number of seed anchor pairs |
--use_bernoulli_trials |
True |
Beta–Bernoulli posterior ensemble selection (paper default; pass False for the raw-vote baseline) |
--max_iter |
100 |
Maximum refinement iterations |
--seed |
None |
Random seed for reproducibility |
--use_gpu |
True |
Enable GPU acceleration |
BEIR benchmarks: scifact, scidocs, fiqa, nfcorpus, arguana.