Skip to content

DBgroup-Edinburgh/VecLinking

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VecLink

Cross-embedding vector correspondence via iterative geometric embedding hashing. Given two sets of embeddings (from different models) over the same data that are partially overlapping, VecLink identifies which vectors correspond to the same underlying entities using only a small set of seed anchors.

Installation

Requires Python 3.9–3.10.

# Install dependencies
uv sync

# PyTorch Geometric extensions (must be installed separately)
uv pip install torch-cluster torch-scatter torch-sparse torch-spline-conv \
  -f https://data.pyg.org/whl/torch-2.1.0+cu118.html

Quick Start

uv run veclink.py \
  --dataset scifact \
  --emb1 mistral \
  --emb2 openai \
  --overlap_ratio 0.3 \
  --n_seeds 15 \
  --seed 42 \
  --use_bernoulli_trials

Embeddings

Place embedding files in the embeddings/ directory as NumPy .npy files, named as:

corpus_embeddings_{model}_{dataset}.npy

For example: corpus_embeddings_mistral_scifact.npy, corpus_embeddings_openai_scifact.npy.

Key Arguments

Argument Default Description
--dataset scifact Dataset name
--emb1 / --emb2 mistral / openai Embedding model names
--overlap_ratio 0.3 Fraction of data shared between the two sets
--n_seeds None Number of seed anchor pairs
--use_bernoulli_trials True Beta–Bernoulli posterior ensemble selection (paper default; pass False for the raw-vote baseline)
--max_iter 100 Maximum refinement iterations
--seed None Random seed for reproducibility
--use_gpu True Enable GPU acceleration

Supported Datasets

BEIR benchmarks: scifact, scidocs, fiqa, nfcorpus, arguana.

About

Code repository for ICML 2026 paper "Vector Linking via Cross-Model Local Isometric Consistency"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages