arXiv 2025 Β· [Paper] Β· [Dataset (HuggingFace)]
A retrieval benchmark evaluating whether embedding models can bridge idiomatic and literal expressions β matching queries to documents that share the same conceptual meaning regardless of figurative vs. literal usage.
107 idioms | 10,700 documents | 2,140 queries | 24 models | 4 query configurations
git clone <repo-url>
cd IdioLink
pip install -r requirements.txt
python download_data.py
python run_dense.py --model sentence-transformers/all-MiniLM-L6-v2 --query_mode sentenceExpected output: R-Precision and nDCG@10 on the test set (1,500 queries, 7,500 documents).
Hardware: Models with 7B+ parameters require a GPU with β₯24GB VRAM.
This repo ships with Claude Code skills for streamlined experiment workflows. Type / in Claude Code to see them.
| Command | What it does | Example |
|---|---|---|
/onboard |
Install deps, verify env, explain project | /onboard |
/run-experiment |
Run one model Γ one mode | /run-experiment bge-m3 sentence |
/run-all |
Run full experiment grid | /run-all --debug |
/fine-tune |
Contrastive fine-tuning | /fine-tune e5-base-v2 sentence |
/evaluate |
View/regenerate metrics | /evaluate bge-m3 span |
/reproduce-paper |
Full paper reproduction | /reproduce-paper |
/debug-model |
Diagnose model issues | /debug-model GritLM/GritLM-7B |
Skills are defined in .claude/skills/ and work for any collaborator who clones the repo.
Research question: Can embedding models retrieve documents that share conceptual meaning with a query containing a potentially idiomatic expression (PIE), regardless of literal vs. figurative usage?
| Query type | Relevant documents |
|---|---|
| Literal query | All literal docs for the same PIE |
| Idiomatic query | All idiomatic + simplification + sense docs for the same PIE |
- R-Precision β Precision at R, where R = number of relevant documents for the query
- nDCG@10 β Normalized Discounted Cumulative Gain at rank 10
Each metrics.json also carries two diagnostic breakdowns:
by_usage(recommended diagnostic) β the headline metrics restricted to literal-only and idiomatic-only queries. This is the primary breakdown for understanding model behavior across PIE usage types.by_subject(supplementary sanity check) β a topical-coherence signal using sharedsubjectas binary relevance. Intentionally weaker than the idiom-relevance gold β it ignores idiom identity and usage type. Use for diagnostic comparison only, not as a headline metric.
| Type | Description | Share |
|---|---|---|
| literal | PIE used with its word-by-word meaning | 40% |
| idiomatic | PIE used figuratively | 20% |
| simplification | Paraphrase of the idiomatic meaning | 20% |
| sense | Dictionary-sense rephrasing | 20% |
| Split | PIEs | Documents | Queries |
|---|---|---|---|
| Train | 22 | 2,200 | 440 |
| Val | 10 | 1,000 | 200 |
| Test | 75 | 7,500 | 1,500 |
| Total | 107 | 10,700 | 2,140 |
- 10 subject domains (Politics, Sport, Business, Technology, Science, Education, Health, Entertainment, Environment, Daily Life)
- Quality tiers: gold (human-verified) and silver (LLM-validated)
- Dataset on HuggingFace: Intellexus/IdioLink
- See
data/README.mdfor full schema documentation
Four zero-shot query configurations. Documents are always indexed as full sentences; only query encoding changes.
| Config | Query Encoding | Runner |
|---|---|---|
| sentence | Full sentence | run_dense.py |
| span | Late-chunking on idiom span | run_dense.py |
| instruction_sentence | Instruction prefix + full sentence | run_instruction.py |
| instruction_span | Instruction prefix + late-chunking span | run_instruction.py |
Instruction template:
"Based on the literal/idiomatic usage of the span '{span}' in the query, retrieve documents that contain a span conveying the same conceptual meaning."
Plus:
- BM25 lexical baseline (
run_bm25.py) - Contrastive fine-tuning for 5 models Γ 3 seeds (
run_fine_tune.py)
python run_bm25.py --query_mode sentence # default params
python run_bm25.py --query_mode sentence --tune # grid search on val
python run_bm25.py --query_mode span --tunepython run_dense.py --model BAAI/bge-m3 --query_mode sentence
python run_dense.py --model BAAI/bge-m3 --query_mode spanpython run_instruction.py --model intfloat/multilingual-e5-large-instruct --query_mode instruction_sentence
python run_instruction.py --model intfloat/multilingual-e5-large-instruct --query_mode instruction_spanpython run_all.py # all 24 models Γ 4 modes
python run_all.py --debug # smoke test (5 queries)
python run_all.py --models BAAI/bge-m3 intfloat/e5-base-v2 # subsetrun_ablation.py runs the two-preset ablation (lit_sim_sense and lit_idiom) across the model matrix. It is resumable: per-(preset, model, mode) metrics.json files are written atomically, and re-running skips combos that already exist on disk. After every model, the aggregated CSV is rebuilt from disk so partial runs are always consistent.
python run_ablation.py # all <7B models Γ both presets Γ 4 modes
python run_ablation.py --debug # smoke test (5 queries)
python run_ablation.py --models BAAI/bge-m3
python run_ablation.py --presets lit_idiom
python run_ablation.py --no_bm25 # skip BM25 baseline
python run_ablation.py --force # recompute even if metrics existTwo presets:
lit_sim_senseβ drop idiomatic docs (keep literal + simplification + sense)lit_idiomβ drop simplification + sense paraphrases (keep literal + idiomatic)
Results land under results/ablation/<preset>/{<model_slug>|bm25}/<mode>/metrics.json and the rebuilt aggregate is results/ablation/full_results.csv. For one-off ad-hoc ablations on a single (model, mode), run_dense.py and run_bm25.py accept the same --index_filter <preset|csv-list> flag and write to the same directory layout.
results/
βββ zero_shot/{model_slug}/{mode}/metrics.json
βββ bm25/{mode}/metrics.json
βββ ablation/{preset}/{model_slug}/{mode}/metrics.json
βββ ablation/full_results.csv
βββ fine_tuning/{model_slug}/{mode}/seed_{N}/metrics.json
βββ full_results.csv
All CLI arguments override values from config.yaml.
Results will be updated upon paper publication. Run python run_all.py to reproduce.
| Model | Size | Sentence | Span | Instr+Sent | Instr+Span |
|---|---|---|---|---|---|
| SBERT | 110M | -- | -- | -- | -- |
| Contriever | 110M | -- | -- | -- | -- |
| E5-base-v2 | 110M | -- | -- | -- | -- |
| TART | 110M | -- | -- | -- | -- |
| BGE-base | 326M | -- | -- | -- | -- |
| Instructor-base | 335M | -- | -- | -- | -- |
| Nomic-v2 | 475M | -- | -- | -- | -- |
| Multilingual-E5-large | 560M | -- | -- | -- | -- |
| BGE-M3 | 568M | -- | -- | -- | -- |
| Qwen3-Embed-0.6B | 600M | -- | -- | -- | -- |
| DRAMA-1B | 1B | -- | -- | -- | -- |
| Stella-1.5B | 1.5B | -- | -- | -- | -- |
| Instructor-xl | 1.5B | -- | -- | -- | -- |
| Lychee-embed | 1.5B | -- | -- | -- | -- |
| GTE-Qwen2-1.5B | 1.5B | -- | -- | -- | -- |
| Qwen3-Embed-4B | 4B | -- | -- | -- | -- |
| Linq-Embed-Mistral | 7B | -- | -- | -- | -- |
| SFR-Embedding-Mistral | 7B | -- | -- | -- | -- |
| E5-Mistral-7B | 7B | -- | -- | -- | -- |
| GritLM-7B | 7B | -- | -- | -- | -- |
| GTE-Qwen2-7B | 7B | -- | -- | -- | -- |
| Qwen3-Embed-8B | 8B | -- | -- | -- | -- |
| Nemotron-8B | 8B | -- | -- | -- | -- |
| BGE-Gemma2 | 9B | -- | -- | -- | -- |
Columns show nDCG@10. See full results with R-Precision in results/full_results.csv.
Models: SBERT, DRAMA-1B, E5-base-v2, BGE-M3, Qwen3-Embedding-0.6B
Protocol:
- InfoNCE loss, temperature = 0.05
- AdamW, lr = 2e-5, linear warmup (100 steps)
- Early stopping: patience = 3, monitor val nDCG@10
- Seeds: 42, 43, 44 (results reported as mean Β± std)
- Hard negatives: same PIE, opposite usage type
python run_fine_tune.py --model intfloat/e5-base-v2 --mode sentence --seeds 42 43 44
python run_fine_tune.py --model BAAI/bge-m3 --mode span --seeds 42 43 44Full paper reproduction in 4 steps:
# 1. BM25 baselines (tuned on validation)
python run_bm25.py --query_mode sentence --tune
python run_bm25.py --query_mode span --tune
# 2. Zero-shot dense retrieval (all 24 models Γ 4 modes)
python run_all.py
# 3. Fine-tuning (5 models Γ 4 modes Γ 3 seeds)
python run_fine_tune.py --model sentence-transformers/all-MiniLM-L6-v2 --mode sentence
python run_fine_tune.py --model sentence-transformers/all-MiniLM-L6-v2 --mode span
# ... (repeat for all 5 models Γ 4 modes)
# 4. Generate paper tables and figures
python analysis/generate_zero_shot_table.py
python analysis/generate_finetuning_table.py
python analysis/generate_dataset_stats.py
python analysis/plot_performance.py
python analysis/generate_variant_tables.py # per-variant table with by_usage and by_subject splits
python analysis/generate_ablation_table.py # per-mode ablation tables (requires run_ablation.py)
python analysis/lexical_overlap.py # keyword overlap diagnosticOr use the skill: /reproduce-paper
All 24 evaluated models:
| # | Model | HF ID | Size | Class | Instruction |
|---|---|---|---|---|---|
| 1 | SBERT | sentence-transformers/all-MiniLM-L6-v2 |
110M | sentence_transformer | β |
| 2 | Contriever | facebook/contriever |
110M | sentence_transformer | β |
| 3 | E5-base-v2 | intfloat/e5-base-v2 |
110M | sentence_transformer | e5_inline |
| 4 | TART | orionweller/tart-dual-contriever-msmarco |
110M | instruction | tart_sep |
| 5 | BGE-base | BAAI/bge-base-en-v1.5 |
326M | instruction | prompt_prefix |
| 6 | Instructor-base | hkunlp/instructor-base |
335M | instruction | instructor_pairs |
| 7 | Nomic-v2 | nomic-ai/nomic-embed-text-v2-moe |
475M | instruction | nomic_prefix |
| 8 | Multilingual-E5-large | intfloat/multilingual-e5-large-instruct |
560M | instruction | e5_inline |
| 9 | BGE-M3 | BAAI/bge-m3 |
568M | sentence_transformer | β |
| 10 | Qwen3-Embed-0.6B | Qwen/Qwen3-Embedding-0.6B |
600M | qwen | e5_inline_no_space |
| 11 | DRAMA-1B | facebook/drama-1b |
1B | sentence_transformer | β |
| 12 | Stella-1.5B | NovaSearch/stella-en-1.5B-v5 |
1.5B | sentence_transformer | β |
| 13 | Instructor-xl | hkunlp/instructor-xl |
1.5B | instruction | instructor_pairs |
| 14 | Lychee-embed | vec-ai/lychee-embed |
1.5B | instruction | e5_inline_no_space |
| 15 | GTE-Qwen2-1.5B | Alibaba-NLP/gte-Qwen2-1.5B-instruct |
1.5B | qwen | e5_inline |
| 16 | Qwen3-Embed-4B | Qwen/Qwen3-Embedding-4B |
4B | qwen | e5_inline_no_space |
| 17 | Linq-Embed-Mistral | Linq-AI-Research/Linq-Embed-Mistral |
7B | instruction | e5_inline_no_space |
| 18 | SFR-Embedding-Mistral | Salesforce/SFR-Embedding-Mistral |
7B | instruction | e5_inline |
| 19 | E5-Mistral-7B | intfloat/e5-mistral-7b-instruct |
7B | instruction | e5_inline |
| 20 | GritLM-7B | GritLM/GritLM-7B |
7B | gritlm | instructor_pairs |
| 21 | GTE-Qwen2-7B | Alibaba-NLP/gte-Qwen2-7B-instruct |
7B | qwen | e5_inline |
| 22 | Qwen3-Embed-8B | Qwen/Qwen3-Embedding-8B |
8B | qwen | e5_inline_no_space |
| 23 | Nemotron-8B | nvidia/llama-embed-nemotron-8b |
8B | instruction | e5_inline_no_space |
| 24 | BGE-Gemma2 | BAAI/bge-multilingual-gemma2 |
9B | instruction | bge_gemma |
BGE-base-en-v1.5 uses its canonical pretrained prefix ("Represent this sentence for searching relevant passages: ") via the prompt_prefix instruction format. The Qwen3 family, Lychee, Linq, and Nemotron use the no-space Instruct: {task}\nQuery:{query} variant per their model cards β verify against each card's "Usage" section before paper publication.
The benchmark data is pre-generated in data/. To reproduce the dataset from scratch (requires Gemini API key):
- MAGPIE corpus filtering (min_occurrences=30, ambiguity 25-75%)
- Variant generation via Gemini (10 domains Γ 4 usage types)
- LLM validation (3Γ majority vote)
- Stratified splitting (22/10/75 PIEs for train/val/test)
See data_generation/README.md for full instructions.
If you use IdioLink in your research, please cite:
BibTeX:
@misc{hashiloni2026idiolinkretrievingmeaningwords,
title={IdioLink: Retrieving Meaning Beyond Words Across Idiomatic and Literal Expressions},
author={Kai Golan Hashiloni and Daniel Fadlon and Lior Livyatan and Ofri Hefetz and Jiahuan Pei and Kfir Bar},
year={2026},
eprint={2605.22247},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2605.22247},
}APA:
Hashiloni, K. G., Fadlon, D., Livyatan, L., Hefetz, O., Pei, J., & Bar, K. (2026). IdioLink: Retrieving meaning beyond words across idiomatic and literal expressions. arXiv. https://arxiv.org/abs/2605.22247
Paper under review. Citation will be updated upon publication.
Code: Apache 2.0. See LICENSE.
Data: CC BY-NC-SA 4.0
Kai Golan Hashiloni et al. (Intellexus Project)
For questions or contributions: kai.golanhashiloni@post.runi.ac.il Β· daniel.fadlon@post.runi.ac.il, daniel.fadlon@post.runi.ac.il