lexEN is a lexicographer-reviewed English Word Sense Disambiguation (WSD) benchmark derived from Maru et al. 2022 ALL_NEW / ALLamended. It is designed for evaluating systems that assign WordNet 3.0 sense keys to ambiguous English words in context.
The benchmark keeps the standard Raganato-style evaluation format while adding a richer canonical JSONL artifact with source-label lineage, professional review evidence, and Glite coarse-sense mappings.
Current release: lexen-v1.
Use lexEN to evaluate English all-words WSD systems against a benchmark whose highest-risk labels were manually checked by professional lexicographers.
The primary scoring label is labels.lexen_gold. Each retained item
also keeps:
labels.raganato_original: the original Raganato ALL label for the same instance ID.labels.maru2022: the Maru2022 ALLamended source label.labels.lexen_gold: the lexEN label after applying the three-reviewer policy.review: review provenance for audited items.glite: coarse Glite concept mappings for labels, reviewer selections, and candidate senses, using the public Maru2022/lexEN coarsening subset.
| Count | Value |
|---|---|
| Maru2022 source items | 4,917 |
| Retained lexEN benchmark items | 4,861 |
| Suspicious items reviewed by lexicographers | 363 |
| Unreviewed items kept from Maru2022 | 4,554 |
| Reviewed retained items | 307 |
| Reviewed items removed | 56 |
| Retained labels changed from Maru2022 | 211 |
Reviewed items are removed when at least two reviewers mark the item unanswerable, or when no fine-grained sense set receives support from at least two reviewers.
lexEN starts from Maru et al. 2022 ALL_NEW / ALLamended:
- Paper: Nibbling at the Hard Core of Word Sense Disambiguation (local PDF, BibTeX)
- Source repository: SapienzaNLP/wsd-hard-benchmark
- Source files in this repository:
sources/maru2022/original/files/
The original Raganato ALL XML and gold-key files are also preserved under
sources/raganato/original/files/ so every lexEN item can carry
both the original Raganato label and the Maru2022 label.
The construction process is:
- Take Maru2022 ALLamended as the base benchmark.
- Run a WSD prediction panel over all 4,917 Maru2022 items.
- Flag 363 suspicious items where the model panel strongly disagrees with the Maru2022 label using the selection procedure.
- Review those 363 items at marureview.com.
- Give the same reviewer brief (repository copy) to three professional lexicographers: Robert Farren, Patrick White, and Penny Hands.
- Apply the three-reviewer policy:
- keep the Maru2022 label for unreviewed items;
- replace the label when at least two reviewers choose the same non-empty sense set;
- remove the reviewed item when at least two reviewers mark it unanswerable;
- remove the reviewed item when no fine-grained sense set receives two-reviewer support.
- Build the canonical JSONL artifact, Raganato-compatible export, SenseBench export, and removed-items audit sidecar.
Selection: a model panel of three families - SANDWiCH; GPT-5.5; and a CatBoost ensemble of WSD models (ConSeC, ESCHER, BEM, MFS, gpt-5-mini) - flagged 363 items for review via an S1-S6 waterfall (S1 55, S2 138, S3 41, S4 75, S5 48, S6 6; not flagged 4,554).
The S1-S6 waterfall classes are recomputed from committed source predictions in
sources/model-panel/. The GPT prediction files keep prompts and normalized
outputs for reproducibility, but omit transport telemetry such as provider request IDs, token
counts, costs, and latency. Operationally, bucket membership is computed from the eight GPT-5.5
variants; SANDWiCH and CatBoost participate in the modal vote, and S1 requires both to agree with
the GPT-5.5 consensus.
Selection source:
- Generated selection package:
sources/selection/lexicographer_review.jsonl.gz - Selection documentation:
docs/selection.md - Verification script:
scripts/build_selection_source.py
Selection counts:
| Suspicion Set | Items |
|---|---|
| S1 | 55 |
| S2 | 138 |
| S3 | 41 |
| S4 | 75 |
| S5 | 48 |
| S6 | 6 |
| Total reviewed | 363 |
The review website is https://marureview.com/.
The exact reviewer-facing brief is preserved in:
Raw review exports are stored in:
The three-reviewer agreement report is available in:
- HTML:
reports/rf-pw-ph-2026-06-13/lexicographer_agreement_2026_06_13.html - PDF:
reports/rf-pw-ph-2026-06-13/lexicographer_agreement_2026_06_13.pdf - Markdown:
reports/rf-pw-ph-2026-06-13/lexicographer_agreement_2026_06_13.md - Metrics JSON:
reports/rf-pw-ph-2026-06-13/metrics.json - Per-item agreement JSONL:
reports/rf-pw-ph-2026-06-13/per_item_agreement.jsonl
Primary artifact:
Each line is one retained benchmark item. The schema is documented in
docs/schema.md.
Additional release files:
| Path | Purpose |
|---|---|
data/lexen-v1/dataset.json |
Release metadata, counts, source hashes, output hashes, and label policy. |
data/lexen-v1/reviews.jsonl |
Normalized RF/PW/PH review evidence for all 363 reviewed items. |
exports/raganato/lexen-v1/lexen-v1.data.xml |
Raganato-style XML with removed reviewed targets excluded as scoring instances. |
exports/raganato/lexen-v1/lexen-v1.gold.key.txt |
Raganato-style gold key using lexen_gold. |
exports/raganato/lexen-v1/lexen-v1.removed.json |
Audit sidecar for reviewed items removed from scoring. |
exports/sensebench/lexen-v1/items.jsonl |
Compact SenseBench-compatible export. |
sources/manifest.json |
Machine-readable source manifest with package-level provenance and hashes. |
Read the canonical JSONL artifact:
import json
from pathlib import Path
items_path = Path("data/lexen-v1/items.jsonl")
items = []
with items_path.open(encoding="utf-8") as handle:
for line in handle:
items.append(json.loads(line))
print(len(items))
print(items[0]["item_id"], items[0]["labels"]["lexen_gold"]["sense_keys"])Create a simple scoring lookup:
gold_by_id = {
item["item_id"]: item["labels"]["lexen_gold"]["sense_keys"]
for item in items
}Use the Raganato-compatible export with existing WSD evaluation tools:
python your_wsd_system.py \
--input exports/raganato/lexen-v1/lexen-v1.data.xml \
--output predictions.key.txt
python your_scorer.py \
--gold exports/raganato/lexen-v1/lexen-v1.gold.key.txt \
--predictions predictions.key.txtThis is an abbreviated retained item from data/lexen-v1/items.jsonl.
The Maru2022 and original Raganato labels were field%1:17:00::; two reviewers selected
field%1:15:00::, so lexEN uses that sense as lexen_gold.
{
"schema_version": "lexen.item.v2",
"item_id": "senseval2.d000.s003.t009",
"source_dataset": "senseval2",
"lemma": "field",
"pos": "NOUN",
"target": {
"text": "fields",
"token_index": 22
},
"context": {
"preceding_sentences": [
[
"The", "art", "of", "change-ringing", "is", "peculiar", "to",
"the", "English", ",", "and", ",", "like", "most", "English",
"peculiarities", ",", "unintelligible", "to", "the", "rest",
"of", "the", "world", "."
],
["Dorothy", "L.", "Sayers", ",", "``", "The", "Nine", "Tailors", "``"],
["ASLACTON", ",", "England", "--"]
],
"target_sentence": [
"Of", "all", "scenes", "that", "evoke", "rural", "England", ",",
"this", "is", "one", "of", "the", "loveliest", ":", "An", "ancient",
"stone", "church", "stands", "amid", "the", "fields", ",", "the",
"sound", "of", "bells", "cascading", "from", "its", "tower", ",",
"calling", "the", "faithful", "to", "evensong", "."
],
"following_sentences": [
[
"The", "parishioners", "of", "St.", "Michael", "and", "All",
"Angels", "stop", "to", "chat", "at", "the", "church", "door",
",", "as", "members", "here", "always", "have", "."
],
[
"In", "the", "tower", ",", "five", "men", "and", "women", "pull",
"rhythmically", "on", "ropes", "attached", "to", "the", "same",
"five", "bells", "that", "first", "sounded", "here", "in", "1614",
"."
]
]
},
"labels": {
"raganato_original": {
"sense_keys": ["field%1:17:00::"]
},
"maru2022": {
"sense_keys": ["field%1:17:00::"]
},
"lexen_gold": {
"decision": "two_of_three_sense_agreement",
"is_empty": false,
"sense_keys": ["field%1:15:00::"]
}
},
"review": {
"status": "reviewed",
"lexicographer_decision": "two_of_three_sense_agreement",
"release_disposition": "retained",
"suspicion_set": "S1"
},
"glite": {
"labels": {
"raganato_original": {
"concept_ids": ["ct:ct7f56A4ZLBVYeef5CpmNfield"],
"unmapped_sense_keys": []
},
"maru2022": {
"concept_ids": ["ct:ct7f56A4ZLBVYeef5CpmNfield"],
"unmapped_sense_keys": []
},
"lexen_gold": {
"concept_ids": ["ct:ct7f56A4ZLBVYeef5CpmNfield"],
"unmapped_sense_keys": []
}
}
}
}Install dependencies with uv, then run the release scripts:
uv run python scripts/build_selection_source.py --verify
uv run python scripts/build_source_manifest.py
uv run python scripts/build_release.py --release lexen-v1
uv run python scripts/verify_release.py --release lexen-v1
uv run pytestThe release verifier checks:
- source-manifest hashes
- release artifact hashes
- expected counts
- Raganato original label coverage
- Maru2022 source label coverage
lexen_goldpolicy decisions- removed-item exclusion from scoring exports
- Raganato XML and gold-key consistency
- SenseBench export consistency
- embedded review evidence
- Glite label and candidate mappings
- contamination-canary placement
Every lexEN release contains a high-entropy canary string for later training-data contamination
audits. In lexen-v1, it is stored in data/lexen-v1/dataset.json,
repeated in canonical item and review rows, carried in the SenseBench export metadata, and inserted
into the Raganato XML as an XML comment.
The canary is metadata. It should not be included in WSD prompts or used as model input.
lexEN is an English WordNet 3.0 WSD benchmark. It is not a replacement for multilingual WSD, entity-linking, dictionary-definition ranking, or general lexical-semantic evaluation.
Only the 363 model-panel-suspicious Maru2022 items were manually reviewed. The remaining 4,554 items keep the Maru2022 label. This is deliberate: lexEN is a targeted correction and audit layer over Maru2022, not a full reannotation of every Raganato item.
The Glite layer is a separate coarse-sense view. The fine-grained
WordNet scoring label remains labels.lexen_gold.sense_keys.
Start with the canonical dataset in data/lexen-v1/items.jsonl. It is
the richest artifact: each retained item includes original Raganato labels, Maru2022 labels, lexEN
labels, review evidence where available, full context, WordNet candidates, and the Glite coarsening
layer. Use data/lexen-v1/dataset.json for release counts, label
policy, hashes, and provenance pointers.
For standard WSD scoring, use the Raganato export in
exports/raganato/lexen-v1/. It contains the XML input file, the
lexEN gold key, and a removed-items sidecar for audited targets excluded from scoring. For compact
JSONL evaluation pipelines, use the SenseBench export in
exports/sensebench/lexen-v1/items.jsonl.
To understand how the benchmark was built, read docs/provenance.md,
docs/selection.md, and docs/review-protocol.md.
The upstream Raganato and Maru2022 source files are preserved under
sources/raganato/original/ and
sources/maru2022/original/. The committed model-panel predictions
and generated suspicious-item package are in sources/model-panel/ and
sources/selection/.
For audit evidence, use the reviewer brief in
sources/reviews/protocols/marureview-brief-2026-05-26.md,
the raw RF/PW/PH review exports in sources/reviews/, the three-reviewer
agreement report in reports/rf-pw-ph-2026-06-13/, and the
machine-readable source manifest in sources/manifest.json.
The rebuild and verification entry points are in scripts/; the Python package they use
is in src/lexen/, and release-policy tests are in tests/.
If you use lexEN, cite this repository and the upstream Maru et al. 2022 benchmark:
@inproceedings{maru-etal-2022-nibbling,
title = {Nibbling at the Hard Core of Word Sense Disambiguation},
author = {Maru, Marco and Conia, Simone and Bevilacqua, Michele and Navigli, Roberto},
booktitle = {Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
year = {2022},
pages = {4724--4737}
}Repository citation metadata is provided in CITATION.cff.
lexEN separates software licensing from dataset-artifact licensing.
The Python package metadata in pyproject.toml declares Apache-2.0 because the
installable package contains only build, verification, report-generation, and test software. The
software license is LICENSE-CODE.
The dataset citation metadata in CITATION.cff uses CC-BY-NC-4.0 because the
benchmark artifacts include upstream Maru2022 data and are released for research / non-commercial
evaluation under source-package terms recorded in sources/manifest.json,
LICENSE, and NOTICE.
Maru2022 ALLamended is recorded as CC-BY-NC-4.0 in the source manifest. The Maru et al. paper files are recorded as CC-BY-4.0. Source-package hashes and terms are part of the release metadata.
DATASHEET.md: dataset motivation, composition, collection, use, and maintenance.docs/provenance.md: end-to-end dataset lineage.docs/selection.md: suspicious-item selection logic.docs/labeling-process.md: reviewer-facing labeling instructions.docs/review-protocol.md: three-reviewer label and removal policy.docs/reports.md: agreement report contents.docs/schema.md: canonical and export schemas.CHANGELOG.md: release history.CONTRIBUTING.md: issue and contribution process.CODE_OF_CONDUCT.md: community conduct expectations.SECURITY.md: sensitive-reporting process.