lexEN

lexEN is a lexicographer-reviewed English Word Sense Disambiguation (WSD) benchmark derived from Maru et al. 2022 ALL_NEW / ALLamended. It is designed for evaluating systems that assign WordNet 3.0 sense keys to ambiguous English words in context.

The benchmark keeps the standard Raganato-style evaluation format while adding a richer canonical JSONL artifact with source-label lineage, professional review evidence, and Glite coarse-sense mappings.

Current release: lexen-v1.

What This Dataset Is For

Use lexEN to evaluate English all-words WSD systems against a benchmark whose highest-risk labels were manually checked by professional lexicographers.

The primary scoring label is labels.lexen_gold. Each retained item also keeps:

labels.raganato_original: the original Raganato ALL label for the same instance ID.
labels.maru2022: the Maru2022 ALLamended source label.
labels.lexen_gold: the lexEN label after applying the three-reviewer policy.
review: review provenance for audited items.
glite: coarse Glite concept mappings for labels, reviewer selections, and candidate senses, using the public Maru2022/lexEN coarsening subset.

Release Counts

Count	Value
Maru2022 source items	4,917
Retained lexEN benchmark items	4,861
Suspicious items reviewed by lexicographers	363
Unreviewed items kept from Maru2022	4,554
Reviewed retained items	307
Reviewed items removed	56
Retained labels changed from Maru2022	211

Reviewed items are removed when at least two reviewers mark the item unanswerable, or when no fine-grained sense set receives support from at least two reviewers.

How lexEN Was Built

lexEN starts from Maru et al. 2022 ALL_NEW / ALLamended:

Paper: Nibbling at the Hard Core of Word Sense Disambiguation (local PDF, BibTeX)
Source repository: SapienzaNLP/wsd-hard-benchmark
Source files in this repository: sources/maru2022/original/files/

The original Raganato ALL XML and gold-key files are also preserved under sources/raganato/original/files/ so every lexEN item can carry both the original Raganato label and the Maru2022 label.

The construction process is:

Take Maru2022 ALLamended as the base benchmark.
Run a WSD prediction panel over all 4,917 Maru2022 items.
Flag 363 suspicious items where the model panel strongly disagrees with the Maru2022 label using the selection procedure.
Review those 363 items at marureview.com.
Give the same reviewer brief (repository copy) to three professional lexicographers: Robert Farren, Patrick White, and Penny Hands.
Apply the three-reviewer policy:
- keep the Maru2022 label for unreviewed items;
- replace the label when at least two reviewers choose the same non-empty sense set;
- remove the reviewed item when at least two reviewers mark it unanswerable;
- remove the reviewed item when no fine-grained sense set receives two-reviewer support.
Build the canonical JSONL artifact, Raganato-compatible export, SenseBench export, and removed-items audit sidecar.

Suspicious-Item Selection

Selection: a model panel of three families - SANDWiCH; GPT-5.5; and a CatBoost ensemble of WSD models (ConSeC, ESCHER, BEM, MFS, gpt-5-mini) - flagged 363 items for review via an S1-S6 waterfall (S1 55, S2 138, S3 41, S4 75, S5 48, S6 6; not flagged 4,554).

The S1-S6 waterfall classes are recomputed from committed source predictions in sources/model-panel/. The GPT prediction files keep prompts and normalized outputs for reproducibility, but omit transport telemetry such as provider request IDs, token counts, costs, and latency. Operationally, bucket membership is computed from the eight GPT-5.5 variants; SANDWiCH and CatBoost participate in the modal vote, and S1 requires both to agree with the GPT-5.5 consensus.

Selection source:

Generated selection package: sources/selection/lexicographer_review.jsonl.gz
Selection documentation: docs/selection.md
Verification script: scripts/build_selection_source.py

Selection counts:

Suspicion Set	Items
S1	55
S2	138
S3	41
S4	75
S5	48
S6	6
Total reviewed	363

Lexicographer Review

The review website is https://marureview.com/.

The exact reviewer-facing brief is preserved in:

Raw review exports are stored in:

The three-reviewer agreement report is available in:

HTML: reports/rf-pw-ph-2026-06-13/lexicographer_agreement_2026_06_13.html
PDF: reports/rf-pw-ph-2026-06-13/lexicographer_agreement_2026_06_13.pdf
Markdown: reports/rf-pw-ph-2026-06-13/lexicographer_agreement_2026_06_13.md
Metrics JSON: reports/rf-pw-ph-2026-06-13/metrics.json
Per-item agreement JSONL: reports/rf-pw-ph-2026-06-13/per_item_agreement.jsonl

Dataset Artifacts

Primary artifact:

data/lexen-v1/items.jsonl

Each line is one retained benchmark item. The schema is documented in docs/schema.md.

Additional release files:

Path	Purpose
`data/lexen-v1/dataset.json`	Release metadata, counts, source hashes, output hashes, and label policy.
`data/lexen-v1/reviews.jsonl`	Normalized RF/PW/PH review evidence for all 363 reviewed items.
`exports/raganato/lexen-v1/lexen-v1.data.xml`	Raganato-style XML with removed reviewed targets excluded as scoring instances.
`exports/raganato/lexen-v1/lexen-v1.gold.key.txt`	Raganato-style gold key using `lexen_gold`.
`exports/raganato/lexen-v1/lexen-v1.removed.json`	Audit sidecar for reviewed items removed from scoring.
`exports/sensebench/lexen-v1/items.jsonl`	Compact SenseBench-compatible export.
`sources/manifest.json`	Machine-readable source manifest with package-level provenance and hashes.

Loading The Dataset

Read the canonical JSONL artifact:

import json
from pathlib import Path

items_path = Path("data/lexen-v1/items.jsonl")

items = []
with items_path.open(encoding="utf-8") as handle:
    for line in handle:
        items.append(json.loads(line))

print(len(items))
print(items[0]["item_id"], items[0]["labels"]["lexen_gold"]["sense_keys"])

Create a simple scoring lookup:

gold_by_id = {
    item["item_id"]: item["labels"]["lexen_gold"]["sense_keys"]
    for item in items
}

Use the Raganato-compatible export with existing WSD evaluation tools:

python your_wsd_system.py \
  --input exports/raganato/lexen-v1/lexen-v1.data.xml \
  --output predictions.key.txt

python your_scorer.py \
  --gold exports/raganato/lexen-v1/lexen-v1.gold.key.txt \
  --predictions predictions.key.txt

Example Item

This is an abbreviated retained item from data/lexen-v1/items.jsonl. The Maru2022 and original Raganato labels were field%1:17:00::; two reviewers selected field%1:15:00::, so lexEN uses that sense as lexen_gold.

{
  "schema_version": "lexen.item.v2",
  "item_id": "senseval2.d000.s003.t009",
  "source_dataset": "senseval2",
  "lemma": "field",
  "pos": "NOUN",
  "target": {
    "text": "fields",
    "token_index": 22
  },
  "context": {
    "preceding_sentences": [
      [
        "The", "art", "of", "change-ringing", "is", "peculiar", "to",
        "the", "English", ",", "and", ",", "like", "most", "English",
        "peculiarities", ",", "unintelligible", "to", "the", "rest",
        "of", "the", "world", "."
      ],
      ["Dorothy", "L.", "Sayers", ",", "``", "The", "Nine", "Tailors", "``"],
      ["ASLACTON", ",", "England", "--"]
    ],
    "target_sentence": [
      "Of", "all", "scenes", "that", "evoke", "rural", "England", ",",
      "this", "is", "one", "of", "the", "loveliest", ":", "An", "ancient",
      "stone", "church", "stands", "amid", "the", "fields", ",", "the",
      "sound", "of", "bells", "cascading", "from", "its", "tower", ",",
      "calling", "the", "faithful", "to", "evensong", "."
    ],
    "following_sentences": [
      [
        "The", "parishioners", "of", "St.", "Michael", "and", "All",
        "Angels", "stop", "to", "chat", "at", "the", "church", "door",
        ",", "as", "members", "here", "always", "have", "."
      ],
      [
        "In", "the", "tower", ",", "five", "men", "and", "women", "pull",
        "rhythmically", "on", "ropes", "attached", "to", "the", "same",
        "five", "bells", "that", "first", "sounded", "here", "in", "1614",
        "."
      ]
    ]
  },
  "labels": {
    "raganato_original": {
      "sense_keys": ["field%1:17:00::"]
    },
    "maru2022": {
      "sense_keys": ["field%1:17:00::"]
    },
    "lexen_gold": {
      "decision": "two_of_three_sense_agreement",
      "is_empty": false,
      "sense_keys": ["field%1:15:00::"]
    }
  },
  "review": {
    "status": "reviewed",
    "lexicographer_decision": "two_of_three_sense_agreement",
    "release_disposition": "retained",
    "suspicion_set": "S1"
  },
  "glite": {
    "labels": {
      "raganato_original": {
        "concept_ids": ["ct:ct7f56A4ZLBVYeef5CpmNfield"],
        "unmapped_sense_keys": []
      },
      "maru2022": {
        "concept_ids": ["ct:ct7f56A4ZLBVYeef5CpmNfield"],
        "unmapped_sense_keys": []
      },
      "lexen_gold": {
        "concept_ids": ["ct:ct7f56A4ZLBVYeef5CpmNfield"],
        "unmapped_sense_keys": []
      }
    }
  }
}

Reproducibility

Install dependencies with uv, then run the release scripts:

uv run python scripts/build_selection_source.py --verify
uv run python scripts/build_source_manifest.py
uv run python scripts/build_release.py --release lexen-v1
uv run python scripts/verify_release.py --release lexen-v1
uv run pytest

The release verifier checks:

source-manifest hashes
release artifact hashes
expected counts
Raganato original label coverage
Maru2022 source label coverage
lexen_gold policy decisions
removed-item exclusion from scoring exports
Raganato XML and gold-key consistency
SenseBench export consistency
embedded review evidence
Glite label and candidate mappings
contamination-canary placement

Contamination Canary

Every lexEN release contains a high-entropy canary string for later training-data contamination audits. In lexen-v1, it is stored in data/lexen-v1/dataset.json, repeated in canonical item and review rows, carried in the SenseBench export metadata, and inserted into the Raganato XML as an XML comment.

The canary is metadata. It should not be included in WSD prompts or used as model input.

Scope And Limitations

lexEN is an English WordNet 3.0 WSD benchmark. It is not a replacement for multilingual WSD, entity-linking, dictionary-definition ranking, or general lexical-semantic evaluation.

Only the 363 model-panel-suspicious Maru2022 items were manually reviewed. The remaining 4,554 items keep the Maru2022 label. This is deliberate: lexEN is a targeted correction and audit layer over Maru2022, not a full reannotation of every Raganato item.

The Glite layer is a separate coarse-sense view. The fine-grained WordNet scoring label remains labels.lexen_gold.sense_keys.

Artifact Guide

Start with the canonical dataset in data/lexen-v1/items.jsonl. It is the richest artifact: each retained item includes original Raganato labels, Maru2022 labels, lexEN labels, review evidence where available, full context, WordNet candidates, and the Glite coarsening layer. Use data/lexen-v1/dataset.json for release counts, label policy, hashes, and provenance pointers.

For standard WSD scoring, use the Raganato export in exports/raganato/lexen-v1/. It contains the XML input file, the lexEN gold key, and a removed-items sidecar for audited targets excluded from scoring. For compact JSONL evaluation pipelines, use the SenseBench export in exports/sensebench/lexen-v1/items.jsonl.

To understand how the benchmark was built, read docs/provenance.md, docs/selection.md, and docs/review-protocol.md. The upstream Raganato and Maru2022 source files are preserved under sources/raganato/original/ and sources/maru2022/original/. The committed model-panel predictions and generated suspicious-item package are in sources/model-panel/ and sources/selection/.

For audit evidence, use the reviewer brief in sources/reviews/protocols/marureview-brief-2026-05-26.md, the raw RF/PW/PH review exports in sources/reviews/, the three-reviewer agreement report in reports/rf-pw-ph-2026-06-13/, and the machine-readable source manifest in sources/manifest.json.

The rebuild and verification entry points are in scripts/; the Python package they use is in src/lexen/, and release-policy tests are in tests/.

Citation

If you use lexEN, cite this repository and the upstream Maru et al. 2022 benchmark:

@inproceedings{maru-etal-2022-nibbling,
  title = {Nibbling at the Hard Core of Word Sense Disambiguation},
  author = {Maru, Marco and Conia, Simone and Bevilacqua, Michele and Navigli, Roberto},
  booktitle = {Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
  year = {2022},
  pages = {4724--4737}
}

Repository citation metadata is provided in CITATION.cff.

License

lexEN separates software licensing from dataset-artifact licensing.

The Python package metadata in pyproject.toml declares Apache-2.0 because the installable package contains only build, verification, report-generation, and test software. The software license is LICENSE-CODE.

The dataset citation metadata in CITATION.cff uses CC-BY-NC-4.0 because the benchmark artifacts include upstream Maru2022 data and are released for research / non-commercial evaluation under source-package terms recorded in sources/manifest.json, LICENSE, and NOTICE.

Maru2022 ALLamended is recorded as CC-BY-NC-4.0 in the source manifest. The Maru et al. paper files are recorded as CC-BY-4.0. Source-package hashes and terms are part of the release metadata.

Additional Documentation

DATASHEET.md: dataset motivation, composition, collection, use, and maintenance.
docs/provenance.md: end-to-end dataset lineage.
docs/selection.md: suspicious-item selection logic.
docs/labeling-process.md: reviewer-facing labeling instructions.
docs/review-protocol.md: three-reviewer label and removal policy.
docs/reports.md: agreement report contents.
docs/schema.md: canonical and export schemas.
CHANGELOG.md: release history.
CONTRIBUTING.md: issue and contribution process.
CODE_OF_CONDUCT.md: community conduct expectations.
SECURITY.md: sensitive-reporting process.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

lexEN

What This Dataset Is For

Release Counts

How lexEN Was Built

Suspicious-Item Selection

Lexicographer Review

Dataset Artifacts

Loading The Dataset

Example Item

Reproducibility

Contamination Canary

Scope And Limitations

Artifact Guide

Citation

License

Additional Documentation

About

Licenses found

Uh oh!

Releases 1

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.github/workflows		.github/workflows
data/lexen-v1		data/lexen-v1
docs		docs
exports		exports
reports/rf-pw-ph-2026-06-13		reports/rf-pw-ph-2026-06-13
scripts		scripts
sources		sources
src/lexen		src/lexen
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
DATASHEET.md		DATASHEET.md
LICENSE		LICENSE
LICENSE-CODE		LICENSE-CODE
NOTICE		NOTICE
README.md		README.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

lexEN

What This Dataset Is For

Release Counts

How lexEN Was Built

Suspicious-Item Selection

Lexicographer Review

Dataset Artifacts

Loading The Dataset

Example Item

Reproducibility

Contamination Canary

Scope And Limitations

Artifact Guide

Citation

License

Additional Documentation

About

Topics

Resources

License

Licenses found

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages