feat: deterministic tokenizer training and config hashing#17
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds a new deterministic Byte-Level BPE tokenizer module with training and config-hashing, consolidates SHA‑256 hashing into a public utility used by manifest/Merkle logic, adds tokenizer tests, introduces Changes
Sequence Diagram(s)sequenceDiagram
participant User as "User / Test Runner"
participant Trainer as "train_tokenizer (tokenizers lib)"
participant FS as "Filesystem (save_path)"
participant Hasher as "hash_tokenizer_config / utils.compute_sha256"
User->>Trainer: provide text_file, vocab_size, min_frequency
Trainer->>Trainer: preprocess input and train ByteLevelBPETokenizer
Trainer->>FS: write `vocab.json` and `merges.txt`
User->>Hasher: call hash_tokenizer_config(tokenizer_path)
Hasher->>FS: read `vocab.json` and `merges.txt`
Hasher->>Hasher: compute_sha256(vocab.json), compute_sha256(merges.txt)
Hasher-->>User: return {tokenizer_vocab_hash, tokenizer_merges_hash, tokenizer_vocab_size}
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
Adds tokenizer reproducibility support to the verification pipeline by introducing deterministic Byte-Level BPE tokenizer training and cryptographic fingerprinting (SHA256) of tokenizer config artifacts.
Changes:
- Added
openverifiablellm.tokenizerwithtrain_tokenizer()andhash_tokenizer_config()(hashesvocab.json/merges.txt, derives vocab size fromvocab.json). - Refactored hashing utilities in
openverifiablellm.utils(introduced a more flexiblecompute_sha256, updated Merkle hashing call sites). - Added unit tests for tokenizer training determinism and config hash stability/change detection.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
tests/test_tokenizer.py |
New tests covering tokenizer training determinism and tokenizer config hashing behavior. |
pyproject.toml |
Adds runtime deps, but currently placed in an invalid section and duplicates existing deps. |
openverifiablellm/utils.py |
Refactors SHA256 helper and Merkle root computation; updates manifest generation call sites accordingly. |
openverifiablellm/tokenizer.py |
New tokenizer training + config hashing utilities built on tokenizers and shared SHA256 helper. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
openverifiablellm/utils.py
Outdated
| if file_path is None and data is None: | ||
| raise ValueError("Either file_path or data must be provided.") | ||
|
|
||
| sha256 = hashlib.sha256() | ||
|
|
||
| # If keyword data is used | ||
| if data is not None: | ||
| sha256.update(data) | ||
| return sha256.hexdigest() |
There was a problem hiding this comment.
compute_sha256 is documented as hashing a file OR raw bytes, but it currently allows both file_path and data to be provided (it silently prefers data). This can hide call-site bugs and makes the API ambiguous; consider enforcing mutual exclusivity (raise if both are set), while still supporting the backward-compatible call patterns.
tests/test_tokenizer.py
Outdated
| @@ -0,0 +1,95 @@ | |||
| import json | |||
| import pytest | |||
| from pathlib import Path | |||
There was a problem hiding this comment.
Path is imported but unused in this test module. If linting (e.g., ruff/pyflakes) is enabled in CI later, this will fail; remove the unused import or use it explicitly.
| from pathlib import Path |
tests/test_tokenizer.py
Outdated
| "Wikipedia is a free online encyclopedia.\n" | ||
| "It is written collaboratively by volunteers.\n" | ||
| "Anyone can edit Wikipedia articles.\n" | ||
| "Wikipedia was launched on January 15 2001.\n" | ||
| "It is one of the most popular websites in the world.\n" * 100, |
There was a problem hiding this comment.
In sample_text_file, the * 100 only applies to the final string literal due to Python’s string-literal concatenation rules. If the intention is to repeat the entire multi-line sample 100x, wrap the whole block in parentheses before multiplying to avoid accidentally skewing the corpus.
| "Wikipedia is a free online encyclopedia.\n" | |
| "It is written collaboratively by volunteers.\n" | |
| "Anyone can edit Wikipedia articles.\n" | |
| "Wikipedia was launched on January 15 2001.\n" | |
| "It is one of the most popular websites in the world.\n" * 100, | |
| ( | |
| "Wikipedia is a free online encyclopedia.\n" | |
| "It is written collaboratively by volunteers.\n" | |
| "Anyone can edit Wikipedia articles.\n" | |
| "Wikipedia was launched on January 15 2001.\n" | |
| "It is one of the most popular websites in the world.\n" | |
| ) * 100, |
pyproject.toml
Outdated
| [tool.setuptools.packages.find] | ||
| include = ["openverifiablellm*"] | ||
|
|
||
| dependencies = [ | ||
| "defusedxml", | ||
| "tokenizers>=0.13.0" | ||
| ] |
There was a problem hiding this comment.
The new dependencies = [...] block is under [tool.setuptools.packages.find], which is not a valid location for runtime dependencies in PEP 621 / setuptools. As a result, tokenizers won’t be installed when users pip install openverifiablellm, and importing openverifiablellm.tokenizer will fail. Move tokenizers>=0.13.0 into the existing [project].dependencies list and remove this duplicate block (and the duplicated defusedxml).
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@openverifiablellm/tokenizer.py`:
- Around line 98-102: The code reads vocab_path twice; instead, read the vocab
file once into a bytes buffer, pass that bytes to compute_sha256 via the
data=... parameter to compute vocab_hash, then decode the same bytes to get JSON
and set actual_vocab_size = len(json.loads(decoded)), leaving merges_path
hashing unchanged; update references to compute_sha256, vocab_path and
actual_vocab_size accordingly so the file is not reread.
- Around line 56-64: The docstring's claim that training is "deterministic" is
not supported by the current use of ByteLevelBPETokenizer/tokenizer.train and
tokenizer.save_model because HF tokenizers can vary across runs; update the
function docstring to remove or qualify the absolute determinism claim and list
known sources of nondeterminism (hash-map iteration, library version, platform),
and/or add safeguards: pin the tokenizers package version in project
requirements, add a regression check that computes and stores hashes of the
generated vocab.json and merges.txt after tokenizer.save_model (and fail or warn
on drift), and document these reproducibility conditions near SPECIAL_TOKENS and
the training call so future maintainers know how to reproduce results.
In `@pyproject.toml`:
- Around line 26-29: Remove the duplicate top-level dependencies block and
instead add "tokenizers>=0.13.0" to the existing dependencies array under the
[project] table; locate the standalone dependencies = [ "defusedxml",
"tokenizers>=0.13.0" ] and delete it, then update the [project] dependencies
list to include tokenizers (alongside any existing entries such as defusedxml)
so the tokenizers requirement is declared only within the [project] section.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (4)
openverifiablellm/tokenizer.pyopenverifiablellm/utils.pypyproject.tomltests/test_tokenizer.py
pyproject.toml
Outdated
| dependencies = [ | ||
| "defusedxml", | ||
| "tokenizers>=0.13.0" | ||
| ] |
There was a problem hiding this comment.
Duplicate dependencies declaration outside [project] section will break installation.
The new dependencies block is placed at file root level, not within the [project] section. This is invalid TOML for pyproject.toml — the key will either cause a parse error or be silently ignored, meaning tokenizers won't be installed.
Merge the new dependency into the existing dependencies list under [project] (lines 14-16).
🐛 Proposed fix
Remove lines 26-29 and update lines 14-16:
dependencies= [
- "defusedxml"
+ "defusedxml",
+ "tokenizers>=0.13.0"
]
-
-[project.optional-dependencies]
-dev = [
- "pytest"
-]
-
-[tool.setuptools.packages.find]
-include = ["openverifiablellm*"]
-
-dependencies = [
- "defusedxml",
- "tokenizers>=0.13.0"
-]
+
+[project.optional-dependencies]
+dev = [
+ "pytest"
+]
+
+[tool.setuptools.packages.find]
+include = ["openverifiablellm*"]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pyproject.toml` around lines 26 - 29, Remove the duplicate top-level
dependencies block and instead add "tokenizers>=0.13.0" to the existing
dependencies array under the [project] table; locate the standalone dependencies
= [ "defusedxml", "tokenizers>=0.13.0" ] and delete it, then update the
[project] dependencies list to include tokenizers (alongside any existing
entries such as defusedxml) so the tokenizers requirement is declared only
within the [project] section.
f2a67b8 to
b4a3860
Compare
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
openverifiablellm/utils.py (1)
141-152: 🧹 Nitpick | 🔵 TrivialValidate
raw_pathexplicitly before hashing.
processed_pathis prechecked, butraw_pathis not. Adding a direct existence check gives clearer, symmetric manifest errors.🔧 Proposed fix
def generate_manifest(raw_path, processed_path): raw_path = Path(raw_path) processed_path = Path(processed_path) + if not raw_path.exists(): + raise FileNotFoundError( + f"Raw file not found at {raw_path}. Provide a valid dump file." + ) if not processed_path.exists(): raise FileNotFoundError( f"Processed file not found at {processed_path}. Run preprocessing first." )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openverifiablellm/utils.py` around lines 141 - 152, Add an explicit existence check for raw_path before using it to build the manifest: verify raw_path.exists() and raise a FileNotFoundError with a clear message (similar to the processed_path check) if missing, then proceed to call extract_dump_date(raw_path.name), compute_sha256(raw_path) and compute_merkle_root(raw_path) when constructing the manifest; this ensures the functions extract_dump_date, compute_sha256 and compute_merkle_root are only called on an existing raw_path and yields a symmetric error message alongside the existing processed_path check.
♻️ Duplicate comments (1)
pyproject.toml (1)
14-17:⚠️ Potential issue | 🟠 MajorPin
tokenizersto an exact tested version to prevent reproducibility drift.Line 16 currently uses a floating constraint (
tokenizers>=0.13.0). For deterministic tokenizer artifacts, this should be pinned to a single tested version; resolver upgrades can otherwise changevocab.json/merges.txtoutputs across environments.🔧 Proposed fix
dependencies = [ "defusedxml", - "tokenizers>=0.13.0" + "tokenizers==0.13.0" ]#!/bin/bash set -euo pipefail echo "pyproject declaration:" rg -n 'tokenizers' pyproject.toml || true echo echo "lock/constraints occurrences:" for f in $(fd -HI 'poetry.lock|uv.lock|pdm.lock|requirements.*\.txt|constraints.*\.txt'); do echo "== $f ==" rg -n 'tokenizers' "$f" || true done🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pyproject.toml` around lines 14 - 17, Replace the floating constraint for the tokenizers dependency in pyproject.toml (currently "tokenizers>=0.13.0") with an exact, tested version (e.g., "tokenizers==X.Y.Z") to prevent resolver-induced changes to tokenizer artifacts; after updating the dependency string, regenerate your lock file(s) (poetry lock / pdm lock / pip-compile as appropriate) and run the provided verification script (the shell snippet that greps for "tokenizers" across pyproject.toml and lock/constraints files) to confirm the exact version is consistently pinned across manifests.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@openverifiablellm/tokenizer.py`:
- Around line 20-22: Add explicit validation at the start of the
tokenizer-building function (the function that takes vocab_size: int =
TOKENIZER_VOCAB_SIZE and min_frequency: int = TOKENIZER_MIN_FREQUENCY and
returns Path) to raise a clear ValueError if vocab_size <= 0 or min_frequency <=
0; perform the same checks in the alternate overload/variant present around
lines 51-56 (the other function signature that accepts the same parameters) so
callers fail fast with descriptive messages referencing the invalid parameter
name.
- Around line 39-46: The code currently creates the output directory
(save_path.mkdir) before checking whether the input text_file exists, which can
leave empty directories on failure; modify the logic in tokenizer.py to first
validate text_file.exists() and raise FileNotFoundError if missing, and only
after that call save_path.mkdir(parents=True, exist_ok=True) to create the
output directory (reference variables: text_file, save_path, and the Path
usage).
In `@openverifiablellm/utils.py`:
- Around line 60-76: In compute_merkle_root validate that chunk_size is a
positive integer (e.g., > 0) at the start of the function and raise a clear
ValueError if not; this prevents a chunk_size of 0 from causing the read loop to
never progress and returning the empty-file root for non-empty files—check the
chunk_size parameter (default MERKLE_CHUNK_SIZE_BYTES) and raise
ValueError("chunk_size must be a positive integer") or similar before opening
the file so subsequent logic (reading chunks and computing leaves via
compute_sha256) behaves correctly.
In `@tests/test_tokenizer.py`:
- Around line 83-99: Add a new test mirroring
test_hash_changes_when_vocab_changes that mutates the merges.txt file under the
trained_tokenizer fixture and asserts hash_tokenizer_config detects the change;
specifically, create test_hash_changes_when_merges_change(trained_tokenizer)
which reads hashes_before = hash_tokenizer_config(trained_tokenizer), then edit
the trained_tokenizer / "merges.txt" (e.g., append a dummy merge line or alter
an existing line), write it back, compute hashes_after =
hash_tokenizer_config(trained_tokenizer), and assert
hashes_before["tokenizer_merges_hash"] != hashes_after["tokenizer_merges_hash"]
so merges.txt drift is covered.
---
Outside diff comments:
In `@openverifiablellm/utils.py`:
- Around line 141-152: Add an explicit existence check for raw_path before using
it to build the manifest: verify raw_path.exists() and raise a FileNotFoundError
with a clear message (similar to the processed_path check) if missing, then
proceed to call extract_dump_date(raw_path.name), compute_sha256(raw_path) and
compute_merkle_root(raw_path) when constructing the manifest; this ensures the
functions extract_dump_date, compute_sha256 and compute_merkle_root are only
called on an existing raw_path and yields a symmetric error message alongside
the existing processed_path check.
---
Duplicate comments:
In `@pyproject.toml`:
- Around line 14-17: Replace the floating constraint for the tokenizers
dependency in pyproject.toml (currently "tokenizers>=0.13.0") with an exact,
tested version (e.g., "tokenizers==X.Y.Z") to prevent resolver-induced changes
to tokenizer artifacts; after updating the dependency string, regenerate your
lock file(s) (poetry lock / pdm lock / pip-compile as appropriate) and run the
provided verification script (the shell snippet that greps for "tokenizers"
across pyproject.toml and lock/constraints files) to confirm the exact version
is consistently pinned across manifests.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (4)
openverifiablellm/tokenizer.pyopenverifiablellm/utils.pypyproject.tomltests/test_tokenizer.py
b4a3860 to
34ffcfd
Compare
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
.github/workflows/coderabbit-approval.yml (1)
38-42: 🧹 Nitpick | 🔵 TrivialRemove unused
shouldRemoveLabelfrom the returned object (or wire it to outputs).
shouldRemoveLabelduplicatesisCodeRabbit && isApprovedbut isn’t consumed later, which adds noise without value.♻️ Minimal cleanup
return { isCodeRabbit, - isApproved, - shouldRemoveLabel: isCodeRabbit && isApproved + isApproved };🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/coderabbit-approval.yml around lines 38 - 42, The returned object includes an unused property shouldRemoveLabel (computed as isCodeRabbit && isApproved); remove shouldRemoveLabel from the returned object (or alternatively wire it to the action outputs if intended to be consumed), leaving only isCodeRabbit and isApproved, and update any callers/outputs to use isCodeRabbit && isApproved directly if needed; ensure no other code references shouldRemoveLabel before deleting it.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@openverifiablellm/tokenizer.py`:
- Around line 44-47: The path validation currently uses Path.exists() and can
mistakenly accept directories; change the checks for text_file, vocab_file
(vocab.json), and merges_file (merges.txt) to use Path.is_file() instead of
exists(), and keep raising FileNotFoundError with the same descriptive message;
update both the check near the text_file validation and the similar block
referenced at lines 79-83 so directories are rejected early and clearer errors
are produced.
In `@openverifiablellm/utils.py`:
- Around line 39-43: Replace existence-only checks that use path.exists() with
checks that ensure the path is a regular file via path.is_file() before
attempting to open it; for the snippet around the variable file_path (where path
= Path(file_path) and with path.open("rb") as f:) change the guard to if not
path.is_file(): raise FileNotFoundError(f"{path} not found or is not a file") so
directories are rejected early. Apply the same change to the other occurrences
noted (around lines 60-64 and 127-135) by locating the code that calls
Path(...), uses path.exists(), and then path.open(), and replace exists() with
is_file() and provide a clear FileNotFoundError message.
In `@pyproject.toml`:
- Around line 12-17: The declared Python range in pyproject.toml
(requires-python) conflicts with the pinned tokenizers version
(tokenizers==0.13.0) which only provides wheels for CPython 3.7–3.10; either
tighten requires-python to ">=3.9,<3.11" to match tokenizers==0.13.0, or update
the pinned tokenizers to a newer release that supports Python 3.11+ (and, if you
upgrade tokenizers, run and re-baseline the determinism tests afterward).
In `@tests/test_tokenizer.py`:
- Around line 10-39: Add negative-path pytest cases in tests/test_tokenizer.py
targeting the public APIs train_tokenizer and hash_tokenizer_config: write tests
that assert train_tokenizer raises for invalid arguments (e.g., vocab_size <= 0
and min_frequency < 1) using pytest.raises and descriptive error types/messages,
assert train_tokenizer raises FileNotFoundError when text_file is missing, and
assert hash_tokenizer_config raises FileNotFoundError or a clear error when
vocab.json or merges.txt are absent; reuse tmp_path fixtures to create
missing/invalid inputs and reference the functions train_tokenizer and
hash_tokenizer_config in the tests so the CI locks in the intended error
behavior.
---
Outside diff comments:
In @.github/workflows/coderabbit-approval.yml:
- Around line 38-42: The returned object includes an unused property
shouldRemoveLabel (computed as isCodeRabbit && isApproved); remove
shouldRemoveLabel from the returned object (or alternatively wire it to the
action outputs if intended to be consumed), leaving only isCodeRabbit and
isApproved, and update any callers/outputs to use isCodeRabbit && isApproved
directly if needed; ensure no other code references shouldRemoveLabel before
deleting it.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (18)
.coderabbit.yaml.github/ISSUE_TEMPLATE/good_first_issue.yml.github/workflows/coderabbit-approval.yml.github/workflows/sync-pr-labels.yml.github/workflows/version-release.yml.gitignore.pre-commit-config.yamlCONTRIBUTING.mdLICENSEREADME.mdVERSIONexamples/demo_util.pyexamples/sample_wiki.pyopenverifiablellm/tokenizer.pyopenverifiablellm/utils.pypyproject.tomltests/test_tokenizer.pytests/test_util.py
|
Pls make the following changes:
|
b6b6181 to
34a63dc
Compare
|
I've addressed all requested changes: • Removed unrelated changes from .github, examples, and root files The PR now only contains tokenizer-related changes. |
Fixes #15
Summary
This PR introduces deterministic tokenizer training, cryptographic fingerprinting of tokenizer configuration files, and a modular tokenizer architecture to support multiple tokenizer implementations.
The goal is to extend the dataset verification pipeline so that tokenizer behavior becomes reproducible and verifiable across runs.
What This Adds
Deterministic Tokenizer Training
Tokenizer Configuration Hashing
Cryptographic hashing of tokenizer artifacts:
Using SHA-256 utilities already implemented in the dataset verification pipeline.
Vocabulary Integrity Check
Modular Tokenizer Architecture
Refactors tokenizer code to support multiple tokenizer implementations:
This design allows new tokenizer types to be added without modifying the training pipeline.
Tests
New and existing tests cover:
All tests pass locally.
Why This Matters
Currently there is a verification gap between the dataset verification layer and model training.
Verified Dataset (Layer 2)
│
│ (Tokenization not verifiable)
▼
Model Training (Layer 4)
While dataset verification provides cryptographic proof for raw and processed Wikipedia data, the tokenizer step is not verified.
This makes it impossible to detect:
By hashing tokenizer configuration files, tokenization becomes cryptographically tied to a specific tokenizer state.
This closes a key reproducibility gap in the pipeline.
Relation to the Verification Framework
This work aligns with the pipeline described in:
"A Framework for Cryptographic Verifiability of End-to-End AI Pipelines" (IWSPA 2025)
The paper identifies the Extraction and Analysis phase (Stage 2) as a critical verification gap in current ML pipelines.
This PR begins addressing that gap by making tokenizer configuration verifiable.
Reference:
https://arxiv.org/pdf/2503.22573v1
Related Work
These PRs establish dataset verification.
This PR extends the verification chain to include tokenizer configuration.
Verification Chain
Raw Data Merkle Root
↓
Processed Data Merkle Root
↓
Tokenizer Config Hash
↓
Tokenized Data
↓
Model Training
Code of Conduct