Skip to content

feat: deterministic tokenizer training and config hashing#17

Merged
Archit381 merged 1 commit intoAOSSIE-Org:mainfrom
Varshiniputtabakula:feat-tokenizer-training
Mar 9, 2026
Merged

feat: deterministic tokenizer training and config hashing#17
Archit381 merged 1 commit intoAOSSIE-Org:mainfrom
Varshiniputtabakula:feat-tokenizer-training

Conversation

@Varshiniputtabakula
Copy link

@Varshiniputtabakula Varshiniputtabakula commented Mar 1, 2026

Fixes #15

Summary

This PR introduces deterministic tokenizer training, cryptographic fingerprinting of tokenizer configuration files, and a modular tokenizer architecture to support multiple tokenizer implementations.

The goal is to extend the dataset verification pipeline so that tokenizer behavior becomes reproducible and verifiable across runs.


What This Adds

Deterministic Tokenizer Training

  • Deterministic Byte-Level BPE tokenizer training using HuggingFace Tokenizers
  • Fixed vocabulary size and minimum frequency parameters for reproducibility

Tokenizer Configuration Hashing

Cryptographic hashing of tokenizer artifacts:

  • vocab.json
  • merges.txt

Using SHA-256 utilities already implemented in the dataset verification pipeline.

Vocabulary Integrity Check

  • Vocabulary size is derived directly from vocab.json
  • Ensures tokenizer artifacts match the expected configuration

Modular Tokenizer Architecture

Refactors tokenizer code to support multiple tokenizer implementations:

  • BaseTokenizer abstract interface
  • BPETokenizer implementation
  • SentencePieceTokenizer implementation
  • TokenizerFactory for tokenizer instantiation

This design allows new tokenizer types to be added without modifying the training pipeline.

Tests

New and existing tests cover:

  • Deterministic tokenizer training
  • Artifact generation (vocab.json, merges.txt)
  • SHA256 hash stability
  • Hash change detection when tokenizer artifacts change
  • Vocabulary size validation
  • Tokenizer factory behavior and type selection

All tests pass locally.


Why This Matters

Currently there is a verification gap between the dataset verification layer and model training.

Verified Dataset (Layer 2)

│ (Tokenization not verifiable)

Model Training (Layer 4)

While dataset verification provides cryptographic proof for raw and processed Wikipedia data, the tokenizer step is not verified.

This makes it impossible to detect:

  • Use of a different tokenizer than claimed
  • Different tokenizer configuration than claimed
  • Silent tokenizer drift across runs
  • Tampering with tokenizer artifacts

By hashing tokenizer configuration files, tokenization becomes cryptographically tied to a specific tokenizer state.

This closes a key reproducibility gap in the pipeline.


Relation to the Verification Framework

This work aligns with the pipeline described in:

"A Framework for Cryptographic Verifiability of End-to-End AI Pipelines" (IWSPA 2025)

The paper identifies the Extraction and Analysis phase (Stage 2) as a critical verification gap in current ML pipelines.

This PR begins addressing that gap by making tokenizer configuration verifiable.

Reference:
https://arxiv.org/pdf/2503.22573v1


Related Work

These PRs establish dataset verification.
This PR extends the verification chain to include tokenizer configuration.


Verification Chain

Raw Data Merkle Root

Processed Data Merkle Root

Tokenizer Config Hash

Tokenized Data

Model Training


Code of Conduct

  • I have joined the Discord server and will post updates there
  • I have searched existing issues to avoid duplicates
  • Open to feedback on the approach and implementation

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 1, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds a new deterministic Byte-Level BPE tokenizer module with training and config-hashing, consolidates SHA‑256 hashing into a public utility used by manifest/Merkle logic, adds tokenizer tests, introduces tokenizers==0.15.2 dependency, and applies minor formatting/CI/doc edits.

Changes

Cohort / File(s) Summary
Tokenizer Implementation
openverifiablellm/tokenizer.py
New module: constants TOKENIZER_VOCAB_SIZE, TOKENIZER_MIN_FREQUENCY, SPECIAL_TOKENS; train_tokenizer(...) to train/save a ByteLevelBPETokenizer; hash_tokenizer_config(...) computes SHA‑256 of vocab.json and merges.txt and returns hashes plus actual vocab size.
Utilities & Manifest
openverifiablellm/utils.py
Added public compute_sha256(file_path=None, *, data=None); refactored compute_merkle_root, manifest generation, and XML extraction to use unified hashing, Path objects, input validation, and empty-data handling.
Tests (tokenizer)
tests/test_tokenizer.py
New tests: tokenizer training, determinism, hash outputs and sensitivity to file changes, reported vocab size, and negative-path checks for invalid params and missing files.
Packaging / Dependencies
pyproject.toml
Added runtime dependency tokenizers==0.15.2 and minor formatting adjustments.
CI / Config / Docs / Misc
.github/..., .pre-commit-config.yaml, .coderabbit.yaml, .gitignore, CONTRIBUTING.md, LICENSE, README.md, VERSION, examples/*, tests/test_util.py
Whitespace/formatting and small non-functional edits across workflows, configs, docs, examples, and tests; no behavioral changes beyond dependency addition.

Sequence Diagram(s)

sequenceDiagram
  participant User as "User / Test Runner"
  participant Trainer as "train_tokenizer (tokenizers lib)"
  participant FS as "Filesystem (save_path)"
  participant Hasher as "hash_tokenizer_config / utils.compute_sha256"

  User->>Trainer: provide text_file, vocab_size, min_frequency
  Trainer->>Trainer: preprocess input and train ByteLevelBPETokenizer
  Trainer->>FS: write `vocab.json` and `merges.txt`
  User->>Hasher: call hash_tokenizer_config(tokenizer_path)
  Hasher->>FS: read `vocab.json` and `merges.txt`
  Hasher->>Hasher: compute_sha256(vocab.json), compute_sha256(merges.txt)
  Hasher-->>User: return {tokenizer_vocab_hash, tokenizer_merges_hash, tokenizer_vocab_size}
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested labels

Python Lang, Documentation

Suggested reviewers

  • Archit381

Poem

🐰 I nibble bytes and stitch the BPE,
I save the merges for you and me,
I hash the vocab — precise and neat,
Tests hop through fields to prove repeat,
A rabbit's stamp approves the feat.

🚥 Pre-merge checks | ✅ 2
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat: deterministic tokenizer training and config hashing' accurately captures the main changes: adding tokenizer training and hashing functionality for configuration files.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds tokenizer reproducibility support to the verification pipeline by introducing deterministic Byte-Level BPE tokenizer training and cryptographic fingerprinting (SHA256) of tokenizer config artifacts.

Changes:

  • Added openverifiablellm.tokenizer with train_tokenizer() and hash_tokenizer_config() (hashes vocab.json / merges.txt, derives vocab size from vocab.json).
  • Refactored hashing utilities in openverifiablellm.utils (introduced a more flexible compute_sha256, updated Merkle hashing call sites).
  • Added unit tests for tokenizer training determinism and config hash stability/change detection.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
tests/test_tokenizer.py New tests covering tokenizer training determinism and tokenizer config hashing behavior.
pyproject.toml Adds runtime deps, but currently placed in an invalid section and duplicates existing deps.
openverifiablellm/utils.py Refactors SHA256 helper and Merkle root computation; updates manifest generation call sites accordingly.
openverifiablellm/tokenizer.py New tokenizer training + config hashing utilities built on tokenizers and shared SHA256 helper.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +35 to +43
if file_path is None and data is None:
raise ValueError("Either file_path or data must be provided.")

sha256 = hashlib.sha256()

# If keyword data is used
if data is not None:
sha256.update(data)
return sha256.hexdigest()
Copy link

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compute_sha256 is documented as hashing a file OR raw bytes, but it currently allows both file_path and data to be provided (it silently prefers data). This can hide call-site bugs and makes the API ambiguous; consider enforcing mutual exclusivity (raise if both are set), while still supporting the backward-compatible call patterns.

Copilot uses AI. Check for mistakes.
@@ -0,0 +1,95 @@
import json
import pytest
from pathlib import Path
Copy link

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Path is imported but unused in this test module. If linting (e.g., ruff/pyflakes) is enabled in CI later, this will fail; remove the unused import or use it explicitly.

Suggested change
from pathlib import Path

Copilot uses AI. Check for mistakes.
Comment on lines +15 to +19
"Wikipedia is a free online encyclopedia.\n"
"It is written collaboratively by volunteers.\n"
"Anyone can edit Wikipedia articles.\n"
"Wikipedia was launched on January 15 2001.\n"
"It is one of the most popular websites in the world.\n" * 100,
Copy link

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In sample_text_file, the * 100 only applies to the final string literal due to Python’s string-literal concatenation rules. If the intention is to repeat the entire multi-line sample 100x, wrap the whole block in parentheses before multiplying to avoid accidentally skewing the corpus.

Suggested change
"Wikipedia is a free online encyclopedia.\n"
"It is written collaboratively by volunteers.\n"
"Anyone can edit Wikipedia articles.\n"
"Wikipedia was launched on January 15 2001.\n"
"It is one of the most popular websites in the world.\n" * 100,
(
"Wikipedia is a free online encyclopedia.\n"
"It is written collaboratively by volunteers.\n"
"Anyone can edit Wikipedia articles.\n"
"Wikipedia was launched on January 15 2001.\n"
"It is one of the most popular websites in the world.\n"
) * 100,

Copilot uses AI. Check for mistakes.
pyproject.toml Outdated
Comment on lines +23 to +29
[tool.setuptools.packages.find]
include = ["openverifiablellm*"]

dependencies = [
"defusedxml",
"tokenizers>=0.13.0"
]
Copy link

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new dependencies = [...] block is under [tool.setuptools.packages.find], which is not a valid location for runtime dependencies in PEP 621 / setuptools. As a result, tokenizers won’t be installed when users pip install openverifiablellm, and importing openverifiablellm.tokenizer will fail. Move tokenizers>=0.13.0 into the existing [project].dependencies list and remove this duplicate block (and the duplicated defusedxml).

Copilot uses AI. Check for mistakes.
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/tokenizer.py`:
- Around line 98-102: The code reads vocab_path twice; instead, read the vocab
file once into a bytes buffer, pass that bytes to compute_sha256 via the
data=... parameter to compute vocab_hash, then decode the same bytes to get JSON
and set actual_vocab_size = len(json.loads(decoded)), leaving merges_path
hashing unchanged; update references to compute_sha256, vocab_path and
actual_vocab_size accordingly so the file is not reread.
- Around line 56-64: The docstring's claim that training is "deterministic" is
not supported by the current use of ByteLevelBPETokenizer/tokenizer.train and
tokenizer.save_model because HF tokenizers can vary across runs; update the
function docstring to remove or qualify the absolute determinism claim and list
known sources of nondeterminism (hash-map iteration, library version, platform),
and/or add safeguards: pin the tokenizers package version in project
requirements, add a regression check that computes and stores hashes of the
generated vocab.json and merges.txt after tokenizer.save_model (and fail or warn
on drift), and document these reproducibility conditions near SPECIAL_TOKENS and
the training call so future maintainers know how to reproduce results.

In `@pyproject.toml`:
- Around line 26-29: Remove the duplicate top-level dependencies block and
instead add "tokenizers>=0.13.0" to the existing dependencies array under the
[project] table; locate the standalone dependencies = [ "defusedxml",
"tokenizers>=0.13.0" ] and delete it, then update the [project] dependencies
list to include tokenizers (alongside any existing entries such as defusedxml)
so the tokenizers requirement is declared only within the [project] section.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 99283f7 and f2a67b8.

📒 Files selected for processing (4)
  • openverifiablellm/tokenizer.py
  • openverifiablellm/utils.py
  • pyproject.toml
  • tests/test_tokenizer.py

pyproject.toml Outdated
Comment on lines +26 to +29
dependencies = [
"defusedxml",
"tokenizers>=0.13.0"
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Duplicate dependencies declaration outside [project] section will break installation.

The new dependencies block is placed at file root level, not within the [project] section. This is invalid TOML for pyproject.toml — the key will either cause a parse error or be silently ignored, meaning tokenizers won't be installed.

Merge the new dependency into the existing dependencies list under [project] (lines 14-16).

🐛 Proposed fix

Remove lines 26-29 and update lines 14-16:

 dependencies= [
-    "defusedxml"
+    "defusedxml",
+    "tokenizers>=0.13.0"
 ]
-
-[project.optional-dependencies]
-dev = [
-    "pytest"
-]
-
-[tool.setuptools.packages.find]
-include = ["openverifiablellm*"]
-
-dependencies = [
-    "defusedxml",
-    "tokenizers>=0.13.0"
-]
+
+[project.optional-dependencies]
+dev = [
+    "pytest"
+]
+
+[tool.setuptools.packages.find]
+include = ["openverifiablellm*"]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pyproject.toml` around lines 26 - 29, Remove the duplicate top-level
dependencies block and instead add "tokenizers>=0.13.0" to the existing
dependencies array under the [project] table; locate the standalone dependencies
= [ "defusedxml", "tokenizers>=0.13.0" ] and delete it, then update the
[project] dependencies list to include tokenizers (alongside any existing
entries such as defusedxml) so the tokenizers requirement is declared only
within the [project] section.

@github-actions github-actions bot added size/L and removed size/L labels Mar 1, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
openverifiablellm/utils.py (1)

141-152: 🧹 Nitpick | 🔵 Trivial

Validate raw_path explicitly before hashing.

processed_path is prechecked, but raw_path is not. Adding a direct existence check gives clearer, symmetric manifest errors.

🔧 Proposed fix
 def generate_manifest(raw_path, processed_path):
     raw_path = Path(raw_path)
     processed_path = Path(processed_path)
 
+    if not raw_path.exists():
+        raise FileNotFoundError(
+            f"Raw file not found at {raw_path}. Provide a valid dump file."
+        )
     if not processed_path.exists():
         raise FileNotFoundError(
             f"Processed file not found at {processed_path}. Run preprocessing first."
         )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/utils.py` around lines 141 - 152, Add an explicit existence
check for raw_path before using it to build the manifest: verify
raw_path.exists() and raise a FileNotFoundError with a clear message (similar to
the processed_path check) if missing, then proceed to call
extract_dump_date(raw_path.name), compute_sha256(raw_path) and
compute_merkle_root(raw_path) when constructing the manifest; this ensures the
functions extract_dump_date, compute_sha256 and compute_merkle_root are only
called on an existing raw_path and yields a symmetric error message alongside
the existing processed_path check.
♻️ Duplicate comments (1)
pyproject.toml (1)

14-17: ⚠️ Potential issue | 🟠 Major

Pin tokenizers to an exact tested version to prevent reproducibility drift.

Line 16 currently uses a floating constraint (tokenizers>=0.13.0). For deterministic tokenizer artifacts, this should be pinned to a single tested version; resolver upgrades can otherwise change vocab.json/merges.txt outputs across environments.

🔧 Proposed fix
 dependencies = [
     "defusedxml",
-    "tokenizers>=0.13.0"
+    "tokenizers==0.13.0"
 ]
#!/bin/bash
set -euo pipefail

echo "pyproject declaration:"
rg -n 'tokenizers' pyproject.toml || true

echo
echo "lock/constraints occurrences:"
for f in $(fd -HI 'poetry.lock|uv.lock|pdm.lock|requirements.*\.txt|constraints.*\.txt'); do
  echo "== $f =="
  rg -n 'tokenizers' "$f" || true
done
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pyproject.toml` around lines 14 - 17, Replace the floating constraint for the
tokenizers dependency in pyproject.toml (currently "tokenizers>=0.13.0") with an
exact, tested version (e.g., "tokenizers==X.Y.Z") to prevent resolver-induced
changes to tokenizer artifacts; after updating the dependency string, regenerate
your lock file(s) (poetry lock / pdm lock / pip-compile as appropriate) and run
the provided verification script (the shell snippet that greps for "tokenizers"
across pyproject.toml and lock/constraints files) to confirm the exact version
is consistently pinned across manifests.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/tokenizer.py`:
- Around line 20-22: Add explicit validation at the start of the
tokenizer-building function (the function that takes vocab_size: int =
TOKENIZER_VOCAB_SIZE and min_frequency: int = TOKENIZER_MIN_FREQUENCY and
returns Path) to raise a clear ValueError if vocab_size <= 0 or min_frequency <=
0; perform the same checks in the alternate overload/variant present around
lines 51-56 (the other function signature that accepts the same parameters) so
callers fail fast with descriptive messages referencing the invalid parameter
name.
- Around line 39-46: The code currently creates the output directory
(save_path.mkdir) before checking whether the input text_file exists, which can
leave empty directories on failure; modify the logic in tokenizer.py to first
validate text_file.exists() and raise FileNotFoundError if missing, and only
after that call save_path.mkdir(parents=True, exist_ok=True) to create the
output directory (reference variables: text_file, save_path, and the Path
usage).

In `@openverifiablellm/utils.py`:
- Around line 60-76: In compute_merkle_root validate that chunk_size is a
positive integer (e.g., > 0) at the start of the function and raise a clear
ValueError if not; this prevents a chunk_size of 0 from causing the read loop to
never progress and returning the empty-file root for non-empty files—check the
chunk_size parameter (default MERKLE_CHUNK_SIZE_BYTES) and raise
ValueError("chunk_size must be a positive integer") or similar before opening
the file so subsequent logic (reading chunks and computing leaves via
compute_sha256) behaves correctly.

In `@tests/test_tokenizer.py`:
- Around line 83-99: Add a new test mirroring
test_hash_changes_when_vocab_changes that mutates the merges.txt file under the
trained_tokenizer fixture and asserts hash_tokenizer_config detects the change;
specifically, create test_hash_changes_when_merges_change(trained_tokenizer)
which reads hashes_before = hash_tokenizer_config(trained_tokenizer), then edit
the trained_tokenizer / "merges.txt" (e.g., append a dummy merge line or alter
an existing line), write it back, compute hashes_after =
hash_tokenizer_config(trained_tokenizer), and assert
hashes_before["tokenizer_merges_hash"] != hashes_after["tokenizer_merges_hash"]
so merges.txt drift is covered.

---

Outside diff comments:
In `@openverifiablellm/utils.py`:
- Around line 141-152: Add an explicit existence check for raw_path before using
it to build the manifest: verify raw_path.exists() and raise a FileNotFoundError
with a clear message (similar to the processed_path check) if missing, then
proceed to call extract_dump_date(raw_path.name), compute_sha256(raw_path) and
compute_merkle_root(raw_path) when constructing the manifest; this ensures the
functions extract_dump_date, compute_sha256 and compute_merkle_root are only
called on an existing raw_path and yields a symmetric error message alongside
the existing processed_path check.

---

Duplicate comments:
In `@pyproject.toml`:
- Around line 14-17: Replace the floating constraint for the tokenizers
dependency in pyproject.toml (currently "tokenizers>=0.13.0") with an exact,
tested version (e.g., "tokenizers==X.Y.Z") to prevent resolver-induced changes
to tokenizer artifacts; after updating the dependency string, regenerate your
lock file(s) (poetry lock / pdm lock / pip-compile as appropriate) and run the
provided verification script (the shell snippet that greps for "tokenizers"
across pyproject.toml and lock/constraints files) to confirm the exact version
is consistently pinned across manifests.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f2a67b8 and b4a3860.

📒 Files selected for processing (4)
  • openverifiablellm/tokenizer.py
  • openverifiablellm/utils.py
  • pyproject.toml
  • tests/test_tokenizer.py

@github-actions github-actions bot added ci-cd documentation Improvements or additions to documentation github-actions size/XL and removed size/L size/XL labels Mar 1, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
.github/workflows/coderabbit-approval.yml (1)

38-42: 🧹 Nitpick | 🔵 Trivial

Remove unused shouldRemoveLabel from the returned object (or wire it to outputs).

shouldRemoveLabel duplicates isCodeRabbit && isApproved but isn’t consumed later, which adds noise without value.

♻️ Minimal cleanup
             return {
               isCodeRabbit,
-              isApproved,
-              shouldRemoveLabel: isCodeRabbit && isApproved
+              isApproved
             };
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/coderabbit-approval.yml around lines 38 - 42, The returned
object includes an unused property shouldRemoveLabel (computed as isCodeRabbit
&& isApproved); remove shouldRemoveLabel from the returned object (or
alternatively wire it to the action outputs if intended to be consumed), leaving
only isCodeRabbit and isApproved, and update any callers/outputs to use
isCodeRabbit && isApproved directly if needed; ensure no other code references
shouldRemoveLabel before deleting it.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/tokenizer.py`:
- Around line 44-47: The path validation currently uses Path.exists() and can
mistakenly accept directories; change the checks for text_file, vocab_file
(vocab.json), and merges_file (merges.txt) to use Path.is_file() instead of
exists(), and keep raising FileNotFoundError with the same descriptive message;
update both the check near the text_file validation and the similar block
referenced at lines 79-83 so directories are rejected early and clearer errors
are produced.

In `@openverifiablellm/utils.py`:
- Around line 39-43: Replace existence-only checks that use path.exists() with
checks that ensure the path is a regular file via path.is_file() before
attempting to open it; for the snippet around the variable file_path (where path
= Path(file_path) and with path.open("rb") as f:) change the guard to if not
path.is_file(): raise FileNotFoundError(f"{path} not found or is not a file") so
directories are rejected early. Apply the same change to the other occurrences
noted (around lines 60-64 and 127-135) by locating the code that calls
Path(...), uses path.exists(), and then path.open(), and replace exists() with
is_file() and provide a clear FileNotFoundError message.

In `@pyproject.toml`:
- Around line 12-17: The declared Python range in pyproject.toml
(requires-python) conflicts with the pinned tokenizers version
(tokenizers==0.13.0) which only provides wheels for CPython 3.7–3.10; either
tighten requires-python to ">=3.9,<3.11" to match tokenizers==0.13.0, or update
the pinned tokenizers to a newer release that supports Python 3.11+ (and, if you
upgrade tokenizers, run and re-baseline the determinism tests afterward).

In `@tests/test_tokenizer.py`:
- Around line 10-39: Add negative-path pytest cases in tests/test_tokenizer.py
targeting the public APIs train_tokenizer and hash_tokenizer_config: write tests
that assert train_tokenizer raises for invalid arguments (e.g., vocab_size <= 0
and min_frequency < 1) using pytest.raises and descriptive error types/messages,
assert train_tokenizer raises FileNotFoundError when text_file is missing, and
assert hash_tokenizer_config raises FileNotFoundError or a clear error when
vocab.json or merges.txt are absent; reuse tmp_path fixtures to create
missing/invalid inputs and reference the functions train_tokenizer and
hash_tokenizer_config in the tests so the CI locks in the intended error
behavior.

---

Outside diff comments:
In @.github/workflows/coderabbit-approval.yml:
- Around line 38-42: The returned object includes an unused property
shouldRemoveLabel (computed as isCodeRabbit && isApproved); remove
shouldRemoveLabel from the returned object (or alternatively wire it to the
action outputs if intended to be consumed), leaving only isCodeRabbit and
isApproved, and update any callers/outputs to use isCodeRabbit && isApproved
directly if needed; ensure no other code references shouldRemoveLabel before
deleting it.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b4a3860 and 34ffcfd.

📒 Files selected for processing (18)
  • .coderabbit.yaml
  • .github/ISSUE_TEMPLATE/good_first_issue.yml
  • .github/workflows/coderabbit-approval.yml
  • .github/workflows/sync-pr-labels.yml
  • .github/workflows/version-release.yml
  • .gitignore
  • .pre-commit-config.yaml
  • CONTRIBUTING.md
  • LICENSE
  • README.md
  • VERSION
  • examples/demo_util.py
  • examples/sample_wiki.py
  • openverifiablellm/tokenizer.py
  • openverifiablellm/utils.py
  • pyproject.toml
  • tests/test_tokenizer.py
  • tests/test_util.py

@github-actions github-actions bot removed the size/XL label Mar 1, 2026
@Archit381 Archit381 self-requested a review March 5, 2026 14:50
@github-actions github-actions bot added size/XL size/M enhancement New feature or request and removed size/XL size/M labels Mar 6, 2026
@Archit381
Copy link
Member

Pls make the following changes:

  • Remove any unnecessary changes in the following files:

  • from the .github/ dir

  • from the examples/ dir

  • from root files (except .toml)

  • Don't change existing functions from openverifiablellm/utils.py. Use existing ones

  • Combine all the tests related to tokenizer into a single file tests/test_tokenizer.py

  • Fix any coderabbit issues and merge conflicts

Make sure to pull from main for latest changes

@github-actions github-actions bot added size/L and removed size/XL labels Mar 7, 2026
@Varshiniputtabakula
Copy link
Author

I've addressed all requested changes:

• Removed unrelated changes from .github, examples, and root files
• Restored utils.py and used existing functions
• Combined tokenizer tests into tests/test_tokenizer.py
• Resolved CodeRabbit feedback
• Rebased on latest main

The PR now only contains tokenizer-related changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE]: Deterministic Tokenization Pipeline with Cryptographic Verification

3 participants