feat: deterministic tokenizer training and config hashing by Varshiniputtabakula · Pull Request #17 · AOSSIE-Org/OpenVerifiableLLM

Varshiniputtabakula · 2026-03-01T09:47:04Z

Fixes #15

Summary

This PR introduces deterministic tokenizer training, cryptographic fingerprinting of tokenizer configuration files, and a modular tokenizer architecture to support multiple tokenizer implementations.

The goal is to extend the dataset verification pipeline so that tokenizer behavior becomes reproducible and verifiable across runs.

What This Adds

Deterministic Tokenizer Training

Deterministic Byte-Level BPE tokenizer training using HuggingFace Tokenizers
Fixed vocabulary size and minimum frequency parameters for reproducibility

Tokenizer Configuration Hashing

Cryptographic hashing of tokenizer artifacts:

vocab.json
merges.txt

Using SHA-256 utilities already implemented in the dataset verification pipeline.

Vocabulary Integrity Check

Vocabulary size is derived directly from vocab.json
Ensures tokenizer artifacts match the expected configuration

Modular Tokenizer Architecture

Refactors tokenizer code to support multiple tokenizer implementations:

BaseTokenizer abstract interface
BPETokenizer implementation
SentencePieceTokenizer implementation
TokenizerFactory for tokenizer instantiation

This design allows new tokenizer types to be added without modifying the training pipeline.

Tests

New and existing tests cover:

Deterministic tokenizer training
Artifact generation (vocab.json, merges.txt)
SHA256 hash stability
Hash change detection when tokenizer artifacts change
Vocabulary size validation
Tokenizer factory behavior and type selection

All tests pass locally.

Why This Matters

Currently there is a verification gap between the dataset verification layer and model training.

Verified Dataset (Layer 2)
│
│ (Tokenization not verifiable)
▼
Model Training (Layer 4)

While dataset verification provides cryptographic proof for raw and processed Wikipedia data, the tokenizer step is not verified.

This makes it impossible to detect:

Use of a different tokenizer than claimed
Different tokenizer configuration than claimed
Silent tokenizer drift across runs
Tampering with tokenizer artifacts

By hashing tokenizer configuration files, tokenization becomes cryptographically tied to a specific tokenizer state.

This closes a key reproducibility gap in the pipeline.

Relation to the Verification Framework

This work aligns with the pipeline described in:

"A Framework for Cryptographic Verifiability of End-to-End AI Pipelines" (IWSPA 2025)

The paper identifies the Extraction and Analysis phase (Stage 2) as a critical verification gap in current ML pipelines.

This PR begins addressing that gap by making tokenizer configuration verifiable.

Reference:
https://arxiv.org/pdf/2503.22573v1

Related Work

PR Add deterministic dataset hashing utility #1: Deterministic dataset hashing utility
PR Add Merkle Tree–Based Chunk-Level Hashing for Dataset Verification this establishes dataset verification integrity . #8: Merkle Tree Based Chunk Level Hashing

These PRs establish dataset verification.
This PR extends the verification chain to include tokenizer configuration.

Verification Chain

Raw Data Merkle Root
↓
Processed Data Merkle Root
↓
Tokenizer Config Hash
↓
Tokenized Data
↓
Model Training

Code of Conduct

I have joined the Discord server and will post updates there
I have searched existing issues to avoid duplicates
Open to feedback on the approach and implementation

coderabbitai · 2026-03-01T09:47:16Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Walkthrough

Adds a new deterministic Byte-Level BPE tokenizer module with training and config-hashing, consolidates SHA‑256 hashing into a public utility used by manifest/Merkle logic, adds tokenizer tests, introduces tokenizers==0.15.2 dependency, and applies minor formatting/CI/doc edits.

Changes

Cohort / File(s)	Summary
Tokenizer Implementation `openverifiablellm/tokenizer.py`	New module: constants `TOKENIZER_VOCAB_SIZE`, `TOKENIZER_MIN_FREQUENCY`, `SPECIAL_TOKENS`; `train_tokenizer(...)` to train/save a ByteLevelBPETokenizer; `hash_tokenizer_config(...)` computes SHA‑256 of `vocab.json` and `merges.txt` and returns hashes plus actual vocab size.
Utilities & Manifest `openverifiablellm/utils.py`	Added public `compute_sha256(file_path=None, *, data=None)`; refactored `compute_merkle_root`, manifest generation, and XML extraction to use unified hashing, `Path` objects, input validation, and empty-data handling.
Tests (tokenizer) `tests/test_tokenizer.py`	New tests: tokenizer training, determinism, hash outputs and sensitivity to file changes, reported vocab size, and negative-path checks for invalid params and missing files.
Packaging / Dependencies `pyproject.toml`	Added runtime dependency `tokenizers==0.15.2` and minor formatting adjustments.
CI / Config / Docs / Misc `.github/...`, `.pre-commit-config.yaml`, `.coderabbit.yaml`, `.gitignore`, `CONTRIBUTING.md`, `LICENSE`, `README.md`, `VERSION`, `examples/*`, `tests/test_util.py`	Whitespace/formatting and small non-functional edits across workflows, configs, docs, examples, and tests; no behavioral changes beyond dependency addition.

Sequence Diagram(s)

sequenceDiagram
  participant User as "User / Test Runner"
  participant Trainer as "train_tokenizer (tokenizers lib)"
  participant FS as "Filesystem (save_path)"
  participant Hasher as "hash_tokenizer_config / utils.compute_sha256"

  User->>Trainer: provide text_file, vocab_size, min_frequency
  Trainer->>Trainer: preprocess input and train ByteLevelBPETokenizer
  Trainer->>FS: write `vocab.json` and `merges.txt`
  User->>Hasher: call hash_tokenizer_config(tokenizer_path)
  Hasher->>FS: read `vocab.json` and `merges.txt`
  Hasher->>Hasher: compute_sha256(vocab.json), compute_sha256(merges.txt)
  Hasher-->>User: return {tokenizer_vocab_hash, tokenizer_merges_hash, tokenizer_vocab_size}

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Add deterministic dataset hashing utility #1: Adds a compute_sha256 utility for file hashing—overlaps with the new compute_sha256 and merged hashing logic in openverifiablellm/utils.py.
feat: add deterministic preprocessing and dataset identity tracking #2: Also modifies openverifiablellm/utils.py to add/replace SHA‑256-based hashing utilities, directly related to the refactor here.

Suggested labels

Python Lang, Documentation

Suggested reviewers

Archit381

Poem

🐰 I nibble bytes and stitch the BPE,
I save the merges for you and me,
I hash the vocab — precise and neat,
Tests hop through fields to prove repeat,
A rabbit's stamp approves the feat.

🚥 Pre-merge checks | ✅ 2

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'feat: deterministic tokenizer training and config hashing' accurately captures the main changes: adding tokenizer training and hashing functionality for configuration files.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Copilot

Pull request overview

Adds tokenizer reproducibility support to the verification pipeline by introducing deterministic Byte-Level BPE tokenizer training and cryptographic fingerprinting (SHA256) of tokenizer config artifacts.

Changes:

Added openverifiablellm.tokenizer with train_tokenizer() and hash_tokenizer_config() (hashes vocab.json / merges.txt, derives vocab size from vocab.json).
Refactored hashing utilities in openverifiablellm.utils (introduced a more flexible compute_sha256, updated Merkle hashing call sites).
Added unit tests for tokenizer training determinism and config hash stability/change detection.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File	Description
`tests/test_tokenizer.py`	New tests covering tokenizer training determinism and tokenizer config hashing behavior.
`pyproject.toml`	Adds runtime deps, but currently placed in an invalid section and duplicates existing deps.
`openverifiablellm/utils.py`	Refactors SHA256 helper and Merkle root computation; updates manifest generation call sites accordingly.
`openverifiablellm/tokenizer.py`	New tokenizer training + config hashing utilities built on `tokenizers` and shared SHA256 helper.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-01T09:50:05Z

openverifiablellm/utils.py

+    if file_path is None and data is None:
+        raise ValueError("Either file_path or data must be provided.")
+
+    sha256 = hashlib.sha256()
+
+    # If keyword data is used
+    if data is not None:
+        sha256.update(data)
+        return sha256.hexdigest()


compute_sha256 is documented as hashing a file OR raw bytes, but it currently allows both file_path and data to be provided (it silently prefers data). This can hide call-site bugs and makes the API ambiguous; consider enforcing mutual exclusivity (raise if both are set), while still supporting the backward-compatible call patterns.

Copilot · 2026-03-01T09:50:05Z

tests/test_tokenizer.py

@@ -0,0 +1,95 @@
+import json
+import pytest
+from pathlib import Path


Path is imported but unused in this test module. If linting (e.g., ruff/pyflakes) is enabled in CI later, this will fail; remove the unused import or use it explicitly.

Suggested change

from pathlib import Path

Copilot · 2026-03-01T09:50:06Z

tests/test_tokenizer.py

+        "Wikipedia is a free online encyclopedia.\n"
+        "It is written collaboratively by volunteers.\n"
+        "Anyone can edit Wikipedia articles.\n"
+        "Wikipedia was launched on January 15 2001.\n"
+        "It is one of the most popular websites in the world.\n" * 100,


In sample_text_file, the * 100 only applies to the final string literal due to Python’s string-literal concatenation rules. If the intention is to repeat the entire multi-line sample 100x, wrap the whole block in parentheses before multiplying to avoid accidentally skewing the corpus.

Suggested change

"Wikipedia is a free online encyclopedia.\n"

"It is written collaboratively by volunteers.\n"

"Anyone can edit Wikipedia articles.\n"

"Wikipedia was launched on January 15 2001.\n"

"It is one of the most popular websites in the world.\n" * 100,

(

"Wikipedia is a free online encyclopedia.\n"

"It is written collaboratively by volunteers.\n"

"Anyone can edit Wikipedia articles.\n"

"Wikipedia was launched on January 15 2001.\n"

"It is one of the most popular websites in the world.\n"

) * 100,

Copilot · 2026-03-01T09:50:06Z

pyproject.toml

 [tool.setuptools.packages.find]
 include = ["openverifiablellm*"]

+dependencies = [
+    "defusedxml",
+    "tokenizers>=0.13.0"
+]


The new dependencies = [...] block is under [tool.setuptools.packages.find], which is not a valid location for runtime dependencies in PEP 621 / setuptools. As a result, tokenizers won’t be installed when users pip install openverifiablellm, and importing openverifiablellm.tokenizer will fail. Move tokenizers>=0.13.0 into the existing [project].dependencies list and remove this duplicate block (and the duplicated defusedxml).

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/tokenizer.py`:
- Around line 98-102: The code reads vocab_path twice; instead, read the vocab
file once into a bytes buffer, pass that bytes to compute_sha256 via the
data=... parameter to compute vocab_hash, then decode the same bytes to get JSON
and set actual_vocab_size = len(json.loads(decoded)), leaving merges_path
hashing unchanged; update references to compute_sha256, vocab_path and
actual_vocab_size accordingly so the file is not reread.
- Around line 56-64: The docstring's claim that training is "deterministic" is
not supported by the current use of ByteLevelBPETokenizer/tokenizer.train and
tokenizer.save_model because HF tokenizers can vary across runs; update the
function docstring to remove or qualify the absolute determinism claim and list
known sources of nondeterminism (hash-map iteration, library version, platform),
and/or add safeguards: pin the tokenizers package version in project
requirements, add a regression check that computes and stores hashes of the
generated vocab.json and merges.txt after tokenizer.save_model (and fail or warn
on drift), and document these reproducibility conditions near SPECIAL_TOKENS and
the training call so future maintainers know how to reproduce results.

In `@pyproject.toml`:
- Around line 26-29: Remove the duplicate top-level dependencies block and
instead add "tokenizers>=0.13.0" to the existing dependencies array under the
[project] table; locate the standalone dependencies = [ "defusedxml",
"tokenizers>=0.13.0" ] and delete it, then update the [project] dependencies
list to include tokenizers (alongside any existing entries such as defusedxml)
so the tokenizers requirement is declared only within the [project] section.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 99283f7 and f2a67b8.

📒 Files selected for processing (4)

openverifiablellm/tokenizer.py
openverifiablellm/utils.py
pyproject.toml
tests/test_tokenizer.py

openverifiablellm/tokenizer.py

coderabbitai · 2026-03-01T09:51:08Z

pyproject.toml

+dependencies = [
+    "defusedxml",
+    "tokenizers>=0.13.0"
+]


⚠️ Potential issue | 🔴 Critical

Duplicate dependencies declaration outside [project] section will break installation.

The new dependencies block is placed at file root level, not within the [project] section. This is invalid TOML for pyproject.toml — the key will either cause a parse error or be silently ignored, meaning tokenizers won't be installed.

Merge the new dependency into the existing dependencies list under [project] (lines 14-16).

🐛 Proposed fix

Remove lines 26-29 and update lines 14-16:

dependencies= [ - "defusedxml" + "defusedxml", + "tokenizers>=0.13.0" ] - -[project.optional-dependencies] -dev = [ - "pytest" -] - -[tool.setuptools.packages.find] -include = ["openverifiablellm*"] - -dependencies = [ - "defusedxml", - "tokenizers>=0.13.0" -] + +[project.optional-dependencies] +dev = [ + "pytest" +] + +[tool.setuptools.packages.find] +include = ["openverifiablellm*"]

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@pyproject.toml` around lines 26 - 29, Remove the duplicate top-level dependencies block and instead add "tokenizers>=0.13.0" to the existing dependencies array under the [project] table; locate the standalone dependencies = [ "defusedxml", "tokenizers>=0.13.0" ] and delete it, then update the [project] dependencies list to include tokenizers (alongside any existing entries such as defusedxml) so the tokenizers requirement is declared only within the [project] section.

coderabbitai

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

openverifiablellm/utils.py (1)

141-152: 🧹 Nitpick | 🔵 Trivial

Validate raw_path explicitly before hashing.

processed_path is prechecked, but raw_path is not. Adding a direct existence check gives clearer, symmetric manifest errors.

🔧 Proposed fix

 def generate_manifest(raw_path, processed_path):
     raw_path = Path(raw_path)
     processed_path = Path(processed_path)
 
+    if not raw_path.exists():
+        raise FileNotFoundError(
+            f"Raw file not found at {raw_path}. Provide a valid dump file."
+        )
     if not processed_path.exists():
         raise FileNotFoundError(
             f"Processed file not found at {processed_path}. Run preprocessing first."
         )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/utils.py` around lines 141 - 152, Add an explicit existence
check for raw_path before using it to build the manifest: verify
raw_path.exists() and raise a FileNotFoundError with a clear message (similar to
the processed_path check) if missing, then proceed to call
extract_dump_date(raw_path.name), compute_sha256(raw_path) and
compute_merkle_root(raw_path) when constructing the manifest; this ensures the
functions extract_dump_date, compute_sha256 and compute_merkle_root are only
called on an existing raw_path and yields a symmetric error message alongside
the existing processed_path check.

♻️ Duplicate comments (1)

pyproject.toml (1)

14-17: ⚠️ Potential issue | 🟠 Major

Pin tokenizers to an exact tested version to prevent reproducibility drift.

Line 16 currently uses a floating constraint (tokenizers>=0.13.0). For deterministic tokenizer artifacts, this should be pinned to a single tested version; resolver upgrades can otherwise change vocab.json/merges.txt outputs across environments.

🔧 Proposed fix

 dependencies = [
     "defusedxml",
-    "tokenizers>=0.13.0"
+    "tokenizers==0.13.0"
 ]

#!/bin/bash
set -euo pipefail

echo "pyproject declaration:"
rg -n 'tokenizers' pyproject.toml || true

echo
echo "lock/constraints occurrences:"
for f in $(fd -HI 'poetry.lock|uv.lock|pdm.lock|requirements.*\.txt|constraints.*\.txt'); do
  echo "== $f =="
  rg -n 'tokenizers' "$f" || true
done

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@pyproject.toml` around lines 14 - 17, Replace the floating constraint for the
tokenizers dependency in pyproject.toml (currently "tokenizers>=0.13.0") with an
exact, tested version (e.g., "tokenizers==X.Y.Z") to prevent resolver-induced
changes to tokenizer artifacts; after updating the dependency string, regenerate
your lock file(s) (poetry lock / pdm lock / pip-compile as appropriate) and run
the provided verification script (the shell snippet that greps for "tokenizers"
across pyproject.toml and lock/constraints files) to confirm the exact version
is consistently pinned across manifests.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/tokenizer.py`:
- Around line 20-22: Add explicit validation at the start of the
tokenizer-building function (the function that takes vocab_size: int =
TOKENIZER_VOCAB_SIZE and min_frequency: int = TOKENIZER_MIN_FREQUENCY and
returns Path) to raise a clear ValueError if vocab_size <= 0 or min_frequency <=
0; perform the same checks in the alternate overload/variant present around
lines 51-56 (the other function signature that accepts the same parameters) so
callers fail fast with descriptive messages referencing the invalid parameter
name.
- Around line 39-46: The code currently creates the output directory
(save_path.mkdir) before checking whether the input text_file exists, which can
leave empty directories on failure; modify the logic in tokenizer.py to first
validate text_file.exists() and raise FileNotFoundError if missing, and only
after that call save_path.mkdir(parents=True, exist_ok=True) to create the
output directory (reference variables: text_file, save_path, and the Path
usage).

In `@openverifiablellm/utils.py`:
- Around line 60-76: In compute_merkle_root validate that chunk_size is a
positive integer (e.g., > 0) at the start of the function and raise a clear
ValueError if not; this prevents a chunk_size of 0 from causing the read loop to
never progress and returning the empty-file root for non-empty files—check the
chunk_size parameter (default MERKLE_CHUNK_SIZE_BYTES) and raise
ValueError("chunk_size must be a positive integer") or similar before opening
the file so subsequent logic (reading chunks and computing leaves via
compute_sha256) behaves correctly.

In `@tests/test_tokenizer.py`:
- Around line 83-99: Add a new test mirroring
test_hash_changes_when_vocab_changes that mutates the merges.txt file under the
trained_tokenizer fixture and asserts hash_tokenizer_config detects the change;
specifically, create test_hash_changes_when_merges_change(trained_tokenizer)
which reads hashes_before = hash_tokenizer_config(trained_tokenizer), then edit
the trained_tokenizer / "merges.txt" (e.g., append a dummy merge line or alter
an existing line), write it back, compute hashes_after =
hash_tokenizer_config(trained_tokenizer), and assert
hashes_before["tokenizer_merges_hash"] != hashes_after["tokenizer_merges_hash"]
so merges.txt drift is covered.

---

Outside diff comments:
In `@openverifiablellm/utils.py`:
- Around line 141-152: Add an explicit existence check for raw_path before using
it to build the manifest: verify raw_path.exists() and raise a FileNotFoundError
with a clear message (similar to the processed_path check) if missing, then
proceed to call extract_dump_date(raw_path.name), compute_sha256(raw_path) and
compute_merkle_root(raw_path) when constructing the manifest; this ensures the
functions extract_dump_date, compute_sha256 and compute_merkle_root are only
called on an existing raw_path and yields a symmetric error message alongside
the existing processed_path check.

---

Duplicate comments:
In `@pyproject.toml`:
- Around line 14-17: Replace the floating constraint for the tokenizers
dependency in pyproject.toml (currently "tokenizers>=0.13.0") with an exact,
tested version (e.g., "tokenizers==X.Y.Z") to prevent resolver-induced changes
to tokenizer artifacts; after updating the dependency string, regenerate your
lock file(s) (poetry lock / pdm lock / pip-compile as appropriate) and run the
provided verification script (the shell snippet that greps for "tokenizers"
across pyproject.toml and lock/constraints files) to confirm the exact version
is consistently pinned across manifests.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f2a67b8 and b4a3860.

📒 Files selected for processing (4)

openverifiablellm/tokenizer.py
openverifiablellm/utils.py
pyproject.toml
tests/test_tokenizer.py

openverifiablellm/tokenizer/train.py

openverifiablellm/utils.py

tests/test_tokenizer.py

coderabbitai

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

.github/workflows/coderabbit-approval.yml (1)

38-42: 🧹 Nitpick | 🔵 Trivial

Remove unused shouldRemoveLabel from the returned object (or wire it to outputs).

shouldRemoveLabel duplicates isCodeRabbit && isApproved but isn’t consumed later, which adds noise without value.

♻️ Minimal cleanup

             return {
               isCodeRabbit,
-              isApproved,
-              shouldRemoveLabel: isCodeRabbit && isApproved
+              isApproved
             };

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In @.github/workflows/coderabbit-approval.yml around lines 38 - 42, The returned
object includes an unused property shouldRemoveLabel (computed as isCodeRabbit
&& isApproved); remove shouldRemoveLabel from the returned object (or
alternatively wire it to the action outputs if intended to be consumed), leaving
only isCodeRabbit and isApproved, and update any callers/outputs to use
isCodeRabbit && isApproved directly if needed; ensure no other code references
shouldRemoveLabel before deleting it.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/tokenizer.py`:
- Around line 44-47: The path validation currently uses Path.exists() and can
mistakenly accept directories; change the checks for text_file, vocab_file
(vocab.json), and merges_file (merges.txt) to use Path.is_file() instead of
exists(), and keep raising FileNotFoundError with the same descriptive message;
update both the check near the text_file validation and the similar block
referenced at lines 79-83 so directories are rejected early and clearer errors
are produced.

In `@openverifiablellm/utils.py`:
- Around line 39-43: Replace existence-only checks that use path.exists() with
checks that ensure the path is a regular file via path.is_file() before
attempting to open it; for the snippet around the variable file_path (where path
= Path(file_path) and with path.open("rb") as f:) change the guard to if not
path.is_file(): raise FileNotFoundError(f"{path} not found or is not a file") so
directories are rejected early. Apply the same change to the other occurrences
noted (around lines 60-64 and 127-135) by locating the code that calls
Path(...), uses path.exists(), and then path.open(), and replace exists() with
is_file() and provide a clear FileNotFoundError message.

In `@pyproject.toml`:
- Around line 12-17: The declared Python range in pyproject.toml
(requires-python) conflicts with the pinned tokenizers version
(tokenizers==0.13.0) which only provides wheels for CPython 3.7–3.10; either
tighten requires-python to ">=3.9,<3.11" to match tokenizers==0.13.0, or update
the pinned tokenizers to a newer release that supports Python 3.11+ (and, if you
upgrade tokenizers, run and re-baseline the determinism tests afterward).

In `@tests/test_tokenizer.py`:
- Around line 10-39: Add negative-path pytest cases in tests/test_tokenizer.py
targeting the public APIs train_tokenizer and hash_tokenizer_config: write tests
that assert train_tokenizer raises for invalid arguments (e.g., vocab_size <= 0
and min_frequency < 1) using pytest.raises and descriptive error types/messages,
assert train_tokenizer raises FileNotFoundError when text_file is missing, and
assert hash_tokenizer_config raises FileNotFoundError or a clear error when
vocab.json or merges.txt are absent; reuse tmp_path fixtures to create
missing/invalid inputs and reference the functions train_tokenizer and
hash_tokenizer_config in the tests so the CI locks in the intended error
behavior.

---

Outside diff comments:
In @.github/workflows/coderabbit-approval.yml:
- Around line 38-42: The returned object includes an unused property
shouldRemoveLabel (computed as isCodeRabbit && isApproved); remove
shouldRemoveLabel from the returned object (or alternatively wire it to the
action outputs if intended to be consumed), leaving only isCodeRabbit and
isApproved, and update any callers/outputs to use isCodeRabbit && isApproved
directly if needed; ensure no other code references shouldRemoveLabel before
deleting it.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b4a3860 and 34ffcfd.

📒 Files selected for processing (18)

.coderabbit.yaml
.github/ISSUE_TEMPLATE/good_first_issue.yml
.github/workflows/coderabbit-approval.yml
.github/workflows/sync-pr-labels.yml
.github/workflows/version-release.yml
.gitignore
.pre-commit-config.yaml
CONTRIBUTING.md
LICENSE
README.md
VERSION
examples/demo_util.py
examples/sample_wiki.py
openverifiablellm/tokenizer.py
openverifiablellm/utils.py
pyproject.toml
tests/test_tokenizer.py
tests/test_util.py

openverifiablellm/tokenizer.py

openverifiablellm/utils.py

pyproject.toml

tests/test_tokenizer.py

Archit381 · 2026-03-07T09:00:05Z

Pls make the following changes:

Remove any unnecessary changes in the following files:
from the .github/ dir
from the examples/ dir
from root files (except .toml)
Don't change existing functions from openverifiablellm/utils.py. Use existing ones
Combine all the tests related to tokenizer into a single file tests/test_tokenizer.py
Fix any coderabbit issues and merge conflicts

Make sure to pull from main for latest changes

…iece support

Varshiniputtabakula · 2026-03-07T10:58:35Z

I've addressed all requested changes:

• Removed unrelated changes from .github, examples, and root files
• Restored utils.py and used existing functions
• Combined tokenizer tests into tests/test_tokenizer.py
• Resolved CodeRabbit feedback
• Rebased on latest main

The PR now only contains tokenizer-related changes.

Copilot AI review requested due to automatic review settings March 1, 2026 09:47

github-actions bot added no-issue-linked backend configuration python size/L first-time-contributor labels Mar 1, 2026

github-actions bot added the pending-coderabbit-review label Mar 1, 2026

Copilot started reviewing on behalf of Varshiniputtabakula March 1, 2026 09:47 View session

github-actions bot added size/L and removed size/L labels Mar 1, 2026

Copilot AI reviewed Mar 1, 2026

View reviewed changes

coderabbitai bot requested changes Mar 1, 2026

View reviewed changes

Varshiniputtabakula force-pushed the feat-tokenizer-training branch from f2a67b8 to b4a3860 Compare March 1, 2026 10:47

github-actions bot added size/L and removed size/L labels Mar 1, 2026

coderabbitai bot requested changes Mar 1, 2026

View reviewed changes

openverifiablellm/tokenizer/train.py Show resolved Hide resolved

openverifiablellm/tokenizer/train.py Show resolved Hide resolved

openverifiablellm/utils.py Outdated Show resolved Hide resolved

tests/test_tokenizer.py Show resolved Hide resolved

Varshiniputtabakula force-pushed the feat-tokenizer-training branch from b4a3860 to 34ffcfd Compare March 1, 2026 12:22

github-actions bot added ci-cd documentation Improvements or additions to documentation github-actions size/XL and removed size/L size/XL labels Mar 1, 2026

coderabbitai bot requested changes Mar 1, 2026

View reviewed changes

openverifiablellm/tokenizer.py Outdated Show resolved Hide resolved

openverifiablellm/utils.py Outdated Show resolved Hide resolved

pyproject.toml Show resolved Hide resolved

tests/test_tokenizer.py Show resolved Hide resolved

github-actions bot removed the size/XL label Mar 1, 2026

github-actions bot added the size/XL label Mar 1, 2026

Archit381 self-requested a review March 5, 2026 14:50

github-actions bot added size/XL size/M enhancement New feature or request and removed size/XL size/M labels Mar 6, 2026

feat: modular tokenizer architecture with BaseTokenizer and SentenceP…

34a63dc

…iece support

Varshiniputtabakula force-pushed the feat-tokenizer-training branch from b6b6181 to 34a63dc Compare March 7, 2026 10:47

github-actions bot added size/L and removed size/XL labels Mar 7, 2026

Archit381 approved these changes Mar 9, 2026

View reviewed changes

Archit381 merged commit 870df23 into AOSSIE-Org:main Mar 9, 2026
3 of 4 checks passed

This was referenced Mar 9, 2026

[FEATURE]: Complete SentencePiece tokenizer — encode, decode, load and input validation #51

Open

[FEATURE]: Complete BPETokenizer and BaseTokenizer-Add encode, decode and load to BaseTokenizer and BPETokenizer #52

Open

coderabbitai bot mentioned this pull request Mar 10, 2026

[Feature] : add training pipeline(Llama Architecture ) and deterministic tokenization #59

Open

Varshiniputtabakula mentioned this pull request Mar 10, 2026

[FEATURE]: Deterministic Dataset Tokenization with Merkle Verification #61

Open

2 tasks

coderabbitai bot mentioned this pull request Mar 10, 2026

[FEATURE]: Add contamination detection and config loader #60

Open

5 tasks

This was referenced Mar 11, 2026

feature:completed bpe and base #64

Open

Feature/complete sentencepiece tokenizer #65

Open

Uh oh!

Conversation

Varshiniputtabakula commented Mar 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What This Adds

Deterministic Tokenizer Training

Tokenizer Configuration Hashing

Vocabulary Integrity Check

Modular Tokenizer Architecture

Tests

Why This Matters

Relation to the Verification Framework

Related Work

Verification Chain

Code of Conduct

Uh oh!

coderabbitai bot commented Mar 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Archit381 commented Mar 7, 2026

Uh oh!

Varshiniputtabakula commented Mar 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

Varshiniputtabakula commented Mar 1, 2026 •

edited

Loading

coderabbitai bot commented Mar 1, 2026 •

edited

Loading