Upgrade vendored darts-clone to v0.32h and harden OCD dictionary loading#1372
Merged
Conversation
Add the upstream v0.32h header and BSD license under a separate dependency directory without changing the active build configuration.
Use OpenCC's existing size_t id_type for the vendored v0.32h header so the serialized Darts array layout stays compatible with the current OCD format.
Point Bazel, CMake, Windows CLI Zig builds, and npm packaging at the compatible vendored v0.32h dependency. Add local Bazel module metadata so @darts-clone resolves from the repository copy.
Delete the unused darts-clone 0.32 header after all build and package paths were switched to the compatible v0.32h vendor directory.
Apply the darts.h validation hardening from google/sentencepiece@d685ef31 and wire it into OpenCC's OCD loader. Validate serialized Darts arrays for unit alignment, root/offset bounds, and lexicon value bounds with regression coverage for malformed dictionaries.
Owner
Collaborator
Author
|
You can upstream this updated version immediately in bcr.1: |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
deps/darts-clone-0.32h/and adapt itsid_typetosize_tso the serialized OCD array layout stays compatiblewith the existing format.
vendor directory; remove the old vendored header.
DartsDictloading with three validation checks ported fromgoogle/sentencepiece@d685ef31: unit-size alignment of the serialized array,
root/offset bounds via
doubleArray->validate(), and lexicon value bounds.Malformed
.ocdfiles now throwInvalidFormatinstead of silentlyproducing undefined behavior.
NOTE: The .ocd format has always been platform-dependent — id_type is size_t,
so files generated on 64-bit and 32-bit builds are mutually incompatible. This is a
pre-existing limitation predating this PR; .ocd2 (marisa-trie, the default format) is
unaffected.
Test plan
DartsDictTestandConfigTestsuites pass.DartsDictTestcases cover the added rejection paths:RejectsMisalignedDartsSize— array size not a multiple of unit sizeRejectsInvalidDartsRoot— corrupted root unitRejectsInvalidDartsValue— out-of-bounds value in a leaf unitbazel build //src:opencc_lib //src:opencc-DENABLE_DARTS=ON