Skip to content

Split tofu-risk TS character mappings into an extension table#1234

Merged
frankslin merged 1 commit into
BYVoid:masterfrom
frankslin:upstream-master2
May 21, 2026
Merged

Split tofu-risk TS character mappings into an extension table#1234
frankslin merged 1 commit into
BYVoid:masterfrom
frankslin:upstream-master2

Conversation

@frankslin
Copy link
Copy Markdown
Collaborator

Keep tofu-risk mappings in TSCharacters.txt as non-default candidates by making the source character the first candidate on annotated rows. This removes those rare inferred simplified forms from the default TSCharacters conversion result when the extension table is not loaded.

Update extract_tofu_risk.py so TSCharactersExt.txt is generated from the same annotations while skipping the newly inserted identity candidate. The generated extension table therefore preserves the old risk mappings, allowing configurations that load it before TSCharacters.ocd2 to keep existing conversion behavior.

Wire TSCharactersExt into Bazel, CMake, Node gyp, npm packaging, and the t2s/hk2s/tw2s/tw2sp configs, including the Jieba tw2sp config. The extension dictionary is generated and compiled before TSCharacters.ocd2 in those configurations so current default software behavior remains unchanged.

Verified with: bazel build //data/dictionary:generate_bin_TSCharactersExt; bazel test //data/config:config_dict_validation_test; bazel test //test:command_line_converter_test; bazel test //data/dictionary:dictionary_test; bazel test //data/config:config_schema_validation_test; cmake --build build/dbg --target Dictionaries; ./node_modules/.bin/node-gyp rebuild; node scripts/prepare-node-prebuild-artifacts.js; python3 -m build --wheel.

Note: npm test was also run, but it still fails existing fruit-drying testcase expectations unrelated to TSCharactersExt packaging.

Keep tofu-risk mappings in TSCharacters.txt as non-default candidates by making the source character the first candidate on annotated rows. This removes those rare inferred simplified forms from the default TSCharacters conversion result when the extension table is not loaded.

Update extract_tofu_risk.py so TSCharactersExt.txt is generated from the same annotations while skipping the newly inserted identity candidate. The generated extension table therefore preserves the old risk mappings, allowing configurations that load it before TSCharacters.ocd2 to keep existing conversion behavior.

Wire TSCharactersExt into Bazel, CMake, Node gyp, npm packaging, and the t2s/hk2s/tw2s/tw2sp configs, including the Jieba tw2sp config. The extension dictionary is generated and compiled before TSCharacters.ocd2 in those configurations so current default software behavior remains unchanged.

Verified with: bazel build //data/dictionary:generate_bin_TSCharactersExt; bazel test //data/config:config_dict_validation_test; bazel test //test:command_line_converter_test; bazel test //data/dictionary:dictionary_test; bazel test //data/config:config_schema_validation_test; cmake --build build/dbg --target Dictionaries; ./node_modules/.bin/node-gyp rebuild; node scripts/prepare-node-prebuild-artifacts.js; python3 -m build --wheel.

Note: npm test was also run, but it still fails existing fruit-drying testcase expectations unrelated to TSCharactersExt packaging.
@frankslin
Copy link
Copy Markdown
Collaborator Author

frankslin commented May 20, 2026

@danny0838 这个版本可以测试了,目前的行为保持和原来一样。把几个 json 里的 TSCharactersExt.ocd2 字典去掉,就变成不生成「豆腐块」的新行为了。

Ref: #217

@frankslin frankslin merged commit 09f530c into BYVoid:master May 21, 2026
32 checks passed
@frankslin frankslin deleted the upstream-master2 branch May 21, 2026 17:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant