Skip to content

Add experimental Jieba segmentation plugin and standalone packaging support#1093

Merged
frankslin merged 20 commits intoBYVoid:masterfrom
frankslin:upstream-master
Apr 13, 2026
Merged

Add experimental Jieba segmentation plugin and standalone packaging support#1093
frankslin merged 20 commits intoBYVoid:masterfrom
frankslin:upstream-master

Conversation

@frankslin
Copy link
Copy Markdown
Collaborator

@frankslin frankslin commented Apr 11, 2026

Summary

This PR adds the first loadable segmentation plugin for OpenCC and wires in an experimental jieba-based segmenter.

It includes:

  • a plugin host in core OpenCC for non-built-in segmenters
  • a plugins/jieba implementation backed by cppjieba
  • CMake and Bazel build targets for the plugin
  • integration tests and plugin-specific configs for s2twp and tw2sp
  • standalone CMake build support so downstream packagers can ship the plugin separately from core OpenCC

The plugin is dynamically loaded during opencc execution.

What changed

Core plugin host (in earlier commit)

  • add a runtime plugin loading path for non-built-in segmentation.type
  • define and install the plugin API header used by external segmenter plugins
  • keep built-in behavior unchanged for existing configs such as mmseg

Jieba plugin

  • third-party library cppjieba under plugins/jieba/deps, including required resources (in earlier commit)
  • add opencc-jieba shared library target
  • add plugin configs:
    • s2twp_jieba.json
    • tw2sp_jieba.json

Build and packaging

  • add integrated CMake/Bazel targets under plugins/jieba
  • support building plugins/jieba as a standalone CMake project via find_package(OpenCC)
  • align standalone install paths with the integrated OpenCC plugin/data layout
  • document standalone packaging steps for distro maintainers

Tests

  • add integration coverage that exercises:
    • the built opencc command
    • the built opencc-jieba plugin
    • real plugin configs and resource files
  • add regression-style cases for known segmentation-sensitive conversions such as:
    • 慰藉著
    • 城堡的士兵

Notes

  • jieba remains experimental and is exposed through plugin-specific configs instead of replacing built-in segmentation.
  • This PR is structured so downstream packages can choose to distribute core OpenCC and opencc-jieba separately.

…plugin

- Refactored C++ calls to use the modern cppjieba::Jieba interface.
- Updated CMakeLists and BUILD.bazel to point to local plugin dependencies.
- Added portable_util export to Bazel test framework to support cross-package Jieba plugin integration tests.
…lone project

- Adapted plugins/jieba/CMakeLists.txt to dynamically detect CMAKE_CURRENT_SOURCE_DIR.
- Resolves upstream OpenCC C++ Headers dynamically via find_package(opencc) during standalone builds.
- Guards internal component integrations (like test cases) when OpenCC root tree isn't present.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4e0c9996fa

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread plugins/jieba/CMakeLists.txt Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 298e2bbc69

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread plugins/jieba/tests/JiebaPluginIntegrationTest.cpp Outdated
Comment thread plugins/jieba/tests/JiebaPluginIntegrationTest.cpp Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

bool IsReadableFile(const std::string& path) {
std::ifstream ifs(path.c_str());
return ifs.is_open();

P2 Badge Validate Jieba resource paths with Windows-safe path API

Jieba resource existence checks are routed through IsReadableFile, which uses narrow std::ifstream(path.c_str()); on Windows this fails for UTF-8/non-ASCII paths even when files exist. In those environments the plugin incorrectly returns OPENCC_ERROR_PLUGIN_RESOURCE_MISSING for dict_path/model_path/aux files, breaking conversions for valid installations under localized directory names.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/PluginSegmentation.cpp
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

if (!ReadConfigValue(args->config, args->config_size, "dict_path", &dictPath) ||
dictPath.empty()) {
SetError(args->error, OPENCC_ERROR_PLUGIN_RESOURCE_MISSING,
"Required resource missing: dict_path");
return -1;

P2 Badge Clear output handle on create() failure paths

CreateJiebaSegmentation returns failure when required resources are missing, but these early-return paths never reset *args->out to null. The plugin ABI contract in OpenCCPlugin.h says create() failures must leave *out == NULL; otherwise callers that reuse a non-null handle variable can keep a stale pointer and may later destroy or reuse an invalid handle after initialization failed.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread plugins/jieba/CMakeLists.txt
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8f6a2c682e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread plugins/jieba/CMakeLists.txt Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5fa48b5904

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread plugins/jieba/src/JiebaSegmentationPlugin.cpp Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 450d480944

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread plugins/jieba/BUILD.bazel
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 301c7229c4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread plugins/jieba/src/JiebaSegmentationPlugin.cpp Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3d42994ebd

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +146 to +149
#ifdef _WIN32
"_main/plugins/jieba/opencc-jieba.dll",
"_main/plugins/jieba/opencc-jieba.exe",
#elif defined(__APPLE__)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Include all Windows DLL variants in Bazel plugin lookup

In PluginDirectory() the Bazel/Windows candidate list only checks opencc-jieba.dll and opencc-jieba.exe, but the host loader also accepts libopencc-<type>.dll and msys-opencc-<type>.dll (see GetPluginFileNames in src/PluginSegmentation.cpp). When Bazel emits one of those alternate DLL names (for example with MinGW-flavored toolchains), this test path resolves to "", sets OPENCC_SEGMENTATION_PLUGIN_PATH to empty, and fails with a false plugin-not-found error even though the plugin was built.

Useful? React with 👍 / 👎.

@BYVoid
Copy link
Copy Markdown
Owner

BYVoid commented Apr 13, 2026

In the PR description please also briefly explain how the plugin is loaded into opencc (statically linked, dynamically linked or dynamically loaded).

Comment thread Makefile

PREFIX = /usr
REL_BUILD_DOCUMENTATION ?= ON
REL_BUILD_DOCUMENTATION ?= OFF
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated change?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was needed to resolve a CI failure due to a missing doxygen deps in the GitHub Action runtime. I forgot how it was triggered, but it didn't look like it always repro on all platform combinations. If generating doc is still needed, we can run it one-time in the release workflow and only in one architecture.

Comment thread plugins/jieba/BUILD.bazel
)

cc_binary(
name = "opencc-jieba",
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this binary target a library to be dynamically linked to opencc? If so, I would prefer a more explicit name like libopencc-jieba

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "binary" target outputs a platform dependent dynamic library following each platform's naming convention:

  • Linux: libopencc-jieba.so
  • macOS: libopencc-jieba.dylib
  • Windows: opencc-jieba.dll (or msys-opencc-jieba.dll for Mingw/msys)

Using a name like "lib..." could imply it's for a cc_library rule (which it isn't). And on Windows, the naming convention is to drop the "lib" prefix.

@frankslin frankslin merged commit e0a3818 into BYVoid:master Apr 13, 2026
27 checks passed
@frankslin frankslin deleted the upstream-master branch April 13, 2026 20:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants