Skip to content

Switch segmentation plugin ABI to codepoint lengths#1111

Merged
frankslin merged 3 commits intoBYVoid:masterfrom
frankslin:upstream-master
Apr 15, 2026
Merged

Switch segmentation plugin ABI to codepoint lengths#1111
frankslin merged 3 commits intoBYVoid:masterfrom
frankslin:upstream-master

Conversation

@frankslin
Copy link
Copy Markdown
Collaborator

@frankslin frankslin commented Apr 15, 2026

Update the segmentation plugin ABI entry point to:

  • opencc_get_segmentation_plugin_v2()

Segmentation results are returned as a sequence of segment lengths measured in
Unicode code points, not as copied token strings. The ABI contract is:

  • input text is passed to the plugin as null-terminated UTF-8
  • the plugin returns segment_count plus codepoint_lengths
  • each element is the number of Unicode code points in the next segment
  • lengths must be positive and must cover the full input, in order
  • the host reconstructs segment boundaries from the original UTF-8 input

This keeps the ABI simpler and avoids allocating one string per token across
the plugin boundary.

@BYVoid
Copy link
Copy Markdown
Owner

BYVoid commented Apr 15, 2026

Add description?

@frankslin
Copy link
Copy Markdown
Collaborator Author

Add description?

Done.

@frankslin frankslin merged commit 3cdf22c into BYVoid:master Apr 15, 2026
28 checks passed
@frankslin frankslin deleted the upstream-master branch April 15, 2026 04:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants