fix(708): Write consistent 2-byte UTF-16BE encoding for CEA-708 captions#1820
Merged
fix(708): Write consistent 2-byte UTF-16BE encoding for CEA-708 captions#1820
Conversation
Previously, the write_utf16_char (C) and write_char (Rust) functions wrote 1 byte for ASCII characters (high byte = 0) and 2 bytes for non-ASCII characters. This created an invalid mix of 8-bit and 16-bit values that iconv/encoding_rs couldn't convert properly when UTF-16BE encoding was specified. The fix always writes 2 bytes per character, ensuring consistent UTF-16BE encoding. This allows iconv to properly convert the data to UTF-8, fixing garbled output for Japanese and Chinese captions. Before fix (garbled): 人々が私を知‰挰弰栰䴰Ź섰漠時間管理につい‰晦<U+F830>䐰昰䐰縰 After fix (correct): 人々が私を知 ったとき、私は 時間管理につい て書いています Fixes #1451 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The test was checking for the old (incorrect) behavior where ASCII characters were written as 1 byte. The fix for issue #1451 correctly changed write_char to always write 2 bytes for proper UTF-16BE encoding. Updated the test to match this correct behavior. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
write_utf16_charfunction in C (ccx_decoders_708_output.c) to always write 2 byteswrite_charfunction in Rust (decoder/output.rs) to always write 2 bytesProblem
When extracting CEA-708 captions with Japanese or Chinese characters using
--service all[UTF-16BE], the output was garbled:The root cause was that both C and Rust implementations wrote:
This created an invalid mix of 8-bit and 16-bit values that couldn't be properly converted.
Solution
Always write 2 bytes per character, ensuring valid UTF-16BE encoding. After the fix:
Test plan
Fixes #1451
🤖 Generated with Claude Code