Skip to content

fix(708): Write consistent 2-byte UTF-16BE encoding for CEA-708 captions#1820

Merged
cfsmp3 merged 2 commits intomasterfrom
fix/issue-1451-utf16-encoding
Dec 14, 2025
Merged

fix(708): Write consistent 2-byte UTF-16BE encoding for CEA-708 captions#1820
cfsmp3 merged 2 commits intomasterfrom
fix/issue-1451-utf16-encoding

Conversation

@cfsmp3
Copy link
Copy Markdown
Contributor

@cfsmp3 cfsmp3 commented Dec 14, 2025

Summary

  • Fixed the write_utf16_char function in C (ccx_decoders_708_output.c) to always write 2 bytes
  • Fixed the write_char function in Rust (decoder/output.rs) to always write 2 bytes
  • This ensures consistent UTF-16BE encoding that iconv/encoding_rs can properly convert to UTF-8

Problem

When extracting CEA-708 captions with Japanese or Chinese characters using --service all[UTF-16BE], the output was garbled:

人々が私を知‰挰弰栰䴰Ź섰漠時間管理につい‰晦<U+F830>䐰昰䐰縰

The root cause was that both C and Rust implementations wrote:

  • 1 byte for ASCII characters (high byte = 0)
  • 2 bytes for non-ASCII characters

This created an invalid mix of 8-bit and 16-bit values that couldn't be properly converted.

Solution

Always write 2 bytes per character, ensuring valid UTF-16BE encoding. After the fix:

人々が私を知 ったとき、私は 時間管理につい て書いています

Test plan

  • Downloaded and tested with the sample file from issue [BUG] A mix of 8-bit/16-bit chars sent to iconv #1451
  • Verified Japanese captions in service 2 now display correctly
  • Verified Chinese captions in service 3 now display correctly
  • Verified no encoding errors are reported
  • Verified build succeeds for both C and Rust components

Fixes #1451

🤖 Generated with Claude Code

cfsmp3 and others added 2 commits December 14, 2025 13:05
Previously, the write_utf16_char (C) and write_char (Rust) functions
wrote 1 byte for ASCII characters (high byte = 0) and 2 bytes for
non-ASCII characters. This created an invalid mix of 8-bit and 16-bit
values that iconv/encoding_rs couldn't convert properly when UTF-16BE
encoding was specified.

The fix always writes 2 bytes per character, ensuring consistent
UTF-16BE encoding. This allows iconv to properly convert the data to
UTF-8, fixing garbled output for Japanese and Chinese captions.

Before fix (garbled):
人々が私を知‰挰弰栰䴰Ź섰漠時間管理につい‰晦<U+F830>䐰昰䐰縰

After fix (correct):
人々が私を知 ったとき、私は 時間管理につい て書いています

Fixes #1451

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The test was checking for the old (incorrect) behavior where ASCII
characters were written as 1 byte. The fix for issue #1451 correctly
changed write_char to always write 2 bytes for proper UTF-16BE encoding.
Updated the test to match this correct behavior.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@cfsmp3 cfsmp3 merged commit a0129df into master Dec 14, 2025
28 of 29 checks passed
@cfsmp3 cfsmp3 deleted the fix/issue-1451-utf16-encoding branch December 14, 2025 15:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] A mix of 8-bit/16-bit chars sent to iconv

1 participant