fix(708): Write consistent 2-byte UTF-16BE encoding for CEA-708 captions by cfsmp3 · Pull Request #1820 · CCExtractor/ccextractor

cfsmp3 · 2025-12-14T12:06:03Z

Summary

Fixed the write_utf16_char function in C (ccx_decoders_708_output.c) to always write 2 bytes
Fixed the write_char function in Rust (decoder/output.rs) to always write 2 bytes
This ensures consistent UTF-16BE encoding that iconv/encoding_rs can properly convert to UTF-8

Problem

When extracting CEA-708 captions with Japanese or Chinese characters using --service all[UTF-16BE], the output was garbled:

人々が私を知‰挰弰栰䴰Ź섰漠時間管理につい‰晦<U+F830>䐰昰䐰縰

The root cause was that both C and Rust implementations wrote:

1 byte for ASCII characters (high byte = 0)
2 bytes for non-ASCII characters

This created an invalid mix of 8-bit and 16-bit values that couldn't be properly converted.

Solution

Always write 2 bytes per character, ensuring valid UTF-16BE encoding. After the fix:

人々が私を知 ったとき、私は 時間管理につい て書いています

Test plan

Downloaded and tested with the sample file from issue [BUG] A mix of 8-bit/16-bit chars sent to iconv #1451
Verified Japanese captions in service 2 now display correctly
Verified Chinese captions in service 3 now display correctly
Verified no encoding errors are reported
Verified build succeeds for both C and Rust components

Fixes #1451

🤖 Generated with Claude Code

Previously, the write_utf16_char (C) and write_char (Rust) functions wrote 1 byte for ASCII characters (high byte = 0) and 2 bytes for non-ASCII characters. This created an invalid mix of 8-bit and 16-bit values that iconv/encoding_rs couldn't convert properly when UTF-16BE encoding was specified. The fix always writes 2 bytes per character, ensuring consistent UTF-16BE encoding. This allows iconv to properly convert the data to UTF-8, fixing garbled output for Japanese and Chinese captions. Before fix (garbled): 人々が私を知‰挰弰栰䴰Ź섰漠時間管理につい‰晦<U+F830>䐰昰䐰縰 After fix (correct): 人々が私を知ったとき、私は時間管理について書いています Fixes #1451 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

The test was checking for the old (incorrect) behavior where ASCII characters were written as 1 byte. The fix for issue #1451 correctly changed write_char to always write 2 bytes for proper UTF-16BE encoding. Updated the test to match this correct behavior. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

cfsmp3 and others added 2 commits December 14, 2025 13:05

cfsmp3 merged commit a0129df into master Dec 14, 2025
28 of 29 checks passed

cfsmp3 deleted the fix/issue-1451-utf16-encoding branch December 14, 2025 15:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(708): Write consistent 2-byte UTF-16BE encoding for CEA-708 captions#1820

fix(708): Write consistent 2-byte UTF-16BE encoding for CEA-708 captions#1820
cfsmp3 merged 2 commits intomasterfrom
fix/issue-1451-utf16-encoding

cfsmp3 commented Dec 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cfsmp3 commented Dec 14, 2025

Summary

Problem

Solution

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant