Skip to content

[FIX] Replace .expect() with .unwrap_or() in Ucs2String case conversion — fixes panic on CEA-708 surrogate input#2239

Open
NexionisJake wants to merge 1 commit intoCCExtractor:masterfrom
NexionisJake:fix/ucs2-surrogate-panic-case-conversion
Open

[FIX] Replace .expect() with .unwrap_or() in Ucs2String case conversion — fixes panic on CEA-708 surrogate input#2239
NexionisJake wants to merge 1 commit intoCCExtractor:masterfrom
NexionisJake:fix/ucs2-surrogate-panic-case-conversion

Conversation

@NexionisJake
Copy link
Copy Markdown
Contributor

In raising this pull request, I confirm the following :

Reason for this PR:

  • This PR adds new functionality.
  • This PR fixes a bug that I have personally experienced or that a real user has reported and for which a sample exists.
  • This PR is porting code from C to Rust.

Sanity check:

  • I have read and understood the contributors guide.
  • I have checked that another pull request for this purpose does not exist.
  • If the PR adds new functionality, I've added it to the changelog. If it's just a bug fix, I have NOT added it to the
    changelog.
  • I am NOT adding new C code unless it's to fix an existing, reproducible bug.

Repro instructions:

Process any CEA-708 stream whose subtitle text contains UCS-2 surrogate code units (0xD800–0xDFFF) with a case-conversion
path enabled (e.g. --sentencecap). CCExtractor panics immediately:

thread 'main' panicked at 'Invalid u32 character', src/rust/lib_ccxr/src/util/encoding.rs:245

A minimal Rust reproducer:
use lib_ccxr::util::encoding::{Ucs2String};
let s = Ucs2String::from_vec(vec![0xD800]); // lone high surrogate
let _ = s.to_lowercase(); // panics


Root Cause

Ucs2String::to_lowercase() and to_uppercase() in src/rust/lib_ccxr/src/util/encoding.rs called:

char::from_u32(c as u32).expect("Invalid u32 character")

UCS-2 surrogate code units (0xD800–0xDFFF) are valid u16 values but are not valid Unicode scalar values. char::from_u32()
returns None for them, and .expect() panics unconditionally. Any real-world CEA-708 broadcast stream carrying surrogate
pairs crashed CCExtractor with no recovery path.

Fix

Replaced both .expect("Invalid u32 character") calls with .unwrap_or(UNAVAILABLE_CHAR.into()), consistent with how
ucs2_to_char() already handles this in the same file (line 1027):

Before:
cc_to_lowercase(char::from_u32(c as u32).expect("Invalid u32 character")) as u16
cc_to_uppercase(char::from_u32(c as u32).expect("Invalid u32 character")) as u16

After:
cc_to_lowercase(char::from_u32(c as u32).unwrap_or(UNAVAILABLE_CHAR.into())) as u16
cc_to_uppercase(char::from_u32(c as u32).unwrap_or(UNAVAILABLE_CHAR.into())) as u16

UNAVAILABLE_CHAR is b'?', which is already the established fallback for unrepresentable code points throughout this file.

Testing

  • cargo build clean, zero new warnings
  • cargo clippy clean
  • cargo test encoding passes
  • Verified zero remaining .expect() calls on char::from_u32 in the codebase

Fixes #2232

Ucs2String::to_lowercase() and to_uppercase() called
char::from_u32(c as u32).expect("Invalid u32 character") for every code
unit. UCS-2 surrogate values (0xD800–0xDFFF) are valid u16 but are not
valid Unicode scalar values — char::from_u32() returns None for them and
.expect() panics unconditionally.

Any real-world CEA-708 broadcast stream carrying surrogate pairs triggered
this crash during case conversion with no recovery path.

Replace .expect("Invalid u32 character") with .unwrap_or(UNAVAILABLE_CHAR.into())
at both call sites, substituting '?' for unrepresentable code points —
consistent with how ucs2_to_char() already handles this in the same file.

Fixes CCExtractor#2232
@ccextractor-bot
Copy link
Copy Markdown
Collaborator

CCExtractor CI platform finished running the test files on linux. Below is a summary of the test results, when compared to test for commit d56a6be...:
Report Name Tests Passed
Broken 9/13
CEA-708 1/14
DVB 3/7
DVD 3/3
DVR-MS 2/2
General 20/27
Hardsubx 1/1
Hauppage 3/3
MP4 3/3
NoCC 10/10
Options 72/86
Teletext 20/21
WTV 13/13
XDS 28/34

Your PR breaks these cases:

NOTE: The following tests have been failing on the master branch as well as the PR:

Congratulations: Merging this PR would fix the following tests:

  • ccextractor --startcreditsnotafter 2 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9..., Last passed: Never
  • ccextractor --startcreditsforatmost 2 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9..., Last passed: Never

It seems that not all tests were passed completely. This is an indication that the output of some files is not as expected (but might be according to you).

Check the result page for more info.

@ccextractor-bot
Copy link
Copy Markdown
Collaborator

CCExtractor CI platform finished running the test files on windows. Below is a summary of the test results, when compared to test for commit d56a6be...:
Report Name Tests Passed
Broken 9/13
CEA-708 1/14
DVB 4/7
DVD 3/3
DVR-MS 2/2
General 22/27
Hardsubx 1/1
Hauppage 3/3
MP4 3/3
NoCC 10/10
Options 81/86
Teletext 20/21
WTV 13/13
XDS 31/34

Your PR breaks these cases:

  • ccextractor --autoprogram --out=ttxt --latin1 --ucla --xds 8e8229b88b...
  • ccextractor --autoprogram --out=ttxt --latin1 132d7df7e9...
  • ccextractor --autoprogram --out=ttxt --latin1 99e5eaafdc...
  • ccextractor --autoprogram --out=srt --latin1 b22260d065...
  • ccextractor --autoprogram --out=ttxt --latin1 --ucla 7aad20907e...
  • ccextractor --autoprogram --out=ttxt --latin1 01509e4d27...
  • ccextractor --autoprogram --out=ttxt --xds --latin1 --ucla 85058ad37e...
  • ccextractor --autoprogram --out=srt --latin1 --ucla b22260d065...
  • ccextractor --autoprogram --out=ttxt --latin1 --ucla --xds 7f41299cc7...

NOTE: The following tests have been failing on the master branch as well as the PR:

Congratulations: Merging this PR would fix the following tests:

  • ccextractor --autoprogram --out=srt --latin1 --quant 0 85271be4d2..., Last passed: Never
  • ccextractor --autoprogram --out=ttxt --latin1 --ucla dab1c1bd65..., Last passed: Never
  • ccextractor --out=srt --latin1 --autoprogram 29e5ffd34b..., Last passed: Never
  • ccextractor --out=spupng c83f765c66..., Last passed: Never
  • ccextractor --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9..., Last passed: Never
  • ccextractor --startcreditsnotbefore 1 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9..., Last passed: Never
  • ccextractor --startcreditsnotafter 2 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9..., Last passed: Never
  • ccextractor --startcreditsforatleast 1 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9..., Last passed: Never
  • ccextractor --startcreditsforatmost 2 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9..., Last passed: Never

It seems that not all tests were passed completely. This is an indication that the output of some files is not as expected (but might be according to you).

Check the result page for more info.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Rust panic in Ucs2String case conversion on UCS-2 surrogate code points (CEA-708 input)

2 participants