[FIX] Replace .expect() with .unwrap_or() in Ucs2String case conversion — fixes panic on CEA-708 surrogate input#2239
Open
NexionisJake wants to merge 1 commit intoCCExtractor:masterfrom
Conversation
Ucs2String::to_lowercase() and to_uppercase() called
char::from_u32(c as u32).expect("Invalid u32 character") for every code
unit. UCS-2 surrogate values (0xD800–0xDFFF) are valid u16 but are not
valid Unicode scalar values — char::from_u32() returns None for them and
.expect() panics unconditionally.
Any real-world CEA-708 broadcast stream carrying surrogate pairs triggered
this crash during case conversion with no recovery path.
Replace .expect("Invalid u32 character") with .unwrap_or(UNAVAILABLE_CHAR.into())
at both call sites, substituting '?' for unrepresentable code points —
consistent with how ucs2_to_char() already handles this in the same file.
Fixes CCExtractor#2232
Collaborator
CCExtractor CI platform finished running the test files on linux. Below is a summary of the test results, when compared to test for commit d56a6be...:
Your PR breaks these cases:
NOTE: The following tests have been failing on the master branch as well as the PR:
Congratulations: Merging this PR would fix the following tests:
It seems that not all tests were passed completely. This is an indication that the output of some files is not as expected (but might be according to you). Check the result page for more info. |
Collaborator
CCExtractor CI platform finished running the test files on windows. Below is a summary of the test results, when compared to test for commit d56a6be...:
Your PR breaks these cases:
NOTE: The following tests have been failing on the master branch as well as the PR:
Congratulations: Merging this PR would fix the following tests:
It seems that not all tests were passed completely. This is an indication that the output of some files is not as expected (but might be according to you). Check the result page for more info. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
In raising this pull request, I confirm the following :
Reason for this PR:
Sanity check:
changelog.
Repro instructions:
Process any CEA-708 stream whose subtitle text contains UCS-2 surrogate code units (0xD800–0xDFFF) with a case-conversion
path enabled (e.g. --sentencecap). CCExtractor panics immediately:
thread 'main' panicked at 'Invalid u32 character', src/rust/lib_ccxr/src/util/encoding.rs:245
A minimal Rust reproducer:
use lib_ccxr::util::encoding::{Ucs2String};
let s = Ucs2String::from_vec(vec![0xD800]); // lone high surrogate
let _ = s.to_lowercase(); // panics
Root Cause
Ucs2String::to_lowercase() and to_uppercase() in src/rust/lib_ccxr/src/util/encoding.rs called:
char::from_u32(c as u32).expect("Invalid u32 character")
UCS-2 surrogate code units (0xD800–0xDFFF) are valid u16 values but are not valid Unicode scalar values. char::from_u32()
returns None for them, and .expect() panics unconditionally. Any real-world CEA-708 broadcast stream carrying surrogate
pairs crashed CCExtractor with no recovery path.
Fix
Replaced both .expect("Invalid u32 character") calls with .unwrap_or(UNAVAILABLE_CHAR.into()), consistent with how
ucs2_to_char() already handles this in the same file (line 1027):
Before:
cc_to_lowercase(char::from_u32(c as u32).expect("Invalid u32 character")) as u16
cc_to_uppercase(char::from_u32(c as u32).expect("Invalid u32 character")) as u16
After:
cc_to_lowercase(char::from_u32(c as u32).unwrap_or(UNAVAILABLE_CHAR.into())) as u16
cc_to_uppercase(char::from_u32(c as u32).unwrap_or(UNAVAILABLE_CHAR.into())) as u16
UNAVAILABLE_CHAR is b'?', which is already the established fallback for unrepresentable code points throughout this file.
Testing
Fixes #2232