Skip to content

String lone surrogate handling: isWellFormed / toWellFormed #29

@proggeramlug

Description

@proggeramlug

Split off from #18 — the UTF-16 .length and NFC/NFD normalization issues are resolved in v0.5.20, but lone surrogate detection still fails:

"\uD800".isWellFormed()       → true   (should be false)
"\uD800".toWellFormed()       → "\uD800"  (should be "\uFFFD")

test_gap_string_methods shows 2 remaining diff lines for these.

Root cause

Perry uses Rust's &str (valid UTF-8) for string storage. Rust cannot represent lone surrogates — when the parser encounters "\uD800" in source, it either rejects it or replaces it with U+FFFD. So lone surrogates never survive into the runtime, and the isWellFormed runtime check at crates/perry-runtime/src/string.rs:898 walks UTF-16 looking for invalid sequences that can't exist.

Fix requires WTF-8 support

To handle this properly, three things need to change:

  1. Compiler: HIR string literal lowering needs to emit CESU-8/WTF-8 byte sequences for \uXXXX escapes in the surrogate range (U+D800..U+DFFF), instead of letting Rust normalize them away.

  2. Storage: StringHeader data section needs to tolerate invalid UTF-8 (the WTF-8 encoding). string_as_str would need to switch from str::from_utf8_unchecked to a WTF-8 wrapper, or operations would need to work directly on bytes.

  3. APIs: All string ops that currently use &str would need to either operate on raw bytes or detect the WTF-8 marker and behave correctly.

Why this is low priority

This is genuinely niche — isWellFormed/toWellFormed are designed for sanitizing untrusted JSON/IPC input that might contain malformed UTF-16. Very few real applications use these methods. The fix is a meaningful architectural change for a narrow correctness gap.

Metadata

Metadata

Assignees

No one assigned

    Labels

    parityNode.js compatibility / parity gaps

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions