Split off from #18 — the UTF-16 .length and NFC/NFD normalization issues are resolved in v0.5.20, but lone surrogate detection still fails:
"\uD800".isWellFormed() → true (should be false)
"\uD800".toWellFormed() → "\uD800" (should be "\uFFFD")
test_gap_string_methods shows 2 remaining diff lines for these.
Root cause
Perry uses Rust's &str (valid UTF-8) for string storage. Rust cannot represent lone surrogates — when the parser encounters "\uD800" in source, it either rejects it or replaces it with U+FFFD. So lone surrogates never survive into the runtime, and the isWellFormed runtime check at crates/perry-runtime/src/string.rs:898 walks UTF-16 looking for invalid sequences that can't exist.
Fix requires WTF-8 support
To handle this properly, three things need to change:
-
Compiler: HIR string literal lowering needs to emit CESU-8/WTF-8 byte sequences for \uXXXX escapes in the surrogate range (U+D800..U+DFFF), instead of letting Rust normalize them away.
-
Storage: StringHeader data section needs to tolerate invalid UTF-8 (the WTF-8 encoding). string_as_str would need to switch from str::from_utf8_unchecked to a WTF-8 wrapper, or operations would need to work directly on bytes.
-
APIs: All string ops that currently use &str would need to either operate on raw bytes or detect the WTF-8 marker and behave correctly.
Why this is low priority
This is genuinely niche — isWellFormed/toWellFormed are designed for sanitizing untrusted JSON/IPC input that might contain malformed UTF-16. Very few real applications use these methods. The fix is a meaningful architectural change for a narrow correctness gap.
Split off from #18 — the UTF-16
.lengthand NFC/NFD normalization issues are resolved in v0.5.20, but lone surrogate detection still fails:test_gap_string_methodsshows 2 remaining diff lines for these.Root cause
Perry uses Rust's
&str(valid UTF-8) for string storage. Rust cannot represent lone surrogates — when the parser encounters"\uD800"in source, it either rejects it or replaces it with U+FFFD. So lone surrogates never survive into the runtime, and theisWellFormedruntime check atcrates/perry-runtime/src/string.rs:898walks UTF-16 looking for invalid sequences that can't exist.Fix requires WTF-8 support
To handle this properly, three things need to change:
Compiler: HIR string literal lowering needs to emit CESU-8/WTF-8 byte sequences for
\uXXXXescapes in the surrogate range (U+D800..U+DFFF), instead of letting Rust normalize them away.Storage:
StringHeaderdata section needs to tolerate invalid UTF-8 (the WTF-8 encoding).string_as_strwould need to switch fromstr::from_utf8_uncheckedto a WTF-8 wrapper, or operations would need to work directly on bytes.APIs: All string ops that currently use
&strwould need to either operate on raw bytes or detect the WTF-8 marker and behave correctly.Why this is low priority
This is genuinely niche —
isWellFormed/toWellFormedare designed for sanitizing untrusted JSON/IPC input that might contain malformed UTF-16. Very few real applications use these methods. The fix is a meaningful architectural change for a narrow correctness gap.