String lone surrogate handling: isWellFormed / toWellFormed

Split off from #18 — the UTF-16 `.length` and NFC/NFD normalization issues are resolved in v0.5.20, but lone surrogate detection still fails:

```
"\uD800".isWellFormed()       → true   (should be false)
"\uD800".toWellFormed()       → "\uD800"  (should be "\uFFFD")
```

`test_gap_string_methods` shows 2 remaining diff lines for these.

## Root cause

Perry uses Rust's `&str` (valid UTF-8) for string storage. Rust cannot represent lone surrogates — when the parser encounters `"\uD800"` in source, it either rejects it or replaces it with U+FFFD. So lone surrogates never survive into the runtime, and the `isWellFormed` runtime check at `crates/perry-runtime/src/string.rs:898` walks UTF-16 looking for invalid sequences that can't exist.

## Fix requires WTF-8 support

To handle this properly, three things need to change:

1. **Compiler**: HIR string literal lowering needs to emit CESU-8/WTF-8 byte sequences for `\uXXXX` escapes in the surrogate range (U+D800..U+DFFF), instead of letting Rust normalize them away.

2. **Storage**: `StringHeader` data section needs to tolerate invalid UTF-8 (the WTF-8 encoding). `string_as_str` would need to switch from `str::from_utf8_unchecked` to a WTF-8 wrapper, or operations would need to work directly on bytes.

3. **APIs**: All string ops that currently use `&str` would need to either operate on raw bytes or detect the WTF-8 marker and behave correctly.

## Why this is low priority

This is genuinely niche — `isWellFormed`/`toWellFormed` are designed for sanitizing untrusted JSON/IPC input that might contain malformed UTF-16. Very few real applications use these methods. The fix is a meaningful architectural change for a narrow correctness gap.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

String lone surrogate handling: isWellFormed / toWellFormed #29

Root cause

Fix requires WTF-8 support

Why this is low priority

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

String lone surrogate handling: isWellFormed / toWellFormed #29

Description

Root cause

Fix requires WTF-8 support

Why this is low priority

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions