Conversation
There was a problem hiding this comment.
Code Review
This pull request addresses a chat formatting issue involving double-encoded UTF-8 strings. The proposed solution involves detecting and decoding these strings. However, the current _looksDoubleEncoded detection logic is overly broad, leading to false positives for valid strings with accented characters. This inefficiency forces the decodeString getter to rely on try-catch for normal operations. The suggested modification provides a more precise detection, improving correctness, performance, and reducing reliance on try-catch for non-exceptional cases.
| bool _looksDoubleEncoded() { | ||
| // Common UTF-8 leading byte patterns when misinterpreted as Latin-1: | ||
| // - Ã (0xC3) followed by another character = 2-byte UTF-8 sequence | ||
| // - â (0xE2) often starts 3-byte sequences (em-dash, curly quotes, etc.) | ||
| // These patterns are very unlikely in correctly-encoded text | ||
| for (int i = 0; i < length; i++) { | ||
| final code = codeUnitAt(i); | ||
| // Check for Latin-1 supplement range that looks like UTF-8 leading bytes | ||
| if (code >= 0xC0 && code <= 0xF4) { | ||
| // This could be a UTF-8 leading byte stored as Latin-1 | ||
| return true; | ||
| } | ||
| } | ||
| return false; | ||
| } |
There was a problem hiding this comment.
The current implementation of _looksDoubleEncoded is too broad. The condition code >= 0xC0 && code <= 0xF4 will incorrectly return true for valid, single-encoded strings that contain common non-ASCII characters like 'é', 'à', or 'ü'. This forces the decodeString getter to rely on a try-catch block for normal program flow with valid inputs, which is inefficient.
A more robust approach is to check for the specific pattern of a double-encoded character: a character that looks like a UTF-8 leading byte followed by one that looks like a UTF-8 continuation byte. This pattern is extremely unlikely to occur in correctly encoded text.
I suggest replacing this method with a more precise check to avoid these false positives.
bool _looksDoubleEncoded() {
// A more robust check for double-encoding is to look for a potential UTF-8
// leading byte (C2-F4) followed by a continuation byte (80-BF).
// This pattern is very unlikely in correctly-encoded text.
for (int i = 0; i < length - 1; i++) {
final c1 = codeUnitAt(i);
// Check for a potential multi-byte UTF-8 start character (excluding overlong C0/C1).
if (c1 >= 0xC2 && c1 <= 0xF4) {
final c2 = codeUnitAt(i + 1);
// Check if it's followed by a continuation character.
if (c2 >= 0x80 && c2 <= 0xBF) {
return true;
}
}
}
return false;
}References
- The current implementation forces a
try-catchblock for normal program flow, which is inefficient. Improving the detection logic will reduce the need fortry-catchin non-exceptional cases, aligning with the principle of avoiding unnecessary code complexity for operations with a negligible chance of failure.
No description provided.