Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 30 additions & 1 deletion app/lib/widgets/extensions/string.dart
Original file line number Diff line number Diff line change
@@ -1,14 +1,43 @@
import 'dart:convert';

extension StringExtensions on String {
/// Attempts to fix double-encoded UTF-8 strings.
/// Only applies decoding if the string appears to be double-encoded
/// (UTF-8 bytes incorrectly stored as Latin-1 characters).
String get decodeString {
// Quick check: if no high-byte characters that look like UTF-8 leading bytes,
// the string is probably already correctly encoded
if (!_looksDoubleEncoded()) {
return this;
}
try {
return utf8.decode(codeUnits);
// Use latin1.encode to get byte values (treats each char as a byte),
// then decode those bytes as UTF-8
return utf8.decode(latin1.encode(this));
} on Exception catch (_) {
return this;
}
}

/// Checks if the string appears to be double-encoded UTF-8.
/// Double-encoding happens when UTF-8 bytes are incorrectly interpreted as Latin-1,
/// resulting in patterns like "é" instead of "é", or "â€"" instead of "—".
bool _looksDoubleEncoded() {
// Common UTF-8 leading byte patterns when misinterpreted as Latin-1:
// - Ã (0xC3) followed by another character = 2-byte UTF-8 sequence
// - â (0xE2) often starts 3-byte sequences (em-dash, curly quotes, etc.)
// These patterns are very unlikely in correctly-encoded text
for (int i = 0; i < length; i++) {
final code = codeUnitAt(i);
// Check for Latin-1 supplement range that looks like UTF-8 leading bytes
if (code >= 0xC0 && code <= 0xF4) {
// This could be a UTF-8 leading byte stored as Latin-1
return true;
}
}
return false;
}
Comment on lines +25 to +39
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current implementation of _looksDoubleEncoded is too broad. The condition code >= 0xC0 && code <= 0xF4 will incorrectly return true for valid, single-encoded strings that contain common non-ASCII characters like 'é', 'à', or 'ü'. This forces the decodeString getter to rely on a try-catch block for normal program flow with valid inputs, which is inefficient.

A more robust approach is to check for the specific pattern of a double-encoded character: a character that looks like a UTF-8 leading byte followed by one that looks like a UTF-8 continuation byte. This pattern is extremely unlikely to occur in correctly encoded text.

I suggest replacing this method with a more precise check to avoid these false positives.

  bool _looksDoubleEncoded() {
    // A more robust check for double-encoding is to look for a potential UTF-8
    // leading byte (C2-F4) followed by a continuation byte (80-BF).
    // This pattern is very unlikely in correctly-encoded text.
    for (int i = 0; i < length - 1; i++) {
      final c1 = codeUnitAt(i);
      // Check for a potential multi-byte UTF-8 start character (excluding overlong C0/C1).
      if (c1 >= 0xC2 && c1 <= 0xF4) {
        final c2 = codeUnitAt(i + 1);
        // Check if it's followed by a continuation character.
        if (c2 >= 0x80 && c2 <= 0xBF) {
          return true;
        }
      }
    }
    return false;
  }
References
  1. The current implementation forces a try-catch block for normal program flow, which is inefficient. Improving the detection logic will reduce the need for try-catch in non-exceptional cases, aligning with the principle of avoiding unnecessary code complexity for operations with a negligible chance of failure.


String capitalize() {
return isNotEmpty ? '${this[0].toUpperCase()}${substring(1)}' : '';
}
Expand Down