Skip to content
This repository has been archived by the owner on Aug 4, 2024. It is now read-only.

URL: Fix URLSearchParams decoding to not throw URIError #1173

Merged
merged 1 commit into from
Mar 12, 2022

Commits on Mar 12, 2022

  1. URL: Fix URLSearchParams decoding to not throw URIError

    == Background ==
    
    The URL spec states that `%` and any characters following it must be
    considered as simple literals unless they are a pair of two hexidecimals
    (0-9, A-F, a-f). [1] This means that `?q=%` or `?x=%zz` are valid and
    simply represent their literal self, they don't decode to something
    else just like `?x=foo` doesn't either.
    
    The URL spec also states that percent decoding happens at the byte
    level after a string has been UTF decoded, and should not be turned
    back into a UTF string until after the percent decoding is done.
    This is important because when going through percent decoding you
    can encounter a multi-byte character of which one byte isn't a valid
    UTF-8 string (or at least not the same string), and simply
    concatenating them does not produce the same result.
    
    This is very different from the decodeURIComponent [2] function,
    which basically assumes its input is maximally encoded by its
    encodeURIComponent counter-part and throws a URIError for anything
    that looks like percent encoding but isn't. This means decoding
    `%` or `%zz` results in a URIError. And, decoding a valid chunk
    that is part of a multi-byte codepoint such as `%7F` from `%7F%C3%BF`,
    `%f0` from `%9f%92%a9` also produces a URIError.
    
    [1] https://url.spec.whatwg.org/#percent-decode
    [2] https://es5.github.io/#x15.1.3.2
    
    == Problem ==
    
    For example, the following query string uses legacy encoding, which
    the backend software supports and corrects automatically. With the
    native URL API you can `new URL(location)` and e.g. read out the
    "foo" parameter. The polyfill can't get pass the constructor due
    to decodeURIComponent throwing URIError.
    
    > https://en.wikipedia.org/?title=Apollo%96Soyuz_Test_Project&foo=bar
    
    This affects any browser where the polyfill is active (including
    latest Firefox or Chromium of forcing the polyfill by e.g. setting
    window.URI to null first).
    
    == Solutions ==
    
    The naive and first solution I tried was to try to follow the spec
    with minimal JS code, e.g. write the `percent_decode` function as a
    for-loop over the `bytes` string, and whenever you encounter a percent
    sign with two hex chars after it, call decodeURIComponent on just that
    chunk. This fixes the problem with exceptions from chunks that
    decodeURIComponent considers incomplete or out of range, such as `%`
    or `%ZZ` , as we'd skip over those.
    
    However, this then started failing tests for multi-byte codepoints
    such as emojis (e.g. 💩, `%f0%9f%92%a9`, U+1F4A9) where
    decodeURIComponent was seeing only one chunk which is valid percent
    encoding, but not a valid UTF string by itself. I tried to bruteforce
    this by using `String.fromCharCode(parseInt(chunk.slice(1), 16));`
    instead of decodeURIComponent, thus forcefully creating the byte
    in question and then concatenating this. This avoids the error but
    creates corrupted output. To do this correctly one would need
    additional mapping and logic to compute the right codepoint and
    create that directly somehow. I couldn't find an easy way to do this
    that didn't require additional and large polyfills (e.g. TextEncoder).
    
    ```js
    function percent_decode(bytes) {
      var output = '';
      for (var i = 0; i < bytes.length; i++) {
       var byte = bytes[i];
       if (byte === '%') {
         var chunk = bytes.substring(i, i + 3);
         if (chunk.length === 3) {
           var code1 = chunk.charCodeAt(1);
           var code2 = chunk.charCodeAt(2);
           // Only decode bytes in range 0-9, A-F, a-f.
           if (
             (
               (code1 >= 0x30 && code1 <= 0x39) ||
               (code1 >= 0x41 && code1 <= 0x46) ||
               (code1 >= 0x61 && code1 <= 0x66)
             ) && (
               (code2 >= 0x30 && code2 <= 0x39) ||
               (code2 >= 0x41 && code2 <= 0x46) ||
               (code2 >= 0x61 && code2 <= 0x66)
             )
           ) {
             // Even for a single valid chunk we can't use decodeURIComponent()
             // as that throws on "%7F" decoded separately from "%7F%C3%BF".
             // FIXME: This avoids throwing but produces incorrect output
             output += String.fromCharCode(parseInt(chunk.slice(1), 16));
             i = i + 2;
             continue;
           }
         }
       }
       output += byte;
      }
      return output;
    }
    ```
    
    A simpler way, since decodeURIComponent obviously has the logic
    we need for multi-byte characters, is to always call it with as many
    sequentual chunks as possible, whilst still skipping the things it
    shouldn't handle.
    
    This would get messy and needlessly slow and verbose to do with a
    for-loop and a buffer, so... I resorted to a regex.
    
    It simply looks for repetitions of "% + Hex pair" and decodes those.
    This repairs the pre-existing test cases I had broken and still fixes
    the new test cases I added. It also brings the code back to how it
    was before basically, and just uses the regex to skip over ranges that
    we know decodeURIComponent would otherwise throw on.
    
    Fixes JakeChampion#4.
    Krinkle authored and JakeChampion committed Mar 12, 2022
    Configuration menu
    Copy the full SHA
    2c550ff View commit details
    Browse the repository at this point in the history