URL: Fix URLSearchParams decoding to not throw URIError #1173

Krinkle · 2022-03-11T15:21:22Z

Background

The URL spec states that % and any characters following it must be considered as simple literals unless they are a pair of two hexidecimals (0-9, A-F, a-f). [1] This means that ?q=% or ?x=%zz are valid and simply represent their literal self, they don't decode to something else just like ?x=foo doesn't either.

The URL spec also states that percent decoding happens at the byte level after a string has been UTF decoded, and should not be turned back into a UTF string until after the percent decoding is done. This is important because when going through percent decoding you can encounter a multi-byte character of which one byte isn't a valid UTF-8 string (or at least not the same string), and simply concatenating them does not produce the same result.

This is very different from the decodeURIComponent [2] function, which basically assumes its input is maximally encoded by its encodeURIComponent counter-part and throws a URIError for anything that looks like percent encoding but isn't. This means decoding % or %zz results in a URIError. And, decoding a valid chunk that is part of a multi-byte codepoint such as %7F from %7F%C3%BF, or %f0 from %9f%92%a9 also produces a URIError.

[1] https://url.spec.whatwg.org/#percent-decode
[2] https://es5.github.io/#x15.1.3.2

Problem

For example, the following query string uses legacy encoding, which the backend software supports and corrects automatically. With the native URL API you can new URL(location) and e.g. read out the "foo" parameter. The polyfill can't get pass the constructor due to decodeURIComponent throwing URIError.

https://en.wikipedia.org/?title=Apollo%96Soyuz_Test_Project&foo=bar

This affects any browser where the polyfill is active (including latest Firefox or Chromium, when forcing the polyfill by e.g. setting window.URI to null first).

Solutions

The naive and first solution I tried was to try to follow the spec with minimal JS code, e.g. write the percent_decode function as a for-loop over the bytes string, and whenever you encounter a percent sign with two hex chars after it, call decodeURIComponent on just that chunk. This fixes the problem with exceptions from chunks that decodeURIComponent considers incomplete or out of range, such as % or %ZZ , as we'd skip over those.

However, this then started failing tests for multi-byte codepoints such as emojis (e.g. 💩, %f0%9f%92%a9, U+1F4A9) where decodeURIComponent was seeing only one chunk which is valid percent encoding, but not a valid UTF string by itself. I tried to bruteforce this by using String.fromCharCode(parseInt(chunk.slice(1), 16)); instead of decodeURIComponent, thus forcefully creating the byte in question and then concatenating this. This avoids the error but creates corrupted output. To do this correctly one would need additional mapping and logic to compute the right codepoint and create that directly somehow. I couldn't find an easy way to do this that didn't require additional and large polyfills (e.g. TextEncoder).

function percent_decode(bytes) {
  var output = '';
  for (var i = 0; i < bytes.length; i++) {
   var byte = bytes[i];
   if (byte === '%') {
     var chunk = bytes.substring(i, i + 3);
     if (chunk.length === 3) {
       var code1 = chunk.charCodeAt(1);
       var code2 = chunk.charCodeAt(2);
       // Only decode bytes in range 0-9, A-F, a-f.
       if (
         (
           (code1 >= 0x30 && code1 <= 0x39) ||
           (code1 >= 0x41 && code1 <= 0x46) ||
           (code1 >= 0x61 && code1 <= 0x66)
         ) && (
           (code2 >= 0x30 && code2 <= 0x39) ||
           (code2 >= 0x41 && code2 <= 0x46) ||
           (code2 >= 0x61 && code2 <= 0x66)
         )
       ) {
         // Even for a single valid chunk we can't use decodeURIComponent()
         // as that throws on "%7F" decoded separately from "%7F%C3%BF".
         // FIXME: This avoids throwing but produces incorrect output
         output += String.fromCharCode(parseInt(chunk.slice(1), 16));
         i = i + 2;
         continue;
       }
     }
   }
   output += byte;
  }
  return output;
}

A simpler way, since decodeURIComponent obviously has the logic we need for multi-byte characters, is to always call it with as many chunks as possible, whilst still skipping the things it shouldn't handle.

This would get messy and needlessly slow and verbose to do with a for-loop and a buffer, so... I resorted to a regex.

It simply looks for repetitions of "% + Hex pair" and decodes those. This repairs the pre-existing test cases I had broken and still fixes the new test cases I added. It also brings the code back to how it was before basically, and just uses the regex to skip over ranges that we know decodeURIComponent would otherwise throw on.

Fixes #4.

== Background == The URL spec states that `%` and any characters following it must be considered as simple literals unless they are a pair of two hexidecimals (0-9, A-F, a-f). [1] This means that `?q=%` or `?x=%zz` are valid and simply represent their literal self, they don't decode to something else just like `?x=foo` doesn't either. The URL spec also states that percent decoding happens at the byte level after a string has been UTF decoded, and should not be turned back into a UTF string until after the percent decoding is done. This is important because when going through percent decoding you can encounter a multi-byte character of which one byte isn't a valid UTF-8 string (or at least not the same string), and simply concatenating them does not produce the same result. This is very different from the decodeURIComponent [2] function, which basically assumes its input is maximally encoded by its encodeURIComponent counter-part and throws a URIError for anything that looks like percent encoding but isn't. This means decoding `%` or `%zz` results in a URIError. And, decoding a valid chunk that is part of a multi-byte codepoint such as `%7F` from `%7F%C3%BF`, `%f0` from `%9f%92%a9` also produces a URIError. [1] https://url.spec.whatwg.org/#percent-decode [2] https://es5.github.io/#x15.1.3.2 == Problem == For example, the following query string uses legacy encoding, which the backend software supports and corrects automatically. With the native URL API you can `new URL(location)` and e.g. read out the "foo" parameter. The polyfill can't get pass the constructor due to decodeURIComponent throwing URIError. > https://en.wikipedia.org/?title=Apollo%96Soyuz_Test_Project&foo=bar This affects any browser where the polyfill is active (including latest Firefox or Chromium of forcing the polyfill by e.g. setting window.URI to null first). == Solutions == The naive and first solution I tried was to try to follow the spec with minimal JS code, e.g. write the `percent_decode` function as a for-loop over the `bytes` string, and whenever you encounter a percent sign with two hex chars after it, call decodeURIComponent on just that chunk. This fixes the problem with exceptions from chunks that decodeURIComponent considers incomplete or out of range, such as `%` or `%ZZ` , as we'd skip over those. However, this then started failing tests for multi-byte codepoints such as emojis (e.g. 💩, `%f0%9f%92%a9`, U+1F4A9) where decodeURIComponent was seeing only one chunk which is valid percent encoding, but not a valid UTF string by itself. I tried to bruteforce this by using `String.fromCharCode(parseInt(chunk.slice(1), 16));` instead of decodeURIComponent, thus forcefully creating the byte in question and then concatenating this. This avoids the error but creates corrupted output. To do this correctly one would need additional mapping and logic to compute the right codepoint and create that directly somehow. I couldn't find an easy way to do this that didn't require additional and large polyfills (e.g. TextEncoder). ```js function percent_decode(bytes) { var output = ''; for (var i = 0; i < bytes.length; i++) { var byte = bytes[i]; if (byte === '%') { var chunk = bytes.substring(i, i + 3); if (chunk.length === 3) { var code1 = chunk.charCodeAt(1); var code2 = chunk.charCodeAt(2); // Only decode bytes in range 0-9, A-F, a-f. if ( ( (code1 >= 0x30 && code1 <= 0x39) || (code1 >= 0x41 && code1 <= 0x46) || (code1 >= 0x61 && code1 <= 0x66) ) && ( (code2 >= 0x30 && code2 <= 0x39) || (code2 >= 0x41 && code2 <= 0x46) || (code2 >= 0x61 && code2 <= 0x66) ) ) { // Even for a single valid chunk we can't use decodeURIComponent() // as that throws on "%7F" decoded separately from "%7F%C3%BF". // FIXME: This avoids throwing but produces incorrect output output += String.fromCharCode(parseInt(chunk.slice(1), 16)); i = i + 2; continue; } } } output += byte; } return output; } ``` A simpler way, since decodeURIComponent obviously has the logic we need for multi-byte characters, is to always call it with as many sequentual chunks as possible, whilst still skipping the things it shouldn't handle. This would get messy and needlessly slow and verbose to do with a for-loop and a buffer, so... I resorted to a regex. It simply looks for repetitions of "% + Hex pair" and decodes those. This repairs the pre-existing test cases I had broken and still fixes the new test cases I added. It also brings the code back to how it was before basically, and just uses the regex to skip over ranges that we know decodeURIComponent would otherwise throw on. Fixes JakeChampion#4.

JakeChampion · 2022-03-12T10:38:22Z

This looks fantastic, thank you for working in this!

The test suite for this is running at https://github.com/Financial-Times/polyfill-library/runs/5521367066?check_suite_focus=true

Krinkle · 2022-03-12T13:59:09Z

I found another bug I have a fix for, which basically replaces this function again. I don't mind sending it as a follow-up, but I also wouldn't mind holding this back for ~1.5 day (I intend to polish it up and submit here on Sunday evening UTC).

JakeChampion · 2022-03-12T14:05:39Z

@Krinkle let's land this and we can have another pull-request as a follow up 👍

Ref JakeChampion#1173. Ref JakeChampion#4. Co-authored-by: David Chan <david@troi.org>

Ref #1173. Ref #4. Co-authored-by: David Chan <david@troi.org>

Details in this pull request and commit, and in their source. In short: UTF-8 is well-designed, David Chan is awesome, and a regular expression will solve all our problems. * JakeChampion/polyfill-library@0ece79ce32 * JakeChampion/polyfill-library#1173 Bug: T103379 Bug: T207365 Change-Id: I7c4f9b6449a4317d68f4923fb4f198181bbfe800

Ref #1173. Ref #4. Co-authored-by: David Chan <david@troi.org>

Krinkle requested a review from a team as a code owner March 11, 2022 15:21

origamiserviceuser added this to Backlog in Origami ✨ Mar 11, 2022

github-actions bot added the library Relates to an Origami library label Mar 11, 2022

JakeChampion force-pushed the fix-percent-decode branch from 0c19210 to 2c550ff Compare March 12, 2022 10:25

JakeChampion enabled auto-merge (rebase) March 12, 2022 14:05

JakeChampion approved these changes Mar 12, 2022

View reviewed changes

JakeChampion disabled auto-merge March 12, 2022 14:05

JakeChampion merged commit 8cccd74 into JakeChampion:master Mar 12, 2022

Origami ✨ automation moved this from Backlog to Done Mar 12, 2022

Krinkle deleted the fix-percent-decode branch March 12, 2022 14:12

Krinkle added a commit to Krinkle/polyfill-library that referenced this pull request Mar 15, 2022

URL: Fix URLSearchParams to further avoid decoding URIError

7e0b5f0

Ref JakeChampion#1173. Ref JakeChampion#4. Co-authored-by: David Chan <david@troi.org>

Krinkle mentioned this pull request Mar 15, 2022

URL: Fix URLSearchParams to further avoid decoding URIError #1174

Merged

Krinkle added a commit to Krinkle/polyfill-library that referenced this pull request Mar 15, 2022

URL: Fix URLSearchParams to further avoid decoding URIError

d80a5ce

Ref JakeChampion#1173. Ref JakeChampion#4. Co-authored-by: David Chan <david@troi.org>

JakeChampion pushed a commit that referenced this pull request Mar 16, 2022

URL: Fix URLSearchParams to further avoid decoding URIError

0ece79c

Ref #1173. Ref #4. Co-authored-by: David Chan <david@troi.org>

JakeChampion pushed a commit that referenced this pull request Apr 1, 2022

URL: Fix URLSearchParams to further avoid decoding URIError

fa8f047

Ref #1173. Ref #4. Co-authored-by: David Chan <david@troi.org>

github-actions bot locked as resolved and limited conversation to collaborators Mar 13, 2023

robertboulton removed this from Done in Origami ✨ Jul 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

URL: Fix URLSearchParams decoding to not throw URIError #1173

URL: Fix URLSearchParams decoding to not throw URIError #1173

Krinkle commented Mar 11, 2022 •

edited

Loading

JakeChampion commented Mar 12, 2022

Krinkle commented Mar 12, 2022 •

edited

Loading

JakeChampion commented Mar 12, 2022

URL: Fix URLSearchParams decoding to not throw URIError #1173

URL: Fix URLSearchParams decoding to not throw URIError #1173

Conversation

Krinkle commented Mar 11, 2022 • edited Loading

Background

Problem

Solutions

JakeChampion commented Mar 12, 2022

Krinkle commented Mar 12, 2022 • edited Loading

JakeChampion commented Mar 12, 2022

Krinkle commented Mar 11, 2022 •

edited

Loading

Krinkle commented Mar 12, 2022 •

edited

Loading