Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IE Polyfill unable to handle invalid URLs #4

Closed
FizzBuzz791 opened this issue Dec 19, 2018 · 1 comment · Fixed by #1173
Closed

IE Polyfill unable to handle invalid URLs #4

FizzBuzz791 opened this issue Dec 19, 2018 · 1 comment · Fixed by #1173
Labels
help wanted We'd appreciate some help with this library Relates to an Origami library

Comments

@FizzBuzz791
Copy link

If I try to access an invalid URL in IE, such as localhost/?query=123%%%, the polyfill hits this line;
https://github.com/Financial-Times/polyfill-library/blob/1f5b5a1b72facaa89856de5da9afc3b85aaf7531/polyfills/URL/polyfill.js#L89

The error that comes back is "The URI to be decoded is not a valid encoding".

Seems to figure itself out in Firefox and Chrome, I haven't delved much further as to why it works for those browsers.

@JakeChampion JakeChampion added the help wanted We'd appreciate some help with this label Apr 3, 2019
@chee chee added this to incoming in Origami ✨ Feb 1, 2020
@Krinkle
Copy link
Contributor

Krinkle commented Mar 4, 2022

I can reproduce this and it is affecting Wikipedia as downstream user of this polyfill (T106244).

The previous homemade URL parser we used was unable to deal with invalid UTF-8 in percent encoding in some graceful manner, whereas the URL spec does. But upon switching to it, while it addressed the issue in modern browsers where we use it natively, it remains broken in browsers where we rely on the polyfill.

Standalone example: https://codepen.io/Krinkle/full/gOXqZNK.

console.log(URL.toString()+''); var x;

try {
	x = new URL('https://example.org/y');
} catch (e) {
	x = '(caught) ' + e;
	console.error(e);
	console.log(e.stack);
} console.log('1:', x);

try {
	x = new URL('https://example.org/%96');
} catch (e) {
	x = '(caught) ' + e;
	console.error(e);
	console.log(e.stack);
} console.log('2:', x);

try {
	x = new URL('https://en.wikipedia.org/wiki/Special:Blankpage?from=en&to=es&campaign=article-recommendation&page=Apollo%96Soyuz_Test_Project');
} catch (e) {
	x = '(caught) ' + e;
	console.error(e);
	console.log(e.stack);
} console.log('3:', x);

In modern browsers:

function URL() { [native code] } 
1: URL { protocol: "https:", host: "example.org", pathname: "/y", .. }
2: URL { protocol: "https:", host: "example.org", pathname: "/%96", .. }
3: URL { protocol: "https:", host: "en.wikipedia.org", pathname: "/wiki/Special:Blankpage", search: "?from=en&to=es&campaign=article-recommendation&page=Apollo%96Soyuz_Test_Project" }

In IE 11 (Windows 10, BrowserStack):

funciton URL(url,base){if(!this instanceof URL)...}
1: https://example.org/y
2: https://example.org/%96
3: 
URIError: The URI to be decoded is not a valid encoding
  at Anonymous function
  at urlencoded_parse
  at URLSearchParams
  at URL

Krinkle added a commit to Krinkle/polyfill-library that referenced this issue Mar 11, 2022
== Background ==

The URL spec states that `%` and any characters following it must be
considered as simple literals unless they are a pair of two hexidecimals
(0-9, A-F, a-f). [1] This means that `?q=%` or `?x=%zz` are valid and
simply represent their literal self, they don't decode to something
else just like `?x=foo` doesn't either.

The URL spec also states that percent decoding happens at the byte
level after a string has been UTF decoded, and should not be turned
back into a UTF string until after the percent decoding is done.
This is important because when going through percent decoding you
can encounter a multi-byte character of which one byte isn't a valid
UTF-8 string (or at least not the same string), and simply
concatenating them does not produce the same result.

This is very different from the decodeURIComponent [2] function,
which basically assumes its input is maximally encoded by its
encodeURIComponent counter-part and throws a URIError for anything
that looks like percent encoding but isn't. This means decoding
`%` or `%zz` results in a URIError. And, decoding a valid chunk
that is part of a multi-byte codepoint such as `%7F` from `%7F%C3%BF`,
`%f0` from `%9f%92%a9` also produces a URIError.

[1] https://url.spec.whatwg.org/#percent-decode
[2] https://es5.github.io/#x15.1.3.2

== Problem ==

For example, the following query string uses legacy encoding, which
the backend software supports and corrects automatically. With the
native URL API you can `new URL(location)` and e.g. read out the
"foo" parameter. The polyfill can't get pass the constructor due
to decodeURIComponent throwing URIError.

> https://en.wikipedia.org/?title=Apollo%96Soyuz_Test_Project&foo=bar

This affects any browser where the polyfill is active (including
latest Firefox or Chromium of forcing the polyfill by e.g. setting
window.URI to null first).

== Solutions ==

The naive and first solution I tried was to try to follow the spec
with minimal JS code, e.g. write the `percent_decode` function as a
for-loop over the `bytes` string, and whenever you encounter a percent
sign with two hex chars after it, call decodeURIComponent on just that
chunk. This fixes the problem with exceptions from chunks that
decodeURIComponent considers incomplete or out of range, such as `%`
or `%ZZ` , as we'd skip over those.

However, this then started failing tests for multi-byte codepoints
such as emojis (e.g. 💩, `%f0%9f%92%a9`, U+1F4A9) where
decodeURIComponent was seeing only one chunk which is valid percent
encoding, but not a valid UTF string by itself. I tried to bruteforce
this by using `String.fromCharCode(parseInt(chunk.slice(1), 16));`
instead of decodeURIComponent, thus forcefully creating the byte
in question and then concatenating this. This avoids the error but
creates corrupted output. To do this correctly one would need
additional mapping and logic to compute the right codepoint and
create that directly somehow. I couldn't find an easy way to do this
that didn't require additional and large polyfills (e.g. TextEncoder).

```js
function percent_decode(bytes) {
  var output = '';
  for (var i = 0; i < bytes.length; i++) {
   var byte = bytes[i];
   if (byte === '%') {
     var chunk = bytes.substring(i, i + 3);
     if (chunk.length === 3) {
       var code1 = chunk.charCodeAt(1);
       var code2 = chunk.charCodeAt(2);
       // Only decode bytes in range 0-9, A-F, a-f.
       if (
         (
           (code1 >= 0x30 && code1 <= 0x39) ||
           (code1 >= 0x41 && code1 <= 0x46) ||
           (code1 >= 0x61 && code1 <= 0x66)
         ) && (
           (code2 >= 0x30 && code2 <= 0x39) ||
           (code2 >= 0x41 && code2 <= 0x46) ||
           (code2 >= 0x61 && code2 <= 0x66)
         )
       ) {
         // Even for a single valid chunk we can't use decodeURIComponent()
         // as that throws on "%7F" decoded separately from "%7F%C3%BF".
         // FIXME: This avoids throwing but produces incorrect output
         output += String.fromCharCode(parseInt(chunk.slice(1), 16));
         i = i + 2;
         continue;
       }
     }
   }
   output += byte;
  }
  return output;
}
```

A simpler way, since decodeURIComponent obviously has the logic
we need for multi-byte characters, is to always call it with as many
sequentual chunks as possible, whilst still skipping the things it
shouldn't handle.

This would get messy and needlessly slow and verbose to do with a
for-loop and a buffer, so... I resorted to a regex.

It simply looks for repetitions of "% + Hex pair" and decodes those.
This repairs the pre-existing test cases I had broken and still fixes
the new test cases I added. It also brings the code back to how it
was before basically, and just uses the regex to skip over ranges that
we know decodeURIComponent would otherwise throw on.

Fixes JakeChampion#4.
JakeChampion pushed a commit to Krinkle/polyfill-library that referenced this issue Mar 12, 2022
== Background ==

The URL spec states that `%` and any characters following it must be
considered as simple literals unless they are a pair of two hexidecimals
(0-9, A-F, a-f). [1] This means that `?q=%` or `?x=%zz` are valid and
simply represent their literal self, they don't decode to something
else just like `?x=foo` doesn't either.

The URL spec also states that percent decoding happens at the byte
level after a string has been UTF decoded, and should not be turned
back into a UTF string until after the percent decoding is done.
This is important because when going through percent decoding you
can encounter a multi-byte character of which one byte isn't a valid
UTF-8 string (or at least not the same string), and simply
concatenating them does not produce the same result.

This is very different from the decodeURIComponent [2] function,
which basically assumes its input is maximally encoded by its
encodeURIComponent counter-part and throws a URIError for anything
that looks like percent encoding but isn't. This means decoding
`%` or `%zz` results in a URIError. And, decoding a valid chunk
that is part of a multi-byte codepoint such as `%7F` from `%7F%C3%BF`,
`%f0` from `%9f%92%a9` also produces a URIError.

[1] https://url.spec.whatwg.org/#percent-decode
[2] https://es5.github.io/#x15.1.3.2

== Problem ==

For example, the following query string uses legacy encoding, which
the backend software supports and corrects automatically. With the
native URL API you can `new URL(location)` and e.g. read out the
"foo" parameter. The polyfill can't get pass the constructor due
to decodeURIComponent throwing URIError.

> https://en.wikipedia.org/?title=Apollo%96Soyuz_Test_Project&foo=bar

This affects any browser where the polyfill is active (including
latest Firefox or Chromium of forcing the polyfill by e.g. setting
window.URI to null first).

== Solutions ==

The naive and first solution I tried was to try to follow the spec
with minimal JS code, e.g. write the `percent_decode` function as a
for-loop over the `bytes` string, and whenever you encounter a percent
sign with two hex chars after it, call decodeURIComponent on just that
chunk. This fixes the problem with exceptions from chunks that
decodeURIComponent considers incomplete or out of range, such as `%`
or `%ZZ` , as we'd skip over those.

However, this then started failing tests for multi-byte codepoints
such as emojis (e.g. 💩, `%f0%9f%92%a9`, U+1F4A9) where
decodeURIComponent was seeing only one chunk which is valid percent
encoding, but not a valid UTF string by itself. I tried to bruteforce
this by using `String.fromCharCode(parseInt(chunk.slice(1), 16));`
instead of decodeURIComponent, thus forcefully creating the byte
in question and then concatenating this. This avoids the error but
creates corrupted output. To do this correctly one would need
additional mapping and logic to compute the right codepoint and
create that directly somehow. I couldn't find an easy way to do this
that didn't require additional and large polyfills (e.g. TextEncoder).

```js
function percent_decode(bytes) {
  var output = '';
  for (var i = 0; i < bytes.length; i++) {
   var byte = bytes[i];
   if (byte === '%') {
     var chunk = bytes.substring(i, i + 3);
     if (chunk.length === 3) {
       var code1 = chunk.charCodeAt(1);
       var code2 = chunk.charCodeAt(2);
       // Only decode bytes in range 0-9, A-F, a-f.
       if (
         (
           (code1 >= 0x30 && code1 <= 0x39) ||
           (code1 >= 0x41 && code1 <= 0x46) ||
           (code1 >= 0x61 && code1 <= 0x66)
         ) && (
           (code2 >= 0x30 && code2 <= 0x39) ||
           (code2 >= 0x41 && code2 <= 0x46) ||
           (code2 >= 0x61 && code2 <= 0x66)
         )
       ) {
         // Even for a single valid chunk we can't use decodeURIComponent()
         // as that throws on "%7F" decoded separately from "%7F%C3%BF".
         // FIXME: This avoids throwing but produces incorrect output
         output += String.fromCharCode(parseInt(chunk.slice(1), 16));
         i = i + 2;
         continue;
       }
     }
   }
   output += byte;
  }
  return output;
}
```

A simpler way, since decodeURIComponent obviously has the logic
we need for multi-byte characters, is to always call it with as many
sequentual chunks as possible, whilst still skipping the things it
shouldn't handle.

This would get messy and needlessly slow and verbose to do with a
for-loop and a buffer, so... I resorted to a regex.

It simply looks for repetitions of "% + Hex pair" and decodes those.
This repairs the pre-existing test cases I had broken and still fixes
the new test cases I added. It also brings the code back to how it
was before basically, and just uses the regex to skip over ranges that
we know decodeURIComponent would otherwise throw on.

Fixes JakeChampion#4.
Origami ✨ automation moved this from Backlog to Done Mar 12, 2022
JakeChampion pushed a commit that referenced this issue Mar 12, 2022
== Background ==

The URL spec states that `%` and any characters following it must be
considered as simple literals unless they are a pair of two hexidecimals
(0-9, A-F, a-f). [1] This means that `?q=%` or `?x=%zz` are valid and
simply represent their literal self, they don't decode to something
else just like `?x=foo` doesn't either.

The URL spec also states that percent decoding happens at the byte
level after a string has been UTF decoded, and should not be turned
back into a UTF string until after the percent decoding is done.
This is important because when going through percent decoding you
can encounter a multi-byte character of which one byte isn't a valid
UTF-8 string (or at least not the same string), and simply
concatenating them does not produce the same result.

This is very different from the decodeURIComponent [2] function,
which basically assumes its input is maximally encoded by its
encodeURIComponent counter-part and throws a URIError for anything
that looks like percent encoding but isn't. This means decoding
`%` or `%zz` results in a URIError. And, decoding a valid chunk
that is part of a multi-byte codepoint such as `%7F` from `%7F%C3%BF`,
`%f0` from `%9f%92%a9` also produces a URIError.

[1] https://url.spec.whatwg.org/#percent-decode
[2] https://es5.github.io/#x15.1.3.2

== Problem ==

For example, the following query string uses legacy encoding, which
the backend software supports and corrects automatically. With the
native URL API you can `new URL(location)` and e.g. read out the
"foo" parameter. The polyfill can't get pass the constructor due
to decodeURIComponent throwing URIError.

> https://en.wikipedia.org/?title=Apollo%96Soyuz_Test_Project&foo=bar

This affects any browser where the polyfill is active (including
latest Firefox or Chromium of forcing the polyfill by e.g. setting
window.URI to null first).

== Solutions ==

The naive and first solution I tried was to try to follow the spec
with minimal JS code, e.g. write the `percent_decode` function as a
for-loop over the `bytes` string, and whenever you encounter a percent
sign with two hex chars after it, call decodeURIComponent on just that
chunk. This fixes the problem with exceptions from chunks that
decodeURIComponent considers incomplete or out of range, such as `%`
or `%ZZ` , as we'd skip over those.

However, this then started failing tests for multi-byte codepoints
such as emojis (e.g. 💩, `%f0%9f%92%a9`, U+1F4A9) where
decodeURIComponent was seeing only one chunk which is valid percent
encoding, but not a valid UTF string by itself. I tried to bruteforce
this by using `String.fromCharCode(parseInt(chunk.slice(1), 16));`
instead of decodeURIComponent, thus forcefully creating the byte
in question and then concatenating this. This avoids the error but
creates corrupted output. To do this correctly one would need
additional mapping and logic to compute the right codepoint and
create that directly somehow. I couldn't find an easy way to do this
that didn't require additional and large polyfills (e.g. TextEncoder).

```js
function percent_decode(bytes) {
  var output = '';
  for (var i = 0; i < bytes.length; i++) {
   var byte = bytes[i];
   if (byte === '%') {
     var chunk = bytes.substring(i, i + 3);
     if (chunk.length === 3) {
       var code1 = chunk.charCodeAt(1);
       var code2 = chunk.charCodeAt(2);
       // Only decode bytes in range 0-9, A-F, a-f.
       if (
         (
           (code1 >= 0x30 && code1 <= 0x39) ||
           (code1 >= 0x41 && code1 <= 0x46) ||
           (code1 >= 0x61 && code1 <= 0x66)
         ) && (
           (code2 >= 0x30 && code2 <= 0x39) ||
           (code2 >= 0x41 && code2 <= 0x46) ||
           (code2 >= 0x61 && code2 <= 0x66)
         )
       ) {
         // Even for a single valid chunk we can't use decodeURIComponent()
         // as that throws on "%7F" decoded separately from "%7F%C3%BF".
         // FIXME: This avoids throwing but produces incorrect output
         output += String.fromCharCode(parseInt(chunk.slice(1), 16));
         i = i + 2;
         continue;
       }
     }
   }
   output += byte;
  }
  return output;
}
```

A simpler way, since decodeURIComponent obviously has the logic
we need for multi-byte characters, is to always call it with as many
sequentual chunks as possible, whilst still skipping the things it
shouldn't handle.

This would get messy and needlessly slow and verbose to do with a
for-loop and a buffer, so... I resorted to a regex.

It simply looks for repetitions of "% + Hex pair" and decodes those.
This repairs the pre-existing test cases I had broken and still fixes
the new test cases I added. It also brings the code back to how it
was before basically, and just uses the regex to skip over ranges that
we know decodeURIComponent would otherwise throw on.

Fixes #4.
@github-actions github-actions bot added the library Relates to an Origami library label Mar 12, 2022
Krinkle added a commit to Krinkle/polyfill-library that referenced this issue Mar 15, 2022
Krinkle added a commit to Krinkle/polyfill-library that referenced this issue Mar 15, 2022
JakeChampion pushed a commit that referenced this issue Mar 16, 2022
Ref #1173.
Ref #4.

Co-authored-by: David Chan <david@troi.org>
JakeChampion pushed a commit that referenced this issue Apr 1, 2022
Ref #1173.
Ref #4.

Co-authored-by: David Chan <david@troi.org>
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 9, 2022
@robertboulton robertboulton removed this from Done in Origami ✨ Jul 26, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
help wanted We'd appreciate some help with this library Relates to an Origami library
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants