Consider using RFC3986 encode form-urlencoded #1778

chanjarster · 2018-12-14T05:50:38Z

3.0.2 spec section Support for x-www-form-urlencoded Request Bodies:

To submit content using form url encoding via RFC1866, ...

And, RFC1866 section 8.2.1 The form-urlencoded Media Type:

The form field names and values are escaped: space characters are replaced by `+', and then reserved characters are escaped as per [URL]

Indicates that using encoding specified in RFC1738 and a special treatment on space character

While according to HTML 5.2 section 4.10.21.6 URL-encoded form data:

See the WHATWG URL specification for details on application/x-www-form-urlencoded. [URL] ...

Which says:

Note: URLs can be used in numerous different manners, in many differing contexts. For the purpose of producing strict URLs one may wish to consider [RFC3986] [RFC3987] ...

So, please consider using RFC3986 instead of RFC1738, RFC1738 is too old(December 1994).

handrews · 2022-05-19T03:10:39Z

I believe the only difference is that /, ?, and ';' are no longer reserved within the query string in RFC 3986, correct? (And I agree, it should reference RFC 3986).

chanjarster · 2022-05-19T05:31:04Z

I believe the only difference is that /, ?, and ';' are no longer reserved within the query string in RFC 3986, correct? (And I agree, it should reference RFC 3986).

Yes

vinniefalco · 2022-09-03T22:25:48Z

Does anyone have a reference to the BNF/grammar for the "key=value" pairs? I wrote this:

query-params    = query-param *( "&" query-param )
query-param     = key [ "=" value ]
key             = *qpchar
value           = *( qpchar / "=" )

but is there an official IETF document (or addendum) that states this?

handrews · 2024-01-27T03:02:42Z

@vinniefalco there is no IETF document. There was a draft, but it was abandoned. WHATWG addresses it in their URL "living standard", although I assume like all of their specs it's defined in terms of browser implementation state machines as opposed to ABNF or anything like that.

vinniefalco · 2024-01-28T15:12:54Z

Thanks! (I find the WHATWG "spec" exceptionally frustrating)

handrews · 2024-01-28T18:13:36Z

It's not clear to me whether we can change an RFC reference in a patch release, so for now I'm putting this in 3.2.

karenetheridge · 2024-04-24T19:12:51Z

FWIW, RFC3986 is what JSON Schema's "format": "uri-reference" uses (which appears in the openapi schema a few times).

handrews · 2024-04-24T20:20:57Z

@karenetheridge yeah I think the chances that anyone else even realizes the difference, much less implements it, is pretty slim. Arguably, since RFC3986 is formally listed as updating RFC 1738 (the URL encoding that RFC1866 references), it should be automatically understood that 3986 supersedes the outdated requirements.

I'm not sure whether we should do anything here- it's really getting into the weeds, and I think we can rely on the IETF's Update/Obsolete linkage. Maybe. I hope?

handrews · 2024-04-25T23:10:11Z

I noticed (with the discussion around urlencoding { and }) that our examples do not urlencode ~, which would be required by RFC 1738 but is not required by RFC 3986. Of course, we weren't urlencoding that example correctly anyway, but... I don't think anyone at any time meant we should urlencode ~ just because RFC 1738 is very indirectly referenced. We reference 3986 normatively elsewhere.

handrews · 2024-04-27T16:31:21Z

@OAI/tsc review request: Is it worth doing anything here? To summarize:

We cite RFC 1886 (HTML 2.0) for application/x-www-form-urlencoded, presumably as it was the last IETF document to even sort-of address it
RFC 1886 cites RFC 1738 (Uniform Resource Locators (URL)), the then-current URI/URL RFC, which has relatively restrictive rules for URL encoding
RFC 1738 has long since been superseded by RFC 3986 (with others in between), which has somewhat more relaxed encoding rules (which we take advantage of in some of our examples, such as by not encoding ~ which RFC 3986 specifically calls out as being allowed but previously forbidden)

Do we need to specifically note that our reference to RFC 1886 should incorporate RFC 3986 rules rather than RFC 1738 rules, or is that to be expected automatically because RFC 3986 formally obsoletes RFC 1738 already?

In terms of practical impact, I need to know for OASComply whether to validate using the RFC 1738 rules or the RFC 3986 rules.

ralfhandl · 2024-04-29T12:50:31Z

The WhatWG URL spec requires ~ to be percent-encoded:

Note

The application/x-www-form-urlencoded percent-encode set contains all code points, except the ASCII alphanumeric, U+002A (*), U+002D (-), U+002E (.), and U+005F (_).

mikekistler · 2024-04-29T20:37:34Z

I would be in favor of updating the RFC reference in the 3.2. My meager understanding of this topic is that the RFC switch is not simply a "clarification" so I don't think it could be done in a patch version on earlier versions.

handrews · 2024-04-29T23:08:31Z

@ralfhandl Ugh. The previous line in the WHATWG spec is probably the better one to quote here:

The application/x-www-form-urlencoded percent-encode set is the component percent-encode set and U+0021 (!), U+0027 (') to U+0029 RIGHT PARENTHESIS, inclusive, and U+007E (~).

However, that is only relevant when running the WHATWG application/x-www-form-urlencoded serializing algorithm. The WHATWG application/x-www-form-urlencoded parsing algorithm will, AFAICT, happily accept a literal ~, because their percent-decode algorithm copies bytes verbatim unless it sees a %.

My feeling is that if people want to use WHATWG's serialization rules, that's on them. It's not our responsibility to keep up with their "living standard" or sort out the grammatical (in the ABNF sense) implications of their pseudocode. They have, historically, been aggressively hostile to accommodating anyone else's use cases or concerns, for example refusing to support relative URI-references because there aren't enough "browser-related use cases".

As we are an API specification, I feel comfortable ignoring people who have made it clear that they have no interest in supporting APIs as used by non-browsers.

handrews · 2024-04-30T00:05:17Z

@mikekistler if it's really a change, I agree with you. My question is, "is this even a change?" I can't find a clear explanation from the IETF as to whether an RFC that formally obsoletes another RFC automatically takes effect, or whether it requires an updated citation.

We don't even cite RFC 1738 directly, so there's no citation to replace. We do normatively cite RFC 3986, but the way we get to 1738 is:

OAS 3.0 (7/17) and OAS 3.1 (2/21) cite RFC 1866 (HTML 2.0; 11/95), which cites RFC 1738 (URLs; 12/94)
RFC 1738 (12/94) is obsoleted by RFC 2396 (8/98), which moves ~ to unreserved, and is obsoleted by RFC 3986 (1/05)

All the OAS citations are many years after 3986, so it was already clear that those rules were doubly-obsolete when we cited 1866.

For that matter, it's not entirely clear to me that RFC 1866 really requires percent-encoding those extra characters. §8.2.1 The form-urlencoded Media Type states:

The form field names and values are escaped: space
characters are replaced by '+', and then reserved characters
are escaped as per [URL]; that is, non-alphanumeric
characters are replaced by '%HH', a percent sign and two
hexadecimal digits representing the ASCII code of the
character.

Oddly enough, this mentions percent-encoding reserved characters, rather than unsafe characters. There's no errata for this RFC, so this is the text in question.

§2.2 URL Character Encoding Issues includes this text in a paragraph with the sub-heading "Unsafe":

Other characters are unsafe because gateways and other transport agents are known to sometimes modify such characters. These characters are "{", "}", "|", "", "^", "~", "[", "]", and "`".

In that same section, under the sub-heading "Reserved":

Many URL schemes reserve certain characters for a special meaning:
their appearance in the scheme-specific part of the URL has a
designated semantics. If the character corresponding to an octet is
reserved in a scheme, the octet must be encoded. The characters ";",
"/", "?", ":", "@", "=" and "&" are the characters which may be
reserved for special meaning within a scheme. No other characters may
be reserved within a scheme.

The BNF in §5 also does not place the characters in question in the reserved production:

safe           = "$" | "-" | "_" | "." | "+"
extra          = "!" | "*" | "'" | "(" | ")" | ","
national       = "{" | "}" | "|" | "\" | "^" | "~" | "[" | "]" | "`"
punctuation    = "<" | ">" | "#" | "%" | <">


reserved       = ";" | "/" | "?" | ":" | "@" | "&" | "="
hex            = digit | "A" | "B" | "C" | "D" | "E" | "F" |
                 "a" | "b" | "c" | "d" | "e" | "f"
escape         = "%" hex hex

unreserved     = alpha | digit | safe | extra
uchar          = unreserved | escape
xchar          = unreserved | reserved | escape

The term "national" is not used anywhere else in the RFC, nor do any other BNF rules depend on it. Which means the national characters are not part of uchar or xchar (the relevant productions for what characters are legal).

But RFC 1866 only mentions "reserved" characters. Not "unsafe" characters, or "characters outside of the legal set."

So... my feeling, after spending way too much time reading specs for this, is that anyone who is going to be legalistic enough to complain about RFC 1738 being cited by RFC 1866 can be countered by the fact that the most direct reading of RFC 1866 doesn't require encoding the national set of characters in the first place.

MikeRalphson added the review label Jan 4, 2019

handrews added this to the v3.2.0 milestone Jan 28, 2024

handrews added the media and encoding Issues regarding media type support and how to encode data (outside of query/path params) label Jan 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider using RFC3986 encode form-urlencoded #1778

Consider using RFC3986 encode form-urlencoded #1778

chanjarster commented Dec 14, 2018

handrews commented May 19, 2022

chanjarster commented May 19, 2022

vinniefalco commented Sep 3, 2022

handrews commented Jan 27, 2024

vinniefalco commented Jan 28, 2024

handrews commented Jan 28, 2024

karenetheridge commented Apr 24, 2024 •

edited

handrews commented Apr 24, 2024

handrews commented Apr 25, 2024

handrews commented Apr 27, 2024 •

edited

ralfhandl commented Apr 29, 2024 •

edited

mikekistler commented Apr 29, 2024

handrews commented Apr 29, 2024 •

edited

handrews commented Apr 30, 2024

Consider using RFC3986 encode form-urlencoded #1778

Consider using RFC3986 encode form-urlencoded #1778

Comments

chanjarster commented Dec 14, 2018

handrews commented May 19, 2022

chanjarster commented May 19, 2022

vinniefalco commented Sep 3, 2022

handrews commented Jan 27, 2024

vinniefalco commented Jan 28, 2024

handrews commented Jan 28, 2024

karenetheridge commented Apr 24, 2024 • edited

handrews commented Apr 24, 2024

handrews commented Apr 25, 2024

handrews commented Apr 27, 2024 • edited

ralfhandl commented Apr 29, 2024 • edited

mikekistler commented Apr 29, 2024

handrews commented Apr 29, 2024 • edited

handrews commented Apr 30, 2024

karenetheridge commented Apr 24, 2024 •

edited

handrews commented Apr 27, 2024 •

edited

ralfhandl commented Apr 29, 2024 •

edited

handrews commented Apr 29, 2024 •

edited