Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider using RFC3986 encode form-urlencoded #1778

Open
chanjarster opened this issue Dec 14, 2018 · 14 comments
Open

Consider using RFC3986 encode form-urlencoded #1778

chanjarster opened this issue Dec 14, 2018 · 14 comments
Labels
media and encoding Issues regarding media type support and how to encode data (outside of query/path params) review
Milestone

Comments

@chanjarster
Copy link

3.0.2 spec section Support for x-www-form-urlencoded Request Bodies:

To submit content using form url encoding via RFC1866, ...

And, RFC1866 section 8.2.1 The form-urlencoded Media Type:

The form field names and values are escaped: space characters are replaced by `+', and then reserved characters are escaped as per [URL]

Indicates that using encoding specified in RFC1738 and a special treatment on space character

While according to HTML 5.2 section 4.10.21.6 URL-encoded form data:

See the WHATWG URL specification for details on application/x-www-form-urlencoded. [URL] ...

Which says:

Note: URLs can be used in numerous different manners, in many differing contexts. For the purpose of producing strict URLs one may wish to consider [RFC3986] [RFC3987] ...

So, please consider using RFC3986 instead of RFC1738, RFC1738 is too old(December 1994).

@handrews
Copy link
Contributor

I believe the only difference is that /, ?, and ';' are no longer reserved within the query string in RFC 3986, correct? (And I agree, it should reference RFC 3986).

@chanjarster
Copy link
Author

I believe the only difference is that /, ?, and ';' are no longer reserved within the query string in RFC 3986, correct? (And I agree, it should reference RFC 3986).

Yes

@vinniefalco
Copy link

Does anyone have a reference to the BNF/grammar for the "key=value" pairs? I wrote this:

query-params    = query-param *( "&" query-param )
query-param     = key [ "=" value ]
key             = *qpchar
value           = *( qpchar / "=" )

but is there an official IETF document (or addendum) that states this?

@handrews
Copy link
Contributor

@vinniefalco there is no IETF document. There was a draft, but it was abandoned. WHATWG addresses it in their URL "living standard", although I assume like all of their specs it's defined in terms of browser implementation state machines as opposed to ABNF or anything like that.

@vinniefalco
Copy link

Thanks! (I find the WHATWG "spec" exceptionally frustrating)

@handrews handrews added this to the v3.2.0 milestone Jan 28, 2024
@handrews
Copy link
Contributor

It's not clear to me whether we can change an RFC reference in a patch release, so for now I'm putting this in 3.2.

@handrews handrews added the media and encoding Issues regarding media type support and how to encode data (outside of query/path params) label Jan 28, 2024
@karenetheridge
Copy link
Contributor

karenetheridge commented Apr 24, 2024

FWIW, RFC3986 is what JSON Schema's "format": "uri-reference" uses (which appears in the openapi schema a few times).

@handrews
Copy link
Contributor

@karenetheridge yeah I think the chances that anyone else even realizes the difference, much less implements it, is pretty slim. Arguably, since RFC3986 is formally listed as updating RFC 1738 (the URL encoding that RFC1866 references), it should be automatically understood that 3986 supersedes the outdated requirements.

I'm not sure whether we should do anything here- it's really getting into the weeds, and I think we can rely on the IETF's Update/Obsolete linkage. Maybe. I hope?

@handrews
Copy link
Contributor

I noticed (with the discussion around urlencoding { and }) that our examples do not urlencode ~, which would be required by RFC 1738 but is not required by RFC 3986. Of course, we weren't urlencoding that example correctly anyway, but... I don't think anyone at any time meant we should urlencode ~ just because RFC 1738 is very indirectly referenced. We reference 3986 normatively elsewhere.

@handrews
Copy link
Contributor

handrews commented Apr 27, 2024

@OAI/tsc review request: Is it worth doing anything here? To summarize:

  • We cite RFC 1886 (HTML 2.0) for application/x-www-form-urlencoded, presumably as it was the last IETF document to even sort-of address it
  • RFC 1886 cites RFC 1738 (Uniform Resource Locators (URL)), the then-current URI/URL RFC, which has relatively restrictive rules for URL encoding
  • RFC 1738 has long since been superseded by RFC 3986 (with others in between), which has somewhat more relaxed encoding rules (which we take advantage of in some of our examples, such as by not encoding ~ which RFC 3986 specifically calls out as being allowed but previously forbidden)

Do we need to specifically note that our reference to RFC 1886 should incorporate RFC 3986 rules rather than RFC 1738 rules, or is that to be expected automatically because RFC 3986 formally obsoletes RFC 1738 already?

In terms of practical impact, I need to know for OASComply whether to validate using the RFC 1738 rules or the RFC 3986 rules.

@ralfhandl
Copy link
Contributor

ralfhandl commented Apr 29, 2024

The WhatWG URL spec requires ~ to be percent-encoded:

Note

The application/x-www-form-urlencoded percent-encode set contains all code points, except the ASCII alphanumeric, U+002A (*), U+002D (-), U+002E (.), and U+005F (_).

@mikekistler
Copy link
Contributor

I would be in favor of updating the RFC reference in the 3.2. My meager understanding of this topic is that the RFC switch is not simply a "clarification" so I don't think it could be done in a patch version on earlier versions.

@handrews
Copy link
Contributor

handrews commented Apr 29, 2024

@ralfhandl Ugh. The previous line in the WHATWG spec is probably the better one to quote here:

The application/x-www-form-urlencoded percent-encode set is the component percent-encode set and U+0021 (!), U+0027 (') to U+0029 RIGHT PARENTHESIS, inclusive, and U+007E (~).

However, that is only relevant when running the WHATWG application/x-www-form-urlencoded serializing algorithm. The WHATWG application/x-www-form-urlencoded parsing algorithm will, AFAICT, happily accept a literal ~, because their percent-decode algorithm copies bytes verbatim unless it sees a %.

My feeling is that if people want to use WHATWG's serialization rules, that's on them. It's not our responsibility to keep up with their "living standard" or sort out the grammatical (in the ABNF sense) implications of their pseudocode. They have, historically, been aggressively hostile to accommodating anyone else's use cases or concerns, for example refusing to support relative URI-references because there aren't enough "browser-related use cases".

As we are an API specification, I feel comfortable ignoring people who have made it clear that they have no interest in supporting APIs as used by non-browsers.

@handrews
Copy link
Contributor

@mikekistler if it's really a change, I agree with you. My question is, "is this even a change?" I can't find a clear explanation from the IETF as to whether an RFC that formally obsoletes another RFC automatically takes effect, or whether it requires an updated citation.

We don't even cite RFC 1738 directly, so there's no citation to replace. We do normatively cite RFC 3986, but the way we get to 1738 is:

  • OAS 3.0 (7/17) and OAS 3.1 (2/21) cite RFC 1866 (HTML 2.0; 11/95), which cites RFC 1738 (URLs; 12/94)
  • RFC 1738 (12/94) is obsoleted by RFC 2396 (8/98), which moves ~ to unreserved, and is obsoleted by RFC 3986 (1/05)

All the OAS citations are many years after 3986, so it was already clear that those rules were doubly-obsolete when we cited 1866.

For that matter, it's not entirely clear to me that RFC 1866 really requires percent-encoding those extra characters. §8.2.1 The form-urlencoded Media Type states:

  1. The form field names and values are escaped: space
    characters are replaced by '+', and then reserved characters
    are escaped as per [URL]; that is, non-alphanumeric
    characters are replaced by '%HH', a percent sign and two
    hexadecimal digits representing the ASCII code of the
    character.

Oddly enough, this mentions percent-encoding reserved characters, rather than unsafe characters. There's no errata for this RFC, so this is the text in question.

§2.2 URL Character Encoding Issues includes this text in a paragraph with the sub-heading "Unsafe":

Other characters are unsafe because gateways and other transport agents are known to sometimes modify such characters. These characters are "{", "}", "|", "", "^", "~", "[", "]", and "`".

In that same section, under the sub-heading "Reserved":

Many URL schemes reserve certain characters for a special meaning:
their appearance in the scheme-specific part of the URL has a
designated semantics. If the character corresponding to an octet is
reserved in a scheme, the octet must be encoded. The characters ";",
"/", "?", ":", "@", "=" and "&" are the characters which may be
reserved for special meaning within a scheme. No other characters may
be reserved within a scheme.

The BNF in §5 also does not place the characters in question in the reserved production:

safe           = "$" | "-" | "_" | "." | "+"
extra          = "!" | "*" | "'" | "(" | ")" | ","
national       = "{" | "}" | "|" | "\" | "^" | "~" | "[" | "]" | "`"
punctuation    = "<" | ">" | "#" | "%" | <">


reserved       = ";" | "/" | "?" | ":" | "@" | "&" | "="
hex            = digit | "A" | "B" | "C" | "D" | "E" | "F" |
                 "a" | "b" | "c" | "d" | "e" | "f"
escape         = "%" hex hex

unreserved     = alpha | digit | safe | extra
uchar          = unreserved | escape
xchar          = unreserved | reserved | escape

The term "national" is not used anywhere else in the RFC, nor do any other BNF rules depend on it. Which means the national characters are not part of uchar or xchar (the relevant productions for what characters are legal).

But RFC 1866 only mentions "reserved" characters. Not "unsafe" characters, or "characters outside of the legal set."


So... my feeling, after spending way too much time reading specs for this, is that anyone who is going to be legalistic enough to complain about RFC 1738 being cited by RFC 1866 can be countered by the fact that the most direct reading of RFC 1866 doesn't require encoding the national set of characters in the first place.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
media and encoding Issues regarding media type support and how to encode data (outside of query/path params) review
Projects
None yet
Development

No branches or pull requests

7 participants