Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Router does not decode URIs #2672

Closed
szmarczak opened this issue May 14, 2024 · 47 comments
Closed

Router does not decode URIs #2672

szmarczak opened this issue May 14, 2024 · 47 comments

Comments

@szmarczak
Copy link
Contributor

What version of Hono are you using?

4.3.6

What runtime/platform is your app running on?

Deno

What steps can reproduce the bug?

Deno/Hono

import { Hono } from 'npm:hono';

const h = new Hono();

h.get('/|', async (c) => c.text('ayy'));

Deno.serve({
	port: 8080,
}, h.fetch);

Node

// ayy
http.get({ host: 'localhost', port: 8080, path: '/|' }, res => res.pipe(process.stdout));'';

// 404 Not Found, expected ayy
http.get({ host: 'localhost', port: 8080, path: '/'+encodeURI('|') }, res => res.pipe(process.stdout));'';

What is the expected behavior?

ayy

What do you see instead?

404 Not Found

Additional information

No response

@szmarczak szmarczak added the bug label May 14, 2024
@mgrithm
Copy link
Contributor

mgrithm commented May 15, 2024

@yusukebe, do you think that it is a bug?

@yusukebe
Copy link
Member

Hi @szmarczak

The router doesn't need to decode the path. Instead, if you want to decode it, you can use the getPath option for new Hono().

import { Hono } from 'hono'

const app = new Hono({
  getPath: (req) => {
    const url = req.url
    const queryIndex = url.indexOf('?', 8)
    const path = url.slice(url.indexOf('/', 8), queryIndex === -1 ? undefined : queryIndex)
    return decodeURI(path)
  },
})

@yusukebe yusukebe removed the bug label May 15, 2024
@szmarczak
Copy link
Contributor Author

I disagree. The params are decoded out-of-the-box (though in Request), I don't see why the path can't be:

Deno/Hono

import { Hono } from 'npm:hono';

const h = new Hono();

h.get('/:foobar', async (c) => c.text(c.req.param('foobar')));

Deno.serve({
	port: 8080,
}, h.fetch);

Node

http.get({ host: 'localhost', port: 8080, path: '/'+encodeURIComponent('/') }, res => res.pipe(process.stdout));'';

Also, the benchmark for find-my-way uses find which does decoding.

@yusukebe
Copy link
Member

@szmarczak

Thanks. Indeed, paths may need to be decoded. But I think it should not be done by the router but by getPath(). We have made the method for that purpose. Regarding the benchmark, we can add a note on the page.

@usualoma What do you think about this?

@szmarczak
Copy link
Contributor Author

should not be done by the router but by getPath()

There's no reason to make decoding opt-in. It's bad practice and bad DX.

Furthermore, the URI RFC says:

Percent-encoded octets must be decoded at some point during the
dereference process. Applications must split the URI into its
components and subcomponents prior to decoding the octets, as
otherwise the decoded octets might be mistaken for delimiters.

so not decoding the components is actually against the spec.

@fzn0x
Copy link
Contributor

fzn0x commented May 16, 2024

I think the paths should be decoded by default, making decoding optional can lead to errors and poor developer experience (DX) because it introduces the possibility of inconsistent handling of URIs, I believe automatically decoding URI components ensures consistency and reduces the likelihood of bugs related to misinterpretation of encoded characters.

I think it should not be done by the router but by getPath()

I think we can add an option in Hono instance to disable decoding?

But default is enabled.

@usualoma
Copy link
Member

I considered it for a while and concluded that we should decode the path component before processing the routing.
As @yusukebe says, the place to do this processing is getPath(). We need an implementation of getPath() that is fast and does the decoding correctly.

@yusukebe
Copy link
Member

@fzn0x @usualoma Thanks.

Let's implement getPath() to decode a URL path correctly and make it the default.

@mgrithm
Copy link
Contributor

mgrithm commented May 16, 2024

@yusukebe, can I implement it?

@szmarczak
Copy link
Contributor Author

the place to do this processing is getPath()

Why? There have been several comments saying that with no rationale behind.

@usualoma
Copy link
Member

@szmarczak Thanks for your comment.

Are you saying that you think the router should do this process?
We believe that modifying getPath() would satisfy the specification. Please be specific about what you are objecting to.

@usualoma
Copy link
Member

@szmarczak
Oh, and to be clear, the "modifying getPath()" we are talking about now refers to the following changes.
We want to implement this change in a less performance-degrading way.

diff --git a/src/utils/url.ts b/src/utils/url.ts
index b37a5539..7109a307 100644
--- a/src/utils/url.ts
+++ b/src/utils/url.ts
@@ -73,7 +73,7 @@ export const getPath = (request: Request): string => {
   // Optimized: indexOf() + slice() is faster than RegExp
   const url = request.url
   const queryIndex = url.indexOf('?', 8)
-  return url.slice(url.indexOf('/', 8), queryIndex === -1 ? undefined : queryIndex)
+  return decodeURI(url.slice(url.indexOf('/', 8), queryIndex === -1 ? undefined : queryIndex))
 }
 
 export const getQueryStrings = (url: string): string => {

@szmarczak
Copy link
Contributor Author

Are you saying that you think the router should do this process?

Correct. For example, testing this behavior would be much easier. Otherwise, you'd have to test the entire Hono app to make sure the behavior is right.

We believe that modifying getPath() would satisfy the specification.

Indeed, however the router can be used separately outside of Hono. Making it a user responsibility would be a bad DX.
For example, if I use a router then I'd expect it to parse paths for me without the need to do anything in that regards from my side.
If this is true when reading path params, then it should be true for the rest of the path.

Please be specific about what you are objecting to.

Having said the above, I don't believe that making it a part of getPath is a good idea.

We want to implement this change in a less performance-degrading way.

path.includes('%') ? decodeURI(path) : path

Also note that decodeURI may throw if it contains non-UTF-8 data so perhaps it's good to wrap it in a safe function that returns unwrapped data instead.

@usualoma
Copy link
Member

@szmarczak Thank you! I think I understand your thoughts. It will be helpful.

I consider the following

What are the responsibilities of Hono routers?

The match() method of Hono's router is responsible for taking a string and returning a result.
https://github.com/honojs/hono/blob/main/src/router.ts#L7-L11

If it is "receive a Request object and return a result", I think it should be decoded at the router, but since it is not, I feel comfortable with the API being "decode and pass it outside the router".

Who do we consider DX for?

the router can be used separately outside of Hono

Yes, there may be such users.

But I would prefer to improve the DX for the users of Hono (or us, the developers) who implement their own router in Hono, rather than improve the DX for the users who "use the router outside of Hono". If I want to write a new router, I would appreciate it if it is decoded outside the router.

Performance

path.includes('%') ? decodeURI(path) : path

I think that snippet will meet the specification, but we need code that performs better. For example, the following code. (Whether it is decodeURI or decodeURIComponent is a separate discussion.)

export const getPath = (request: Request): string => {
  const url = request.url
  const start = url.indexOf('/', 8)
  let i = start
  for (; i < url.length; i++) {
    const charCode = url.charCodeAt(i)
    if (charCode === 37) {
      // '%'
      // If the path contains percent encoding, use `indexOf()` to find '?' and return the result immediately.
      // Although this is a performance disadvantage, it is acceptable since we prefer cases that do not include percent encoding.
      const queryIndex = url.indexOf('?', i)
      return decodeURIComponent(url.slice(start, queryIndex === -1 ? undefined : queryIndex))
    } else if (charCode === 63) {
      // '?'
      break
    }
  }
  return url.slice(start, i)
}

@szmarczak
Copy link
Contributor Author

The match() method of Hono's router is responsible for taking a string

The string consists of URI components.

If it is "receive a Request object and return a result", I think it should be decoded at the router, but since it is not

Whether the URL is extracted from Request or a user supplies its own request.url directly (guaranteed it's not decoded already) is irrelevant.

If I want to write a new router, I would appreciate it if it is decoded outside the router.

Then you can extract the custom decoding URI function into fastDecodeURI and do path = fastDecodeURI(path) in

match(method: string, path: string): Result<T> {

and other routers too. IMO in that way DX for Hono developers is preserved.

I think that snippet will meet the specification, but we need code that performs better. For example, the following code.

I haven't benchmarked this but it may occur that for + charCodeAt is slower than calling always indexOf('?') and includes('%'). (Well, after all it's JS and not C++ so often the code that "looks fast" is not). Feel free to check that, I'll be back in the morning as I'm off to sleep.


I think that both sides (me and Hono) have expressed their opinions thoroughly. Whether it's getPath or router is a just implementation detail of free choice. It doesn't matter for the Hono end user which way it's done. I don't think I have any other points to add. Thanks for consideration!

Regarding the benchmark, we can add a note on the page.

I think the benchmark should contain code that decodes the URI (using Hono's implementation) for routers that don't decode URIs. Otherwise the benchmark is untrue as it doesn't represent how the route is handled in a real life scenario.

@usualoma
Copy link
Member

@szmarczak Thanks. I understand that there are other opinions, but I still think that getPath() should be modified.

@usualoma
Copy link
Member

usualoma commented May 17, 2024

I would like to refer to #2688 for discussion.

If anyone has an opinion, please comment.

Expected Routing Results

Is this the expected routing result?

https://github.com/honojs/hono/pull/2688/files#diff-8ac13809c9886e994d1db33943de82df4d6c5d88b73fd0270c0189804ff565c2R736-R775

There may be disagreement as to whether http://localhost/users/hono%2Fposts should match /users/:id or /users/:id/posts. However, leaving "%2F" here (by decodeURI() ) would complicate the "do not double-decode" process, so I think it is better to decode everything (by decodeURIComponent() ) here. The current v4.3.7, before the change, matches /users/:id, so this is a spec change.

It's a bit of a quibble whether http://localhost/users/%73tatic should match /users/static, but it might slip through and match '/users/:id' and c.req.param ('id') returns "static", then there may be a security problem depending on the application specification, so I think it should match /users/static. This is also a spec change.

Performace

I have benchmarked with benchmarks/utils/src/get-path.ts and I think this implementation will be faster in any runtime.

% bun run src/get-path.ts; deno run --allow-sys src/get-path.ts; npx tsx src/get-path.ts
cpu: Apple M2 Pro
runtime: bun 1.1.2 (arm64-darwin)

benchmark                                time (avg)             (min … max)       p75       p99      p999
--------------------------------------------------------------------------- -----------------------------
noop                                     45 ps/iter       (0 ps … 22.16 ns)     41 ps     81 ps    122 ps !

• getPath
--------------------------------------------------------------------------- -----------------------------
slice + indexOf : w/o decodeURI       30.43 ns/iter      (28.4 ns … 307 ns)  29.97 ns  36.03 ns    212 ns
regexp : w/o decodeURI                40.58 ns/iter     (34.79 ns … 312 ns)  39.67 ns  53.65 ns    285 ns
slice + indexOf                       63.21 ns/iter     (56.68 ns … 343 ns)  63.99 ns  72.98 ns    280 ns
slice + for-loop + flag               23.35 ns/iter     (21.81 ns … 313 ns)  22.66 ns   29.3 ns    211 ns
slice + for-loop + immediate return   23.12 ns/iter     (21.67 ns … 235 ns)  22.66 ns   27.3 ns    188 ns

summary for getPath
  slice + for-loop + immediate return
   1.01x faster than slice + for-loop + flag
   1.32x faster than slice + indexOf : w/o decodeURI
   1.75x faster than regexp : w/o decodeURI
   2.73x faster than slice + indexOf
cpu: Apple M2 Pro
runtime: deno 1.43.1 (aarch64-apple-darwin)

benchmark                                time (avg)             (min … max)       p75       p99      p999
--------------------------------------------------------------------------- -----------------------------
noop                                    698 ps/iter         (0 ps … 977 ns)      0 ps      0 ps      0 ps !

• getPath
--------------------------------------------------------------------------- -----------------------------
slice + indexOf : w/o decodeURI       15.12 ns/iter         (0 ps … 977 ns)      0 ps    977 ns    977 ns !
regexp : w/o decodeURI                49.17 ns/iter         (0 ps … 977 ns)      0 ps    977 ns    977 ns !
slice + indexOf                       45.34 ns/iter         (0 ps … 977 ns)      0 ps    977 ns    977 ns !
slice + for-loop + flag                34.2 ns/iter         (0 ps … 977 ns)      0 ps    977 ns    977 ns !
slice + for-loop + immediate return   31.73 ns/iter         (0 ps … 977 ns)      0 ps    977 ns    977 ns !

summary for getPath
  slice + indexOf : w/o decodeURI
   2.1x faster than slice + for-loop + immediate return
   2.26x faster than slice + for-loop + flag
   3x faster than slice + indexOf
   3.25x faster than regexp : w/o decodeURI
cpu: Apple M2 Pro
runtime: node v20.0.0 (arm64-darwin)

benchmark                                time (avg)             (min … max)       p75       p99      p999
--------------------------------------------------------------------------- -----------------------------
noop                                     71 ps/iter        (20 ps … 234 ns)     82 ps    102 ps    143 ps !

• getPath
--------------------------------------------------------------------------- -----------------------------
slice + indexOf : w/o decodeURI       21.29 ns/iter     (19.96 ns … 177 ns)  21.18 ns  30.03 ns  40.85 ns
regexp : w/o decodeURI                42.16 ns/iter     (37.82 ns … 259 ns)  40.85 ns    105 ns    143 ns
slice + indexOf                       34.05 ns/iter     (31.74 ns … 146 ns)     34 ns  42.07 ns    125 ns
slice + for-loop + flag                27.9 ns/iter     (26.12 ns … 304 ns)  27.73 ns  37.03 ns    140 ns
slice + for-loop + immediate return   27.39 ns/iter     (25.45 ns … 192 ns)  27.02 ns  36.44 ns    153 ns

summary for getPath
  slice + indexOf : w/o decodeURI
   1.29x faster than slice + for-loop + immediate return
   1.31x faster than slice + for-loop + flag
   1.6x faster than slice + indexOf
   1.98x faster than regexp : w/o decodeURI

Handle URIError: URI malformed

Error handling is under consideration.

@szmarczak
Copy link
Contributor Author

szmarczak commented May 17, 2024

@usualoma decodeURIComponent is incorrect. decodeURI is correct. It must not normalize %2F into / (nor %3F otherwise it will mistake it for a query).

localhost/foo%2Fbar/baz/ąę consists of only three segment, not four. If the router accepted an array, it would look like this:

['foo/bar', 'baz', 'ąę']

However, if we're using strings, then to match this we need foo%2Fbar/baz/ąę (note that ą and ę are decoded and not in their percent-encoded form).

Handle URIError: URI malformed

Ideally, we'd like to decode what we can, and malformed percent-encoded characters leave in their percent-encoded form. I'm almost certain that this would be much slower, therefore IMO it's ok to fallback to original not-decoded path.

@usualoma
Copy link
Member

@szmarczak Thanks for the comment!

(nor %3F otherwise it will mistake it for a query).

If you read the code, you will see that we extract the path component and then apply decodeURIComponent, so there is no confusion with the query.

localhost/foo%2Fbar/baz/ąę consists of only three segment

Can you provide a reason to believe that the specification is correct? I am referring to RFC 3986, 2.3, which states that they should be decoded.

https://datatracker.ietf.org/doc/html/rfc3986#section-2.3

For consistency, percent-encoded octets in the ranges of ALPHA
(%41-%5A and %61-%7A), DIGIT (%30-%39), hyphen (%2D), period (%2E),
underscore (%5F), or tilde (%7E) should not be created by URI
producers and, when found in a URI, should be decoded to their
corresponding unreserved characters by URI normalizers.

Also, /a%2Fb.html is considered the same as /a/b.html on typical web servers such as nginx.

@szmarczak
Copy link
Contributor Author

szmarczak commented May 17, 2024

Can you provide a reason to believe that the specification is correct?

I don't understand the question. The specification is always correct because it is the definition.

I am referring to RFC 3986, 2.3, which states that they should be decoded.

No, it does not state that. 2F is not in the range of RFC 3986, 2.3 so you're incorrectly adhering to it.

Also, /a%2Fb.html is considered the same as /a/b.html on typical web servers such as nginx.

We're not talking about nginx or any other servers here.

@szmarczak
Copy link
Contributor Author

https://www.rfc-editor.org/rfc/rfc9110.html#name-https-normalization-and-com

Characters other than those in the "reserved" set are equivalent to their percent-encoded octets: the normal form is to not encode them (see Sections 2.1 and 2.2 of [URI]).

https://www.rfc-editor.org/rfc/rfc3986#section-2.2

2.2. Reserved Characters

  reserved    = gen-delims / sub-delims

  gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"

  sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
              / "*" / "+" / "," / ";" / "="

I hope this clarifies. This means you must not decode %3F into ? while normalizing the path segment (normalization happens when we operate on paths like strings, if segments were arrays then decodeURIComponent would be correct, like said in the comment above).

@usualoma
Copy link
Member

@szmarczak
Sorry, you were right from the beginning on the following points you mentioned and my understanding was wrong. If we look at it in terms of "segments", I think it should be divided into three.

localhost/foo%2Fbar/baz/ąę consists of only three segments, not four. If the router accepted an array, it would look like this: > localhost/foo%2Fbar/baz/ąę

['foo/bar', 'baz', 'ąę']

However, "apply decodeURI() -> split into segments -> apply decodeURIComponent() for each segment" doesn't work.

decodeURI("%2525a/b/c").split("/").map(decodeURIComponent)

The expected result should be as follows,

['%25a', 'b', 'c']

In fact, it would look like this.

['%a', 'b', 'c']

If we want to get the correct result on a per-segment basis, we would need to return an array from getPath() as follows

return "%2525a/b/c".split("/").map(decodeURIComponent)

However, the Hono router does not understand the unit of “segment”.

@szmarczak
Copy link
Contributor Author

szmarczak commented May 17, 2024

No worries! I'm glad I could clarify.

However, "apply decodeURI() -> split into segments -> apply decodeURIComponent() for each segment" doesn't work.

[edit: removed because the example is right]

@fzn0x
Copy link
Contributor

fzn0x commented May 17, 2024

I believe implement memoization could improve the speed. Also skips normal path to fast parse

Click me
/**
 * Parse the `req` URL with memoization.
 *
 * @param {ServerRequest} req
 * @return {Object|undefined}
 * @public
 */
function parseurl(req) {
  const url = req.url;

  if (url === undefined) {
    // URL is undefined
    return undefined;
  }

  const parsed = req._parsedUrl;

  if (isFresh(url, parsed)) {
    // Return cached URL parse
    return parsed;
  }

  // Parse the URL
  const newParsed = fastparse(url);
  newParsed._raw = url;

  return (req._parsedUrl = newParsed);
}

/**
 * Parse the `req` original URL with fallback and memoization.
 *
 * @param {ServerRequest} req
 * @return {Object|undefined}
 * @public
 */
function originalurl(req) {
  const url = req.originalUrl;

  if (typeof url !== 'string') {
    // Fallback
    return parseurl(req);
  }

  const parsed = req._parsedOriginalUrl;

  if (isFresh(url, parsed)) {
    // Return cached URL parse
    return parsed;
  }

  // Parse the URL
  const newParsed = fastparse(url);
  newParsed._raw = url;

  return (req._parsedOriginalUrl = newParsed);
}

/**
 * Parse the `str` URL with a fast-path shortcut.
 *
 * @param {string} str
 * @return {Object}
 * @private
 */
function fastparse(str) {
  if (typeof str !== 'string' || str.charCodeAt(0) !== 0x2f /* / */) {
    return decodeURIToUrlObject(str);
  }

  let pathname = str;
  let query = null;
  let search = null;

  // Unroll the regexp for performance optimization
  for (let i = 1; i < str.length; i++) {
    switch (str.charCodeAt(i)) {
      case 0x3f: /* ? */
        if (search === null) {
          pathname = str.substring(0, i);
          query = str.substring(i + 1);
          search = str.substring(i);
        }
        break;
      case 0x09: /* \t */
      case 0x0a: /* \n */
      case 0x0c: /* \f */
      case 0x0d: /* \r */
      case 0x20: /*   */
      case 0x23: /* # */
      case 0xa0: /* non-breaking space */
      case 0xfeff: /* BOM */
        return decodeURIToUrlObject(str);
    }
  }

  const url = {};
  url.path = str;
  url.href = str;
  url.pathname = pathname;

  if (search !== null) {
    url.query = query;
    url.search = search;
  }

  return url;
}

/**
 * Decode the URI and convert it to a URL object.
 *
 * @param {string} uri
 * @return {Object}
 * @private
 */
function decodeURIToUrlObject(uri) {
  const decodedURI = decodeURI(uri);
  const url = new URL(decodedURI);

  return {
    href: decodedURI,
    path: url.pathname + (url.search || ''),
    pathname: url.pathname,
    search: url.search,
    query: url.searchParams.toString(),
  };
}

/**
 * Determine if parsed URL is still fresh.
 *
 * @param {string} url
 * @param {object} parsedUrl
 * @return {boolean}
 * @private
 */
function isFresh(url, parsedUrl) {
  return (
    typeof parsedUrl === 'object' &&
    parsedUrl !== null &&
    parsedUrl._raw === url
  );
}

const req = {
  url: "localhost/foo%2Fbar/baz/ąę"
}
console.log(parseurl(req).path)

Result

{
  href: 'localhost/foo%2Fbar/baz/ąę',
  path: '/foo%2Fbar/baz/%C4%85%C4%99',
  pathname: '/foo%2Fbar/baz/%C4%85%C4%99',
  search: '',
  query: '',
  _raw: 'localhost/foo%2Fbar/baz/ąę'
}

@usualoma
Copy link
Member

@fzn0x Thanks for the suggestion!

First of all, I don't think the cache will speed up the process, since we won't be calling getPath() multiple times for the same request object.

Also, I don't think the code you suggested would perform well if it were not cached.

If you still think it performs better, I would appreciate it if you could suggest it along with a benchmark.

@usualoma
Copy link
Member

FYI

The results of using symbols and multibyte characters in the routing definitions were investigated in the major frameworks. They vary considerably from framework to framework.

I don't think "we should match the major frameworks," but just that "if symbols and multibyte characters are not well supported in the major frameworks, it indicates that no one is having trouble in the production environment.

express

Symbols and multibytes do not appear to be available. paths are treated as "/"-separated segments.

const express = require('express')
const app = express()
const port = 3333

// "|" has a special meaning, so use "^" to validate
app.get('/^hello^', (req, res) => {
  res.send('^hello^')
})

app.get('/hello🔥', (req, res) => {
  res.send('hello🔥')
})

app.get('/users/:id/action', (req, res) => {
  res.send('users action')
})

app.listen(port, () => {
  console.log(`Example app listening on port ${port}`)
})
$ curl 'http://localhost:3333/^hello^'
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Error</title>
</head>
<body>
<pre>Cannot GET /%5Ehello%5E</pre>
</body>
</html>
$ curl 'http://localhost:3333/%5Ehello%5E'
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Error</title>
</head>
<body>
<pre>Cannot GET /%5Ehello%5E</pre>
</body>
</html>
$ curl 'http://localhost:3333/hello🔥'
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Error</title>
</head>
<body>
<pre>Cannot GET /hello%f0%9f%94%a5</pre>
</body>
</html>
$ curl 'http://localhost:3333/hello%f0%9f%94%a5'
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Error</title>
</head>
<body>
<pre>Cannot GET /hello%f0%9f%94%a5</pre>
</body>
</html>
$ curl 'http://localhost:3333/users/1%2F2/action' # 200 : id => 1/2
$ curl 'http://localhost:3333/users%2F1%2Faction' # 404

ruby on rails

Only percent-encoded URLs are supported; unencoded URLs will result in a 404. paths are treated as "/"-separated segments.

get '/^hello^'
get '/|hello|'
get '/hello🔥'
get '/users/:id/action'
$ curl 'http://localhost/^hello^'            # 404
$ curl 'http://localhost/%5Ehello%5E' # 200
$ curl 'http://localhost/|hello|'               # 404
$ curl 'http://localhost/%7Chello%7C' # 200
$ curl 'http://localhost/hello🔥'             # 404
$ curl 'http://localhost/hello%F0%9F%94%A5' # 200
$  curl 'http://localhost/users/1%2F2/action' # 200
1/2
$  curl 'http://localhost/users%2F1%2Faction' # 200

dgango

Before routing, all characters are decoded in a decodeURIComponent() equivalent process, showing similar results to #2688.

path('^hello^', views.index, name='test'),
path('|hello|', views.index, name='test'),
path('hello🔥', views.index, name='test'),
path('a/<slug:slug>/c', views.path, name='test'),
$ curl 'http://localhost/^hello^'            # 200
$ curl 'http://localhost/%5Ehello%5E' # 200
$ curl 'http://localhost/|hello|'               # 200
$ curl 'http://localhost/%7Chello%7C' # 200
$ curl 'http://localhost/hello🔥'             # 200
$ curl 'http://localhost/hello%F0%9F%94%A5' # 200
$ curl 'http://localhost:8000/a%2Fb%2Fc' # 200
slug = b
$ curl 'http://localhost:8000/a/a%2Fb%2Fc/c' # 404

@usualoma
Copy link
Member

If what we need is a result similar to Ruby on Rails, then the following changes will accomplish this. (Actually, it's a bit more complicated, but it's "encode when adding a routing.")

diff --git a/src/hono-base.ts b/src/hono-base.ts
index 293d3874..d4cb8071 100644
--- a/src/hono-base.ts
+++ b/src/hono-base.ts
@@ -290,7 +290,7 @@ class Hono<
 
   private addRoute(method: string, path: string, handler: H) {
     method = method.toUpperCase()
-    path = mergePath(this._basePath, path)
+    path = encodeURI(mergePath(this._basePath, path))
     const r: RouterRoute = { path: path, method: method, handler: handler }
     this.router.add(method, path, [handler, r])
     this.routes.push(r)

@fzn0x
Copy link
Contributor

fzn0x commented May 18, 2024

If what we need is a result similar to Ruby on Rails, then the following changes will accomplish this. (Actually, it's a bit more complicated, but it's "encode when adding a routing.")

diff --git a/src/hono-base.ts b/src/hono-base.ts
index 293d3874..d4cb8071 100644
--- a/src/hono-base.ts
+++ b/src/hono-base.ts
@@ -290,7 +290,7 @@ class Hono<
 
   private addRoute(method: string, path: string, handler: H) {
     method = method.toUpperCase()
-    path = mergePath(this._basePath, path)
+    path = encodeURI(mergePath(this._basePath, path))
     const r: RouterRoute = { path: path, method: method, handler: handler }
     this.router.add(method, path, [handler, r])
     this.routes.push(r)

Yes, that's what I mean, sorry for the example, I think every framework does different things to deal with paths, but Django looks good!

@usualoma
Copy link
Member

If the expected behavior is equivalent to Ruby on Rails, I think #2711 can achieve this. I feel this is better than #2688 because it is less burdensome at the request time.

However, I wonder if such routing is ever used in a production environment. If not, I think we can consider the option of "not supporting it".

@usualoma
Copy link
Member

I created #2714 with great support from @szmarczak!

Although #2711 is hard to discard for its performance advantage, at this point I consider this new #2714 to be the best.

The reasons are as follows

Although I am concerned about the following points.

However, I wonder if such routing is ever used in a production environment. If not, I think we can consider the option of "not supporting it".

Also, there is still the issue of how to handle URIError: URI malformed. (I think it should return an error.)

@yusukebe Looking at the discussion so far, what are your thoughts?

@yusukebe
Copy link
Member

@szmarczak @usualoma @fzn0x

Thank you for the discussion, and I'm sorry for the misunderstanding at first.

I like #2714! It achieves ideal handling paths, and the performance is good.

URIError: URI malformed

I also think it should return an error. When should it return the error? Can it return an error during the registration phase?

@szmarczak
Copy link
Contributor Author

szmarczak commented May 18, 2024

@usualoma If someone sends UTF-8 (non-ASCII) path over the wire, it's incorrect as the URI spec permits only ASCII. UTF-8 characters must be percent-encoded. So the Ruby on Rails behavior seems most correct.

Although #2711 is hard to discard for its performance advantage, at this point I consider this new #2714 to be the best.

The performance should be the same. The least performant way is when someone sends %25 which is 1 of dozen characters that have to be percent-encoded, so the probability of this path is very low.

Also, there is still the issue of how to handle URIError: URI malformed. (I think it should return an error.)

Well, (unfortunately!) the URI spec does not define the encoding percent-encoded characters have to use (they are just bytes). It's just that the browser APIs like decodeURI and encodeURI force UTF-8 (using system encoding does not make sense as there are 8 billion people in the world with varying languages, so they had to arbitrarily define the encoding as UTF-8 (like MSL in TCP)).

Whether you return 400 Bad Request or use the URI without decoding in that case (it failed to decode to UTF-8) - any approach would be ok IMO.

@szmarczak
Copy link
Contributor Author

szmarczak commented May 18, 2024

However, I wonder if such routing is ever used in a production environment. If not, I think we can consider the option of "not supporting it".

Having developed a Node.js HTTP client (sorry for bragging!), I can definitely say that people widely send UTF-8 encoded characters (well, some even send non-UTF-8 but that's minority and we were enforcing UTF-8 for years, we had to disable UTF-8 enforcing in order to support legacy servers using other encodings).

@usualoma
Copy link
Member

usualoma commented May 18, 2024

Difference in performance between #2711 and #2714

I think the difference between the two at request time is indicated by the following part of the benchmark: for bun, #2714 is slightly faster; for deno and node, #2711 is faster. (This is a short URL benchmark, not including the %)
But in any case, it's a small difference, so I wouldn't worry too much about it.

CleanShot 2024-05-19 at 06 08 30@2x

Who is affected by this modification?

However, I wonder if such routing is ever used in a production environment. If not, I think we can consider the option of "not supporting it".

Even hono@v4.3.7 can process URLs containing percent encodings with the following routing

app.get('/users/:id', (c) => c.text(`id is ${c.req.param('id')}`))
$ curl 'http://localhost:3000/users/|ok|'
id is |ok|
$ curl 'http://localhost:3000/users/%7Cok%7C'
id is |ok|

What cannot be handled correctly is the case where "the routing definition contains symbols or multibyte characters".

app.get('/|', async (c) => c.text('ayy'));

I would like to merge #2714 because it is more correct for the framework to be able to handle these cases. However, I don't think there are any realistic users who would define such a routing for a web service endpoint that is widely accessed by various clients.

Handle "URIError: URI malformed"

@szmarczak
Sorry for not being clear on the subject. You were primarily targeting "errors that cannot be decoded as UTF-8", while I think I was primarily targeting "incorrect percent encoding errors".

  • decodeURI("%A4%A2") : Japanese "あ" encoded in EUC-JP
  • decodeURI("%a") : Invalid encoding

You are right, regarding the former, I think it would be best if the framework does not make it an error, but in a way that the user can handle it himself.
As for the latter, I think it should return 400 as soon as possible.

Of course, getPath() could be modified to handle these properly, and hono-base could be rewritten, but I don't want to make getPath() too large.

I am considering this.

@szmarczak
Copy link
Contributor Author

szmarczak commented May 18, 2024

However, I don't think there are any realistic users who would define such a routing for a web service endpoint that is widely accessed by various clients.

It depends what you're developing. If you're developing an HTTP client, then you need to have a spec-compliant server to properly test the client. Or you just need static routes for simple mock server.

Also, being a Polish guy, I can say there are websites exposing static routes like /główna which means /home in English. The fix would allow these cases, but more importantly would be spec-compliant.

decodeURI("%a") : Invalid encoding

I wouldn't worry about this. First, it's against the spec. Second, it's impossible to get this via HTML forms (because forms are spec compliant). Proper fetch usage using WHATWG URLs never results in malformed URIs. The only possible situation for this is when someone intentionally types that in a browser or curl. If anybody wants something against the spec, they're better defining their own protocol instead.

@usualoma
Copy link
Member

@szmarczak Thank you!

more importantly would be spec-compliant.

I agree with you on that point.

decodeURI("%a") : Invalid encoding

I wouldn't worry about this.

Yes, of course, a user trying to send a correct request would not send the /%a/ path. However, there may be clients who send it for an attack.

Suppose we have a routing definition like this,

app.get('/content/:path', (c) => c.text('content'))
app.get('*', (c) => fallback()))

And suppose we have the following URL

http://localhost/users/%63ontent/path-to-%a

I think it would be best if a request to this URL would result in an error before entering the router, rather than proceeding to fallback.

Prevent "intentionally slip through static paths using percent encoding" attacks.

As I said in a previous comment, I think #2714 has the goodness to prevent this. If this feature is compromised, I think #2711 would be sufficient.

About #2711

As shown in the following test, I would expect #2711 to return the expected result for processing /| (or /główna), but would you prefer to choose #2714 over #2711?

https://github.com/honojs/hono/pull/2711/files#diff-8ac13809c9886e994d1db33943de82df4d6c5d88b73fd0270c0189804ff565c2R745-R765

@usualoma
Copy link
Member

Oh, by the way, there is another issue that is difficult to handle in #2714 in terms of spec-compliant.

Suppose we have a routing definition like this,

app.get('/główna/:path', (c) => c.text('główna'))

And suppose we have the following URL

http://localhost/g%C5%82%C3%B3wna/%A4%A2 # %A4%A2 is not a UTF-8 character

I think this is a spec-compliant URL, but it is difficult to get it routed in #2714 (while maintaining the spec of accepting non-UTF-8 encodings). #2711 would accomplish this.

@usualoma
Copy link
Member

I've come to the conclusion that with a few adjustments #2714 should be fine. Wait a minute.

@szmarczak
Copy link
Contributor Author

Yes, of course, a user trying to send a correct request would not send the /%a/ path. However, there may be clients who send it for an attack.

That's a very good find! In this case it makes more sense to return 400 Bad Request.

would you prefer to choose #2714 over #2711?

I believe there's no difference in terms of the result. I'd choose what's more performant (or maintainable) and link to the other possible solution if you ever needed to change the algorithm.

I think this is a spec-compliant URL

It is!

it is difficult to get it routed in #2714 (while maintaining the spec of accepting non-UTF-8 encodings)

The main branch is affected as well. The spec tells you that the percent encoded characters must be decoded, but it doesn't specify the encoding (making it a Uint8Array is unusable, if anybody wants to send binary data they're better using POST). Browsers send UTF-8, so I'd force it to be UTF-8 and 400 Bad Request if it isn't (I'd leave decodeURIComponent as is).

@szmarczak
Copy link
Contributor Author

I'm not sure about not decoding non-UTF-8 characters in #2714, but I'm leaning towards being okay.

I'm satisfied with #2714 👍🏼

@usualoma
Copy link
Member

@szmarczak Thanks right away! I was going to comment on this, but I'm late.

Added process for invalid percent-encoding and invalid UTF-8 sequences.
Skip these sequences and apply decodeURI() to the rest as much as possible.

d252d13

This will prevent the following URLs from also slipping through the routing

http://localhost/users/%63ontent/path-to-%a
http://localhost/g%C5%82%C3%B3wna/%A4%A2

(As was the case before this PR.) If an encoding other than UTF-8 is used, a URIError is thrown when trying to retrieve it with c.req.param(). If you want to get it, you can get it by yourself from c.req.url, or write your own getPath().

https://github.com/usualoma/hono/blob/ea8ec95852aca34a1ed06177c8202e5b7e550404/src/hono.test.ts#L766-L809

Invalid percent encodings, such as %a, may be allowed to exit with a 400 error before invoking handlers, but that error checking is beyond the scope of this issue and is not addressed here.

@usualoma
Copy link
Member

usualoma commented May 19, 2024

@szmarczak Thanks again for all the thought-provoking comments so far, and for your thoughtful answers to all my questions.

#2714 or #2711

I think #2711 is often marginally better in terms of performance.

However, #2714 has more room for getPath() adjustments for users who want to do more complex things, and (not mentioned so far) regular expressions such as the following can be applied to decoded strings.

app.get('/:path{home|główna|ホーム}')

I think #2714 is a good choice because of these various good points.

@usualoma
Copy link
Member

@yusukebe How about we make this issue resolved by #2714, including error handling?

@fzn0x @mgrithm
If you have any thoughts, feel free to comment.

@usualoma
Copy link
Member

@yusukebe
Added "Specification Changes" item to descriptioin. #2714

@yusukebe
Copy link
Member

@usualoma

It looks good to me! The implementation of tryDecodeURI() and using app.onError for URIError is good.

Shall we go with it?

@usualoma
Copy link
Member

@yusukebe Thanks for the confirmation.
Let's go with it!

@fzn0x
Copy link
Contributor

fzn0x commented May 20, 2024

Let's go with it. I don't have any suggestions; I trust everyone here. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants