Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Waitress errors on curl request with non ASCII in URL. #127

Closed
GrahamDumpleton opened this issue Apr 21, 2016 · 11 comments
Closed

Waitress errors on curl request with non ASCII in URL. #127

GrahamDumpleton opened this issue Apr 21, 2016 · 11 comments

Comments

@GrahamDumpleton
Copy link

If you issue a request with curl of:

curl http://127.0.0.1:8080/a=тест

Waitress server will die with:

ERROR:waitress:uncaptured python exception, closing channel <waitress.channel.HTTPChannel connected 127.0.0.1:59045 at 0x103ad7b38> (<class 'UnicodeDecodeError'>:'ascii' codec can't decode byte 0xd1 in position 3: ordinal not in range(128) [/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/asyncore.py|read|83] [/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/asyncore.py|handle_read_event|423] [/private/tmp/py35/lib/python3.5/site-packages/waitress/channel.py|handle_read|174] [/private/tmp/py35/lib/python3.5/site-packages/waitress/channel.py|received|191] [/private/tmp/py35/lib/python3.5/site-packages/waitress/parser.py|received|102] [/private/tmp/py35/lib/python3.5/site-packages/waitress/parser.py|parse_header|206] [/private/tmp/py35/lib/python3.5/site-packages/waitress/parser.py|split_uri|254] [/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/urllib/parse.py|urlsplit|327] [/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/urllib/parse.py|_coerce_args|114] [/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/urllib/parse.py|_decode_args|98] [/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/urllib/parse.py|<genexpr>|98])

This came up in discussion:

You may want to check that and related issues:

to check how Waitress behaves in cases of client sending non ASCII.

Right now Waitress fails. Both wsgiref and Gunicorn appear to get it wrong. But mod_wsgi appears to get the desired result.

@tseaver
Copy link
Member

tseaver commented Apr 21, 2016

Thanks for the report. A minor clarification: the waitress process itself doesn't die: it closes the connection without returning anything.

@digitalresistor
Copy link
Member

It looks like cURL is not percent encoding the URL, and is instead sending UTF-8 to the server, which is not valid for the HTTP specification which requires latin-1 for requests, and thus requires that URL to be urlencoded.

@rr-
Copy link

rr- commented Feb 5, 2017

Why latin-1? How do we encode 漢字 with it?

The standard seems to advocate UTF-8 rather than latin-1:

Non-ASCII characters must first be encoded according to UTF-8 [STD63], and then each octet of the corresponding UTF-8 sequence must be percent-encoded to be represented as URI characters.

https://tools.ietf.org/html/rfc3986
http://stackoverflow.com/a/913653

Percent-encoded URLs do not currently work either:

Input - GET /tag/Madoka%E2%99%A5Magika HTTP/1.0 (generated by modern web browser accessing /tag/Madoka♥Magika)
Output - /tag/Madokaâ¥Magika

@digitalresistor
Copy link
Member

percent-encoded to be represented as URI characters.

Percent encoding is latin-1 (ASCII).

Percent-encoded URLs do not currently work either:

I am not sure what you mean here...

On a Pyramid application running locally on my machine (Python 3.5, waitress 1.0.1):

 curl -vvvvv http://10.10.10.205:6543/Madoka%E2%99%A5Magika
*   Trying 10.10.10.205...
* TCP_NODELAY set
* Connected to 10.10.10.205 (10.10.10.205) port 6543 (#0)
> GET /Madoka%E2%99%A5Magika HTTP/1.1
> Host: 10.10.10.205:6543
> User-Agent: curl/7.51.0
> Accept: */*
> 
< HTTP/1.1 404 Not Found
< Content-Length: 921
< Content-Type: text/html; charset=UTF-8
< Date: Mon, 06 Feb 2017 01:20:58 GMT
< Server: waitress
< 
<html>
 <head>
  <title>404 Not Found</title>
 </head>
 <body>
  <h1>404 Not Found</h1>
  The resource could not be found.<br/><br/>
debug_notfound of url http://10.10.10.205:6543/Madoka%E2%99%A5Magika; path_info: &#x27;/Madoka&#9829;Magika&#x27;, context: &lt;myapp.traversal.Root object at 0x109773b00&gt;, view_name: &#x27;Madoka&#9829;Magika&#x27;, subpath: (), traversed: (), root: &lt;myapp.traversal.Root object at 0x109773b00&gt;, vroot: &lt;myapp.traversal.Root object at 0x109773b00&gt;, vroot_path: ()


 <link rel="stylesheet" type="text/css" href="http://10.10.10.205:6543/_debug_toolbar/static/toolbar/toolbar_button.css">

<div id="pDebug">
    <div  id="pDebugToolbarHandle">
        <a title="Show Toolbar" id="pShowToolBarButton"
           href="http://10.10.10.205:6543/_debug_toolbar/34343533343036383136" target="pDebugToolbar">&#171; FIXME: Debug Toolbar</a>
    </div>
</div>
</body>
* Curl_http_done: called premature == 0
* Connection #0 to host 10.10.10.205 left intact

Same with:

alexandra:~ xistence$ curl -vvvvv "http://10.10.10.205:6543/%E6%BC%A2%E5%AD%97"
*   Trying 10.10.10.205...
* TCP_NODELAY set
* Connected to 10.10.10.205 (10.10.10.205) port 6543 (#0)
> GET /%E6%BC%A2%E5%AD%97 HTTP/1.1
> Host: 10.10.10.205:6543
> User-Agent: curl/7.51.0
> Accept: */*
> 
< HTTP/1.1 404 Not Found
< Content-Length: 912
< Content-Type: text/html; charset=UTF-8
< Date: Mon, 06 Feb 2017 01:22:05 GMT
< Server: waitress
< 
<html>
 <head>
  <title>404 Not Found</title>
 </head>
 <body>
  <h1>404 Not Found</h1>
  The resource could not be found.<br/><br/>
debug_notfound of url http://10.10.10.205:6543/%E6%BC%A2%E5%AD%97; path_info: &#x27;/&#28450;&#23383;&#x27;, context: &lt;myapp.traversal.Root object at 0x1096fcb00&gt;, view_name: &#x27;&#28450;&#23383;&#x27;, subpath: (), traversed: (), root: &lt;myapp.traversal.Root object at 0x1096fcb00&gt;, vroot: &lt;myapp.traversal.Root object at 0x1096fcb00&gt;, vroot_path: ()


 <link rel="stylesheet" type="text/css" href="http://10.10.10.205:6543/_debug_toolbar/static/toolbar/toolbar_button.css">

<div id="pDebug">
    <div  id="pDebugToolbarHandle">
        <a title="Show Toolbar" id="pShowToolBarButton"
           href="http://10.10.10.205:6543/_debug_toolbar/34343439333837333532" target="pDebugToolbar">&#171; FIXME: Debug Toolbar</a>
    </div>
</div>
</body>
* Curl_http_done: called premature == 0
* Connection #0 to host 10.10.10.205 left intact

Application output:

2017-02-05 18:19:52,118 DEBUG [myapp:106][waitress] route matched for url http://10.10.10.205:6543/Madoka%E2%99%A5Magika; route_name: 'main', path_info: '/Madoka♥Magika', pattern: '/*traverse', matchdict: {'traverse': ('Madoka♥Magika',)}, predicates: ''
2017-02-05 18:21:50,355 DEBUG [myapp:106][waitress] route matched for url http://10.10.10.205:6543/%E6%BC%A2%E5%AD%97; route_name: 'main', path_info: '/漢字', pattern: '/*traverse', matchdict: {'traverse': ('漢字',)}, predicates: ''

The issue is that cURL by default will NOT send the percent encoded request:

alexandra:~ xistence$ curl -vvv "http://10.10.10.205:6543/Madoka♥Magika"
*   Trying 10.10.10.205...
* TCP_NODELAY set
* Connected to 10.10.10.205 (10.10.10.205) port 6543 (#0)
> GET /Madoka♥Magika HTTP/1.1
> Host: 10.10.10.205:6543
> User-Agent: curl/7.51.0
> Accept: */*
> 
* Curl_http_done: called premature == 0
* Empty reply from server
* Connection #0 to host 10.10.10.205 left intact
curl: (52) Empty reply from server

Which causes waitress to close the connection:

2017-02-05 18:24:44,552 ERROR [waitress:181][MainThread] uncaptured python exception, closing channel <waitress.channel.HTTPChannel connected 10.10.10.205:50329 at 0x109591cf8> (<class 'UnicodeDecodeError'>:'ascii' codec can't decode byte 0xe2 in position 7: ordinal not in range(128) [/Users/xistence/.pyenv/versions/3.5.0/lib/python3.5/asyncore.py|read|83] [/Users/xistence/.pyenv/versions/3.5.0/lib/python3.5/asyncore.py|handle_read_event|423] [/Users/xistence/.ve/myapp/lib/python3.5/site-packages/waitress/channel.py|handle_read|174] [/Users/xistence/.ve/myapp/lib/python3.5/site-packages/waitress/channel.py|received|191] [/Users/xistence/.ve/myapp/lib/python3.5/site-packages/waitress/parser.py|received|102] [/Users/xistence/.ve/myapp/lib/python3.5/site-packages/waitress/parser.py|parse_header|208] [/Users/xistence/.ve/myapp/lib/python3.5/site-packages/waitress/parser.py|split_uri|256] [/Users/xistence/.pyenv/versions/3.5.0/lib/python3.5/urllib/parse.py|urlsplit|327] [/Users/xistence/.pyenv/versions/3.5.0/lib/python3.5/urllib/parse.py|_coerce_args|114] [/Users/xistence/.pyenv/versions/3.5.0/lib/python3.5/urllib/parse.py|_decode_args|98] [/Users/xistence/.pyenv/versions/3.5.0/lib/python3.5/urllib/parse.py|<genexpr>|98])

This behaviour should be improved upon, but is technically contra-spec because the sending entity should have percent encoded the URL before sending it to the server.

@rr-
Copy link

rr- commented Feb 6, 2017

I'm not sure how you got the above results, but the problematic behavior is demonstrated in existing unit tests:

https://github.com/Pylons/waitress/blob/1bcdeaec9fb60ba41053fcf9253d2a340af95310/waitress/tests/test_compat.py

b'/a%C5%9B'
assert '/aÅ\x9b'

whereas it "should" (should it?) be

b'/a\xc5\x9b'.decode('utf-8')
'/aś'

This weird encoding ends up being stored in env['PATH_INFO']

@rr-
Copy link

rr- commented Feb 6, 2017

Example:

testapp.py

def application(env, start_response):
    start_response('200', [('content-type', 'text/plain; charset=utf-8')])
    a = env['PATH_INFO']
    b = a.encode('latin-1').decode('utf-8')  # :E
    print(a, b)
    return ('%s %s' % (a, b)).encode('utf-8'),

waitress-serve --port 1234 testapp:application

rr-@tornado:~$ curl 'localhost:1234/%E6%BC%A2%E5%AD%97'
/æ¼¢å­% /漢字

The .encode('latin-1').decode('utf-8') gives the expected result but I totally get a "you're doing it wrong" vibe from it.

Edit: looks like pyramid does just that: https://github.com/Pylons/pyramid/blob/4acd85dc98fb2a43eae54d2116cc4bf383157269/pyramid/request.py#L283

In the test I see a reference to PEP 3333 https://www.python.org/dev/peps/pep-3333/#unicode-issues but the reason for latin-1 is bogus at best, even after reading whole "unicode issues" section...

@digitalresistor
Copy link
Member

Actually Pyramid uses WebOb which does the right thing here: https://github.com/Pylons/webob/blob/master/webob/request.py#L321 and https://github.com/Pylons/webob/blob/master/webob/request.py#L167.

Which is similar to what Werkzeug does: https://github.com/pallets/werkzeug/blob/109dad4ac9e0a1690666b2d4f29d07d98a3701d9/werkzeug/wsgi.py#L233

That being said, the encode/decode spiel is indeed correct.

Based upon the comments in the above bug reports linked by @GrahamDumpleton, it is expected that the PATH_INFO contains the percent decoded URL in latin-1. Changing this would be against the WSGI spec.


The only way that waitress would fix this issue is for it to accept the UTF-8, encode it, and decode it as latin-1 and put it in PATH_INFO, and you would still have to do the dance in your application.

@rr-
Copy link

rr- commented Feb 6, 2017

Thanks for the confirmation, wish I had known sooner about that encoding gotcha (or at least thought about going to look for it in the WSGI ref.)

Regarding the OP's issue I think curl is at fault for not encoding the URLs like the RFC linked earlier says to, and trying to parse such URLs seems like asking for trouble - for example, what if the user issues curl command in a console with non-unicode locale?

@digitalresistor
Copy link
Member

I agree with cURL being at fault. Trying UTF-8 and failing back to latin-1 might make sense. The other fix I am thinking about is having it actually return a 400 Bad Request instead of just closing the connection. Slamming the door in someones face is not my idea of a good web citizen.

@mmerickel
Copy link
Member

This issue is the same as #64.

@digitalresistor
Copy link
Member

Fixed by #162

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants