url quote/unquote with python 3 broken #164

sileht · 2014-10-06T16:47:07Z

Hi,

I have some trouble with th url handling of webob in python 3, after some investigation, it seems
that the issue is localed in webob.compat.url_unquote

A simple testcase to reproduce the issue that works in python 2 but not in python 3:

class urlquote_Tests(unittest.TestCase):
    def test_url_unquote(self):
        from webob.compat import url_quote
        from webob.compat import url_unquote
        url = 'http://localhost/foo\xe2\x88\xa7bar/'
        val = url_unquote(url_quote(url))
        self.assertEqual(url, val)

Regards,

quantum-omega · 2015-04-13T21:30:04Z

The problem surfaces in url_unquote when, after successfully parsing the percent-encoded string and returning the right byte sequence, .decode('latin-1') is called on the result. The function function does this in order to accept a Python 3str object and return a str object. However, the character in \x88 in the test case submitted is not part of ISO-8859-1 (latin1).

I didn't investigate RFC 3986 much, but it seems to indicate that non-ASCII characters, when they are percent-encoded, should be translated to their UTF-8 representation. In turn, that means that when we decode an URL, we should assume that it was encoded in UTF-8. However, just changing "latin-1" for "UTF-8" breaks a few tests. I'm inclined to think the assumptions those tests make about the expected strings are wrong, but I'd like a second opinion on that.

digitalresistor · 2015-04-13T23:26:04Z

Here is another report for stuff related to decode: #195

digitalresistor · 2016-07-31T00:20:12Z

Linking this to #161

Natim · 2017-04-25T12:55:53Z

However, just changing "latin-1" for "UTF-8" breaks a few tests.

@quantum-omega Note that I tried using the urllib.parse.unquote and urllib.parse.parse_qsl as we do for Python2 instead of the custom code and it gives exactly the same results as changing the custom code in webob/compat.py to use utf-8 rather than latin-1

digitalresistor · 2017-04-29T05:55:11Z

The problem is that in Python 3 it is not valid to provide byte values for a str by backslash escaping them:

Here's the output if you provide a b'' with that same escape sequence and then decode as 'UTF-8':

I will take a look and see if after accounting for the fact that it's a bytestring, the two functions still don't work, I haven't tested that. Just came to me that this may be a reason for the issue.

Natim · 2017-05-09T07:25:00Z

The problem is that in Python 3 it is not valid to provide byte values for a str by backslash escaping them

If we use bytes the same way in both Python2 and Python3 we don't have the issue:

Python 2.7.13 (default, Jan 19 2017, 14:48:08) 
[GCC 6.3.0 20170118] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> url = 'http://localhost/foo\xe2\x88\xa7bar/'
>>> url.decode('utf-8')
u'http://localhost/foo\u2227bar/'
>>> print(url.decode('utf-8'))
http://localhost/foo∧bar/

Python 3.5.3 (default, Jan 19 2017, 14:11:04) 
[GCC 6.3.0 20170118] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> url = b'http://localhost/foo\xe2\x88\xa7bar/'
>>> url.decode('utf-8')
'http://localhost/foo∧bar/'
>>> url = 'http://localhost/foo\u2227bar/'
>>> print(url)
http://localhost/foo∧bar/

quantum-omega · 2017-05-09T11:33:51Z

I haven't looked into this in quite a while but that corresponds to what I would normally write to ensure Python2/3 compatibility, so it makes sense to me that this would be the way to fix it. However, shouldn't URIs/URLs normally contain only ASCII, with the characters outside of that range URL-encoded, with the encoding of the escapped values left to the interpreting program (usually UTF-8 but not necessarily)? Here, the unicode byte sequence should not even appear in a valid URL and instead of it, we should have "%5c" or something like that. If we get to a point where we have UTF-8 in a URL, that means some decoding already took place, and probably not in the right spot. Le 9 mai 2017 03:25, "Rémy HUBSCHER" <notifications@github.com> a écrit :

…

The problem is that in Python 3 it is not valid to provide byte values for a str by backslash escaping them If we use bytes the same way in both Python2 and Python3 we don't have the issue: Python 2.7.13 (default, Jan 19 2017, 14:48:08) [GCC 6.3.0 20170118] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> url = 'http://localhost/foo\xe2\x88\xa7bar/' >>> url.decode('utf-8') u'http://localhost/foo\u2227bar/' >>> print(url.decode('utf-8'))http://localhost/foo∧bar/ Python 3.5.3 (default, Jan 19 2017, 14:11:04) [GCC 6.3.0 20170118] on linux Type "help", "copyright", "credits" or "license" for more information. >>> url = b'http://localhost/foo\xe2\x88\xa7bar/' >>> url.decode('utf-8') 'http://localhost/foo∧bar/' — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#164 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA_xHL6lq5IU00pRbt0TvhvthYe36burks5r4BTNgaJpZM4CrVLJ> .

Natim mentioned this issue Apr 25, 2017

GET and POST behavior w.r.t. utf-8 decoding errors #161

Open

sileht closed this as completed Nov 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

url quote/unquote with python 3 broken #164

url quote/unquote with python 3 broken #164

sileht commented Oct 6, 2014

quantum-omega commented Apr 13, 2015

digitalresistor commented Apr 13, 2015

digitalresistor commented Jul 31, 2016

Natim commented Apr 25, 2017

digitalresistor commented Apr 29, 2017

Natim commented May 9, 2017 •

edited

Loading

quantum-omega commented May 9, 2017 via email

url quote/unquote with python 3 broken #164

url quote/unquote with python 3 broken #164

Comments

sileht commented Oct 6, 2014

quantum-omega commented Apr 13, 2015

digitalresistor commented Apr 13, 2015

digitalresistor commented Jul 31, 2016

Natim commented Apr 25, 2017

digitalresistor commented Apr 29, 2017

Natim commented May 9, 2017 • edited Loading

quantum-omega commented May 9, 2017 via email

Natim commented May 9, 2017 •

edited

Loading