Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

url quote/unquote with python 3 broken #164

Closed
sileht opened this issue Oct 6, 2014 · 7 comments
Closed

url quote/unquote with python 3 broken #164

sileht opened this issue Oct 6, 2014 · 7 comments

Comments

@sileht
Copy link

sileht commented Oct 6, 2014

Hi,

I have some trouble with th url handling of webob in python 3, after some investigation, it seems
that the issue is localed in webob.compat.url_unquote

A simple testcase to reproduce the issue that works in python 2 but not in python 3:

class urlquote_Tests(unittest.TestCase):
    def test_url_unquote(self):
        from webob.compat import url_quote
        from webob.compat import url_unquote
        url = 'http://localhost/foo\xe2\x88\xa7bar/'
        val = url_unquote(url_quote(url))
        self.assertEqual(url, val)

Regards,

@quantum-omega
Copy link
Contributor

The problem surfaces in url_unquote when, after successfully parsing the percent-encoded string and returning the right byte sequence, .decode('latin-1') is called on the result. The function function does this in order to accept a Python 3str object and return a str object. However, the character in \x88 in the test case submitted is not part of ISO-8859-1 (latin1).

I didn't investigate RFC 3986 much, but it seems to indicate that non-ASCII characters, when they are percent-encoded, should be translated to their UTF-8 representation. In turn, that means that when we decode an URL, we should assume that it was encoded in UTF-8. However, just changing "latin-1" for "UTF-8" breaks a few tests. I'm inclined to think the assumptions those tests make about the expected strings are wrong, but I'd like a second opinion on that.

@digitalresistor
Copy link
Member

Here is another report for stuff related to decode: #195

@digitalresistor
Copy link
Member

Linking this to #161

@Natim
Copy link

Natim commented Apr 25, 2017

However, just changing "latin-1" for "UTF-8" breaks a few tests.

@quantum-omega Note that I tried using the urllib.parse.unquote and urllib.parse.parse_qsl as we do for Python2 instead of the custom code and it gives exactly the same results as changing the custom code in webob/compat.py to use utf-8 rather than latin-1

@digitalresistor
Copy link
Member

The problem is that in Python 3 it is not valid to provide byte values for a str by backslash escaping them:

screen shot 2017-04-28 at 23 44 58

Here's the output if you provide a b'' with that same escape sequence and then decode as 'UTF-8':

screen shot 2017-04-28 at 23 46 10

I will take a look and see if after accounting for the fact that it's a bytestring, the two functions still don't work, I haven't tested that. Just came to me that this may be a reason for the issue.

@Natim
Copy link

Natim commented May 9, 2017

The problem is that in Python 3 it is not valid to provide byte values for a str by backslash escaping them

If we use bytes the same way in both Python2 and Python3 we don't have the issue:

Python 2.7.13 (default, Jan 19 2017, 14:48:08) 
[GCC 6.3.0 20170118] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> url = 'http://localhost/foo\xe2\x88\xa7bar/'
>>> url.decode('utf-8')
u'http://localhost/foo\u2227bar/'
>>> print(url.decode('utf-8'))
http://localhost/foo∧bar/
Python 3.5.3 (default, Jan 19 2017, 14:11:04) 
[GCC 6.3.0 20170118] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> url = b'http://localhost/foo\xe2\x88\xa7bar/'
>>> url.decode('utf-8')
'http://localhost/foo∧bar/'
>>> url = 'http://localhost/foo\u2227bar/'
>>> print(url)
http://localhost/foo∧bar/

@quantum-omega
Copy link
Contributor

quantum-omega commented May 9, 2017 via email

@sileht sileht closed this as completed Nov 14, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants