Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

better handling of strange url #31

Open
flyingeek opened this issue Jul 22, 2015 · 2 comments
Open

better handling of strange url #31

flyingeek opened this issue Jul 22, 2015 · 2 comments

Comments

@flyingeek
Copy link

Hello,

I am using Django and URLObject, I encounter some UnidecodeEncodeError due to the use of URLObject with some invalid URLs (coming from search engines).

>>> from urlobject.urlobject import QueryString
>>> qs = QueryString(u's=glaci%E8re')
>>> qs.list
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/Eric/Python/Env/lpdc/lib/python2.7/site-packages/urlobject/query_string.py", line 35, in list
    value = qs_decode(value)
  File "/Users/Eric/Python/Env/lpdc/lib/python2.7/site-packages/urlobject/query_string.py", line 138, in _qs_decode_py2
    return urllib.unquote_plus(s).decode('utf-8')
  File "/Users/Eric/Python/Env/lpdc/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe8 in position 5: invalid continuation byte

A partial solution would be:

def _qs_decode_py2(s):
    """Unquote unicode or str using query string rules."""
    if isinstance(s, unicode):
        s = s.encode('utf-8')
    return urllib.unquote_plus(s).decode('utf-8', errors='replace')

But I don't know for py3.

For information Django also does replace when handling query_string.

@agriffis
Copy link
Collaborator

That's a Latin-1 (ISO-8859-1) "è"

This seems related: https://stackoverflow.com/questions/5366007/why-does-the-encodings-of-a-url-and-the-query-string-part-differ

We could do errors='replace' or we could come up with a (possibly optional) way to attempt Latin-1 if UTF-8 decoding fails.

@flyingeek
Copy link
Author

This stack overflow link is very interesting, thanks for sharing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants