Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

don't try to encode unicode as latin1 #11

Closed
evgeni opened this issue Nov 21, 2011 · 9 comments
Closed

don't try to encode unicode as latin1 #11

evgeni opened this issue Nov 21, 2011 · 9 comments

Comments

@evgeni
Copy link

evgeni commented Nov 21, 2011

The fix for http://trac.pythonpaste.org/pythonpaste/ticket/251 introduced the following code:

    if isinstance(value, text_type) and not PY3:
        value = value.encode('latin-1')

As unicode >> latin1 (latin1 does not even contain a simple €) and I don't understand why headers should be latin-1 anyways (my understanding is they have to be ascii), the following code breaks badly:

>>> from webob import Response
>>> resp = Response()
>>> resp.etag = u'€'
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "webob/descriptors.py", line 138, in fset
    hset(r, val)
  File "webob/descriptors.py", line 112, in fset
    value = value.encode('latin-1')
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u20ac' in position 1: ordinal not in range(256)

I would suggest value = value.encode('ascii', 'ignore') here (just to be safe, noone should actually pass non-ascii here, but you never know).

@mcdonc
Copy link
Member

mcdonc commented Nov 21, 2011

The particular encoding aside, why should WebOb ignore characters that someone is trying to attach to a header? I'd personally much rather that it raised an exception, like it does now, instead of silently ignoring an error.

@evgeni
Copy link
Author

evgeni commented Nov 21, 2011

Then it should "just" accept ascii, not latin1 (read: your point is valid ;))

@mcdonc
Copy link
Member

mcdonc commented Nov 21, 2011

I think you're right, by my reading of rfcs 2616/822, e.g.

section 4.2 of RFC 2616:

HTTP header fields, which include general-header (section 4.5),
request-header (section 5.3), response-header (section 6.2), and
entity-header (section 7.1) fields, follow the same generic format as
that given in Section 3.1 of RFC 822 [9].

section 3.1.2 of RFC 822:

    Once a field has been unfolded, it may be viewed as being com-
    posed of a field-name followed by a colon (":"), followed by a
    field-body, and  terminated  by  a  carriage-return/line-feed.
    The  field-name must be composed of printable ASCII characters
    (i.e., characters that  have  values  between  33.  and  126.,
    decimal, except colon).  The field-body may be composed of any
    ASCII characters, except CR or LF.  (While CR and/or LF may be
    present  in the actual text, they are removed by the action of
    unfolding the field.)

But I have to plead ignorance a bit here. I wish there was some canonical source of information for this stuff that I felt like I trusted. ;-)

@maluke
Copy link
Contributor

maluke commented Nov 21, 2011

If we encode etags as ('ascii', 'ignore') we'll end up with etag matches when there were none. Also, "Errors should never pass silently". In the end, if the user sets such a header to non-latin-1 unicode they should either do the encoding you described themselves or get the error so that they know that they are doing it and the headers cannot encode such values.

On the other hand the encoding of headers indeed seems to be ASCII (and only a subset of that is allowed anyway), I'm not sure where the assumption of headers being latin-1 comes, I recall it being generally accepted when the WSGI for py3 was being discussed, but I'm not sure.

I'd suggest asking this on the mailing list to see if anyone knows better.

@maluke
Copy link
Contributor

maluke commented Nov 21, 2011

The spec has this bit:

The TEXT rule is only used for descriptive field contents and values that are not intended to be interpreted by the message parser. Words of *TEXT may contain characters from character sets other than ISO-8859-1 [22] only when encoded according to the rules of RFC 2047 [14]. http://www.w3.org/Protocols/rfc2616/rfc2616-sec2.html#sec2.2

I can't tell if what webob does is a correct interpretation of that rule or not. (We still don't want RFC 2047 anyway, because browsers don't seem to implement it IIRC).

This suggests that current code is correct: http://trac.tools.ietf.org/wg/httpbis/trac/ticket/74

@mcdonc
Copy link
Member

mcdonc commented Nov 22, 2011

I concur with Sergey that the version of RFC2616 referenced above does seem to imply that header values can be latin-1 (iso-8859-1) encoded. It's defensible.

@mcdonc
Copy link
Member

mcdonc commented Nov 23, 2011

I had a discussion with Dan Connolly on IRC (he is a contributor to the spec), and although he regrets naming latin-1 as the TEXT default, he believes we are obligated to decode latin-1-encoded headers sent to us. For this reason, I personally think it's reasonable to also encode headers as latin-1 (as they should/will be decodeable by WebOb). Dan has some cognitive dissonance though, and claims it's likely we should fail when asked to encode anything that falls outside ASCII. I'm apt to just make it simple and always decode from and encode to latin-1, though.

@dckc
Copy link

dckc commented Nov 23, 2011

Fair enough. Just don't fail silently.

@mcdonc
Copy link
Member

mcdonc commented Nov 23, 2011

Closing this issue; no fix required, I think. If you want to pass characters that fall outside latin-1 in headers, you'll have to encode them to ascii (or latin-1) at the application layer, and decode them there too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants