don't try to encode unicode as latin1 #11

evgeni · 2011-11-21T14:04:46Z

The fix for http://trac.pythonpaste.org/pythonpaste/ticket/251 introduced the following code:

    if isinstance(value, text_type) and not PY3:
        value = value.encode('latin-1')

As unicode >> latin1 (latin1 does not even contain a simple €) and I don't understand why headers should be latin-1 anyways (my understanding is they have to be ascii), the following code breaks badly:

>>> from webob import Response
>>> resp = Response()
>>> resp.etag = u'€'
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "webob/descriptors.py", line 138, in fset
    hset(r, val)
  File "webob/descriptors.py", line 112, in fset
    value = value.encode('latin-1')
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u20ac' in position 1: ordinal not in range(256)

I would suggest value = value.encode('ascii', 'ignore') here (just to be safe, noone should actually pass non-ascii here, but you never know).

The text was updated successfully, but these errors were encountered:

mcdonc · 2011-11-21T14:09:28Z

The particular encoding aside, why should WebOb ignore characters that someone is trying to attach to a header? I'd personally much rather that it raised an exception, like it does now, instead of silently ignoring an error.

evgeni · 2011-11-21T14:25:55Z

Then it should "just" accept ascii, not latin1 (read: your point is valid ;))

mcdonc · 2011-11-21T15:06:21Z

I think you're right, by my reading of rfcs 2616/822, e.g.

section 4.2 of RFC 2616:

HTTP header fields, which include general-header (section 4.5),
request-header (section 5.3), response-header (section 6.2), and
entity-header (section 7.1) fields, follow the same generic format as
that given in Section 3.1 of RFC 822 [9].

section 3.1.2 of RFC 822:

    Once a field has been unfolded, it may be viewed as being com-
    posed of a field-name followed by a colon (":"), followed by a
    field-body, and  terminated  by  a  carriage-return/line-feed.
    The  field-name must be composed of printable ASCII characters
    (i.e., characters that  have  values  between  33.  and  126.,
    decimal, except colon).  The field-body may be composed of any
    ASCII characters, except CR or LF.  (While CR and/or LF may be
    present  in the actual text, they are removed by the action of
    unfolding the field.)

But I have to plead ignorance a bit here. I wish there was some canonical source of information for this stuff that I felt like I trusted. ;-)

maluke · 2011-11-21T19:20:00Z

If we encode etags as ('ascii', 'ignore') we'll end up with etag matches when there were none. Also, "Errors should never pass silently". In the end, if the user sets such a header to non-latin-1 unicode they should either do the encoding you described themselves or get the error so that they know that they are doing it and the headers cannot encode such values.

On the other hand the encoding of headers indeed seems to be ASCII (and only a subset of that is allowed anyway), I'm not sure where the assumption of headers being latin-1 comes, I recall it being generally accepted when the WSGI for py3 was being discussed, but I'm not sure.

I'd suggest asking this on the mailing list to see if anyone knows better.

maluke · 2011-11-21T19:27:18Z

The spec has this bit:

The TEXT rule is only used for descriptive field contents and values that are not intended to be interpreted by the message parser. Words of *TEXT may contain characters from character sets other than ISO-8859-1 [22] only when encoded according to the rules of RFC 2047 [14]. http://www.w3.org/Protocols/rfc2616/rfc2616-sec2.html#sec2.2

I can't tell if what webob does is a correct interpretation of that rule or not. (We still don't want RFC 2047 anyway, because browsers don't seem to implement it IIRC).

This suggests that current code is correct: http://trac.tools.ietf.org/wg/httpbis/trac/ticket/74

mcdonc · 2011-11-22T05:12:08Z

I concur with Sergey that the version of RFC2616 referenced above does seem to imply that header values can be latin-1 (iso-8859-1) encoded. It's defensible.

mcdonc · 2011-11-23T22:19:10Z

I had a discussion with Dan Connolly on IRC (he is a contributor to the spec), and although he regrets naming latin-1 as the TEXT default, he believes we are obligated to decode latin-1-encoded headers sent to us. For this reason, I personally think it's reasonable to also encode headers as latin-1 (as they should/will be decodeable by WebOb). Dan has some cognitive dissonance though, and claims it's likely we should fail when asked to encode anything that falls outside ASCII. I'm apt to just make it simple and always decode from and encode to latin-1, though.

dckc · 2011-11-23T22:20:52Z

Fair enough. Just don't fail silently.

mcdonc · 2011-11-23T22:27:39Z

Closing this issue; no fix required, I think. If you want to pass characters that fall outside latin-1 in headers, you'll have to encode them to ascii (or latin-1) at the application layer, and decode them there too.

lajujula mentioned this issue Nov 21, 2011

Yet Another Encoding Error #12

Closed

mcdonc closed this as completed Nov 23, 2011

nidico mentioned this issue Dec 29, 2012

Creation of proposals with non-ascii characters leads to 500 liqd/adhocracy#169

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

don't try to encode unicode as latin1 #11

don't try to encode unicode as latin1 #11

evgeni commented Nov 21, 2011

mcdonc commented Nov 21, 2011

evgeni commented Nov 21, 2011

mcdonc commented Nov 21, 2011

maluke commented Nov 21, 2011

maluke commented Nov 21, 2011

mcdonc commented Nov 22, 2011

mcdonc commented Nov 23, 2011

dckc commented Nov 23, 2011

mcdonc commented Nov 23, 2011

don't try to encode unicode as latin1 #11

don't try to encode unicode as latin1 #11

Comments

evgeni commented Nov 21, 2011

mcdonc commented Nov 21, 2011

evgeni commented Nov 21, 2011

mcdonc commented Nov 21, 2011

maluke commented Nov 21, 2011

maluke commented Nov 21, 2011

mcdonc commented Nov 22, 2011

mcdonc commented Nov 23, 2011

dckc commented Nov 23, 2011

mcdonc commented Nov 23, 2011