-
Notifications
You must be signed in to change notification settings - Fork 187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
don't try to encode unicode as latin1 #11
Comments
The particular encoding aside, why should WebOb ignore characters that someone is trying to attach to a header? I'd personally much rather that it raised an exception, like it does now, instead of silently ignoring an error. |
Then it should "just" accept ascii, not latin1 (read: your point is valid ;)) |
I think you're right, by my reading of rfcs 2616/822, e.g. section 4.2 of RFC 2616:
section 3.1.2 of RFC 822:
But I have to plead ignorance a bit here. I wish there was some canonical source of information for this stuff that I felt like I trusted. ;-) |
If we encode etags as ('ascii', 'ignore') we'll end up with etag matches when there were none. Also, "Errors should never pass silently". In the end, if the user sets such a header to non-latin-1 unicode they should either do the encoding you described themselves or get the error so that they know that they are doing it and the headers cannot encode such values. On the other hand the encoding of headers indeed seems to be ASCII (and only a subset of that is allowed anyway), I'm not sure where the assumption of headers being latin-1 comes, I recall it being generally accepted when the WSGI for py3 was being discussed, but I'm not sure. I'd suggest asking this on the mailing list to see if anyone knows better. |
The spec has this bit: The TEXT rule is only used for descriptive field contents and values that are not intended to be interpreted by the message parser. Words of *TEXT may contain characters from character sets other than ISO-8859-1 [22] only when encoded according to the rules of RFC 2047 [14]. http://www.w3.org/Protocols/rfc2616/rfc2616-sec2.html#sec2.2 I can't tell if what webob does is a correct interpretation of that rule or not. (We still don't want RFC 2047 anyway, because browsers don't seem to implement it IIRC). This suggests that current code is correct: http://trac.tools.ietf.org/wg/httpbis/trac/ticket/74 |
I concur with Sergey that the version of RFC2616 referenced above does seem to imply that header values can be latin-1 (iso-8859-1) encoded. It's defensible. |
I had a discussion with Dan Connolly on IRC (he is a contributor to the spec), and although he regrets naming latin-1 as the TEXT default, he believes we are obligated to decode latin-1-encoded headers sent to us. For this reason, I personally think it's reasonable to also encode headers as latin-1 (as they should/will be decodeable by WebOb). Dan has some cognitive dissonance though, and claims it's likely we should fail when asked to encode anything that falls outside ASCII. I'm apt to just make it simple and always decode from and encode to latin-1, though. |
Fair enough. Just don't fail silently. |
Closing this issue; no fix required, I think. If you want to pass characters that fall outside latin-1 in headers, you'll have to encode them to ascii (or latin-1) at the application layer, and decode them there too. |
The fix for http://trac.pythonpaste.org/pythonpaste/ticket/251 introduced the following code:
As unicode >> latin1 (latin1 does not even contain a simple €) and I don't understand why headers should be latin-1 anyways (my understanding is they have to be ascii), the following code breaks badly:
I would suggest
value = value.encode('ascii', 'ignore')
here (just to be safe, noone should actually pass non-ascii here, but you never know).The text was updated successfully, but these errors were encountered: