Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

Incorrect content-type for raw pod #258

Closed
rwstauner opened this Issue · 7 comments

2 participants

@rwstauner
Owner

Looking at a module on metacpan and clicking the raw source link takes you straight to the api. For example: http://api.metacpan.org/source/ARGRATH/Pod-L10N-0.07/lib/Pod/L10N/Format.pod

That doc looks terrible in a browser because it's not utf-8 encoded,
however our headers say it is:

HTTP/1.1 200 OK
Server: nginx/0.7.67
Date: Fri, 29 Mar 2013 05:13:53 GMT
Content-Type: text/plain; charset=UTF-8
Transfer-Encoding: chunked
Connection: keep-alive
Vary: Accept-Encoding
Content-Encoding: gzip

The current behavior is wrong because we say it's UTF-8 when it's not.
So:

  • We could detect =encoding\s+(\S+) and alter the charset header. This would be inconsistent with other docs that come from the api, but user-agents are certainly capable of dealing with a per-response encoding.
  • We could convert it to utf-8 but then it wouldn't really be "raw".
  • We could set the content-type for raw files to application/octet-stream but then the browser wouldn't display it at all.

Any thoughts?

This issue could probably be applied to any raw file.

@monken
Owner

Refs CPAN-API/metacpan-web@502a86f

I don't feel like detecting the encoding of a file by looking at =encoding is correct. Imagine a module that has both use utf8; and =encoding jpn. The source code might contain utf8 while the pod is jpn encoded. No way to make the right decision.
It just feels too magical and I'd rather see a consistent (i.e. utf8) response than something unreliable. I think we have to ask, what the raw response is being used for. And I'd say people are interested in the byte sequence (create diffs, download, etc).

@monken monken referenced this issue from a commit in CPAN-API/metacpan-web
@rwstauner rwstauner Honor =encoding directive when decoding raw resonses 502a86f
@rwstauner
Owner

Are you in favor of setting the content-type to octet-stream then?

@monken
Owner

what about not providing a charset at all?

@rwstauner
Owner

I considered that too... I'd have to look up how that's supposed to be interpreted

@monken
Owner

I guess it's up to the browser then.
My point is that source code is supposed to be ascii, or utf8 if we talk about perl code. So the /source endpoint should naturally provide an encoding that allows to view the source, not the documentation. We have the /pod endpoint for displaying documentation in the correct encoding (if provided).

@rwstauner
Owner

Yeah, that makes sense.
That's an even a better argument than "the file could be mixed" (which is sufficiently valid).
There is the encoding pragma for writing perl in other encodings but that's been deprecated.
It's a UTF-8 world now.

@rwstauner
Owner

I guess we should just leave it the way it is (since it will be right for most cases).

tl;dr

http://www.w3.org/International/O-HTTP-charset

HTTP 1.1 says that the default charset is ISO-8859-1. But there are too many unlabeled documents in other encodings, so browsers use the reader's preferred encoding when there is no explicit charset parameter.

rfc

http://www.w3.org/Protocols/rfc2068/rfc2068.txt

   The "charset" parameter is used with some media types to define the
   character set (section 3.4) of the data. When no explicit charset
   parameter is provided by the sender, media subtypes of the "text"
   type are defined to have a default charset value of "ISO-8859-1" when
   received via HTTP. Data in character sets other than "ISO-8859-1" or
   its subsets MUST be labeled with an appropriate charset value.

   Some HTTP/1.0 software has interpreted a Content-Type header without
   charset parameter incorrectly to mean "recipient should guess."
   Senders wishing to defeat this behavior MAY include a charset
   parameter even when the charset is ISO-8859-1 and SHOULD do so when
   it is known that it will not confuse the recipient.

   Unfortunately, some older HTTP/1.0 clients did not deal properly with
   an explicit charset parameter. HTTP/1.1 recipients MUST respect the
   charset label provided by the sender; and those user agents that have
   a provision to "guess" a charset MUST use the charset from the
   content-type field if they support that charset, rather than the
   recipient's preference, when initially displaying a document.
@rwstauner rwstauner closed this
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.