Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Content sniffing implementation details #5

Open
Rob--W opened this issue Oct 22, 2017 · 9 comments
Open

Content sniffing implementation details #5

Rob--W opened this issue Oct 22, 2017 · 9 comments

Comments

@Rob--W
Copy link
Owner

Rob--W commented Oct 22, 2017

Last month I spent two weeks on implementing content sniffing, which was behaviorally identical to Firefox's implementation. Unfortunately, I lost the laptop before I pushed the changes, so I will document what's necessary in case anyone (maybe me?) is interested in implementing a content sniffer.

The full implementation (code and comments) consisted of about 3 - 5k lines of JS code (unit tests were written but not included in this count).

The implementation details are as follows (this is a brain dump from my recollection):

Other notes relevant for the implementation:

  • Content sniffing relies on up to 512 bytes of data, but the media sniffer may try to use more if available.
  • At least for text and HTML, Firefox will only display the response after 512 bytes of data have been written (or 1024, I don't remember).
  • For images and media, Firefox will switch to a special image/media document upon detecting the type (typically via magic bytes; for media sniffer more than magic bytes).
  • There is a draft for a specification at https://mimesniff.spec.whatwg.org/. This specification is close to Firefox's content sniffing. It does have any mention of media sniffing for application/octet-stream, and neither mentions the special application/x-unknown-content-type (this MIME is an artefact of Firefox's implementation; internally it represents the default value for a MIME type in a HTTP channel).
  • Character encoding should be respected/supported. For text/plain the UTF-8 and UTF-16 BOM can be used. For text/html, the content can be transcoded via the TextDecoder/TextEncoder APIs (except for UTF-16, which should not be used for HTML anyway).

Bugs in the webRequest.filterResponseData API that I haven't reported upstream (yet?):

  • If the Content-Type is application/x-unknown-content-type and the response is content-encoded, then the filtered response must also be encoded using the same type (e.g. gzipped) (for other types, e.g. text/html, the encoding is transparent, i.e. the value of the Content-Encoding header does not matter). The easiest way around this is to remove the Accept-Encoding request header or the Content-Encoding response header (or set it to "identity"). The more difficult way to get around this is to implement gzipping (and possibly other (obscure) encoding schemes such as deflate/brotli).
  • If a StreamFilter is closed, Firefox will always commit a navigation to a new document, even if no data was written to that StreamFilter, and even if the tab/frame has navigated to a different page. The only work-around that I could think of is to keep the StreamFilter open forever (yuck).
@def00111
Copy link
Contributor

Can you look here?
https://bugzilla.mozilla.org/show_bug.cgi?id=1287264

@Rob--W
Copy link
Owner Author

Rob--W commented Dec 15, 2017

@def00111 I looked (and I filed a new feature request at https://bugzilla.mozilla.org/show_bug.cgi?id=1425479). Why did you want me to look at that bug?

@def00111
Copy link
Contributor

Why did you want me to look at that bug?

I just want to have you look at this bug :)

Maybe, we can also expose nsIChannel.contentDispositionFilename [1]?

[1] https://dxr.mozilla.org/mozilla-central/rev/2386800ec051598ff4dd42da1118abcf05299fc1/netwerk/base/nsIChannel.idl#327

@def00111
Copy link
Contributor

I also have another idea.
Can we add the download [1] attribute value to webRequest.onBeforeRequest details [2]?
To get the filename from download attribute?
Like with Content-Disposition header in webRequest.onHeadersReceived [3]?

Look here please: https://github.com/def00111/always-preview/blob/master/content.js

[1] https://developer.mozilla.org/en-US/docs/Web/HTML/Element/a#attr-download
[2] https://developer.mozilla.org/en-US/Add-ons/WebExtensions/API/webRequest/onBeforeRequest#details
[3] https://developer.mozilla.org/en-US/Add-ons/WebExtensions/API/webRequest/onHeadersReceived

@Rob--W
Copy link
Owner Author

Rob--W commented Dec 15, 2017

Maybe, we can also expose nsIChannel.contentDispositionFilename [1]?

This extension is a very specialized use case. While having such a property would make the life of me as an extension developer easier, I don't think that that convenience outperforms the maintenance cost of exposing the info through the webRequest extension API. Especially since it can fully be implemented in JavaScript with minimal performance impact - https://github.com/Rob--W/open-in-browser/blob/05b80a3ce151737cfc7735eb1a714dfa84f3e3a5/extension/content-disposition.js

Can we add the download [1] attribute value to webRequest.onBeforeRequest details [2]?

This, on the other hand, could be a good reason to support the API enhancement. But...:
<a download> does not work for cross-origin resources, only same-origin resources. Furthermore, <a download> is more commonly ysed for JS-generated content (blob:/data:-URLs), which is not intercepted by my extension. So the value of an accessor for the value of <a download> is limited.

In the case of <a download> to a same-origin resource without Content-Disposition response header (which I presume is rare), users can just open the link in a new tab to get the dialog if they want to view it inline or trigger an Open in Browser dialog). In the worst case (e.g. if the link is not visible), then they can use the extension menu in the Tools menu to force the dialog to appear anyway.

I appreciate your comments, but I'd like to keep the comments here on-topic.
If you have more to say (unrelated to content sniffing), please open a new issue or continue via e-mail.

@def00111
Copy link
Contributor

is more commonly ysed for JS-generated content (blob:/data:-URLs), which is not intercepted by my extension. So the value of an accessor for the value of is limited.

This page: https://atpscan.global.hornetsecurity.com/safe_download.php?uri=aHR0cHM6Ly93d3cuc3dwLWJlcmxpbi5vcmcvZmlsZWFkbWluL2NvbnRlbnRzL3Byb2R1Y3RzL2FrdHVlbGwvMjAxNUEwM193Z24ucGRm&cd=MjAxNWEwM193Z24ucGRm&type=dat

@def00111
Copy link
Contributor

@def00111
Copy link
Contributor

Is this the same what firefox does?

@Rob--W
Copy link
Owner Author

Rob--W commented Dec 15, 2017

Can i use content-disposition.js [1] in my add-on?

Yes. When you add a commit in your repo, do link back to the original source in the commit description. Then in the future it will be easier for others to check whether the implementation is still up-to-date.

Is this the same what firefox does?

Yes, except for a few cases of malformed response headers (I don't think that you will ever find these in the wild). See the commit description and unit tests from 6f3bbb8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants