-
Notifications
You must be signed in to change notification settings - Fork 9.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
core(gather-runner): error on non-HTML #11042
Conversation
// MIME types are case-insenstive | ||
const HTML_MIME_REGEX = /^text\/html$/i; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/MIME_types, "MIME types are case-insensitive but are traditionally written in lowercase". I am unsure if the check needs to be this rigorous/if a constant 'text/html' is good enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would mainRecord.resourceType === NetworkRequest.RESOURCE_TYPES.Document
work too? perhaps not for xml docs...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just double checked and xml docs would indeed pass (no error) if we used that condition
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does the mime type in the network records return Chrome's inferred mime type or just what was set on the header?
we should test it, but just read your comment a few lines below this and that's why "Add single comment" in GitHub reviews is almost always a bad idea 😆 since the failure mode is so drastic, I would probably still prefer to keep the no mimeType case passing (i.e. only reject documents that have an explicit non-HTML mimeType set)
// We want to error when the page is not of MIME type text/html | ||
if (docTypeError) return docTypeError; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assumed that the order of importance of errors was interstitial -> network -> docType, should this be the case?
|
||
it('fails when the page is not of MIME type text/html', () => { | ||
const url = 'http://the-page.com'; | ||
const mimeType = 'application/xml'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this test & unit testing in general, should I apply a rigorous run through of incorrect mimeTypes, or is just having one incorrect option good enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this test & unit testing in general, should I apply a rigorous run through of incorrect mimeTypes, or is just having one incorrect option good enough?
It depends on the situation. The mimeType string part of the check is a relatively simple comparison, it's basically just string equality with no real corner cases, so this seems fine. Since it's going to be case insensitive, maybe add a test for that case?
lighthouse-core/lib/lh-error.js
Outdated
@@ -47,6 +47,8 @@ const UIStrings = { | |||
internalChromeError: 'An internal Chrome error occurred. Please restart Chrome and try re-running Lighthouse.', | |||
/** Error message explaining that fetching the resources of the webpage has taken longer than the maximum time. */ | |||
requestContentTimeout: 'Fetching resource content has exceeded the allotted time', | |||
/** Error message explaining that the webpage is non-HTML, so audits are ill-defined **/ | |||
docTypeInvalid: 'The webpage you have provided appears to be non-HTML', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
docTypeInvalid: 'The webpage you have provided appears to be non-HTML', | |
docTypeInvalid: 'The page provided is not HTML', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, fixed this
@@ -1114,26 +1161,30 @@ describe('GatherRunner', function() { | |||
navigationError = /** @type {LH.LighthouseError} */ (new Error('NAVIGATION_ERROR')); | |||
}); | |||
|
|||
it('passes when the page is loaded', () => { | |||
it('passes when the page is loaded and doc type is text/html', () => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for these tests, I think you can
- leave the test name the same
- prefer
mainRecord.mimeType = 'text/html'
over defining a variable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, fixed both of these
mainRecord.url = passContext.url; | ||
mainRecord.mimeType = mimeType; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here: inline the mimeType
@@ -246,6 +272,9 @@ class GatherRunner { | |||
// Example: `DNS_FAILURE` is better than `NO_FCP`. | |||
if (networkError) return networkError; | |||
|
|||
// We want to error when the page is not of MIME type text/html |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// We want to error when the page is not of MIME type text/html | |
// Error if page is not HTML. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, fixed this
@@ -233,6 +258,7 @@ class GatherRunner { | |||
|
|||
const networkError = GatherRunner.getNetworkError(mainRecord); | |||
const interstitialError = GatherRunner.getInterstitialError(mainRecord, networkRecords); | |||
const docTypeError = GatherRunner.getDocTypeError(mainRecord); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The structure of this function is rather strange–zooming out a bit, I'd expect the loadFailureMode
check to happen first; and for each of these errors to be created in order, but only if the former error check didn't hit. Incrementally, what you have here makes sense, I just think this entire function might benefit from a slight refactor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, there is also definitely some redundant checks in my getDocTypeError function that don't need to be there if it was refactored. Should I try to refactor in this PR? It looks like Patrick wrote this function originally, I can also ping them about this if required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@connorjclark is the refactoring we're talking about here moving 1 line a few lines up and 1 more line a few lines down? I think we can manage it here 😉
be my guest! I certainly have no strong attachment to the specific order in which they're declared as long as we don't let them become multi-line statements :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moving 1 line a few lines up and 1 more line a few lines down?
yeah, that sounds like no problem, but I would be hesitant to do much deduplicating of content within the error checks. They work well today and it's easy to reason about them since they're more or less independent of each other, each just checking on their own particular error case.
Maybe pulling out the !mainRecord
check makes sense since they all end up having to check that and the type system will keep track of it already being done (since we can drop all the |undefined
s)? But, again, everything's working today and this function isn't in bad shape or anything, so @lemcardenas you can also feel free to leave it to a follow-up PR or to someone else in the future if it feels like the PR scope is growing too much.
https://google.github.io/eng-practices/review/developer/small-cls.html is one of my favorite docs :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'll try to stay safe for this PR and won't change anything about the getPageLoadError function, because as mentioned below there may be more work needed for the original change to work.
// MIME types are case-insenstive | ||
const HTML_MIME_REGEX = /^text\/html$/i; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would mainRecord.resourceType === NetworkRequest.RESOURCE_TYPES.Document
work too? perhaps not for xml docs...
* @param {LH.Artifacts.NetworkRequest|undefined} mainRecord | ||
* @return {LH.LighthouseError|undefined} | ||
*/ | ||
static getDocTypeError(mainRecord) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we probably want to avoid the name docType
because of the associations with the related but not quite overlapping doctype
(since we're also going to be rejecting things without doctypes like PDFs, etc).
I can't imagine we're ever going to open up Lighthouse to other types of pages (so it's not really a generalized document type check, it's a specific check that the document is html), so what about something straightforward like getNonHtmlError()
? Kind of terrible sounding and open for bike shedding, but gets the point across :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, I wondered why docType/doctype sounded so familiar to me. I can definitely change the name to what you recommended
if (!mainRecord) return undefined; | ||
|
||
// If the main document failed, this error case is undefined, let other cases handle it. | ||
if (mainRecord.failed) return undefined; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably fine to combine these two cases into a single if (!mainRecord || mainRecord.failed) return;
since they aren't super important to this check in particular, but could probably just drop the mainRecord.failed
check completely, since this function isn't so much a check of a valid main document than a check that it's not an invalid media type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing this out, I ended up just dropping the mainRecord.failed check
@@ -233,6 +258,7 @@ class GatherRunner { | |||
|
|||
const networkError = GatherRunner.getNetworkError(mainRecord); | |||
const interstitialError = GatherRunner.getInterstitialError(mainRecord, networkRecords); | |||
const docTypeError = GatherRunner.getDocTypeError(mainRecord); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moving 1 line a few lines up and 1 more line a few lines down?
yeah, that sounds like no problem, but I would be hesitant to do much deduplicating of content within the error checks. They work well today and it's easy to reason about them since they're more or less independent of each other, each just checking on their own particular error case.
Maybe pulling out the !mainRecord
check makes sense since they all end up having to check that and the type system will keep track of it already being done (since we can drop all the |undefined
s)? But, again, everything's working today and this function isn't in bad shape or anything, so @lemcardenas you can also feel free to leave it to a follow-up PR or to someone else in the future if it feels like the PR scope is growing too much.
https://google.github.io/eng-practices/review/developer/small-cls.html is one of my favorite docs :)
|
||
it('fails when the page is not of MIME type text/html', () => { | ||
const url = 'http://the-page.com'; | ||
const mimeType = 'application/xml'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this test & unit testing in general, should I apply a rigorous run through of incorrect mimeTypes, or is just having one incorrect option good enough?
It depends on the situation. The mimeType string part of the check is a relatively simple comparison, it's basically just string equality with no real corner cases, so this seems fine. Since it's going to be case insensitive, maybe add a test for that case?
Looks like maybe it's just what was on the header, or maybe Chrome does some inference on top of that? Maybe we'll need to combine This is a query of the June 2020 HTTPArchive run for unique main document mimeTypes and an example requested/finalUrl for each. Out of 5.7 million LH runs (results with no network requests or fatal load errors eliminated): (note that any of these URLs could be very NSFW) edit: updated, see below
Old not-quite-right query
|
Actually, some of these are wrong and are the mimeType of the e.g. I can put the better table into the comment above, but it's mostly just a subset of the above, with far more cases correctly under Regardless, looks like this string can be whatever the server wants it to be. |
nice query!
ehhhh. im happy to yell at Because we have the browser sniff, it complicates things. my assumption is that the broad categories we have
given that we're talking about 0.015% of pages in HA.. I think we can afford to just flag everything that's without a .mimeType of |
I can kind of get behind this, but this will be very fatal ( |
lighthouse-core/lib/lh-error.js
Outdated
/* Used when the page is non-HTML. */ | ||
INVALID_DOC_TYPE: { | ||
code: 'INVALID_DOC_TYPE', | ||
message: UIStrings.docTypeInvalid, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to also add lhrRuntimeError: true
here because we want the error to bubble all the way to the top and show up in the LHR runtimeError
entry and exit with a non-zero exit code when running the CLI.
@connorjclark's examples in #9245 show that things mostly go wrong but not completely, so we need to be extra loud to get the attention of users doing automated runs of LH that they're accidentally auditing a page they either didn't mean to or have misconfigured.
Since it'll be lhrRuntimeError: true
(and so can also appear in the LHR that PSI serves), it'll also need to be added as an error entry in the proto.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I'll make sure to fix this!
lighthouse-core/lib/lh-error.js
Outdated
@@ -47,6 +47,8 @@ const UIStrings = { | |||
internalChromeError: 'An internal Chrome error occurred. Please restart Chrome and try re-running Lighthouse.', | |||
/** Error message explaining that fetching the resources of the webpage has taken longer than the maximum time. */ | |||
requestContentTimeout: 'Fetching resource content has exceeded the allotted time', | |||
/** Error message explaining that the webpage is non-HTML, so audits are ill-defined **/ | |||
docTypeInvalid: 'The page provided is not HTML', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we should parameterize this message, so the user knows what to fix? Something like
docTypeInvalid: 'The page provided is not HTML', | |
docTypeInvalid: 'The page provided is not HTML (served as MIME type {mimeType}).', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that idea, my initial message had some mimeType information but wasn't as easy to read as this, I'll change it!
Do you think that looking directly at Content-Type headers will help in reducing some of the false negatives / be better overall? When I first asked about choosing between mimeType/Content-Type/extension approaches, Paul linked this code which mentions In the original issue for this PR, Connor mentioned that
Should we instead then have deny/allow lists to have more coverage? |
contenttype is provided in the response headers. since the browser can choose a mimetype even if a contenttype isn't provided, we should just go with mimetype. it's the signal that's more in-tune with how the browser read the content. let's keep this check simple: if the mimeType isn't that's it. :) |
Thanks for the explanation! Sounds good to me :) |
|
||
// mimeType is determined by the browser, we assume Chrome is determining mimeType correctly, | ||
// independently of 'Content-Type' response headers, and always sending mimeType if well-formed. | ||
if (mainRecord.mimeType || mainRecord.mimeType === '') { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mainRecord.mimeType
is initialized as an empty string and set from response.mimeType
which apparently is always a string, never null or undefined. So I think this can be refactored to just:
if (!HTML_MIME_REGEX.test(mainRecord.mimeType)) {
return new LHError(LHError.errors.NON_HTML, {mimeType: mainRecord.mimeType});
}
(removing the outer if statement)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for clarifying, I was unsure if the string from the response could be overwritten in the browser with a non-string / null / undefined, I'll make sure to fix this!
lighthouse-core/lib/lh-error.js
Outdated
* @description Error message explaining that the webpage is non-HTML, so audits are ill-defined. | ||
* @example {application/xml} mimeType | ||
* */ | ||
nonHtml: 'The page provided is not HTML (served as MIME type {mimeType}).', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: notHtml
instead? and NOT_HTML
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kept the nonHtml name scheme from the function but this makes sense since the function isn't really in context here, I'll change this!
* @return {LH.LighthouseError|undefined} | ||
*/ | ||
static getNonHtmlError(mainRecord) { | ||
// MIME types are case-insenstive |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's no test for the case insensitivity
does chrome perhaps normalize this anyhow? you can hack the static-server.js
to return any mime type you want and run yarn static-server
to verify.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I couldn't find information online as to whether chrome automatically makes mimetypes lowercase regardless of the http response, i'll check with the method you mentioned and let you know
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok update so I replaced all instances of 'text/html' with 'TEXT/HTML' in the static-server Content-Type header responses & confirmed in devtools that the Content-Type header showed up as 'TEXT/HTML'. When I ran a lighthouse run against a page hosted on static-server, I got no error and the mimeType accessible from mainRecord was always lowercase (text/html), so it seems that chrome automatically makes its mimeType lowercase!
So I am going to remove the case insensitivity modifier on the RegEx, and I think I'll replace it with a direct string comparison with a const 'text/html' since thats likely more efficient and seems sufficient based on the test above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice! yeah a direct string comparison would be nice here. to be clear, the performance was never an issue (it's negligible), just the desire to keep things as complex as necessary (aka as simple as possible).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a comment that chrome normalizes the mime type would be good too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good! i'll make sure to add the comment right now
…tement in getNonHtmlError
Co-authored-by: Connor Clark <cjamcl@google.com>
Summary
This PR adds a function in gather-runner.js, getNonHtmlError, that brings up an error if we attempt to run gatherers on non-HTML webpages.
Related Issues/PRs
#9245