Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core(gather-runner): error on non-HTML #11042

Merged
merged 18 commits into from
Jul 8, 2020
Merged

core(gather-runner): error on non-HTML #11042

merged 18 commits into from
Jul 8, 2020

Conversation

lemcardenas
Copy link
Contributor

@lemcardenas lemcardenas commented Jun 30, 2020

Summary

This PR adds a function in gather-runner.js, getNonHtmlError, that brings up an error if we attempt to run gatherers on non-HTML webpages.

Related Issues/PRs

#9245

Comment on lines 224 to 225
// MIME types are case-insenstive
const HTML_MIME_REGEX = /^text\/html$/i;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/MIME_types, "MIME types are case-insensitive but are traditionally written in lowercase". I am unsure if the check needs to be this rigorous/if a constant 'text/html' is good enough.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would mainRecord.resourceType === NetworkRequest.RESOURCE_TYPES.Document work too? perhaps not for xml docs...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just double checked and xml docs would indeed pass (no error) if we used that condition

Copy link
Collaborator

@patrickhulce patrickhulce Jul 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the mime type in the network records return Chrome's inferred mime type or just what was set on the header?

we should test it, but just read your comment a few lines below this and that's why "Add single comment" in GitHub reviews is almost always a bad idea 😆 since the failure mode is so drastic, I would probably still prefer to keep the no mimeType case passing (i.e. only reject documents that have an explicit non-HTML mimeType set)

Comment on lines 275 to 276
// We want to error when the page is not of MIME type text/html
if (docTypeError) return docTypeError;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assumed that the order of importance of errors was interstitial -> network -> docType, should this be the case?


it('fails when the page is not of MIME type text/html', () => {
const url = 'http://the-page.com';
const mimeType = 'application/xml';
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this test & unit testing in general, should I apply a rigorous run through of incorrect mimeTypes, or is just having one incorrect option good enough?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this test & unit testing in general, should I apply a rigorous run through of incorrect mimeTypes, or is just having one incorrect option good enough?

It depends on the situation. The mimeType string part of the check is a relatively simple comparison, it's basically just string equality with no real corner cases, so this seems fine. Since it's going to be case insensitive, maybe add a test for that case?

@lemcardenas lemcardenas marked this pull request as ready for review June 30, 2020 22:56
@lemcardenas lemcardenas requested a review from a team as a code owner June 30, 2020 22:56
@lemcardenas lemcardenas requested review from Beytoven and removed request for a team June 30, 2020 22:56
@lemcardenas lemcardenas linked an issue Jun 30, 2020 that may be closed by this pull request
@@ -47,6 +47,8 @@ const UIStrings = {
internalChromeError: 'An internal Chrome error occurred. Please restart Chrome and try re-running Lighthouse.',
/** Error message explaining that fetching the resources of the webpage has taken longer than the maximum time. */
requestContentTimeout: 'Fetching resource content has exceeded the allotted time',
/** Error message explaining that the webpage is non-HTML, so audits are ill-defined **/
docTypeInvalid: 'The webpage you have provided appears to be non-HTML',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
docTypeInvalid: 'The webpage you have provided appears to be non-HTML',
docTypeInvalid: 'The page provided is not HTML',

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fixed this

@@ -1114,26 +1161,30 @@ describe('GatherRunner', function() {
navigationError = /** @type {LH.LighthouseError} */ (new Error('NAVIGATION_ERROR'));
});

it('passes when the page is loaded', () => {
it('passes when the page is loaded and doc type is text/html', () => {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for these tests, I think you can

  1. leave the test name the same
  2. prefer mainRecord.mimeType = 'text/html' over defining a variable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fixed both of these

mainRecord.url = passContext.url;
mainRecord.mimeType = mimeType;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here: inline the mimeType

@@ -246,6 +272,9 @@ class GatherRunner {
// Example: `DNS_FAILURE` is better than `NO_FCP`.
if (networkError) return networkError;

// We want to error when the page is not of MIME type text/html
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// We want to error when the page is not of MIME type text/html
// Error if page is not HTML.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fixed this

@@ -233,6 +258,7 @@ class GatherRunner {

const networkError = GatherRunner.getNetworkError(mainRecord);
const interstitialError = GatherRunner.getInterstitialError(mainRecord, networkRecords);
const docTypeError = GatherRunner.getDocTypeError(mainRecord);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The structure of this function is rather strange–zooming out a bit, I'd expect the loadFailureMode check to happen first; and for each of these errors to be created in order, but only if the former error check didn't hit. Incrementally, what you have here makes sense, I just think this entire function might benefit from a slight refactor.

Copy link
Contributor Author

@lemcardenas lemcardenas Jul 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, there is also definitely some redundant checks in my getDocTypeError function that don't need to be there if it was refactored. Should I try to refactor in this PR? It looks like Patrick wrote this function originally, I can also ping them about this if required.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@connorjclark is the refactoring we're talking about here moving 1 line a few lines up and 1 more line a few lines down? I think we can manage it here 😉

be my guest! I certainly have no strong attachment to the specific order in which they're declared as long as we don't let them become multi-line statements :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moving 1 line a few lines up and 1 more line a few lines down?

yeah, that sounds like no problem, but I would be hesitant to do much deduplicating of content within the error checks. They work well today and it's easy to reason about them since they're more or less independent of each other, each just checking on their own particular error case.

Maybe pulling out the !mainRecord check makes sense since they all end up having to check that and the type system will keep track of it already being done (since we can drop all the |undefineds)? But, again, everything's working today and this function isn't in bad shape or anything, so @lemcardenas you can also feel free to leave it to a follow-up PR or to someone else in the future if it feels like the PR scope is growing too much.

https://google.github.io/eng-practices/review/developer/small-cls.html is one of my favorite docs :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'll try to stay safe for this PR and won't change anything about the getPageLoadError function, because as mentioned below there may be more work needed for the original change to work.

Comment on lines 224 to 225
// MIME types are case-insenstive
const HTML_MIME_REGEX = /^text\/html$/i;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would mainRecord.resourceType === NetworkRequest.RESOURCE_TYPES.Document work too? perhaps not for xml docs...

* @param {LH.Artifacts.NetworkRequest|undefined} mainRecord
* @return {LH.LighthouseError|undefined}
*/
static getDocTypeError(mainRecord) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we probably want to avoid the name docType because of the associations with the related but not quite overlapping doctype (since we're also going to be rejecting things without doctypes like PDFs, etc).

I can't imagine we're ever going to open up Lighthouse to other types of pages (so it's not really a generalized document type check, it's a specific check that the document is html), so what about something straightforward like getNonHtmlError()? Kind of terrible sounding and open for bike shedding, but gets the point across :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I wondered why docType/doctype sounded so familiar to me. I can definitely change the name to what you recommended

if (!mainRecord) return undefined;

// If the main document failed, this error case is undefined, let other cases handle it.
if (mainRecord.failed) return undefined;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably fine to combine these two cases into a single if (!mainRecord || mainRecord.failed) return; since they aren't super important to this check in particular, but could probably just drop the mainRecord.failed check completely, since this function isn't so much a check of a valid main document than a check that it's not an invalid media type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out, I ended up just dropping the mainRecord.failed check

@@ -233,6 +258,7 @@ class GatherRunner {

const networkError = GatherRunner.getNetworkError(mainRecord);
const interstitialError = GatherRunner.getInterstitialError(mainRecord, networkRecords);
const docTypeError = GatherRunner.getDocTypeError(mainRecord);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moving 1 line a few lines up and 1 more line a few lines down?

yeah, that sounds like no problem, but I would be hesitant to do much deduplicating of content within the error checks. They work well today and it's easy to reason about them since they're more or less independent of each other, each just checking on their own particular error case.

Maybe pulling out the !mainRecord check makes sense since they all end up having to check that and the type system will keep track of it already being done (since we can drop all the |undefineds)? But, again, everything's working today and this function isn't in bad shape or anything, so @lemcardenas you can also feel free to leave it to a follow-up PR or to someone else in the future if it feels like the PR scope is growing too much.

https://google.github.io/eng-practices/review/developer/small-cls.html is one of my favorite docs :)


it('fails when the page is not of MIME type text/html', () => {
const url = 'http://the-page.com';
const mimeType = 'application/xml';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this test & unit testing in general, should I apply a rigorous run through of incorrect mimeTypes, or is just having one incorrect option good enough?

It depends on the situation. The mimeType string part of the check is a relatively simple comparison, it's basically just string equality with no real corner cases, so this seems fine. Since it's going to be case insensitive, maybe add a test for that case?

@brendankenny
Copy link
Member

brendankenny commented Jul 1, 2020

does the mime type in the network records return Chrome's inferred mime type or just what was set on the header?

Looks like maybe it's just what was on the header, or maybe Chrome does some inference on top of that?

Maybe we'll need to combine mimeType with other information? It looks like checking for 'text/html' would catch a large majority of cases but would miss some. Since this will be a fatal error, we'll probably want to be more conservative in throwing. e.g. a server shouldn't be serving html with mime type 'content/html', but if Lighthouse can still run against it, we shouldn't not let them do so.

This is a query of the June 2020 HTTPArchive run for unique main document mimeTypes and an example requested/finalUrl for each. Out of 5.7 million LH runs (results with no network requests or fatal load errors eliminated):

(note that any of these URLs could be very NSFW)

edit: updated, see below

mimeType count requestedUrl finalUrl
text/html 5665445 http://0-1.ru/ http://0-1.ru/
(empty string) 605 http://aa2888.com/ http://aa2888.com/
application/xhtml+xml 126 http://apple-store.in/ http://apple-store.in/
text/plain 88 http://camxclub.fun/ https://virsx.com/c.php?k=63r1l5p2seqav3mqsdvc&clickid=5eedcc07eb376e00016fa45c&affpid=6084&referrer=&sub1=&sub2=&sub3=&sub4=&sub5=&sub6=trafficback,174,[MOB]%20Ramadan%20Girls%20-%20PPS%20up%20to%20$0.28%20WW%20(Asia,%20Africa,%20Middle%20East)%20-%20Mainstream/Adult%20Dating
(null) 25 http://aplicacionesbiblioteca.udea.edu.co/ about:blank
text/xml 7 http://dl.tele2.ru/ http://dl.tele2.ru/
application/json 3 http://mangareadsonline.com/ http://mangareadsonline.com/?&__cf_chl_jschl_tk__=eb0e1bc80ad3b24fce55577de2272226c4c57a92-1588663154-0-AQNng6nLfYRY9SGNobuHLBNw6QuAhSnyPtV_tqJInXx49d0dJvYZHO-hZUGfG0FTdWuqH1aUfYxyOYz8J5UV8XZWIA9P8_4l1YxKvkogFQENLVs0ShZuGtMEsVgNy79lNn0Bj_6Hq6s3SzUiKmJfDl_0vF4jjsbFVFB2fgGUC8NbAACaBB6bBQRr_DXtBne_i7N7-2UAaf1C5SAeoY_VItIqdUzUq7ncuyolVNPnv96rxbXwx6_7S6pHeNSV8W-x6o-5QEIw4XTziavs-pw0wq0
text/x-server-parsed-html 2 http://act.zisho.jp/ http://act.zisho.jp/
text/x-python 2 http://www.dramonline.org/ http://www.dramonline.org/
text/uri-list 1 https://www.scopus.com/ https://www.scopus.com/home.uri
image/gif 1 http://ww3.speetest.net/ https://partners.etoro.com/blank.gif
audio/mpeg 1 https://www.radiostella.cz/ https://ice3.abradio.cz/stella128.mp3
image/jpeg 1 https://lima2000.com/ https://lima2000.com/wp_lima2000/wp-content/uploads/2018/09/MuralesButton.jpg
httpd/unix-directory 1 https://mdirect.e-lina.co.kr/ https://mdirect.e-lina.co.kr/
application/x-www-form-urlencoded 1 https://club.chateaumercian.com/ https://club.chateaumercian.com/club
text/javascript 1 http://olink.tv/ https://whos.amung.us/pingjs/?k=djmatioka403&t=216.244.91.194__&c=d&y=Y20200614164115&a=0&r=20200614164115
Old not-quite-right query
mimeType count example requestedUrl finalUrl
text/html 5628537 http://0-1.ru/ http://0-1.ru/
(empty string) 32019 http://100.v523.com.tw/ http://100.v523.com.tw/estate/index.do
text/plain 5043 http://52kartu.com/ http://52kartu.com/mobile.php;jsessionid=9642EBFAF295C89B0EC9AFC264AA2E63?0
httpd/unix-directory 164 http://code.sfu.ca/ http://code.sfu.ca/index.html
application/xhtml+xml 115 http://apple-store.in/ http://apple-store.in/
text/xml 93 http://alertas.myweb.vodafone.pt/ http://alertas.myweb.vodafone.pt/home
application/octet-stream 85 http://ac.dragonest.com/ http://ac.dragonest.com/m/en
application/json 80 http://asistencia.claro.com.co/ http://asistencia.claro.com.co/soporte/
text/x-perl 31 http://azerbaijan.mfa.gov.by/ http://azerbaijan.mfa.gov.by/ru/
(null) 25 http://aplicacionesbiblioteca.udea.edu.co/ about:blank
application/binary 24 http://ads.google.com/ https://ads.google.com/home/
application/x-httpd-cgi 17 http://post.japanpost.jp/ https://www.post.japanpost.jp/index.html
application/cgi 12 http://filedwon.info/ http://filedwon.info/?op=login
text/x-unknown-content-type 10 http://gov.garant.ru/ http://gov.garant.ru/SESSION/PDA/main.htm
application/x-gzip 8 http://www.wise.org/ http://www.wise.org/en_US/index.html
application/pdf 7 https://www.corelaboratory.abbott/ https://www.corelaboratory.abbott/us/en/home
application/x-msdownload 6 http://www.actorstudio.fr/ http://www.actorstudio.fr/FR/
text/x-python 5 http://www.dramonline.org/ http://www.dramonline.org/
application/xml 4 http://www.insan-pratama.com/ http://www.insan-pratama.com/home.html
application/x-httpd-fcgi 3 http://automower-fans.les-forums.com/ http://automower-fans.les-forums.com/forums/
application/x-perl 3 https://pedidoscastalia.com/ https://pedidoscastalia.com/_login
text/javascript 3 http://olink.tv/ https://whos.amung.us/pingjs/?k=djmatioka403&t=216.244.91.194__&c=d&y=Y20200614164115&a=0&r=20200614164115
text/vnd.wap.wml 2 http://m.n-content.com/ http://m.n-content.com/MusicMobile/wap/index.jsp?cid=1&ads=0007
application/unknown 2 https://cvd.bundesregierung.de/ https://m.cvd.bundesregierung.de/cvdm-de/login
application/javascript 2 http://map.baidu.com/ https://map.baidu.com/mobile/webapp/index/index/
text/x-server-parsed-html 2 http://act.zisho.jp/ http://act.zisho.jp/
"text/html" 1 https://www.svobodny-statek.cz/ https://www.svobodny-statek.cz/bio-dynamicke-zemedelstvi
application/x-cgi 1 https://www.jointheblue.com/ https://www.jointheblue.com/michigan/
application/x-javascript 1 http://ichinoseki-shinkin.jp/ http://ichinoseki-shinkin.jp/sp/index.html
image/jpeg 1 http://www.oeuvrecoeurcroix.fr/ http://www.oeuvrecoeurcroix.fr/portail/
audio/mpeg 1 https://www.radiostella.cz/ https://ice3.abradio.cz/stella128.mp3
image/gif 1 http://ww3.speetest.net/ https://partners.etoro.com/blank.gif
text/html/html 1 https://access.sheridaninstitute.ca/ https://access.sheridaninstitute.ca/http://portal-am.sheridaninstitute.ca/amserver/UI/Login?gw=access.sheridaninstitute.ca&org=o%3Dsheridanc.on.ca&goto=http%3A%2F%2Fportal-p.sheridaninstitute.ca%3A80%2Fportal%2Fdt
content/html 1 https://subs.nymag.com/ https://subs.nymag.com/account/

@brendankenny
Copy link
Member

brendankenny commented Jul 1, 2020

Actually, some of these are wrong and are the mimeType of the requestedUrl even if there's a redirect to the finalUrl (it's kind of hard to find the main document because AFAICT there's no URL constructor available in the BigQuery custom JS functions), but I'm leaving it up because they appear to all be real initial mimeTypes

e.g. https://www.svobodny-statek.cz/ does have mimeType "text/html" (in double quotes), even though the URL it's redirected to has the correct mimeType text/html (same with the first request of http://www.oeuvrecoeurcroix.fr/ being served as image/jpeg).

I can put the better table into the comment above, but it's mostly just a subset of the above, with far more cases correctly under text/html (5,665,445 text/html, 605 empty strings, 126 application/xhtml+xml, 88 text/plain, 7 text/xml, and then various other ones all occurring less than 3 times).

Regardless, looks like this string can be whatever the server wants it to be.

@paulirish
Copy link
Member

nice query!

Maybe we'll need to combine mimeType with other information? It looks like checking for 'text/html' would catch a large majority of cases but would miss some. Since this will be a fatal error, we'll probably want to be more conservative in throwing. e.g. a server shouldn't be serving html with mime type 'content/html', but if Lighthouse can still run against it, we shouldn't not let them do so.

ehhhh. im happy to yell at content/html even tho technically the page will render.

Because we have the browser sniff, it complicates things. my assumption is that the mimeType from the protocol is post-sniff but i'm not confident.

broad categories we have

  1. page is served with text/html and is text/html. YAY
  2. page is served with non-text/html and isn't HTML. page (probably) renders fine?
    • we throw error.
  3. page is served with non-text/html but is HTML. page renders fine.
    • technically the page is good enough for us, but we should throw this error anyway.
  4. page is not served with a content-type, but is HTML and the browser sniffs it works. YAY
  5. page is not served with a content-type, but isnt HTML and the browser sniffs and who knows.
    • Tough situation. LH should throw on this. I have no idea what the .mimeType is in this scenario, but from the above comment it sounds like it's not text/html.

given that we're talking about 0.015% of pages in HA.. I think we can afford to just flag everything that's without a .mimeType of text/html.

@brendankenny
Copy link
Member

page is served with non-text/html but is HTML. page renders fine.
technically the page is good enough for us, but we should throw this error anyway.

I can kind of get behind this, but this will be very fatal (lhrRuntimeError: true), so some number of currently OK users won't be able to run Lighthouse until they reconfigure their server. It is a very small number, true, though how representative the HTTPArchive mime type numbers are of the general population, I don't know.

/* Used when the page is non-HTML. */
INVALID_DOC_TYPE: {
code: 'INVALID_DOC_TYPE',
message: UIStrings.docTypeInvalid,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to also add lhrRuntimeError: true here because we want the error to bubble all the way to the top and show up in the LHR runtimeError entry and exit with a non-zero exit code when running the CLI.

@connorjclark's examples in #9245 show that things mostly go wrong but not completely, so we need to be extra loud to get the attention of users doing automated runs of LH that they're accidentally auditing a page they either didn't mean to or have misconfigured.

Since it'll be lhrRuntimeError: true (and so can also appear in the LHR that PSI serves), it'll also need to be added as an error entry in the proto.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'll make sure to fix this!

@@ -47,6 +47,8 @@ const UIStrings = {
internalChromeError: 'An internal Chrome error occurred. Please restart Chrome and try re-running Lighthouse.',
/** Error message explaining that fetching the resources of the webpage has taken longer than the maximum time. */
requestContentTimeout: 'Fetching resource content has exceeded the allotted time',
/** Error message explaining that the webpage is non-HTML, so audits are ill-defined **/
docTypeInvalid: 'The page provided is not HTML',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should parameterize this message, so the user knows what to fix? Something like

Suggested change
docTypeInvalid: 'The page provided is not HTML',
docTypeInvalid: 'The page provided is not HTML (served as MIME type {mimeType}).',

Copy link
Contributor Author

@lemcardenas lemcardenas Jul 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that idea, my initial message had some mimeType information but wasn't as easy to read as this, I'll change it!

@lemcardenas
Copy link
Contributor Author

lemcardenas commented Jul 1, 2020

Regardless, looks like this string can be whatever the server wants it to be.

Do you think that looking directly at Content-Type headers will help in reducing some of the false negatives / be better overall? When I first asked about choosing between mimeType/Content-Type/extension approaches, Paul linked this code which mentions //Use mime type from cached resource in case the one in response is empty. (this originally suggested that mimeType was the safe option to use), when checking some of the calling functions it mentions //MIME type is determined by HTTP Content-Type header. So from these snippets it sounds like Content-Type may be the stronger method.

In the original issue for this PR, Connor mentioned that

Images, stylesheets, js, and such actually audit just fine (rather, the page Chrome builds to view the resource presents no issues for LH), so we don't have to exclude those.

Should we instead then have deny/allow lists to have more coverage?

@paulirish
Copy link
Member

So from these snippets it sounds like Content-Type may be the stronger method.

contenttype is provided in the response headers.
mimeType is provided by the browser and is based on contentType.

since the browser can choose a mimetype even if a contenttype isn't provided, we should just go with mimetype. it's the signal that's more in-tune with how the browser read the content.

let's keep this check simple: if the mimeType isn't text/html we have a problem.

that's it. :)

@lemcardenas
Copy link
Contributor Author

let's keep this check simple: if the mimeType isn't text/html we have a problem.

that's it. :)

Thanks for the explanation! Sounds good to me :)


// mimeType is determined by the browser, we assume Chrome is determining mimeType correctly,
// independently of 'Content-Type' response headers, and always sending mimeType if well-formed.
if (mainRecord.mimeType || mainRecord.mimeType === '') {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mainRecord.mimeType is initialized as an empty string and set from response.mimeType which apparently is always a string, never null or undefined. So I think this can be refactored to just:

if (!HTML_MIME_REGEX.test(mainRecord.mimeType)) {
  return new LHError(LHError.errors.NON_HTML, {mimeType: mainRecord.mimeType});
}

(removing the outer if statement)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for clarifying, I was unsure if the string from the response could be overwritten in the browser with a non-string / null / undefined, I'll make sure to fix this!

* @description Error message explaining that the webpage is non-HTML, so audits are ill-defined.
* @example {application/xml} mimeType
* */
nonHtml: 'The page provided is not HTML (served as MIME type {mimeType}).',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: notHtml instead? and NOT_HTML

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept the nonHtml name scheme from the function but this makes sense since the function isn't really in context here, I'll change this!

* @return {LH.LighthouseError|undefined}
*/
static getNonHtmlError(mainRecord) {
// MIME types are case-insenstive
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's no test for the case insensitivity

does chrome perhaps normalize this anyhow? you can hack the static-server.js to return any mime type you want and run yarn static-server to verify.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I couldn't find information online as to whether chrome automatically makes mimetypes lowercase regardless of the http response, i'll check with the method you mentioned and let you know

Copy link
Contributor Author

@lemcardenas lemcardenas Jul 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok update so I replaced all instances of 'text/html' with 'TEXT/HTML' in the static-server Content-Type header responses & confirmed in devtools that the Content-Type header showed up as 'TEXT/HTML'. When I ran a lighthouse run against a page hosted on static-server, I got no error and the mimeType accessible from mainRecord was always lowercase (text/html), so it seems that chrome automatically makes its mimeType lowercase!

So I am going to remove the case insensitivity modifier on the RegEx, and I think I'll replace it with a direct string comparison with a const 'text/html' since thats likely more efficient and seems sufficient based on the test above

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice! yeah a direct string comparison would be nice here. to be clear, the performance was never an issue (it's negligible), just the desire to keep things as complex as necessary (aka as simple as possible).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a comment that chrome normalizes the mime type would be good too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good! i'll make sure to add the comment right now

Co-authored-by: Connor Clark <cjamcl@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Warn when trying to audit non-html
8 participants