core(gather-runner): error on non-HTML #11042

lemcardenas · 2020-06-30T22:38:33Z

Summary

This PR adds a function in gather-runner.js, getNonHtmlError, that brings up an error if we attempt to run gatherers on non-HTML webpages.

Related Issues/PRs

#9245

lemcardenas · 2020-06-30T22:40:49Z

lighthouse-core/gather/gather-runner.js

+    // MIME types are case-insenstive
+    const HTML_MIME_REGEX = /^text\/html$/i;


According to https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/MIME_types, "MIME types are case-insensitive but are traditionally written in lowercase". I am unsure if the check needs to be this rigorous/if a constant 'text/html' is good enough.

would mainRecord.resourceType === NetworkRequest.RESOURCE_TYPES.Document work too? perhaps not for xml docs...

just double checked and xml docs would indeed pass (no error) if we used that condition

~~does the mime type in the network records return Chrome's inferred mime type or just what was set on the header?~~

~~we should test it, but~~ just read your comment a few lines below this and that's why "Add single comment" in GitHub reviews is almost always a bad idea 😆 since the failure mode is so drastic, I would probably still prefer to keep the no mimeType case passing (i.e. only reject documents that have an explicit non-HTML mimeType set)

lemcardenas · 2020-06-30T22:45:15Z

lighthouse-core/gather/gather-runner.js

+    // We want to error when the page is not of MIME type text/html
+    if (docTypeError) return docTypeError;


I assumed that the order of importance of errors was interstitial -> network -> docType, should this be the case?

lemcardenas · 2020-06-30T22:48:21Z

lighthouse-core/test/gather/gather-runner-test.js

+
+    it('fails when the page is not of MIME type text/html', () => {
+      const url = 'http://the-page.com';
+      const mimeType = 'application/xml';


For this test & unit testing in general, should I apply a rigorous run through of incorrect mimeTypes, or is just having one incorrect option good enough?

For this test & unit testing in general, should I apply a rigorous run through of incorrect mimeTypes, or is just having one incorrect option good enough?

It depends on the situation. The mimeType string part of the check is a relatively simple comparison, it's basically just string equality with no real corner cases, so this seems fine. Since it's going to be case insensitive, maybe add a test for that case?

connorjclark · 2020-06-30T23:40:40Z

lighthouse-core/lib/lh-error.js

@@ -47,6 +47,8 @@ const UIStrings = {
  internalChromeError: 'An internal Chrome error occurred. Please restart Chrome and try re-running Lighthouse.',
  /** Error message explaining that fetching the resources of the webpage has taken longer than the maximum time. */
  requestContentTimeout: 'Fetching resource content has exceeded the allotted time',
+  /** Error message explaining that the webpage is non-HTML, so audits are ill-defined **/
+  docTypeInvalid: 'The webpage you have provided appears to be non-HTML',


Suggested change

docTypeInvalid: 'The webpage you have provided appears to be non-HTML',

docTypeInvalid: 'The page provided is not HTML',

Thanks, fixed this

connorjclark · 2020-06-30T23:43:15Z

lighthouse-core/test/gather/gather-runner-test.js

@@ -1114,26 +1161,30 @@ describe('GatherRunner', function() {
      navigationError = /** @type {LH.LighthouseError} */ (new Error('NAVIGATION_ERROR'));
    });

-    it('passes when the page is loaded', () => {
+    it('passes when the page is loaded and doc type is text/html', () => {


for these tests, I think you can

leave the test name the same

prefer mainRecord.mimeType = 'text/html' over defining a variable

Thanks, fixed both of these

connorjclark · 2020-06-30T23:43:46Z

lighthouse-core/test/gather/gather-runner-test.js

      mainRecord.url = passContext.url;
+      mainRecord.mimeType = mimeType;


same here: inline the mimeType

connorjclark · 2020-06-30T23:44:32Z

lighthouse-core/gather/gather-runner.js

@@ -246,6 +272,9 @@ class GatherRunner {
    // Example: `DNS_FAILURE` is better than `NO_FCP`.
    if (networkError) return networkError;

+    // We want to error when the page is not of MIME type text/html


Suggested change

// We want to error when the page is not of MIME type text/html

// Error if page is not HTML.

Thanks, fixed this

connorjclark · 2020-06-30T23:51:18Z

lighthouse-core/gather/gather-runner.js

@@ -233,6 +258,7 @@ class GatherRunner {

    const networkError = GatherRunner.getNetworkError(mainRecord);
    const interstitialError = GatherRunner.getInterstitialError(mainRecord, networkRecords);
+    const docTypeError = GatherRunner.getDocTypeError(mainRecord);


The structure of this function is rather strange–zooming out a bit, I'd expect the loadFailureMode check to happen first; and for each of these errors to be created in order, but only if the former error check didn't hit. Incrementally, what you have here makes sense, I just think this entire function might benefit from a slight refactor.

Yeah, there is also definitely some redundant checks in my getDocTypeError function that don't need to be there if it was refactored. Should I try to refactor in this PR? It looks like Patrick wrote this function originally, I can also ping them about this if required.

@connorjclark is the refactoring we're talking about here moving 1 line a few lines up and 1 more line a few lines down? I think we can manage it here 😉

be my guest! I certainly have no strong attachment to the specific order in which they're declared as long as we don't let them become multi-line statements :)

moving 1 line a few lines up and 1 more line a few lines down?

yeah, that sounds like no problem, but I would be hesitant to do much deduplicating of content within the error checks. They work well today and it's easy to reason about them since they're more or less independent of each other, each just checking on their own particular error case.

Maybe pulling out the !mainRecord check makes sense since they all end up having to check that and the type system will keep track of it already being done (since we can drop all the |undefineds)? But, again, everything's working today and this function isn't in bad shape or anything, so @lemcardenas you can also feel free to leave it to a follow-up PR or to someone else in the future if it feels like the PR scope is growing too much.

https://google.github.io/eng-practices/review/developer/small-cls.html is one of my favorite docs :)

I think I'll try to stay safe for this PR and won't change anything about the getPageLoadError function, because as mentioned below there may be more work needed for the original change to work.

connorjclark · 2020-06-30T23:52:58Z

lighthouse-core/gather/gather-runner.js

+    // MIME types are case-insenstive
+    const HTML_MIME_REGEX = /^text\/html$/i;


would mainRecord.resourceType === NetworkRequest.RESOURCE_TYPES.Document work too? perhaps not for xml docs...

brendankenny · 2020-07-01T18:33:04Z

lighthouse-core/gather/gather-runner.js

+   * @param {LH.Artifacts.NetworkRequest|undefined} mainRecord
+   * @return {LH.LighthouseError|undefined}
+   */
+  static getDocTypeError(mainRecord) {


we probably want to avoid the name docType because of the associations with the related but not quite overlapping doctype (since we're also going to be rejecting things without doctypes like PDFs, etc).

I can't imagine we're ever going to open up Lighthouse to other types of pages (so it's not really a generalized document type check, it's a specific check that the document is html), so what about something straightforward like getNonHtmlError()? Kind of terrible sounding and open for bike shedding, but gets the point across :)

Sounds good, I wondered why docType/doctype sounded so familiar to me. I can definitely change the name to what you recommended

brendankenny · 2020-07-01T19:18:46Z

lighthouse-core/gather/gather-runner.js

+    if (!mainRecord) return undefined;
+
+    // If the main document failed, this error case is undefined, let other cases handle it.
+    if (mainRecord.failed) return undefined;


probably fine to combine these two cases into a single if (!mainRecord || mainRecord.failed) return; since they aren't super important to this check in particular, but could probably just drop the mainRecord.failed check completely, since this function isn't so much a check of a valid main document than a check that it's not an invalid media type.

Thanks for pointing this out, I ended up just dropping the mainRecord.failed check

brendankenny · 2020-07-01T19:28:14Z

lighthouse-core/gather/gather-runner.js

@@ -233,6 +258,7 @@ class GatherRunner {

    const networkError = GatherRunner.getNetworkError(mainRecord);
    const interstitialError = GatherRunner.getInterstitialError(mainRecord, networkRecords);
+    const docTypeError = GatherRunner.getDocTypeError(mainRecord);


moving 1 line a few lines up and 1 more line a few lines down?

yeah, that sounds like no problem, but I would be hesitant to do much deduplicating of content within the error checks. They work well today and it's easy to reason about them since they're more or less independent of each other, each just checking on their own particular error case.

Maybe pulling out the !mainRecord check makes sense since they all end up having to check that and the type system will keep track of it already being done (since we can drop all the |undefineds)? But, again, everything's working today and this function isn't in bad shape or anything, so @lemcardenas you can also feel free to leave it to a follow-up PR or to someone else in the future if it feels like the PR scope is growing too much.

https://google.github.io/eng-practices/review/developer/small-cls.html is one of my favorite docs :)

brendankenny · 2020-07-01T19:39:46Z

lighthouse-core/test/gather/gather-runner-test.js

+
+    it('fails when the page is not of MIME type text/html', () => {
+      const url = 'http://the-page.com';
+      const mimeType = 'application/xml';


For this test & unit testing in general, should I apply a rigorous run through of incorrect mimeTypes, or is just having one incorrect option good enough?

It depends on the situation. The mimeType string part of the check is a relatively simple comparison, it's basically just string equality with no real corner cases, so this seems fine. Since it's going to be case insensitive, maybe add a test for that case?

brendankenny · 2020-07-01T20:45:03Z

does the mime type in the network records return Chrome's inferred mime type or just what was set on the header?

Looks like maybe it's just what was on the header, or maybe Chrome does some inference on top of that?

Maybe we'll need to combine mimeType with other information? It looks like checking for 'text/html' would catch a large majority of cases but would miss some. Since this will be a fatal error, we'll probably want to be more conservative in throwing. e.g. a server shouldn't be serving html with mime type 'content/html', but if Lighthouse can still run against it, we shouldn't not let them do so.

This is a query of the June 2020 HTTPArchive run for unique main document mimeTypes and an example requested/finalUrl for each. Out of 5.7 million LH runs (results with no network requests or fatal load errors eliminated):

(note that any of these URLs could be very NSFW)

edit: updated, see below

mimeType	count	requestedUrl	finalUrl
text/html	5665445	`http://0-1.ru/`	`http://0-1.ru/`
(empty string)	605	`http://aa2888.com/`	`http://aa2888.com/`
application/xhtml+xml	126	`http://apple-store.in/`	`http://apple-store.in/`
text/plain	88	`http://camxclub.fun/`	`https://virsx.com/c.php?k=63r1l5p2seqav3mqsdvc&clickid=5eedcc07eb376e00016fa45c&affpid=6084&referrer=&sub1=&sub2=&sub3=&sub4=&sub5=&sub6=trafficback,174,[MOB]%20Ramadan%20Girls%20-%20PPS%20up%20to%20$0.28%20WW%20(Asia,%20Africa,%20Middle%20East)%20-%20Mainstream/Adult%20Dating`
(null)	25	`http://aplicacionesbiblioteca.udea.edu.co/`	about:blank
text/xml	7	`http://dl.tele2.ru/`	`http://dl.tele2.ru/`
application/json	3	`http://mangareadsonline.com/`	`http://mangareadsonline.com/?&__cf_chl_jschl_tk__=eb0e1bc80ad3b24fce55577de2272226c4c57a92-1588663154-0-AQNng6nLfYRY9SGNobuHLBNw6QuAhSnyPtV_tqJInXx49d0dJvYZHO-hZUGfG0FTdWuqH1aUfYxyOYz8J5UV8XZWIA9P8_4l1YxKvkogFQENLVs0ShZuGtMEsVgNy79lNn0Bj_6Hq6s3SzUiKmJfDl_0vF4jjsbFVFB2fgGUC8NbAACaBB6bBQRr_DXtBne_i7N7-2UAaf1C5SAeoY_VItIqdUzUq7ncuyolVNPnv96rxbXwx6_7S6pHeNSV8W-x6o-5QEIw4XTziavs-pw0wq0`
text/x-server-parsed-html	2	`http://act.zisho.jp/`	`http://act.zisho.jp/`
text/x-python	2	`http://www.dramonline.org/`	`http://www.dramonline.org/`
text/uri-list	1	`https://www.scopus.com/`	`https://www.scopus.com/home.uri`
image/gif	1	`http://ww3.speetest.net/`	`https://partners.etoro.com/blank.gif`
audio/mpeg	1	`https://www.radiostella.cz/`	`https://ice3.abradio.cz/stella128.mp3`
image/jpeg	1	`https://lima2000.com/`	`https://lima2000.com/wp_lima2000/wp-content/uploads/2018/09/MuralesButton.jpg`
httpd/unix-directory	1	`https://mdirect.e-lina.co.kr/`	`https://mdirect.e-lina.co.kr/`
application/x-www-form-urlencoded	1	`https://club.chateaumercian.com/`	`https://club.chateaumercian.com/club`
text/javascript	1	`http://olink.tv/`	`https://whos.amung.us/pingjs/?k=djmatioka403&t=216.244.91.194__&c=d&y=Y20200614164115&a=0&r=20200614164115`

Old not-quite-right query

mimeType	count	example requestedUrl	finalUrl
text/html	5628537	`http://0-1.ru/`	`http://0-1.ru/`
(empty string)	32019	`http://100.v523.com.tw/`	`http://100.v523.com.tw/estate/index.do`
text/plain	5043	`http://52kartu.com/`	`http://52kartu.com/mobile.php;jsessionid=9642EBFAF295C89B0EC9AFC264AA2E63?0`
httpd/unix-directory	164	`http://code.sfu.ca/`	`http://code.sfu.ca/index.html`
application/xhtml+xml	115	`http://apple-store.in/`	`http://apple-store.in/`
text/xml	93	`http://alertas.myweb.vodafone.pt/`	`http://alertas.myweb.vodafone.pt/home`
application/octet-stream	85	`http://ac.dragonest.com/`	`http://ac.dragonest.com/m/en`
application/json	80	`http://asistencia.claro.com.co/`	`http://asistencia.claro.com.co/soporte/`
text/x-perl	31	`http://azerbaijan.mfa.gov.by/`	`http://azerbaijan.mfa.gov.by/ru/`
(null)	25	`http://aplicacionesbiblioteca.udea.edu.co/`	about:blank
application/binary	24	`http://ads.google.com/`	`https://ads.google.com/home/`
application/x-httpd-cgi	17	`http://post.japanpost.jp/`	`https://www.post.japanpost.jp/index.html`
application/cgi	12	`http://filedwon.info/`	`http://filedwon.info/?op=login`
text/x-unknown-content-type	10	`http://gov.garant.ru/`	`http://gov.garant.ru/SESSION/PDA/main.htm`
application/x-gzip	8	`http://www.wise.org/`	`http://www.wise.org/en_US/index.html`
application/pdf	7	`https://www.corelaboratory.abbott/`	`https://www.corelaboratory.abbott/us/en/home`
application/x-msdownload	6	`http://www.actorstudio.fr/`	`http://www.actorstudio.fr/FR/`
text/x-python	5	`http://www.dramonline.org/`	`http://www.dramonline.org/`
application/xml	4	`http://www.insan-pratama.com/`	`http://www.insan-pratama.com/home.html`
application/x-httpd-fcgi	3	`http://automower-fans.les-forums.com/`	`http://automower-fans.les-forums.com/forums/`
application/x-perl	3	`https://pedidoscastalia.com/`	`https://pedidoscastalia.com/_login`
text/javascript	3	`http://olink.tv/`	`https://whos.amung.us/pingjs/?k=djmatioka403&t=216.244.91.194__&c=d&y=Y20200614164115&a=0&r=20200614164115`
text/vnd.wap.wml	2	`http://m.n-content.com/`	`http://m.n-content.com/MusicMobile/wap/index.jsp?cid=1&ads=0007`
application/unknown	2	`https://cvd.bundesregierung.de/`	`https://m.cvd.bundesregierung.de/cvdm-de/login`
application/javascript	2	`http://map.baidu.com/`	`https://map.baidu.com/mobile/webapp/index/index/`
text/x-server-parsed-html	2	`http://act.zisho.jp/`	`http://act.zisho.jp/`
"text/html"	1	`https://www.svobodny-statek.cz/`	`https://www.svobodny-statek.cz/bio-dynamicke-zemedelstvi`
application/x-cgi	1	`https://www.jointheblue.com/`	`https://www.jointheblue.com/michigan/`
application/x-javascript	1	`http://ichinoseki-shinkin.jp/`	`http://ichinoseki-shinkin.jp/sp/index.html`
image/jpeg	1	`http://www.oeuvrecoeurcroix.fr/`	`http://www.oeuvrecoeurcroix.fr/portail/`
audio/mpeg	1	`https://www.radiostella.cz/`	`https://ice3.abradio.cz/stella128.mp3`
image/gif	1	`http://ww3.speetest.net/`	`https://partners.etoro.com/blank.gif`
text/html/html	1	`https://access.sheridaninstitute.ca/`	`https://access.sheridaninstitute.ca/http://portal-am.sheridaninstitute.ca/amserver/UI/Login?gw=access.sheridaninstitute.ca&org=o%3Dsheridanc.on.ca&goto=http%3A%2F%2Fportal-p.sheridaninstitute.ca%3A80%2Fportal%2Fdt`
content/html	1	`https://subs.nymag.com/`	`https://subs.nymag.com/account/`

brendankenny · 2020-07-01T21:21:51Z

Actually, some of these are wrong and are the mimeType of the requestedUrl even if there's a redirect to the finalUrl (it's kind of hard to find the main document because AFAICT there's no URL constructor available in the BigQuery custom JS functions), but I'm leaving it up because they appear to all be real initial mimeTypes

e.g. https://www.svobodny-statek.cz/ does have mimeType "text/html" (in double quotes), even though the URL it's redirected to has the correct mimeType text/html (same with the first request of http://www.oeuvrecoeurcroix.fr/ being served as image/jpeg).

I can put the better table into the comment above, but it's mostly just a subset of the above, with far more cases correctly under text/html (5,665,445 text/html, 605 empty strings, 126 application/xhtml+xml, 88 text/plain, 7 text/xml, and then various other ones all occurring less than 3 times).

Regardless, looks like this string can be whatever the server wants it to be.

paulirish · 2020-07-01T21:57:51Z

nice query!

Maybe we'll need to combine mimeType with other information? It looks like checking for 'text/html' would catch a large majority of cases but would miss some. Since this will be a fatal error, we'll probably want to be more conservative in throwing. e.g. a server shouldn't be serving html with mime type 'content/html', but if Lighthouse can still run against it, we shouldn't not let them do so.

ehhhh. im happy to yell at content/html even tho technically the page will render.

Because we have the browser sniff, it complicates things. my assumption is that the mimeType from the protocol is post-sniff but i'm not confident.

broad categories we have

page is served with text/html and is text/html. YAY
page is served with non-text/html and isn't HTML. page (probably) renders fine?
- we throw error.
page is served with non-text/html but is HTML. page renders fine.
- technically the page is good enough for us, but we should throw this error anyway.
page is not served with a content-type, but is HTML and the browser sniffs it works. YAY
page is not served with a content-type, but isnt HTML and the browser sniffs and who knows.
- Tough situation. LH should throw on this. I have no idea what the .mimeType is in this scenario, but from the above comment it sounds like it's not text/html.

given that we're talking about 0.015% of pages in HA.. I think we can afford to just flag everything that's without a .mimeType of text/html.

brendankenny · 2020-07-01T22:09:31Z

page is served with non-text/html but is HTML. page renders fine.
technically the page is good enough for us, but we should throw this error anyway.

I can kind of get behind this, but this will be very fatal (lhrRuntimeError: true), so some number of currently OK users won't be able to run Lighthouse until they reconfigure their server. It is a very small number, true, though how representative the HTTPArchive mime type numbers are of the general population, I don't know.

brendankenny · 2020-07-01T22:17:36Z

lighthouse-core/lib/lh-error.js

+  /* Used when the page is non-HTML. */
+  INVALID_DOC_TYPE: {
+    code: 'INVALID_DOC_TYPE',
+    message: UIStrings.docTypeInvalid,


need to also add lhrRuntimeError: true here because we want the error to bubble all the way to the top and show up in the LHR runtimeError entry and exit with a non-zero exit code when running the CLI.

@connorjclark's examples in #9245 show that things mostly go wrong but not completely, so we need to be extra loud to get the attention of users doing automated runs of LH that they're accidentally auditing a page they either didn't mean to or have misconfigured.

Since it'll be lhrRuntimeError: true (and so can also appear in the LHR that PSI serves), it'll also need to be added as an error entry in the proto.

Thanks, I'll make sure to fix this!

brendankenny · 2020-07-01T22:20:08Z

lighthouse-core/lib/lh-error.js

@@ -47,6 +47,8 @@ const UIStrings = {
  internalChromeError: 'An internal Chrome error occurred. Please restart Chrome and try re-running Lighthouse.',
  /** Error message explaining that fetching the resources of the webpage has taken longer than the maximum time. */
  requestContentTimeout: 'Fetching resource content has exceeded the allotted time',
+  /** Error message explaining that the webpage is non-HTML, so audits are ill-defined **/
+  docTypeInvalid: 'The page provided is not HTML',


maybe we should parameterize this message, so the user knows what to fix? Something like

Suggested change

docTypeInvalid: 'The page provided is not HTML',

docTypeInvalid: 'The page provided is not HTML (served as MIME type {mimeType}).',

I like that idea, my initial message had some mimeType information but wasn't as easy to read as this, I'll change it!

lemcardenas · 2020-07-01T22:20:22Z

Regardless, looks like this string can be whatever the server wants it to be.

Do you think that looking directly at Content-Type headers will help in reducing some of the false negatives / be better overall? When I first asked about choosing between mimeType/Content-Type/extension approaches, Paul linked this code which mentions //Use mime type from cached resource in case the one in response is empty. (this originally suggested that mimeType was the safe option to use), when checking some of the calling functions it mentions //MIME type is determined by HTTP Content-Type header. So from these snippets it sounds like Content-Type may be the stronger method.

In the original issue for this PR, Connor mentioned that

Images, stylesheets, js, and such actually audit just fine (rather, the page Chrome builds to view the resource presents no issues for LH), so we don't have to exclude those.

Should we instead then have deny/allow lists to have more coverage?

paulirish · 2020-07-02T00:23:51Z

So from these snippets it sounds like Content-Type may be the stronger method.

contenttype is provided in the response headers.
mimeType is provided by the browser and is based on contentType.

since the browser can choose a mimetype even if a contenttype isn't provided, we should just go with mimetype. it's the signal that's more in-tune with how the browser read the content.

let's keep this check simple: if the mimeType isn't text/html we have a problem.

that's it. :)

lemcardenas · 2020-07-02T00:27:15Z

let's keep this check simple: if the mimeType isn't text/html we have a problem.

that's it. :)

Thanks for the explanation! Sounds good to me :)

connorjclark · 2020-07-07T17:14:05Z

lighthouse-core/gather/gather-runner.js

+
+    // mimeType is determined by the browser, we assume Chrome is determining mimeType correctly,
+    // independently of 'Content-Type' response headers, and always sending mimeType if well-formed.
+    if (mainRecord.mimeType || mainRecord.mimeType === '') {


mainRecord.mimeType is initialized as an empty string and set from response.mimeType which apparently is always a string, never null or undefined. So I think this can be refactored to just:

if (!HTML_MIME_REGEX.test(mainRecord.mimeType)) { return new LHError(LHError.errors.NON_HTML, {mimeType: mainRecord.mimeType}); }

(removing the outer if statement)

thanks for clarifying, I was unsure if the string from the response could be overwritten in the browser with a non-string / null / undefined, I'll make sure to fix this!

connorjclark · 2020-07-07T17:14:50Z

lighthouse-core/lib/lh-error.js

+   * @description Error message explaining that the webpage is non-HTML, so audits are ill-defined.
+   * @example {application/xml} mimeType
+   * */
+  nonHtml: 'The page provided is not HTML (served as MIME type {mimeType}).',


nit: notHtml instead? and NOT_HTML

I kept the nonHtml name scheme from the function but this makes sense since the function isn't really in context here, I'll change this!

connorjclark · 2020-07-07T17:16:30Z

lighthouse-core/gather/gather-runner.js

+   * @return {LH.LighthouseError|undefined}
+   */
+  static getNonHtmlError(mainRecord) {
+    // MIME types are case-insenstive


there's no test for the case insensitivity

does chrome perhaps normalize this anyhow? you can hack the static-server.js to return any mime type you want and run yarn static-server to verify.

yeah I couldn't find information online as to whether chrome automatically makes mimetypes lowercase regardless of the http response, i'll check with the method you mentioned and let you know

ok update so I replaced all instances of 'text/html' with 'TEXT/HTML' in the static-server Content-Type header responses & confirmed in devtools that the Content-Type header showed up as 'TEXT/HTML'. When I ran a lighthouse run against a page hosted on static-server, I got no error and the mimeType accessible from mainRecord was always lowercase (text/html), so it seems that chrome automatically makes its mimeType lowercase!

So I am going to remove the case insensitivity modifier on the RegEx, and I think I'll replace it with a direct string comparison with a const 'text/html' since thats likely more efficient and seems sufficient based on the test above

nice! yeah a direct string comparison would be nice here. to be clear, the performance was never an issue (it's negligible), just the desire to keep things as complex as necessary (aka as simple as possible).

a comment that chrome normalizes the mime type would be good too

sounds good! i'll make sure to add the comment right now

…tement in getNonHtmlError

…to error-nonHTML

proto/lighthouse-result.proto

Co-authored-by: Connor Clark <cjamcl@google.com>

lemcardenas added 7 commits June 26, 2020 14:36

Added new docTypeError fxn and LHError

c00a579

starting testing code

09682f4

starting tests

eec825d

finished unit tests for getDocTypeError

a63f813

bugfixing

40bf826

integrated getDocTypeError into getPageLoadError

28cad19

removed console logging from debugging

6ac8d69

googlebot added the cla: yes label Jun 30, 2020

lemcardenas commented Jun 30, 2020

View reviewed changes

lemcardenas marked this pull request as ready for review June 30, 2020 22:56

lemcardenas requested a review from a team as a code owner June 30, 2020 22:56

lemcardenas requested review from Beytoven and removed request for a team June 30, 2020 22:56

lemcardenas linked an issue Jun 30, 2020 that may be closed by this pull request

Warn when trying to audit non-html #9245

Closed

devtools-bot assigned Beytoven Jun 30, 2020

devtools-bot added the waiting4reviewer label Jun 30, 2020

connorjclark requested changes Jun 30, 2020

View reviewed changes

lemcardenas added 2 commits July 1, 2020 09:52

changes from code review

fbc6380

more code review changes

ba1dedd

vercel bot deployed to Preview July 1, 2020 16:57 View deployment

brendankenny reviewed Jul 1, 2020

View reviewed changes

lemcardenas added 2 commits July 1, 2020 17:35

committed new strings for testing

85ac35a

even more code review changes

4f1b321

vercel bot deployed to Preview July 2, 2020 01:03 View deployment

Merge branch 'master' into error-nonHTML

6d36004

vercel bot deployed to Preview July 6, 2020 17:32 View deployment

connorjclark requested changes Jul 7, 2020

View reviewed changes

lemcardenas added 3 commits July 7, 2020 15:33

changed naming from nonHtml to notHTML, removed an unnecessary if sta…

9dd709d

…tement in getNonHtmlError

added changed i18n strings

cd32b47

Merge branch 'error-nonHTML' of github.com:GoogleChrome/lighthouse in…

387649b

…to error-nonHTML

vercel bot deployed to Preview July 7, 2020 22:39 View deployment

removed regex and added const str for comparison

8e3b371

vercel bot deployed to Preview July 8, 2020 00:13 View deployment

included a comment about Chrome MIME type normalization

6f90641

vercel bot deployed to Preview July 8, 2020 00:18 View deployment

connorjclark reviewed Jul 8, 2020

View reviewed changes

proto/lighthouse-result.proto Outdated Show resolved Hide resolved

connorjclark approved these changes Jul 8, 2020

View reviewed changes

Update proto/lighthouse-result.proto

8c3bd84

Co-authored-by: Connor Clark <cjamcl@google.com>

vercel bot deployed to Preview July 8, 2020 00:34 View deployment

connorjclark unassigned Beytoven Jul 8, 2020

connorjclark added the land-when-ci-is-green label Jul 8, 2020

devtools-bot merged commit aed2a64 into master Jul 8, 2020

devtools-bot deleted the error-nonHTML branch July 8, 2020 00:42

devtools-bot removed the land-when-ci-is-green label Jul 8, 2020

brendankenny mentioned this pull request Sep 28, 2020

Work with "application/xhtml+xml" documents #11482

Closed

bradfrosty mentioned this pull request Oct 21, 2020

Incorrect main document used when retrieving page load errors #11585

Closed

bradfrosty mentioned this pull request Nov 2, 2020

core(gather-runner): use final document when reporting non-HTML error #11620

Merged

		// MIME types are case-insenstive
		const HTML_MIME_REGEX = /^text\/html$/i;

		// We want to error when the page is not of MIME type text/html
		if (docTypeError) return docTypeError;

	docTypeInvalid: 'The webpage you have provided appears to be non-HTML',
	docTypeInvalid: 'The page provided is not HTML',

		mainRecord.url = passContext.url;
		mainRecord.mimeType = mimeType;

	// We want to error when the page is not of MIME type text/html
	// Error if page is not HTML.

core(gather-runner): error on non-HTML #11042

core(gather-runner): error on non-HTML #11042

Conversation

lemcardenas commented Jun 30, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

patrickhulce Jul 1, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lemcardenas Jul 1, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brendankenny commented Jul 1, 2020 • edited Loading

brendankenny commented Jul 1, 2020 • edited Loading

paulirish commented Jul 1, 2020

brendankenny commented Jul 1, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lemcardenas Jul 1, 2020 • edited Loading

Choose a reason for hiding this comment

lemcardenas commented Jul 1, 2020 • edited Loading

paulirish commented Jul 2, 2020

lemcardenas commented Jul 2, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lemcardenas Jul 7, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lemcardenas commented Jun 30, 2020 •

edited

Loading

patrickhulce Jul 1, 2020 •

edited

Loading

lemcardenas Jul 1, 2020 •

edited

Loading

brendankenny commented Jul 1, 2020 •

edited

Loading

brendankenny commented Jul 1, 2020 •

edited

Loading

lemcardenas Jul 1, 2020 •

edited

Loading

lemcardenas commented Jul 1, 2020 •

edited

Loading

lemcardenas Jul 7, 2020 •

edited

Loading