new_audit(seo): JSON-LD validation (not included in UI for now) #5446

kdzwinel · 2018-06-07T16:02:48Z

Summary
Fixes #4359

~~Depends on #5377~~ (merged)

Preview:

Results form running validation against the main pages of the top 1500 domains (that had JSON-LD): https://gist.github.com/kdzwinel/9c9e209e3b1bb239e01920f4aec10108

Questions:

how to disclose in the audit name, or description, that we are only checking JSON-LD for now?
Where should "Learn more" link from the description point to?
For easier identification we wanted to expose main object from the JSON-LD, but we can't do that if it's not valid JSON-LD object - how about exposing a snippet (first 50 characters?) instead?

Notes:

this audit introduces two external dependencies (jsonld and jsonlint-mod) and two big json files with metadata
we already have audit with "structured-data" id (the manual one), so I had to go with a different id ("structured-data-automatic"). AFAIK we are not (yet?) removing the manual audit.

TODO:

AymenLoukil · 2018-06-07T17:51:28Z

sd-validation/schema.js

+// @ts-ignore
+const schemaStructure = new Map(require('./assets/schema_google'));
+const TYPE_KEYWORD = '@type';
+const SCHEMA_ORG_URL = 'http://schema.org/';


both http and https versions could be used in defining schema types so later in cleanName we risk to have an issue

shall we just normalize all http -> https

good catch @AymenLoukil !

rviscomi · 2018-06-07T18:02:00Z

how to disclose in the audit name, or description, that we are only checking JSON-LD for now?

Maybe just add this sentence to the description: This audit is currently doing basic JSON-LD validation.

Where should "Learn more" link from the description point to?

We can omit the link until we have a page in the LH docs. cc @kaycebasques @ekharvey

For easier identification we wanted to expose main object from the JSON-LD, but we can't do that if it's not valid JSON-LD object - how about exposing a snippet (first 50 characters?) instead?

Sounds good.

Also, see my note in #4359 about removing the SDTT scraping from the audit for now until we can get a more reliable solution.

paulirish · 2018-06-11T19:13:13Z

lighthouse-core/audits/seo/structured-data-automatic.js

+      artifacts.JsonLD.map(async (jsonLD, idx) => {
+        const errors = await validateJsonLD(jsonLD);
+
+        errors.forEach(({message, path}) => {


an idea. wdyt about including all discovered top-level types in the results? I'm trying to debug why some of the errors are being flaky on http://www.lefigaro.fr/ and getting confirmation that it's actually finding the json+ld would be really nice.

👍 sounds like a very good idea, I'll add it

Will keep this in mind for the new result display logic.

paulirish · 2018-06-11T19:23:28Z

sd-validation/expand.js

+
+  jsonld.expand(inputObject, {
+    // custom loader prevents network calls and alows us to return local version of the schema.org document
+    documentLoader: (


extract this fn to a fn declaration? will make this call into .expand a bit easier to parse.

paulirish · 2018-06-11T19:24:53Z

sd-validation/expand.js

+  jsonld.expand(inputObject, {
+    // custom loader prevents network calls and alows us to return local version of the schema.org document
+    documentLoader: (
+        /** @type {string} **/url,


this url is the url of a schema? can we call it schemaUrl

Fixed by Konrad.

paulirish · 2018-06-11T19:26:22Z

sd-validation/jsonld.js

+
+const walkObject = require('./helpers/walkObject');
+
+const CONTEXT = '@context';


oops, good catch, I used it for an additional check that was here before but I ended up removing

Fixed by Konrad.

paulirish · 2018-06-11T19:26:52Z

sd-validation/jsonld.js

+const walkObject = require('./helpers/walkObject');
+
+const CONTEXT = '@context';
+const KEYWORDS = [


add comment for where this list is specified? aka how we maintain it

https://json-ld.org/spec/latest/json-ld/#syntax-tokens-and-keywords ?

Good call, I'll leave a comment and yeah you got the right link 👍

Fixed by Konrad.

paulirish · 2018-06-11T19:44:56Z

sd-validation/test/shema-org-validation-test.js

+      "publisher": "Cat Magazine"
+    }`);
+
+    assert.equal(errors.length, 2);


whats the other one?

Fixed by Konrad, only one error now.

paulirish · 2018-06-11T19:45:37Z

sd-validation/test/shema-org-validation-test.js

+      "image": "https://cats.rock/cat.bmp",
+      "publisher": "Cat Magazine",
+      "mainEntityOfPage": "https://cats.rock/magazine.html",
+      "controversial": true


VERY much so. 😡

🤣

paulirish · 2018-06-11T19:46:42Z

sd-validation/test/shema-org-validation-test.js

+    assert.equal(errors[0].message, 'Unexpected property "controversial"');
+  });
+
+  it('passes if non-schema.org context', async () => {


feels like at the audit-level we'd consider this situation to be equivalent to zero json-ld on the page === not-applicable. right?

In this case, only this one validator is not applicable, the whole audit still makes sense because we have json, json-ld and expansion validation.

paulirish · 2018-06-11T19:52:34Z

sd-validation/index.js

+    return errors;
+  }
+
+  // STEP 3: EXPAND


what exactly does this expansion represent? i never really understood that.

as presented on the meeting - it gives us normalized form of the json-ld

https://json-ld.org/spec/latest/json-ld-api/#expansion
https://json-ld.org/playground/

paulirish · 2018-06-11T19:53:16Z

sd-validation/index.js

+    expandedObj = await promiseExpand(inputObject);
+  } catch (error) {
+    errors.push({
+      validator: 'json-ld',


should this be the same validator? i'm fine with 'yes', just asking..

it might be worth to differentiate between those two, good call (note that this field is not used by LH - I added it because sd-validation aspires to be a separate package, and this info can be useful for someone else)

Done already. 🙂

paulirish

great job on this. this is a big area and it feels quite approachable due to how you structured the problem here. nice work!

paulirish · 2018-06-11T19:56:09Z

lighthouse-core/audits/seo/structured-data-automatic.js

+
+    const headings = [
+      {key: 'idx', itemType: 'text', text: 'Index'},
+      {key: 'path', itemType: 'text', text: 'Line/Path'},


i'm wondering if we should also have a CSS selector to help point to where this was slurped up. i guess it'd be just script[type="application/ld+json" i] with a :nth-child(N).. okay maybe we just communicate that in the description/docs.

yeah, all of those json-ld's are scripts, and all of them are in <head>, so it's hard to provide useful debugging info. IMO you idea with exposing top-level type will be great here (although we can't extract top-level type if json/json-ld/expansion fails :( )

kdzwinel

review comments addressed (thank you Paul, these were great!)
validation of object properties recommended/required by SDTT removed (now we are using only schema.org data)
audit description adjusted

kdzwinel · 2018-06-20T23:16:33Z

sd-validation/jsonld.js

+
+const walkObject = require('./helpers/walkObject');
+
+const CONTEXT = '@context';


oops, good catch, I used it for an additional check that was here before but I ended up removing

kdzwinel · 2018-06-20T23:18:11Z

sd-validation/jsonld.js

+const walkObject = require('./helpers/walkObject');
+
+const CONTEXT = '@context';
+const KEYWORDS = [


Good call, I'll leave a comment and yeah you got the right link 👍

kdzwinel · 2018-06-20T23:21:25Z

sd-validation/jsonld.js

+ * @param {string} name
+ * @returns {string | null} error
+ */
+function validateField(name) {


👍 I changed it to "validateKey"

kdzwinel · 2018-06-20T23:22:43Z

sd-validation/schema.js

+// @ts-ignore
+const schemaStructure = new Map(require('./assets/schema_google'));
+const TYPE_KEYWORD = '@type';
+const SCHEMA_ORG_URL = 'http://schema.org/';


good catch @AymenLoukil !

kdzwinel · 2018-06-20T23:34:14Z

sd-validation/schema.js

+
+  const cleanKeys = keys
+    // skip JSON-LD keywords
+    .filter(key => key.indexOf('@') !== 0)


we could, but then all invalid keys starting with '@' would not be removed here, thus producing a Unexpected property "${key}" error.

Since these keys were already caught by the previous validator (json-ld) and reported ('Unknown keyword') IMO there is no need to report them again.

kdzwinel · 2018-06-20T23:43:45Z

sd-validation/test/shema-org-validation-test.js

+    assert.equal(errors[0].message, 'Unexpected property "controversial"');
+  });
+
+  it('passes if non-schema.org context', async () => {


In this case, only this one validator is not applicable, the whole audit still makes sense because we have json, json-ld and expansion validation.

kdzwinel · 2018-06-20T23:45:49Z

lighthouse-core/audits/seo/structured-data-automatic.js

+      artifacts.JsonLD.map(async (jsonLD, idx) => {
+        const errors = await validateJsonLD(jsonLD);
+
+        errors.forEach(({message, path}) => {


👍 sounds like a very good idea, I'll add it

kdzwinel · 2018-06-20T23:47:47Z

lighthouse-core/audits/seo/structured-data-automatic.js

+
+    const headings = [
+      {key: 'idx', itemType: 'text', text: 'Index'},
+      {key: 'path', itemType: 'text', text: 'Line/Path'},


yeah, all of those json-ld's are scripts, and all of them are in <head>, so it's hard to provide useful debugging info. IMO you idea with exposing top-level type will be great here (although we can't extract top-level type if json/json-ld/expansion fails :( )

kdzwinel · 2018-06-20T23:49:01Z

sd-validation/index.js

+    return errors;
+  }
+
+  // STEP 3: EXPAND


as presented on the meeting - it gives us normalized form of the json-ld

https://json-ld.org/spec/latest/json-ld-api/#expansion
https://json-ld.org/playground/

kdzwinel · 2018-06-20T23:51:48Z

sd-validation/index.js

+    expandedObj = await promiseExpand(inputObject);
+  } catch (error) {
+    errors.push({
+      validator: 'json-ld',


it might be worth to differentiate between those two, good call (note that this field is not used by LH - I added it because sd-validation aspires to be a separate package, and this info can be useful for someone else)

paulirish · 2018-06-22T12:25:30Z

sd-validation/assets/generate-schema-tree.js

+ */
+
+/**
+  * This script can be used to generate schema-tree.json file


should this live in a /scripts/ folder?

How is it used? (maybe something for the readme? though i can't tell..)

Good call, I moved it and changed it to a nodejs script that fetches the schema.org jsonld and writes result to assets/.

paulirish · 2018-06-22T12:28:42Z

sd-validation/test/schema-org-validation-test.js

@@ -31,31 +31,6 @@ describe('schema.org validation', () => {
    assert.equal(errors[0].message, 'Unrecognized schema.org type http://schema.org/Dog');
  });

-  it('reports missing required fields', async () => {


i don't totally grok whats happening here swapping out the google schema to the other one but...

i'm kinda surprised we that 'required' fields isn't a thing anymore. I guess schema.org just doesn't have an opinion on whats required?

Yeah schema.org doesn't care, but Google requires some properties to enable rich snippets.

Will bring it up with @rviscomi if we want to add extra validation for this in the future.

paulirish · 2018-06-22T12:30:15Z

sd-validation/expand.js

@@ -11,6 +11,33 @@ const jsonld = require('jsonld');
 const schemaOrgContext = require('./assets/jsonldcontext');
 const SCHEMA_ORG_HOST = 'schema.org';

+/**
+ * Custom loader that prevents network calls and alows us to return local version of the


paulirish · 2018-06-22T12:35:15Z

sd-validation/assets/generate-schema-tree.js

+ */
+'use strict';
+
+// load data from https://github.com/schemaorg/schemaorg/blob/master/data/releases/3.4/schema.jsonld


Via https://schema.org/docs/developers.html#defs it looks like this URL will redirect and work: https://schema.org/version/latest/schema.jsonld

mattzeunert · 2018-11-24T21:08:33Z

This is a bit trickier than I thought because there are three different forms that the JSON-LD takes:

Raw JSON string on the page
parsed JSON object
expanded JSON-LD (regularized form without @context)

There are 4 types of failures:

JSON validation, gives line number in raw JSON string
JSON-LD keyword validation, gives path in parsed JSON object
Expansion errors in the jsonld module, does not provide location information
schema.org validation, gives path in expanded JSON-LD object

Current status:

Notes on the errors in the screenshot:

JSON-LD keyword validation error, shows the stringified JSON (so it has a different key order from the raw string)
JSON parse failure, shows the raw string
Expansion error, should not have a highlight because we don't know the line
schema org failure, shows stringified JSON
same as 4
schema org failure but it uses a full resource identifier (http://schema.org/author) instead of the the relying on the context. I need to work on this some more to sort out the line mapping from the expanded form to the input object here.

Instead of using a JSON parser I overwrite the property value at the given path with a random key, and then use the line number of the random key in the stringified JSON.

patrickhulce · 2018-11-27T23:37:09Z

Just throwing this out there, what do we think about splitting up this PR and trying to land some pieces of it like the standalone validation folder?

I think Matt's got a decent hold on the usage of everything that it seems like most of the changes will likely be in core now, is that right @mattzeunert?

mattzeunert · 2018-11-28T09:26:34Z

@patrickhulce There'll still be a few changes in the validator to get it to provide line numbers. Other than that it's just adding new rendering logic.

patrickhulce · 2018-11-28T15:27:27Z

There'll still be a few changes in the validator to get it to provide line numbers.

Ah, ok gotcha. Well maybe once the API solidifies for those we can try to break it up?

mattzeunert · 2018-11-28T15:41:48Z

Sure – is the main purpose to simplify the review?

patrickhulce · 2018-11-28T15:43:45Z

Sure – is the main purpose to simplify the review?

Yeah and ideally land pieces earlier so there's less that needs to keep being rebased, etc. I know most of it is generated files, but 13k LOC is a dousy :)

Easier to tell if things are missing when each PR is focused.

mattzeunert · 2018-11-29T16:27:39Z

@patrickhulce @rviscomi Is there any reason not to merge the PR without the new result rendering logic? Especially since it's already 90% reviewed.

Update: seems more like 98% reviewed 🙂. There are 4 small non-merge commits from me on the branch. I think we should get this merged, and possibly disable the audit for now if we don't want to it to get released.

Update2: The plan is to disable the audit for now and merge this PR.

mattzeunert · 2018-11-29T20:35:12Z

lighthouse-core/config/default-config.js

@@ -479,7 +479,7 @@ const defaultConfig = {
        {id: 'canonical', weight: 1, group: 'seo-content'},
        {id: 'font-size', weight: 1, group: 'seo-mobile'},
        {id: 'plugins', weight: 1, group: 'seo-content'},
-        {id: 'structured-data-automatic', weight: 1, group: 'seo-content'},
+        // {id: 'structured-data-automatic', weight: 1, group: 'seo-content'},


Maybe there's a better way to disable the audit for now than what I'm doing in this commit?

mattzeunert · 2018-12-05T18:03:49Z

@patrickhulce Removed the audit from the SEO category like you suggested. Much nicer solution than commenting out all the tests!

I'm guessing AppVeyor is just flaky?

mattzeunert · 2018-12-15T19:43:52Z

Been dabbling around with the rendering logic a bit more.

Can we remove the "Show all" button and just show all when the user clicks on the snippet? We can indicate that it's clickable when the user hovers over it – but maybe that's not discoverable enough?

Should there be some kind of title for each JSON-LD item? Or at least a clearer separation.

@rviscomi Do you have a strong concept of what we're going for? Should we ask a designer for help? Or just keep iterating ourselves?

Different cases we need to handle:

error with no line number
one or more errors on specific lines
maybe: show JSON-LD snippets without any failures, doesn't need to show full JSON but maybe just the top level @type value (@paulirish suggested this so that the user knows that LH picked up the snippet and didn't find anything wrong with it)

rviscomi · 2018-12-17T18:03:35Z

@rviscomi Do you have a strong concept of what we're going for? Should we ask a designer for help? Or just keep iterating ourselves?

It's inspired by the GitHub code diff UI, eg:

@paulirish do you know if there's anything else we need to do to make sure this audit's results render properly on downstream services like web.dev?

Can we remove the "Show all" button and just show all when the user clicks on the snippet? We can indicate that it's clickable when the user hovers over it – but maybe that's not discoverable enough?

Yeah I'm not sure if it'll be obvious that it's clickable.

Should there be some kind of title for each JSON-LD item? Or at least a clearer separation.

Similar to the GitHub example, instead of a file name could we show the DOM address of the script block? When in devtools and clicked it should reveal it in the Elements panel.

Different cases we need to handle:
error with no line number

Let's put any of these kinds of errors on line 1.

one or more errors on specific lines

If there's 1 error, show that error. If there are 2 or more errors on the same line, show "X errors:" then an unordered list with each error.

maybe: show JSON-LD snippets without any failures, doesn't need to show full JSON but maybe just the top level @type value (@paulirish suggested this so that the user knows that LH picked up the snippet and didn't find anything wrong with it)

+1. The type may not always be descriptive (or exist at all) so maybe just show the first ~10 lines with the "Show all" expander.

mattzeunert · 2018-12-20T16:26:42Z

Played around some more:

brendankenny · 2019-04-08T23:12:26Z

Closing in favor of #4359 and upcoming PRs.

We'll keep the branch around in case someone needs it <3

kdzwinel requested review from brendankenny, patrickhulce and paulirish as code owners June 7, 2018 16:02

AymenLoukil reviewed Jun 7, 2018

View reviewed changes

paulirish reviewed Jun 11, 2018

View reviewed changes

kdzwinel force-pushed the json-ld branch from 52e8089 to 789d7b8 Compare June 20, 2018 22:54

kdzwinel commented Jun 21, 2018

View reviewed changes

paulirish reviewed Jun 22, 2018

View reviewed changes

paulirish mentioned this pull request Jun 23, 2018

FR: Allow external API calls #5547

Closed

patrickhulce and others added 18 commits June 28, 2018 16:28

extension: allow use of ES2018 features

e978813

force acorn

cc4471f

windows compat

9f2883d

update comments and such

70c53a5

First version.

bcfd8dc

First version - missing files.

5f0caef

Clena up sd-validation (license, formatting, jsdoc, etc)

146b2dc

rename, String -> string

e9a6894

Make typescript happy

6839dd0

update jsonld

769f5ba

sd-validation tests

84c2ac1

Adjust JSON parsing validation errors

8a65918

Adjust table columns

06256c9

Fix that bad merge!

e64f7c9

Fix that lock?

aa8ae16

Address PR review comments

b1a06f2

Do not take Google recommendations into account for now

2b1098c

Update description.

68a569d

mattzeunert force-pushed the json-ld branch from aec8ee2 to 7a07a28 Compare November 29, 2018 16:07

Matt Zeunert added 3 commits November 29, 2018 16:38

Merge branch 'master' into HEAD

d82cf98

Fix install rdf-canonize in CI

e46cf44

Fix type check

b42a6a6

mattzeunert force-pushed the json-ld branch from 7068ff1 to b42a6a6 Compare November 29, 2018 16:41

Matt Zeunert added 2 commits November 29, 2018 16:54

Ignore JSONLD gatherer error

c45afc0

Move document loader to function declaration

643f1b9

mattzeunert force-pushed the json-ld branch 3 times, most recently from 14f4b03 to 910da3b Compare November 29, 2018 19:59

mattzeunert reviewed Nov 29, 2018

View reviewed changes

Matt Zeunert and others added 2 commits December 5, 2018 16:58

Reinstate yarn.lock integrity key stripped by old yarn version

d9ff2bf

Don't include automated structured data audit as part of SEO audits

50bbaf3

mattzeunert force-pushed the json-ld branch from 910da3b to 50bbaf3 Compare December 5, 2018 17:41

mattzeunert changed the title ~~[WIP] JSON-LD validation~~ (new_audit):JSON-LD validation (not included in UI for now) Dec 5, 2018

mattzeunert changed the title ~~(new_audit):JSON-LD validation (not included in UI for now)~~ new_audit(seo): JSON-LD validation (not included in UI for now) Dec 5, 2018

patrickhulce mentioned this pull request Dec 8, 2018

core(jsonld): add structured data validation #6750

Merged

brendankenny closed this Apr 8, 2019


		const walkObject = require('./helpers/walkObject');

		const CONTEXT = '@context';

new_audit(seo): JSON-LD validation (not included in UI for now) #5446

new_audit(seo): JSON-LD validation (not included in UI for now) #5446

Conversation

kdzwinel commented Jun 7, 2018 • edited by mattzeunert

Preview:

Questions:

Notes:

TODO:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rviscomi commented Jun 7, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paulirish left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kdzwinel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kdzwinel Jun 20, 2018 • edited by paulirish

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattzeunert commented Nov 24, 2018

patrickhulce commented Nov 27, 2018

mattzeunert commented Nov 28, 2018

patrickhulce commented Nov 28, 2018

mattzeunert commented Nov 28, 2018

patrickhulce commented Nov 28, 2018

mattzeunert commented Nov 29, 2018 • edited

Choose a reason for hiding this comment

mattzeunert commented Dec 5, 2018

mattzeunert commented Dec 15, 2018

rviscomi commented Dec 17, 2018

mattzeunert commented Dec 20, 2018

brendankenny commented Apr 8, 2019

kdzwinel commented Jun 7, 2018 •

edited by mattzeunert

kdzwinel Jun 20, 2018 •

edited by paulirish

mattzeunert commented Nov 29, 2018 •

edited