Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

new_audit(seo): JSON-LD validation (not included in UI for now) #5446

Closed
wants to merge 35 commits into from

Conversation

kdzwinel
Copy link
Collaborator

@kdzwinel kdzwinel commented Jun 7, 2018

Summary
Fixes #4359

Depends on #5377 (merged)

Preview:

jsonld-validation

Results form running validation against the main pages of the top 1500 domains (that had JSON-LD): https://gist.github.com/kdzwinel/9c9e209e3b1bb239e01920f4aec10108

Questions:

  • how to disclose in the audit name, or description, that we are only checking JSON-LD for now?
  • Where should "Learn more" link from the description point to?
  • For easier identification we wanted to expose main object from the JSON-LD, but we can't do that if it's not valid JSON-LD object - how about exposing a snippet (first 50 characters?) instead?

Notes:

  • this audit introduces two external dependencies (jsonld and jsonlint-mod) and two big json files with metadata
  • we already have audit with "structured-data" id (the manual one), so I had to go with a different id ("structured-data-automatic"). AFAIK we are not (yet?) removing the manual audit.

TODO:

  • json validation
  • json-ld validation
  • schema.org object validation
  • required/recommended fields validation
  • unit tests
  • if there are only warnings (e.g. only "recommended" fields are missing form the JSON-LD object) - don't fail the audit, mark it as a warning
  • expose failing node in the first column
  • expose main object, or the snippet of the JSON-LD file, for easier identification
  • smoke test
  • improve report UI (make it easier for users to locate errors in their code) ==> will do in separate PR

// @ts-ignore
const schemaStructure = new Map(require('./assets/schema_google'));
const TYPE_KEYWORD = '@type';
const SCHEMA_ORG_URL = 'http://schema.org/';

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both http and https versions could be used in defining schema types so later in cleanName we risk to have an issue

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we just normalize all http -> https

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch @AymenLoukil !

@rviscomi
Copy link
Member

rviscomi commented Jun 7, 2018

how to disclose in the audit name, or description, that we are only checking JSON-LD for now?

Maybe just add this sentence to the description: This audit is currently doing basic JSON-LD validation.

Where should "Learn more" link from the description point to?

We can omit the link until we have a page in the LH docs. cc @kaycebasques @ekharvey

For easier identification we wanted to expose main object from the JSON-LD, but we can't do that if it's not valid JSON-LD object - how about exposing a snippet (first 50 characters?) instead?

Sounds good.

Also, see my note in #4359 about removing the SDTT scraping from the audit for now until we can get a more reliable solution.

artifacts.JsonLD.map(async (jsonLD, idx) => {
const errors = await validateJsonLD(jsonLD);

errors.forEach(({message, path}) => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an idea. wdyt about including all discovered top-level types in the results? I'm trying to debug why some of the errors are being flaky on http://www.lefigaro.fr/ and getting confirmation that it's actually finding the json+ld would be really nice.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 sounds like a very good idea, I'll add it

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will keep this in mind for the new result display logic.


jsonld.expand(inputObject, {
// custom loader prevents network calls and alows us to return local version of the schema.org document
documentLoader: (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extract this fn to a fn declaration? will make this call into .expand a bit easier to parse.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

jsonld.expand(inputObject, {
// custom loader prevents network calls and alows us to return local version of the schema.org document
documentLoader: (
/** @type {string} **/url,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this url is the url of a schema? can we call it schemaUrl

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed by Konrad.


const walkObject = require('./helpers/walkObject');

const CONTEXT = '@context';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why tho?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, good catch, I used it for an additional check that was here before but I ended up removing

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed by Konrad.

const walkObject = require('./helpers/walkObject');

const CONTEXT = '@context';
const KEYWORDS = [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add comment for where this list is specified? aka how we maintain it

https://json-ld.org/spec/latest/json-ld/#syntax-tokens-and-keywords ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, I'll leave a comment and yeah you got the right link 👍

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed by Konrad.

"publisher": "Cat Magazine"
}`);

assert.equal(errors.length, 2);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whats the other one?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed by Konrad, only one error now.

"image": "https://cats.rock/cat.bmp",
"publisher": "Cat Magazine",
"mainEntityOfPage": "https://cats.rock/magazine.html",
"controversial": true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VERY much so. 😡

🤣

assert.equal(errors[0].message, 'Unexpected property "controversial"');
});

it('passes if non-schema.org context', async () => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feels like at the audit-level we'd consider this situation to be equivalent to zero json-ld on the page === not-applicable. right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, only this one validator is not applicable, the whole audit still makes sense because we have json, json-ld and expansion validation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true!

return errors;
}

// STEP 3: EXPAND
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what exactly does this expansion represent? i never really understood that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as presented on the meeting - it gives us normalized form of the json-ld

https://json-ld.org/spec/latest/json-ld-api/#expansion
https://json-ld.org/playground/

expandedObj = await promiseExpand(inputObject);
} catch (error) {
errors.push({
validator: 'json-ld',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be the same validator? i'm fine with 'yes', just asking..

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be worth to differentiate between those two, good call (note that this field is not used by LH - I added it because sd-validation aspires to be a separate package, and this info can be useful for someone else)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done already. 🙂

Copy link
Member

@paulirish paulirish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great job on this. this is a big area and it feels quite approachable due to how you structured the problem here. nice work!


const headings = [
{key: 'idx', itemType: 'text', text: 'Index'},
{key: 'path', itemType: 'text', text: 'Line/Path'},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm wondering if we should also have a CSS selector to help point to where this was slurped up. i guess it'd be just script[type="application/ld+json" i] with a :nth-child(N).. okay maybe we just communicate that in the description/docs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, all of those json-ld's are scripts, and all of them are in <head>, so it's hard to provide useful debugging info. IMO you idea with exposing top-level type will be great here (although we can't extract top-level type if json/json-ld/expansion fails :( )

Copy link
Collaborator Author

@kdzwinel kdzwinel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • review comments addressed (thank you Paul, these were great!)
  • validation of object properties recommended/required by SDTT removed (now we are using only schema.org data)
  • audit description adjusted


const walkObject = require('./helpers/walkObject');

const CONTEXT = '@context';
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, good catch, I used it for an additional check that was here before but I ended up removing

const walkObject = require('./helpers/walkObject');

const CONTEXT = '@context';
const KEYWORDS = [
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, I'll leave a comment and yeah you got the right link 👍

* @param {string} name
* @returns {string | null} error
*/
function validateField(name) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I changed it to "validateKey"

// @ts-ignore
const schemaStructure = new Map(require('./assets/schema_google'));
const TYPE_KEYWORD = '@type';
const SCHEMA_ORG_URL = 'http://schema.org/';
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch @AymenLoukil !


const cleanKeys = keys
// skip JSON-LD keywords
.filter(key => key.indexOf('@') !== 0)
Copy link
Collaborator Author

@kdzwinel kdzwinel Jun 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could, but then all invalid keys starting with '@' would not be removed here, thus producing a Unexpected property "${key}" error.

Since these keys were already caught by the previous validator (json-ld) and reported ('Unknown keyword') IMO there is no need to report them again.

assert.equal(errors[0].message, 'Unexpected property "controversial"');
});

it('passes if non-schema.org context', async () => {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, only this one validator is not applicable, the whole audit still makes sense because we have json, json-ld and expansion validation.

artifacts.JsonLD.map(async (jsonLD, idx) => {
const errors = await validateJsonLD(jsonLD);

errors.forEach(({message, path}) => {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 sounds like a very good idea, I'll add it


const headings = [
{key: 'idx', itemType: 'text', text: 'Index'},
{key: 'path', itemType: 'text', text: 'Line/Path'},
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, all of those json-ld's are scripts, and all of them are in <head>, so it's hard to provide useful debugging info. IMO you idea with exposing top-level type will be great here (although we can't extract top-level type if json/json-ld/expansion fails :( )

return errors;
}

// STEP 3: EXPAND
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as presented on the meeting - it gives us normalized form of the json-ld

https://json-ld.org/spec/latest/json-ld-api/#expansion
https://json-ld.org/playground/

expandedObj = await promiseExpand(inputObject);
} catch (error) {
errors.push({
validator: 'json-ld',
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be worth to differentiate between those two, good call (note that this field is not used by LH - I added it because sd-validation aspires to be a separate package, and this info can be useful for someone else)

*/

/**
* This script can be used to generate schema-tree.json file
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this live in a /scripts/ folder?

How is it used? (maybe something for the readme? though i can't tell..)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, I moved it and changed it to a nodejs script that fetches the schema.org jsonld and writes result to assets/.

@@ -31,31 +31,6 @@ describe('schema.org validation', () => {
assert.equal(errors[0].message, 'Unrecognized schema.org type http://schema.org/Dog');
});

it('reports missing required fields', async () => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't totally grok whats happening here swapping out the google schema to the other one but...

i'm kinda surprised we that 'required' fields isn't a thing anymore. I guess schema.org just doesn't have an opinion on whats required?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah schema.org doesn't care, but Google requires some properties to enable rich snippets.

Will bring it up with @rviscomi if we want to add extra validation for this in the future.

@@ -11,6 +11,33 @@ const jsonld = require('jsonld');
const schemaOrgContext = require('./assets/jsonldcontext');
const SCHEMA_ORG_HOST = 'schema.org';

/**
* Custom loader that prevents network calls and alows us to return local version of the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

allows ;)

*/
'use strict';

// load data from https://github.com/schemaorg/schemaorg/blob/master/data/releases/3.4/schema.jsonld
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mattzeunert
Copy link
Collaborator

This is a bit trickier than I thought because there are three different forms that the JSON-LD takes:

  1. Raw JSON string on the page
  2. parsed JSON object
  3. expanded JSON-LD (regularized form without @context)

There are 4 types of failures:

  1. JSON validation, gives line number in raw JSON string
  2. JSON-LD keyword validation, gives path in parsed JSON object
  3. Expansion errors in the jsonld module, does not provide location information
  4. schema.org validation, gives path in expanded JSON-LD object

Current status:

screenshot 2018-11-24 at 20 29 42

Notes on the errors in the screenshot:

  1. JSON-LD keyword validation error, shows the stringified JSON (so it has a different key order from the raw string)
  2. JSON parse failure, shows the raw string
  3. Expansion error, should not have a highlight because we don't know the line
  4. schema org failure, shows stringified JSON
  5. same as 4
  6. schema org failure but it uses a full resource identifier (http://schema.org/author) instead of the the relying on the context. I need to work on this some more to sort out the line mapping from the expanded form to the input object here.

Instead of using a JSON parser I overwrite the property value at the given path with a random key, and then use the line number of the random key in the stringified JSON.

@patrickhulce
Copy link
Collaborator

Just throwing this out there, what do we think about splitting up this PR and trying to land some pieces of it like the standalone validation folder?

I think Matt's got a decent hold on the usage of everything that it seems like most of the changes will likely be in core now, is that right @mattzeunert?

@mattzeunert
Copy link
Collaborator

@patrickhulce There'll still be a few changes in the validator to get it to provide line numbers. Other than that it's just adding new rendering logic.

@patrickhulce
Copy link
Collaborator

There'll still be a few changes in the validator to get it to provide line numbers.

Ah, ok gotcha. Well maybe once the API solidifies for those we can try to break it up?

@mattzeunert
Copy link
Collaborator

Sure – is the main purpose to simplify the review?

@patrickhulce
Copy link
Collaborator

Sure – is the main purpose to simplify the review?

Yeah and ideally land pieces earlier so there's less that needs to keep being rebased, etc. I know most of it is generated files, but 13k LOC is a dousy :)

Easier to tell if things are missing when each PR is focused.

@mattzeunert
Copy link
Collaborator

mattzeunert commented Nov 29, 2018

@patrickhulce @rviscomi Is there any reason not to merge the PR without the new result rendering logic? Especially since it's already 90% reviewed.

Update: seems more like 98% reviewed 🙂. There are 4 small non-merge commits from me on the branch. I think we should get this merged, and possibly disable the audit for now if we don't want to it to get released.

Update2: The plan is to disable the audit for now and merge this PR.

@@ -479,7 +479,7 @@ const defaultConfig = {
{id: 'canonical', weight: 1, group: 'seo-content'},
{id: 'font-size', weight: 1, group: 'seo-mobile'},
{id: 'plugins', weight: 1, group: 'seo-content'},
{id: 'structured-data-automatic', weight: 1, group: 'seo-content'},
// {id: 'structured-data-automatic', weight: 1, group: 'seo-content'},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe there's a better way to disable the audit for now than what I'm doing in this commit?

@mattzeunert mattzeunert changed the title [WIP] JSON-LD validation (new_audit):JSON-LD validation (not included in UI for now) Dec 5, 2018
@mattzeunert mattzeunert changed the title (new_audit):JSON-LD validation (not included in UI for now) new_audit(seo): JSON-LD validation (not included in UI for now) Dec 5, 2018
@mattzeunert
Copy link
Collaborator

@patrickhulce Removed the audit from the SEO category like you suggested. Much nicer solution than commenting out all the tests!

I'm guessing AppVeyor is just flaky?

@mattzeunert
Copy link
Collaborator

Been dabbling around with the rendering logic a bit more.

Can we remove the "Show all" button and just show all when the user clicks on the snippet? We can indicate that it's clickable when the user hovers over it – but maybe that's not discoverable enough?

Should there be some kind of title for each JSON-LD item? Or at least a clearer separation.

@rviscomi Do you have a strong concept of what we're going for? Should we ask a designer for help? Or just keep iterating ourselves?

screenshot 2018-12-15 at 19 25 05


Different cases we need to handle:

  • error with no line number
  • one or more errors on specific lines
  • maybe: show JSON-LD snippets without any failures, doesn't need to show full JSON but maybe just the top level @type value (@paulirish suggested this so that the user knows that LH picked up the snippet and didn't find anything wrong with it)

@rviscomi
Copy link
Member

@rviscomi Do you have a strong concept of what we're going for? Should we ask a designer for help? Or just keep iterating ourselves?

It's inspired by the GitHub code diff UI, eg:

image

@paulirish do you know if there's anything else we need to do to make sure this audit's results render properly on downstream services like web.dev?

Can we remove the "Show all" button and just show all when the user clicks on the snippet? We can indicate that it's clickable when the user hovers over it – but maybe that's not discoverable enough?

Yeah I'm not sure if it'll be obvious that it's clickable.

Should there be some kind of title for each JSON-LD item? Or at least a clearer separation.

Similar to the GitHub example, instead of a file name could we show the DOM address of the script block? When in devtools and clicked it should reveal it in the Elements panel.

Different cases we need to handle:
error with no line number

Let's put any of these kinds of errors on line 1.

one or more errors on specific lines

If there's 1 error, show that error. If there are 2 or more errors on the same line, show "X errors:" then an unordered list with each error.

maybe: show JSON-LD snippets without any failures, doesn't need to show full JSON but maybe just the top level @type value (@paulirish suggested this so that the user knows that LH picked up the snippet and didn't find anything wrong with it)

+1. The type may not always be descriptive (or exist at all) so maybe just show the first ~10 lines with the "Show all" expander.

@mattzeunert
Copy link
Collaborator

Played around some more:

screenshot 2018-12-20 at 16 25 03

@brendankenny
Copy link
Member

Closing in favor of #4359 and upcoming PRs.

We'll keep the branch around in case someone needs it <3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[SEO Audits] Structured data is valid
9 participants