[SEO Audits] Structured data is valid #4359

rviscomi · 2018-01-25T21:33:52Z

Audit group: Content Best Practices
Description: JSON-LD structured data syntax is valid
Failure description: JSON-LD structured data syntax is invalid
Help text: Structured data contains rich metadata about a web page. The data may be used to provide more context to users in search results or social sharing. This audit is only checking for JSON-LD structured data syntax errors. For a checkup of Google search eligibility, try the Structured Data Testing Tool.

Success conditions:

the JSON can be parsed
the keys are not duplicated
@ keywords are whitelisted in the JSON-LD spec
@context URLs are valid
validation using jsonld.js
@types are whitelisted on schema.org
object properties are whitelisted on schema.org

Notes:

defers Google search eligibility and other more advanced validation logic to the Structured Data Testing Tool (SDTT)
implementation logic is built into Lighthouse and not open sourced separately

The text was updated successfully, but these errors were encountered:

rviscomi · 2018-04-11T17:51:01Z

@kdzwinel let's start by setting up a new repo on your personal GH account and getting the plumbing to work so that it is registered on npm and installed by LH. You can send me any code reviews outside of the LH repo.

rviscomi · 2018-04-11T18:08:36Z

TODO: Is there anything we can reuse from Chromium's "copyless paste" feature, which does JSON-LD parsing? https://github.com/chromium/chromium/search?utf8=%E2%9C%93&q=json-ld&type=

paulirish · 2018-04-12T01:39:22Z

What's in scope for the first pass here? The three checkboxes above?

Also, do we know how much of the heavy lifting can be done by existing libraries?

rviscomi · 2018-04-12T05:21:43Z

Some of the lifting can be done with existing projects, eg https://github.com/zaach/jsonlint, assuming the license is compatible. @kdzwinel could you set up some benchmarks and see how good it is? Depending on it directly is probably fine for v1 but we'd also like to fold in other structured data schemas beyond JSON-LD, so it may be worth setting up a standalone SD validator repo anyway and getting the plumbing set up.

paulirish · 2018-04-12T15:01:57Z

I do see quite a few json-ld projects already, so I was mostly curious on that side.
Seems like JSONlint would be good to provide better diagnostic data than what JSON.parse does; seems good if we're finding a lot of invalid JSON.

The three checkboxes above are what you're scoping this MVP to?

rviscomi · 2018-04-12T16:34:22Z

Oh I actually linked to the wrong project. jsonlint does look interesting and may help. I meant to link to https://github.com/digitalbazaar/jsonld.js, which is JSON-LD specific.

As for the MVP, the first two (valid JSON, standard props) are required and the third is a stretch goal.

Paul, on the subject of an external repo, I was curious to hear your thoughts on internal vs external. My hope/expectation is for the structured data validation code to be reusable in other contexts and also maintained by the web community. I think an external repo would be more conducive to those things, eg how the axe a11y logic is external to LH. WDYT?

kdzwinel · 2018-04-12T16:42:10Z

Paul, on the subject of an external repo, I was curious to hear your thoughts on internal vs external.

We have been talking a lot on the engineering meeting about solutions such as lerna, yarn workspaces and doing what we are doing with lh-logger (same repo, different package), but I haven't asked why we don't want a separate repo in the first place?

paulirish · 2018-04-12T17:28:56Z

breaking out the seperate repo discussion into #4955

kdzwinel · 2018-04-18T13:15:16Z

Small update.

jsonld.js can do some transforms on JSON-LD structures, but I don't see how we can use it for reporting errors. BTW this package has a non-standard license.

I've built a first prototype that checks:

if provided code is a valid JSON,
traverses JSON object, making sure that only valid keywords are used and
for some keywords checks if values are correct type.

And I found a first production error 😄 (already pinged Eric about it)

And a second one 😱 Who would have thought that simple JSON validation will be so useful here.

kdzwinel · 2018-05-04T23:25:10Z

I've put togheter technical decisions and other notes for future reference.

JSON Validation

JSON.parse:
✅ +0kb
❌ doesn't explain what's wrong (in some cases it does tell you where error occurs though e.g. "SyntaxError: Unexpected token b in JSON at position 11")
❌ doesn't catch duplicated keys

https://github.com/zaach/jsonlint (popular, unmaintained, MIT) :
⚠️+12kb
✅explains what's wrong
❌ doesn't catch duplicated keys (zaach/jsonlint#13)

https://github.com/circlecell/jsonlint-mod (fork of the previous one):
⚠️+12kb
✅explains what's wrong
✅catches duplicated keys

JSON-LD doesn't support duplicated keys, so jsonlint-mod was the best choice. It isn't perfect though: error messages are hard to customize and it calls console.error which litters console output. We may want to create our own fork in the long run.

JSON-LD Validation

There are no json-ld linters/validators that I know of (ld-validate that Paul found, doesn't check the json-ld grammar, but rather lets you create and validate object constraints), so for MVP we only do couple of checks. We make sure that:

keys are not duplicated (checked by the JSON parser)
all keywords (keys starting with an @) come from the JSON-LD safelist
all @context URLs are valid
and quite a few additional checks performed by jsonld.js expansion algorithm

Schema.org Validation

Using raw data from schema.org we are checking:

if referenced schema.org types actually exist (e.g. "@type": "InvalidType"),
and if there are no unrecognized properties on all schema.org objects (e.g. {"@type": "Recipe", "unknownProperty": "error"}).

ℹ️ Objects that are using some other vocabulary than schema.org (e.g. http://rdf.data-vocabulary.org/) are ignored.

Google Recommendations Validation

Using data scraped from SDTT we are checking if all required and recommended fields are present on schema.org objects.

This is not a long term solution - in the future we would like to use something like shex to validate not only the required/recommended fields, but also their values. We would also like to also support recommendations coming from other companies (Bing/Pinterest/Apple/Yandex etc.).

kdzwinel · 2018-05-09T20:03:41Z

WIP - if anyone is interested in testing it: https://github.com/kdzwinel/lighthouse/tree/json-ld

AymenLoukil · 2018-06-03T17:48:43Z

Hello,
Why not consuming the Google Structured Data Testing tool for that ? No API available ?

rviscomi · 2018-06-05T22:53:36Z

@AymenLoukil SDTT is closed source and does not provide an API. We're also hoping to make this audit relevant for all search engines, as they all handle structured data bit differently. It's kind of the wild west in terms of consistency 🤠

AymenLoukil · 2018-06-05T23:57:17Z

@rviscomi OMG how is that closed source 🚪 ? It belongs to Google no ? Why not reuse and enhance..
Anyway, http://jsonapi.org/format/1.1/ may helps in validating JSON/ JSON-LD format.

danbri · 2018-06-06T20:53:01Z

Objects that are using some other vocabulary than schema.org (e.g. http://rdf.data-vocabulary.org/) are ignored.

I would encourage also acceptance (or not actively complaining) of markup that links a Schema.org term to a non-Schema.org term. For an example I've been looking at using schema.org/Dataset with a mainEntity property whose value is modeled as a W3C csvw:Table. While it is possible that such markup might be in error, it is also good to accept such possibilities as being potentially sensible.

rviscomi · 2018-06-07T17:57:08Z

@kdzwinel for now let's take out the recommendations based on SDTT scraping. That will help ensure that we don't lag behind recommendations from SDTT in case it changes.

rviscomi · 2018-07-03T18:25:18Z

@kdzwinel See #4359 (comment) - I've updated the audit title and help text.

danbri · 2019-02-13T23:18:54Z

Is this functionality still under development?

rviscomi · 2019-02-13T23:46:13Z

@danbri the core functionality is complete and awaiting review. See #6750.

Beyond that, we're working on a new results renderer to more clearly show users the JSON-LD errors in the context of the code snippet. See #6901 and #6999.

mattzeunert · 2019-04-11T09:37:55Z

A few observations after running this on a few sites:

1. Case sensitivity

This is a failure because there's no aggregateRating type, just a Thing.aggregateRating property. The type should be AggregateRating. It might not be clear to the user though, because the schema.org URL exists.

JSON-LD is case sensitive, but the Structured Data Testing Tool doesn't complain about it.

2. Empty elements

There's a few JSON-LD elements with no content or just {}. SDTT doesn't care, and it probably just means there's no data, so nothing we need to do I think.

3. Parser errors

I've only seen one example of this, but if you try to escape a single quote in the JSON our error message isn't super helpful. (JSON only allows a few escape sequences.)

You get the same message if you have line breaks in your strings or a trailing comma somewhere. Maybe prefixing the message with "Parse error: " would make it clearer what the problem is.

mmocny · 2019-08-16T15:04:14Z

Hey folks, I poked around with the patch this morning to see how this is progressing (just a quick look so far -- looks good!)

One thing I found was that the audit didn't detect externalized structured data: e.g. ~~<script type="application/ld+json" src=...>~~ or <link rel="alternate" type="application/ld+json" href=...>

I'm not actually sure if this is meant to be supported. WDYT?

I suspect that this could be fixed by changing this block which currently only handles script with inline content.

EDIT: apparently it's not allowed to use <script src=> for data (per HTML spec)[1].

[1] "When used to include data blocks, the data must be embedded inline, the format of the data must be given using the type attribute, and the contents of the script element must conform to the requirements defined for the format used. The src, charset, async, defer, crossorigin, and nonce attributes must not be specified."

danbri · 2019-08-16T15:14:11Z

Fwiw Google also does not currently consume structured data published that way

…

On Fri, 16 Aug 2019 at 16:04, Michal Mocny ***@***.***> wrote: Hey folks, I poked around with the patch this morning to see how this is progressing (just a quick look so far -- looks good!) One thing I found was that the audit didn't detect externalized structured data: e.g. <script type="application/ld+json" src=...> or <link rel="alternate" type="application/ld+json" href=...> I'm not actually sure if this is meant to be supported. WDYT? I *suspect* that this could be fixed by changing this block <https://github.com/GoogleChrome/lighthouse/pull/8328/files#diff-5a7063f8a604afcd9d0a9575aa3b3721R48-R49> which currently only handles script with inline content. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4359?email_source=notifications&email_token=AABJSGP5YZH2RE7FUQI2SB3QE26Y3A5CNFSM4ENWC24KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4O3S6A#issuecomment-522041720>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AABJSGI55ZNFUC7YQJAKWMTQE26Y3ANCNFSM4ENWC24A> .

mmocny · 2019-08-16T15:15:45Z

@danbri Do you know if anyone does? Is it meant to not be supported, or just Google doesn't yet?

danbri · 2019-08-16T15:18:25Z

nobody afaik at least in mainstream search the closest we (google) get is indexing json-ld added to main doc dynamically by scripts.

…

On Fri, 16 Aug 2019 at 16:16, Michal Mocny ***@***.***> wrote: @danbri <https://github.com/danbri> Do you know if anyone does? Is it meant to not be supported, or just Google doesn't yet? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4359?email_source=notifications&email_token=AABJSGJV7BONU5EEKFPHKSTQE3AD3A5CNFSM4ENWC24KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4O4RIQ#issuecomment-522045602>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AABJSGN6F6PM64HJP6IDC3TQE3AD3ANCNFSM4ENWC24A> .

mmocny · 2019-08-16T15:25:13Z

Thanks. I can resort to <script> which will fetch and document.write (or rather head.append) the SD.

(Seems silly, thats much harder for simple crawlers to deal with.)

BigBlueHat · 2019-08-16T18:36:03Z

@mmocny can you not just add them on the server side? Any <script> tag with a valid MIME media type is considered a "data block" in HTML. They can be added dynamically, but it's best if they arrive with the rest of the initial markup--for simple crawlers (et al) as you note.

mmocny · 2019-08-16T21:30:57Z

@BigBlueHat My specific use case I have (potentially) large sized SD which is not relevant to 99% of client page loads. I was trying to optimize for page size by merely linking, and not forcefully including, all SD (thus allowing the user agent to decide if/when it was useful to actually download the data).

I guess if I want to keep this behaviour while resorting to using script to dynamically inject, I would now need to take control of when to do it conditionally. And if I'm to do that, I could indeed just as well conditionally Server-side-included it, as you suggest.

jvandriel · 2019-08-21T18:51:49Z

I'm happy to read about this initiative though I'm wondering whether microdata and RDFa have been considered as well?

Reason I ask is because even though JSON-LD is the 'preferred' syntax right now there's still tons of microdata markup out there (and RDFa in lesser quantity as well) that's actively maintained. It would be a real shame if these fall out of scope.

danbri · 2019-08-21T19:23:39Z

operating over the parsed graph (set of node-edge-node triples) from json-ld would establish a model well suited to allow dropping in rdf, microdata parsers. This is what we do at Google and what other public tools do.

…

On Wed, 21 Aug 2019, 19:52 Jarno van Driel, ***@***.***> wrote: I'm happy to read about this initiative though I'm wondering whether microdata and RDFa have been considered as well? Reason I ask is because even though JSON-LD is the 'preferred' syntax right now there's still tons of microdata markup out there (and RDFa in lesser quantity as well) that's actively maintained. It would be a real shame if these fall out of scope. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4359?email_source=notifications&email_token=AABJSGLN42PZCGQMV2PJEGDQFWFGLA5CNFSM4ENWC24KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD42YJXA#issuecomment-523601116>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AABJSGLUBD3W6QWKBI6SN3LQFWFGLANCNFSM4ENWC24A> .

jvandriel · 2019-08-21T20:11:52Z

In regards to parsers/linters for validating syntax/vocabulary constraints maybe @gkellogg knows of some you can use. He's also currently maintaining the Structured Data Linter

mmocny · 2019-08-21T20:32:07Z

Coincidentally, I was doing a bit of my own digging in this space recently.

First, a summary of what I think Lighthouse currently has (note: I'm not author or involved, I'm just piecing together):

Patches core(jsonld): add structured data validation #6750 core(jsonld): Structured data validation updates #8137 added support for JSON-LD parsing & linting (leveraging pre-existing libraries), and manually validating the structure of the JSON objects using published schema.org contexts.
Patch new_audit: json-ld audit #8328 attempts to expose this work to the default Lighthouse UI

Some of the benefits of doing it this way:

It was comparatively simple, in that it required comparatively few external libraries (less binary cost)
Certain features, e.g. matching syntax errors to specific line numbers, may have been easier to implement

On the other hand, there may be correctness issues to doing it this way. I'm not at all sure if the current implementation supports all the different ways you can publish json-ld (e.g. @reverse props, type embedding, @context aliasing...)

And so, even just for correctness reasons it may already be desirable to validate at the graph level, instead of trying to validate the parsed JSON-LD object. We could check this by creating more complex test cases, or finding more complex real world examples.

If that is indeed already true, then the hard part ask of requesting validation at the graph level becomes more of a requirement. After that, I suspect adding more parsers/linters (and even other vocabs) would become near-trivial.

mmocny · 2019-08-21T20:40:39Z

@jvandriel I only took a quick scan, but looks like @gkellogg's tool is "based on RDF.rb", and:

The Linter is implemented as an HTML Web Application, with most of the work performed via Ajax requests using a JSON content type

As such, I suspect it isn't applicable. We probably would want something in JS, or at least that could be compiled into WebAssembly (not sure if Lighthouse is ok with WebAssembly).

mmocny · 2019-08-21T20:43:55Z

Found this, though.

BigBlueHat · 2019-08-21T20:57:06Z

And so, even just for correctness reasons it may already be desirable to validate at the graph level, instead of trying to validate the parsed JSON-LD object. We could check this by creating more complex test cases, or finding more complex real world examples.

Depending on the kind of validation you have in mind, it might make sense to validate against JSON-LD's expenaded form which jsonld.js's expand() would give you from any valid JSON-LD document. Validators written against expanded form would avoid having to deal with all the variations possible in a compact tree-ier form.

mmocny · 2019-08-21T21:41:58Z

...and Indeed, the patch already does this. I should have checked. (Thanks for the note)

gkellogg · 2019-08-21T22:28:55Z

@jvandriel I only took a quick scan, but looks like @gkellogg's tool is "based on RDF.rb", and:

The Linter is implemented as an HTML Web Application, with most of the work performed via Ajax requests using a JSON content type

As such, I suspect it isn't applicable. We probably would want something in JS, or at least that could be compiled into WebAssembly (not sure if Lighthouse is ok with WebAssembly).

The Linter is, indeed, a web application, but is based on the "rdf" CLI installed with the linkeddata Ruby gem. It's been run for years, with minimal updates, to help validate schema.org examples. It principally depends on semantic validation of the rendered RDF graph, which could probably be done using other languages (such as rdfjs) with appropriate code. The validation bits are in the rdf-reasoner gem. It's all public-domain software, so people are free to do with it what they please.

Of course, Ruby can be compiled into WebAssembly too.

rviscomi added chillin ❄ seo labels Jan 25, 2018

rviscomi added P1 and removed chillin ❄ labels Apr 11, 2018

rviscomi assigned kdzwinel Apr 11, 2018

rviscomi added this to the Sprint Trece - Apr 9 - 20 milestone Apr 11, 2018

rviscomi mentioned this issue Apr 11, 2018

[SEO Audits] Tap targets are sized appropriately #4358

Closed

paulirish mentioned this issue Apr 12, 2018

Discussion: separate repos vs monorepo #4955

Closed

rviscomi mentioned this issue Apr 13, 2018

[SEO Audits] Document has a valid rel=canonical #3178

Closed

kdzwinel mentioned this issue Jun 7, 2018

new_audit(seo): JSON-LD validation (not included in UI for now) #5446

Closed

10 tasks

paulirish mentioned this issue Jun 23, 2018

FR: Allow external API calls #5547

Closed

rviscomi mentioned this issue Jul 20, 2018

[SEO Audits] Write documentation #4355

Closed

14 tasks

paulirish added the P2 label Aug 27, 2018

paulirish removed the P1 label Aug 27, 2018

rviscomi assigned mattzeunert and unassigned kdzwinel Oct 22, 2018

mattzeunert mentioned this issue Dec 31, 2018

Code snippet details renderer type #6901

Closed

mmocny mentioned this issue Apr 8, 2019

core(jsonld): add structured data validation #6750

Merged

mattzeunert mentioned this issue Apr 10, 2019

core(jsonld): Structured data validation updates #8137

Merged

mattzeunert mentioned this issue Apr 16, 2019

new_audit: json-ld audit #8328

Closed

rviscomi mentioned this issue Aug 3, 2019

Query metrics: Chapter 10. SEO HTTPArchive/almanac.httparchive.org#91

Closed

15 tasks

brendankenny mentioned this issue May 17, 2022

remove sd-validation audit #14016

Closed

[SEO Audits] Structured data is valid #4359

[SEO Audits] Structured data is valid #4359

Comments

rviscomi commented Jan 25, 2018 • edited Loading

rviscomi commented Apr 11, 2018

rviscomi commented Apr 11, 2018

paulirish commented Apr 12, 2018

rviscomi commented Apr 12, 2018

paulirish commented Apr 12, 2018

rviscomi commented Apr 12, 2018

kdzwinel commented Apr 12, 2018

paulirish commented Apr 12, 2018

kdzwinel commented Apr 18, 2018 • edited Loading

kdzwinel commented May 4, 2018 • edited Loading

JSON Validation

JSON-LD Validation

Schema.org Validation

Google Recommendations Validation

kdzwinel commented May 9, 2018

AymenLoukil commented Jun 3, 2018

rviscomi commented Jun 5, 2018

AymenLoukil commented Jun 5, 2018

danbri commented Jun 6, 2018

rviscomi commented Jun 7, 2018

rviscomi commented Jul 3, 2018

danbri commented Feb 13, 2019

rviscomi commented Feb 13, 2019

mattzeunert commented Apr 11, 2019

mmocny commented Aug 16, 2019 • edited Loading

danbri commented Aug 16, 2019 via email

mmocny commented Aug 16, 2019

danbri commented Aug 16, 2019 via email

mmocny commented Aug 16, 2019

BigBlueHat commented Aug 16, 2019

mmocny commented Aug 16, 2019

jvandriel commented Aug 21, 2019

danbri commented Aug 21, 2019 via email

jvandriel commented Aug 21, 2019

mmocny commented Aug 21, 2019 • edited Loading

mmocny commented Aug 21, 2019

mmocny commented Aug 21, 2019

BigBlueHat commented Aug 21, 2019

mmocny commented Aug 21, 2019

gkellogg commented Aug 21, 2019

rviscomi commented Jan 25, 2018 •

edited

Loading

kdzwinel commented Apr 18, 2018 •

edited

Loading

kdzwinel commented May 4, 2018 •

edited

Loading

mmocny commented Aug 16, 2019 •

edited

Loading

mmocny commented Aug 21, 2019 •

edited

Loading