Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

core(jsonld): add structured data validation #6750

Merged
merged 33 commits into from Apr 8, 2019

Conversation

Projects
None yet
10 participants
@patrickhulce
Copy link
Collaborator

patrickhulce commented Dec 8, 2018

Summary
Split out the structured validation library portion from #5446.

Key changes from Konrad's implementation

  1. Ignoring native dependency instead of modifying our build environment Updated to 1.5.0 so we don't need it anymore! 馃帀
  2. Generated files are now a single-line, I'm not 100% sure these even need to be checked in if we had a prepublish/postinstall hook to generate them? I suppose it's safer though considering they're pulled from URLs Reverted through review
  3. Added a monkeypatch for console.error because jsonlint-mod is annoying Updated to 1.7.4 so also not needed! 馃帀

Native Dependency Notes
jsonld depends on rdf-canonize which in versions <0.2.0 had a native C++ dependency. Personally I consider the existence of this native dependency a major PR blocker. I don't think all users that install Lighthouse should now be blocked on having C++11 on their system. We should try and get jsonld published with rdf-canonize@0.3.0 which was the first version that made the native portion optional. Alternatives include just pulling in their UMD library from a CDN instead of using the NPM version.

This has been resolved and they've removed their native dependency!! Hurray!!

Related Issues/PRs
#5446

patrickhulce added some commits Dec 8, 2018

@patrickhulce

This comment has been minimized.

Copy link
Collaborator Author

patrickhulce commented Dec 11, 2018

Good news: jsonld published a version that uses the new optional native dependency! 馃帀

Bad news: npm and yarn both install optional deps by default, which means everyone depending on LH today would still need to change their build configs to continue installing.

--

What do we do?

  • Pull in jsonld from unpkg instead https://unpkg.com/jsonld@1.2.0/dist/jsonld.min.js
  • Go back and try to thread the dep chain of making it a peer or not a dependency at all, assuming the package owners support this.
  • Document this new dep requirement and consumers have to live with --ignore-optional if they want to use LH and don't have a C++11 compiler
  • Something else I'm missing?
@@ -32,8 +32,9 @@
"unit-cli": "jest --runInBand \"lighthouse-cli/\"",
"unit-cli:ci": "jest --runInBand --coverage --ci \"lighthouse-cli/\"",
"unit-viewer": "mocha --reporter dot \"lighthouse-viewer/test/**/*-test.js\"",
"unit": "yarn unit-core && yarn unit-cli && yarn unit-viewer",
"unit:ci": "yarn unit-core:ci && yarn unit-cli:ci && yarn unit-viewer",
"unit-sd-validation": "mocha --reporter dot \"sd-validation/test/**/*-test.js\"",

This comment was marked as resolved.

Copy link
@paulirish

paulirish Dec 12, 2018

Member

unit-sd ?

This comment was marked as resolved.

Copy link
@patrickhulce

patrickhulce Dec 12, 2018

Author Collaborator

sg, done


jsonld.expand(inputObject, {
documentLoader,
}, (/** @type {string} */e, /** @type {Object} **/expanded) => {

This comment has been minimized.

Copy link
@paulirish

paulirish Dec 12, 2018

Member

seems like we need a few additions to tsconfig?

diff --git a/tsconfig.json b/tsconfig.json
index 6d57b103..dc5e1b14 100644
--- a/tsconfig.json
+++ b/tsconfig.json
@@ -21,11 +21,14 @@
     "lighthouse-core/**/*.js",
     "clients/**/*.js",
     "build/**/*.js",
+    "sd-validation/**/*.js",
     "./types/*.d.ts",
   ],
   "exclude": [
     "lighthouse-cli/test/**/*.js",
     "lighthouse-core/test/**/*.js",
+    "sd-validation/test/**/*.js",
+    "sd-validation/scripts/**/*.js",
     "clients/test/**/*.js",
   ]
 }
/** @type {function(string):void} */
let reject;
const promise = new Promise((res, rej) => {
resolve = res; reject = rej;

This comment has been minimized.

Copy link
@paulirish

paulirish Dec 12, 2018

Member

we dont really follow this pattern much so i tried it using our more typical approach:

module.exports = function expand(inputObject) {
  return new Promise((resolve, reject) => {
    function documentLoader( /** @type {string} **/ schemaUrl, /** @type {function(null, Object):void} **/ callback) {
      try {
        return loadDocument(schemaUrl, callback);
      } catch (e) {
        reject(e.message);
      }
    }

    jsonld.expand(inputObject, {documentLoader}, (/** @type {string} */e, /** @type {Object} **/expanded) => {
      if (e) reject('Expansion error: ' + e.toString());
      else resolve(expanded);
    });
  });
};

but I can't say its a clear win so w/e

This comment has been minimized.

Copy link
@patrickhulce

patrickhulce Dec 12, 2018

Author Collaborator

yeah I'm inclined to leave as-is, I actually prefer removing the layer of nesting :)

try {
return loadDocument(schemaUrl, callback);
} catch (e) {
reject(e.message);

This comment has been minimized.

Copy link
@paulirish

paulirish Dec 12, 2018

Member

return reject IMO

documentLoader,
}, (/** @type {string} */e, /** @type {Object} **/expanded) => {
if (e) {
reject('Expansion error: ' + e.toString());

This comment has been minimized.

Copy link
@paulirish

paulirish Dec 12, 2018

Member

return reject

This comment has been minimized.

Copy link
@patrickhulce

patrickhulce Dec 12, 2018

Author Collaborator

done

This comment has been minimized.

Copy link
@patrickhulce

patrickhulce Dec 12, 2018

Author Collaborator

turns out they have a promise-api, so I'm just switching to that instead

if (e) {
reject('Expansion error: ' + e.toString());
} else {
resolve(expanded);

This comment has been minimized.

Copy link
@paulirish

paulirish Dec 12, 2018

Member

return resolve

This comment has been minimized.

Copy link
@patrickhulce

patrickhulce Dec 12, 2018

Author Collaborator

done

@paulirish

This comment has been minimized.

Copy link
Member

paulirish commented Dec 12, 2018

What do we do?

We discussed offline that peerDep doesn't totally make semantic sense and will spam warnings when installing. The unpkg solution scares me a bit.

I'm favoring this one:

Document this new dep requirement and consumers have to live with --ignore-optional if they want to use LH and don't have a C++11 compiler

@brendankenny
Copy link
Member

brendankenny left a comment

Document this new dep requirement and consumers have to live with --ignore-optional if they want to use LH and don't have a C++11 compiler

The problem is that almost no one is going to read this or do it. A lot of hard work has gone into this, but it still seems backwards to make lighthouse harder to use (and slower by default) to add this audit.

Maybe we should have a manual SEO audit pointing to a new @lighthouse/structured-data plugin?

@patrickhulce

This comment has been minimized.

Copy link
Collaborator Author

patrickhulce commented Dec 12, 2018

pointing to a new @lighthouse/structured-data plugin?

馃槏 馃槏 馃槏 馃槏 馃槏 馃槏 馃槏

I love this for so many reasons.

@rviscomi

This comment has been minimized.

Copy link
Member

rviscomi commented Dec 12, 2018

@brendankenny is your suggestion to omit the SD audit from running in the default LH SEO category and include a link under "Manual audits" for users to install a new extension which will run the SD audit?

@brendankenny

This comment has been minimized.

Copy link
Member

brendankenny commented Dec 12, 2018

@brendankenny is your suggestion to omit the SD audit from running in the default LH SEO category and include a link under "Manual audits" for users to install a new extension which will run the SD audit?

Yeah.

I realize the issues with that, but in the past we haven't pursued audits that would introduce a new reload pass, or wouldn't work in all clients, or would require calls out to third-party APIs.

Landing this as-is is going to break downstream users who won't be reading the install instructions when they do a simple update and are deploying from a container or whatever, and going to slow down everyone else's install as we compile a native library.

This isn't a make-or-break audit (there are already existing tools for structured data), and the fact that there's a lot of hard work here and that it would be useful for developers doesn't seem sufficient to justify that downside at this point.

That said, plugins aren't really ready yet and we haven't figured out how any will be bundled (if they will be), so maybe there are other solutions we can look at in the meantime.

This is the kind of thing third-party/ can be useful for. What would it take to prebundle jsonld and put it in there? In theory a lot of https://npm.anvaka.com/#/view/2d/jsonld can be pruned since we won't need request.

patrickhulce added some commits Jan 24, 2019

@patrickhulce

This comment has been minimized.

Copy link
Collaborator Author

patrickhulce commented Jan 24, 2019

Alright now that the dependency issue has been resolved, I think this is ready for another go around! 馃帀

@mattzeunert

This comment has been minimized.

Copy link
Collaborator

mattzeunert commented Feb 4, 2019

Alright now that the dependency issue has been resolved, I think this is ready for another go around! 馃帀

Waiting for this to get merged to move forward with the structured data audit.

@brendankenny @paulirish Could you take another look? 鉂わ笍

try {
// jsonlint-mod always calls console.error when there's an error
// We don't want this behavior, so we stash console.error while it's executing
console.error = () => undefined; // eslint-disable-line no-console

This comment has been minimized.

Copy link
@mattzeunert

mattzeunert Feb 11, 2019

Collaborator

Should no longer be needed 馃檪 circlecell/jsonlint-mod#2

This comment has been minimized.

Copy link
@patrickhulce

patrickhulce Feb 11, 2019

Author Collaborator

hurray that's awesome! :D

@patrickhulce

This comment has been minimized.

Copy link
Collaborator Author

patrickhulce commented Feb 11, 2019

@danbri

This comment has been minimized.

Copy link

danbri commented Feb 14, 2019

I see some snapshotted schema.org data in https://github.com/GoogleChrome/lighthouse/tree/75b7a51a0eb19fc289338c799bae9074341f466c/sd-validation/assets ... what is the plan for keeping this up to date?

(we've got into trouble before when people have generated artifacts from now-obsolete snapshots, e.g. see https://github.com/google/schemaorg-java )

/cc @vholland @rvguha

@danbri

This comment has been minimized.

Copy link

danbri commented Feb 14, 2019

(to be clear, this is all awesome, any nitpicking is in the spirit of wanting to help make it better, sustainable, and to fit in with our development plans for Schema.org)

@brendankenny
Copy link
Member

brendankenny left a comment

nice, this is looking good!

I see the value in doing this, and so would like to avoid coupling it too much to LH code, but I also recognize that we don't need to be overly optimizing for this hypothetical future case now that there's no clear permanent owner.

yeah, it feels a bit like that's not going to happen at this point :/

I think it could do fine in lib/sd-validation/. It's confusing to have in a top level directory (I think), and really for now it's no more an independent library than manifest-parser, which is just a little longer in loc and was actually also intended to "someday" be an independent module :)

We will have to be vigilant about keeping it componentized, but it has the benefit of being pretty naturally its own thing, so ideally it'll mostly take care of itself.

Show resolved Hide resolved types/structured-data.d.ts Outdated
Show resolved Hide resolved types/structured-data.d.ts Outdated
Show resolved Hide resolved types/isomorphic-fetch.d.ts
Show resolved Hide resolved sd-validation/scripts/generate-schema-tree.js Outdated
Show resolved Hide resolved sd-validation/scripts/generate-schema-tree.js Outdated
Show resolved Hide resolved sd-validation/json.js Outdated
Show resolved Hide resolved sd-validation/json.js Outdated
Show resolved Hide resolved sd-validation/json.js Outdated
* @param {Array<string>} path
*/
module.exports = function walkObject(obj, callback, path = []) {
if (obj === null) {

This comment has been minimized.

Copy link
@brendankenny

brendankenny Mar 28, 2019

Member

do we care about guarding against non-objects passed in? we may not, but the type won't help prevent

This comment has been minimized.

Copy link
@patrickhulce

patrickhulce Mar 28, 2019

Author Collaborator

everything being passed in here is already * or has been filtered out for null already, so I don't think it matters

Show resolved Hide resolved sd-validation/expand.js Outdated

patrickhulce added some commits Mar 28, 2019

const path = require('path');
const fs = require('fs');

const SCHEMA_ORG_URL = 'https://schema.org/docs/jsonldcontext.json';

This comment has been minimized.

Copy link
@mmocny

mmocny Apr 1, 2019

Member

If you fetch with "Accept: application/ld+json" headers, you don't need to use this magic URL. It has been documented before that this URL is for convenience only and should not be treated as an API. Tools will generally fetch the @context value exactly as specified.

curl -H "Accept: application/ld+json" https://schema.org/

This comment has been minimized.

Copy link
@patrickhulce

patrickhulce Apr 2, 2019

Author Collaborator

ah neat thanks!! done 馃憤

@mmocny

This comment has been minimized.

Copy link
Member

mmocny commented Apr 2, 2019

Folks, I'm also interested in seeing LH have support for structured data validation. This patch is really great! I left a few small comments, and I'm happy to help review more in-depth.

Question: this patch is very hard-coded to support only core schema.org vocab. If I was willing to put in the effort to update, would you accept changes to also support hosted/external schema.org extensions? I think it would be better to do this as a follow-up patch.

@patrickhulce

This comment has been minimized.

Copy link
Collaborator Author

patrickhulce commented Apr 2, 2019

Thanks @mmocny! I'd love to have someone that knows what they're doing invested in this :D (core team currently inherited this without complete context, so expertise would be very much appreciated)

I'd definitely like to land this before further increasing the scope since it's been outstanding for quite some time, but supporting beyond schema.org if it's being actively consumed by search engines/crawlers/others sounds good to me! (so long as the tradeoff between usage and cost to Lighthouse users is sufficient of course 馃槃 )

@mmocny

This comment has been minimized.

Copy link
Member

mmocny commented Apr 2, 2019

I'm not sure I qualify as "knows whats they are doing", but I can try. How can I help get this finished?

(Also, it looks like this patch has some issues with original authors not signing CLA. I don't know the history here...)

@patrickhulce

This comment has been minimized.

Copy link
Collaborator Author

patrickhulce commented Apr 2, 2019

(Also, it looks like this patch has some issues with original authors not signing CLA. I don't know the history here...)

That terminal CLA state happens when @brendankenny or @paulirish's suggested changes get applied to a non-googler's patches for some reason. We haven't yet been able to figure out why 馃し鈥嶁檪锔

@patrickhulce

This comment has been minimized.

Copy link
Collaborator Author

patrickhulce commented Apr 2, 2019

How can I help get this finished?

As soon as @brendankenny signs off IMO it's landable. If you'd like to review it for anything that you think would be a showstopper to landing please feel free to help him out :)

@davidlehn

This comment has been minimized.

Copy link

davidlehn commented Apr 2, 2019

That googlebot CLA warning happened immediately after my trivial doc fix suggestion got applied [1]. So perhaps my fault? I think I've done some Google CLA thing in the past but if I need to do it again here for a one-liner just to clear things up, let me know how.

[1] #6750 (comment)

@patrickhulce

This comment has been minimized.

Copy link
Collaborator Author

patrickhulce commented Apr 2, 2019

Oh thanks @davidlehn just a simple "I consent to those edits being released under the Apache 2.0 license in this repo" is fine :)

@mmocny

This comment has been minimized.

Copy link
Member

mmocny commented Apr 8, 2019

@davidlehn May you please give the OK to unblock this merge?

@davidlehn

This comment has been minimized.

Copy link

davidlehn commented Apr 8, 2019

I consent to my edits being released under the Apache 2.0 license in this repo.

@mmocny

This comment has been minimized.

Copy link
Member

mmocny commented Apr 8, 2019

Thanks!

I also think that @brendankenny comments from last review are addressed. I think this patch has addressed issues that have been brought up.

I'm new to this feature, but anxious to see it land.

Questions:

  • It looks like there were various earlier efforts on this feature, and a couple of earlier PR are still open. Should we audit those to make sure all the work and comments have been addressed in this patch as well? Or, has that already been done? (I wasn't sure how this now relates to the original PR #5446 -- but it looks like #4359 describes this in the most recent comment)
  • The work in this patch adds json-ld and schema.org validation and tests, but is it actually exposed to Lighthouse audit already? Will that have to come next? Should there be a process for running the scripts to update the schema definition files?
  • I'd like to write a manual audit which uses much of the work from this patch, but to help validate a schema.org extension (not core schema). I believe I could do this either by adding support for extensions directly into lib/sd-validation, or just by doing it entirely in my manual audit. I'll file a separate issue once this patch lands to ask if there is support for the former.
@patrickhulce

This comment has been minimized.

Copy link
Collaborator Author

patrickhulce commented Apr 8, 2019

It looks like there were various earlier efforts on this feature, and a couple of earlier PR are still open. Should we audit those to make sure all the work and comments have been addressed in this patch as well? Or, has that already been done?

Yeah this was split out from #5446 and addressed anything outstanding at the time.

is it actually exposed to Lighthouse audit already? Will that have to come next?

No it won't be, that will be followup work by @mattzeunert.

Should there be a process for running the scripts to update the schema definition files?

Probably, but we don't have one proposed. If you have any insight to share on how frequently these are likely to become out-of-date we'd love to hear it :)

I'll file a separate issue once this patch lands to ask if there is support for the former.

Sounds great! :)

@brendankenny

This comment has been minimized.

Copy link
Member

brendankenny commented Apr 8, 2019

That googlebot CLA warning happened immediately after my trivial doc fix suggestion got applied [1]. So perhaps my fault? I think I've done some Google CLA thing in the past but if I need to do it again here for a one-liner just to clear things up, let me know how.

[1] #6750 (comment)

can confirm that @davidlehn does indeed have a Google CLA

@patrickhulce

This comment has been minimized.

Copy link
Collaborator Author

patrickhulce commented Apr 8, 2019

sweet thanks @brendankenny! would you mind confirming your intention to review or not review? :)

@brendankenny
Copy link
Member

brendankenny left a comment

just the index.js filename change, otherwise let's do this!

@@ -0,0 +1,75 @@
/**

This comment has been minimized.

Copy link
@brendankenny

brendankenny Apr 8, 2019

Member

rename sd-validator.js(or whatever) so we don't add another index.js file :)

@patrickhulce

This comment has been minimized.

Copy link
Collaborator Author

patrickhulce commented Apr 8, 2019

let's do this!

馃帀 馃帀 馃帀 馃帀 馃帀 馃帀 馃帀 馃帀 馃帀 馃帀

@rviscomi

This comment has been minimized.

Copy link
Member

rviscomi commented Apr 8, 2019

itshappening.gif

Huge thank you to @patrickhulce @brendankenny and everyone else who helped get this out!

@patrickhulce patrickhulce merged commit 3f15b67 into master Apr 8, 2019

6 of 7 checks passed

cla/google CLAs are signed, but unable to verify author consent
bundlesize Good job! bundle size < maxSize
Details
codecov/patch 98.34% of diff hit (target 90.39%)
Details
codecov/project 90.63% (+0.23%) compared to 9790337
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
pr title lint PR title is good to go, boss
Details

@patrickhulce patrickhulce deleted the structued_data_pkg branch Apr 8, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can鈥檛 perform that action at this time.