Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent schema.org duplicates #992

Merged

Conversation

oscarssanchez
Copy link
Contributor

@oscarssanchez oscarssanchez commented Mar 8, 2018

Request For Review

Hi @westonruter,

Could you use this PR, which addresses #962?

Originally, I used json_decode( $script->nodeValue ) to convert the schema.org json data to an array. However, that brought some issues with get_mustache_tag_placeholders(), as it adds some extra markup to the schema.org script tag.

I opted to use preg_match() as I saw the schema.org URL sometimes can either start with http or https.

Fixes #962.

Copy link
Member

@westonruter westonruter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

@@ -14,7 +14,6 @@ function amp_post_template_init_hooks() {
add_action( 'amp_post_template_head', 'amp_post_template_add_scripts' );
add_action( 'amp_post_template_head', 'amp_post_template_add_fonts' );
add_action( 'amp_post_template_head', 'amp_post_template_add_boilerplate_css' );
add_action( 'amp_post_template_head', 'amp_print_schemaorg_metadata' );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be removed. The legacy post templates should remain as-is, I believe. The ensure_required_markup method below does not apply for the legacy post template actions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @westonruter

Thanks for your feedback. I corrected that mistake and restored the line.

* @param boolean $expected The expected result.
*/
public function test_schema_org_present( $script, $expected ) {
$page = '<html><head><script>%s</script></head><body>Test</body></html>';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The <script> here is missing the type="application/ld+json".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good observation, I added the type attribute on the new tests for ensure_required_markup()

public static function schema_org_present( DOMDocument $dom ) {
$head = $dom->getElementsByTagName( 'head' )->item( 0 );
foreach ( $head->getElementsByTagName( 'script' ) as $script ) {
if ( 1 === preg_match( '/{"@context":"https?:[\\\]\/[\\\]\/schema\.org/', $script->nodeValue ) ) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before looking at the content of the element ($script->nodeValue) I think it should first check 'application/ld+json' === $script->getAttribute( 'type' ).

Also, I think the 1 === should be removed. As long as preg_match() returns a truthy value that's all we really care about.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the integer comparison and added your suggestion, indeed it makes it more precise.

* @param DOMDocument $dom Representation of the document.
* @return bool
*/
public static function schema_org_present( DOMDocument $dom ) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's rename this to is_schema_org_metadata_present. But actually, I think maybe the logic should just be inlined in ensure_required_markup like the other checks are done. True that it may be better design to break up the logic into separate methods, but since the existing logic is not in methods then I don't know if we should split out this one separately.

If you changed ensure_required_markup from protected to be public and then added the logic from schema_org_present to it, along with unit tests for ensure_required_markup, then that would be great. Later we may very well want to break up ensure_required_markup into smaller methods, but I think that would probably be done along with breaking up the AMP_Theme_Support class into separate classes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added your suggestions.

I also did the test for ensure_required_markup, however, I only made tests for the schema.org script part. Could you use tests for the entire function?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this method needs more tests, so if that's something you want to add that would be great.

@@ -829,6 +828,13 @@ protected static function ensure_required_markup( DOMDocument $dom ) {
) );
$head->insertBefore( $meta_viewport, $meta_charset->nextSibling );
}
if ( ! self::schema_org_present( $dom ) ) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted below, I think this logic should be inlined (for now) as it is done above for $meta_charset and $meta_viewport and $rel_canonical below.

@westonruter westonruter added this to the v0.7 milestone Mar 8, 2018
Check if schema.org script is present.
Add schema.org only if it doesn't exist.
// Prevent schema.org duplicates.
$schema_org_script = null;
foreach ( $head->getElementsByTagName( 'script' ) as $script ) {
if ( 'application/ld+json' === $script->getAttribute( 'type' ) && preg_match( '/{"@context":"https?:[\\\]\/[\\\]\/schema\.org/', $script->nodeValue ) ) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The regular expression here could be brittle. There's no guarantee that the @context will be the first key, and there's no guarantee there wouldn't be whitespace between the strings. I think doing a simple false !== strpos( $script->nodeValue, 'schema.org' ) could be sufficient here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may not even need to look at the value. Just checking if the type is application/ld+json may be just fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @westonruter

I opted to search for both the attribute and the 'schema.org' substring as you mentioned it should check for any JSON-LD script in the head which mentions 'schema.org' in any form.

$schema_org_script = null;
foreach ( $head->getElementsByTagName( 'script' ) as $script ) {
if ( 'application/ld+json' === $script->getAttribute( 'type' ) && preg_match( '/{"@context":"https?:[\\\]\/[\\\]\/schema\.org/', $script->nodeValue ) ) {
$schema_org_script = $script->nodeValue;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the value isn't being used, it may be better to just store a flag like $has_schema_org_metadata = true.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, it is true, I made the corresponding correction.

}
if ( ! $schema_org_script ) {
$script = $dom->createElement( 'script', wp_json_encode( amp_get_schemaorg_metadata() ) );
AMP_DOM_Utils::add_attributes_to_node( $script, array(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you're using $dom->createElement() here then I think it makes more sense to just do $script->setAttribute( 'type', 'application/ld+json' )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also true. Corrected it.

$dom = new DOMDocument();
$dom->loadHTML( $page );
AMP_Theme_Support::ensure_required_markup( $dom );
$this->assertEquals( substr_count( $dom->saveHTML(), $schema_test_value ), 1 );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also should add a test for when JSON is output but the slashes are not escaped.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And add a test for when @context is not the first key.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made tests for when @ anothercontext is the first key and has schema.org present, I hope I understood correctly.

@westonruter
Copy link
Member

One more higher-level question: when I first raised this issue I was thinking that there could only be one such metadata script in the document. However, apparently it's fine to have multiple: https://stackoverflow.com/questions/30723531/best-json-ld-practices-using-multiple-script-elements

The main concern here for the AMP plugin is to allow for other plugins to take control of the generation of the Schema.org metadata instead of the defaults that the AMP plugin outputs, but then to have the AMP plugin output a good fallback when others don't do so.

I'm feeling like if there is any JSON-LD script in the head (which mentions schema.org) then we can take this as the signal to just skip outputting the default metadata from the plugin.

$schema_org_script = null;
foreach ( $head->getElementsByTagName( 'script' ) as $script ) {
if ( 'application/ld+json' === $script->getAttribute( 'type' ) && preg_match( '/{"@context":"https?:[\\\]\/[\\\]\/schema\.org/', $script->nodeValue ) ) {
$schema_org_script = $script->nodeValue;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't this do json_decode( $script->nodeValue, false ) and check if wp_parse_url( $script['@context'], PHP_URL_HOST ) === 'schema.org?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @westonruter,

Trying to use wp_parse_url with json_decode( $script->nodeValue, false ) breaks the site, but it works fine when $script->nodeValue is converted to an array. The logic however stops working.

I've found that if you comment out these lines in class-amp-dom-utils.php, the logic works again:

	$placeholders = self::get_mustache_tag_placeholders();
	$document     = str_replace(
		array_keys( $placeholders ),
		array_values( $placeholders ),
		$document
	);

I opted to keep your previous suggestions since i don't know how important are those lines.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I think this makes sense. If I have some JSON that looks like:

{"foo": {"bar":}}

The last }} there is going to get replaced with a placeholder (since this token is returned by get_mustache_tag_placeholders) before parsing the DOM, to then get put back in AMP_DOM_Utils::get_content_from_dom_node(). That means we cannot parse JSON from the parsed DOM currently, which is not ideal. To fix this we'd need to limit the replacements to only happen inside of href and src attributes only, using something like this:

diff --git a/includes/utils/class-amp-dom-utils.php b/includes/utils/class-amp-dom-utils.php
index 5288c04..f20fba1 100644
--- a/includes/utils/class-amp-dom-utils.php
+++ b/includes/utils/class-amp-dom-utils.php
@@ -77,9 +77,15 @@ class AMP_DOM_Utils {
 		 * elements since it is faster and wouldn't change the outcome.
 		 */
 		$placeholders = self::get_mustache_tag_placeholders();
-		$document     = str_replace(
-			array_keys( $placeholders ),
-			array_values( $placeholders ),
+		$document     = preg_replace_callback(
+			'#\s(src|href)=("[^"]+"|\'[^\']+\')#s',
+			function( $matches ) use ( $placeholders ) {
+				return str_replace(
+					array_keys( $placeholders ),
+					array_values( $placeholders ),
+					$matches[0]
+				);
+			},
 			$document
 		);

That would probably be something good to add here to ensure we can read JSON out of the DOM.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've merged the fix for this in #1019 via 4a10ee4

@oscarssanchez oscarssanchez force-pushed the add/962-prevent-schema-duplicates branch from 8b96d77 to c66bfc4 Compare March 11, 2018 00:15
@westonruter westonruter merged commit ab0ae8a into ampproject:develop Mar 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants