Skip to content
This repository has been archived by the owner on Sep 14, 2021. It is now read-only.

extension elements in sitemaps #151

Open
pbiron opened this issue Apr 7, 2020 · 14 comments
Open

extension elements in sitemaps #151

pbiron opened this issue Apr 7, 2020 · 14 comments
Labels
Type: Bug Something isn't working

Comments

@pbiron
Copy link
Contributor

pbiron commented Apr 7, 2020

extension elements in sitemaps

Describe the bug

#88 added filters that allow extension elements to be added to sitemaps (e.g., core_sitemaps_posts_url_list).

However, there are a couple of problems with the solution:

  1. Core_Sitemaps_Renderer::get_sitemap_xml() outputs all elements in the sitemaps namespace (i.e., http://www.sitemaps.org/schemas/sitemap/0.9) and the sitemaps XML Schema specifies that any element children of sitemap:url other than sitemap:loc, sitemap:lastmod, sitemap:changefreq and sitemap:priority must be in another namespace (where sitemap:xyz is a QName for an element whose local-name() is xyz and whose namespace-uri() is the sitemaps namespace URI). Therefore, if a plugin hooks into core_sitemaps_posts_url_list and adds a foo property to each URL in the list, the generated sitemap XML will be invalid against the XML Schema. There is currently no way for a plugin to tell the renderer what namespace to use for these extension elements.
  2. The sitemaps XML Schema also specifies that the child elements of sitemap:url must be in the order above, with all elements in a foreign namespace coming at the end. So, if a plugin hooks into core_sitemaps_posts_url_list and does something like foreach ( $url_list as $url ) { $url = array_merge( array( 'priority' => 0.9 ), $url ); } then Core_Sitemaps_Renderer::get_sitemap_xml() will output a sitemap that is invalid against the XML Schema.

I do not know whether sitemap consumers (e.g., Google, Bing, Yandex, etc) would fail to process a sitemap that was invalid against the XML Schema, but do we want to try our best to ensure that generated sitemaps are valid?

@pbiron pbiron added the Type: Bug Something isn't working label Apr 7, 2020
@pbiron
Copy link
Contributor Author

pbiron commented Apr 7, 2020

For example,

add_filter( 'core_sitemaps_posts_url_list', function( $url_list ) {
   foreach ( $url_list as &$url ) {
      $url['foo'] = 'bar';
   }
   return $url_list;
} );

Will produce something like:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
   <url>
      <loc>https://example.com/</loc>
      <lastmod>2020-03-01T18:27:27-07:00</lastmod>
      <foo>bar</loc>
   </url>
</urlset>

which is invalid against the sitemaps XML Schema, since the foo element must be in a namespace other than http://www.sitemaps.org/schemas/sitemap/0.9. The sitemap is well-formed XML, but not valid against the XML Schema.

Similarly,

add_filter( 'core_sitemaps_posts_url_list', function( $url_list ) {
   foreach ( $url_list as &$url ) {
      $url = array( 'priority' => 0.9 ) + $url;
   }
   return $url_list;
} );

Will produce something like:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
   <url>
      <priority>0.9</priority>
      <loc>https://example.com/</loc>
      <lastmod>2020-03-01T18:27:27-07:00</lastmod>
   </url>
</urlset>

which, again, is not valid against the XML Schema because the <loc> child element must come before the <priority> child element.

@pbiron
Copy link
Contributor Author

pbiron commented Apr 10, 2020

I do not know whether sitemap consumers (e.g., Google, Bing, Yandex, etc) would fail to process a sitemap that was invalid against the XML Schema, but do we want to try our best to ensure that generated sitemaps are valid?

I had a closer look at the XML Schema while I was working on #153, and I sure hope that none of the major sitemap consumers require that sitemaps be valid against the schema...because they way the schema is written ensures that the average WP plugin author won't be able to produce sitemap instances containing extension elements that validate against the schema!

Why? Because the content model for Q{http://www.sitemaps.org/schemas/sitemap/0.9}url requires strict validation for elements in extension namespaces (which is a bad decision on the part of the schema author(s), IMHO):

<xsd:complexType name="tUrl">
   <xsd:annotation>
      <xsd:documentation>
         Container for the data needed to describe a document to crawl.
      </xsd:documentation>
   </xsd:annotation>
   <xsd:sequence>
      <xsd:element name="loc" type="tLoc"/>
      <xsd:element name="lastmod" type="tLastmod" minOccurs="0"/>
      <xsd:element name="changefreq" type="tChangeFreq" minOccurs="0"/>
      <xsd:element name="priority" type="tPriority" minOccurs="0"/>
      <xsd:any namespace="##other" minOccurs="0" maxOccurs="unbounded" processContents="strict"/>
   </xsd:sequence>
</xsd:complexType>

What that means is that a schema validator needs to be able to find a schema for elements in any extension namespaces. I think the chances are pretty low that the average WP plugin author is going to write a schema document for their extensions and make that schema document available on the web for validators to download.

Even if plugin authors did make schema documents available for their extension namespace(s), we'd have to add a filter for them to hook into to specify the URL for the schema document for their extension namespace and then output the appropriate markup in the sitemap instance so that validators knew where to find the schema for the extension namespace, e.g.

<?xml version="1.0" encoding="UTF-8"?>
<urlset
      xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" 
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
      xsi:schemaLocation="urn:sparrowhawk.com/core-sitemaps
                          https://sparrowhawkcomputing.com/schema/core-sitemaps.xsd"
   >
...
</urlset>

What the schema author(s) should have (IMHO) written is:

<xsd:any namespace="##other" minOccurs="0" maxOccurs="unbounded" processContents="lax"/>

which would allow validators to validate extension elements if they could find a schema definition for them, or silently ignore them for validation purposes if a schema definition for them couldn't be found.

@pbiron
Copy link
Contributor Author

pbiron commented Apr 10, 2020

That said, I still think we should try to find a way for plugin authors to specify a namespace URI for extensions they add via the various core_sitemaps_{$object_type}_url_list filters.

I haven't come up a great way yet, but I'll throw the following good way out just for comment:

  1. add a filter, such as core_site_maps_namespace_bindings, that plugins would use to add prefix to namespace URI bindings (see below for an example)
    a. we'd have to find a way to resolve conflicts if 2 different plugins tried to register the same prefix
    b. document that plugins that hook into the various core_sitemaps_{$object_type}_url_list filters should use PrefixedNames (i.e., Prefix . ':' . LocalName ) when they add keys to the $url_list
  2. modify Core_Sitemap_Renderer::get_sitemap_xml() to:
    a. add namespace declarations for current bindings when creating the SimpleXMLElement (see below)
    b. when iterating over the $url_item keys, check whether the key is a PrefixedName for which there is a namespace binding and call addChild() with the proper arguments for SimpleXML to add the element in the correct namespace

So, a plugin would do:

add_filter( 'core_site_maps_namespace_bindings', function( $namespace_bindings ) {
	$namespace_bindings[ 'pvb' ] => 'urn:sparrowhawkcomputing.com';
} );
add_filter( 'core_sitemaps_posts_url_list', function( $url_list ) {
	foreach ( $url_list as &$url ) {
		$url['pvb:foo'] = 'bar';
	}
	
	return $url_list;
} );

Then, Core_Sitemap_Renderer::get_sitemap_xml() would become something like:

public function get_sitemap_xml( $url_list ) {
	$namespace_bindings = apply_filters( 'core_site_maps_namespace_bindings', array() );

	$namespace_decls = 'xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"';
	foreach ( $namespace_bindings as $prefix => $namespace_uri ) {
		$namespace_decls .= sprintf( ' xmlns:%s="%s"', $prefix, $namespace_uri );
	}
	$urlset = new SimpleXMLElement( '<?xml version="1.0" encoding="UTF-8" ?>' . $this->stylesheet . '<urlset ' . $namespace_decls . '/>' );
	
	foreach ( $url_list as $url_item ) {
		$url = $urlset->addChild( 'url', null, 'http://www.sitemaps.org/schemas/sitemap/0.9' );

 		foreach ( $url_item as $name => $value ) {
 			if ( ! is_scalar( $value ) ) {
 				// @todo add a _doing_it_wrong() since esc_attr() would generate a run-time PHP error.
 				continue;
 			} elseif ( in_array( $name, array( 'loc', 'lastmod', 'changefreq', 'priority' ) ) ) {
 				$url->addChild( $name, esc_attr( $value ), 'http://www.sitemaps.org/schemas/sitemap/0.9' );
 			} elseif ( self::is_xml_PrefixedName( $name ) ) {
 				list( $prefix, $local_name ) = explode( ':', $name );
 				if ( ! empty( $namespace_bindings[ $prefix ] ) ) {
	 				$url->addChild( $local_name, esc_attr( $value ), $namespace_bindings[ $prefix ] );
 				}
 			}
 			else {
 				// @todo add a _doing_it_wrong():
 				//       a. any other NCName wouldn't be valid against the sitemaps XML Schema
 				//       b. to avoid a bug in `SimpleXMLElement::addChild()`, which doesn't
 				//          check that the first param is a legal XML element name:-(  See below.
				continue;
			}
 		}
 	}

	return $urlset->asXML();
 }

and the new is_xml_PrefixedName() method is:

/**
 * Does a string match the PrefixedName production from the Namespaces in XML spec?
 *
 * @link https://www.w3.org/TR/REC-xml-names/#NT-PrefixedName
 *
 * The regexes used by this method are based on the corresponding
 * productions in the {@link https://www.w3.org/TR/REC-xml/ Extensible Markup Language (XML) 1.0 (Fifth Edition)}
 * and {@link https://www.w3.org/TR/REC-xml-names/ Namespaces in XML} W3C specifications.
 *
 * @param string $str The string to check.
 * @return bool True is `$str` matches the PrefixedName production, false otherwise.
 *
 * @todo Strech goal: create a fillblown core XML API, that added an abstraction layer
 *       around SimpleXMLElement (and/or XMLReader/XMLWriter) that a number of useful
 *       functions/methods like this, e.g., is_xml_NCName(), is_xml_QName(), etc.
 *       Of course, such an API is beyond the scope of this project (which is why
 *       it's a "strech goal" :-), but would be really helpful for feeds, WXR import/export, etc!
 */
static function is_xml_PrefixedName( $str ) {
	$NameStartChar = 'A-Za-z_\x{C0}-\x{D6}\x{D8}-\x{F6}\x{F8}-\x{2FF}\x{370}-\x{37D}\x{37F}-\x{1FFF}\x{200C}-\x{200D}\x{2070}-\x{218F}\x{2C00}-\x{2FEF}\x{3001}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFFD}\x{10000}-\x{EFFFF}';
	$NameChar      = "{$NameStartChar}\-\.0-9\x{B7}\x{0300}-\x{036F}\x{203F}-\x{2040}";
	$NCName        = "[{$NameStartChar}][{$NameChar}]*";
	$PrefixedName  = "{$NCName}:{$NCName}";

	return 1 === preg_match( "/^{$PrefixedName}\$/u", $str );
}

To illustrate the SimpleXMLElement bug mentioned in the @todo above, try the following:

add_filter( 'core_sitemaps_posts_url_list', function( $url_list ) {
	foreach ( $url_list as &$url ) {
		$url['this is not well-formed'] = '';
	}
	
	return $url_list;
} );

and you'll see that what SimpleXMLElement generates is:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
   <url>
      <loc>https://example.com/</loc>
      <this is not well-formed/>
   </url>
</urlset>

Comments?

@pbiron
Copy link
Contributor Author

pbiron commented Apr 10, 2020

And while I'm being pedantic (I'm in one of those moods today...I'll blame it on Covid-ID isolation :-), another bad thing about the sitemaps XML Schema is the type definition for the Q{http://www.sitemaps.org/schemas/sitemap/0.9}loc element is:

<xsd:simpleType name="tLoc">
  <xsd:restriction base="xsd:anyURI">
    <xsd:minLength value="12"/>
    <xsd:maxLength value="2048"/>
  </xsd:restriction>
</xsd:simpleType>

which means that:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
   <url>
      <loc>http://a.tv</loc>
   </url>
</urlset>

is not a valid sitemap (since the URL is shorter than 12 characters).

Note that http://a.tv is a real URL that resolves!

OK, end of pedantic ranting (for today :-)

@pbiron
Copy link
Contributor Author

pbiron commented Apr 11, 2020

So, a plugin would do:

I just realized that I forgot to show the sitemap that would be generated by the suggested mods above:

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:pvb="urn:sparrowhawkcomputing.com">
   <url>
      <loc>https://example.com</loc>
      <pvb:foo>bar</pvb:foo>
   </url>
<urlset>

@joemcgill
Copy link
Contributor

Thanks for clearly explaining the name spacing issue with the XML sitemaps spec and illustrating with examples. I wonder if instead of a simple filter approach for adding custom values to the array, we need to provide a registry for URL properties which requires you to add a namespace in order to register the property. Something like this pseudocode:

register_sitemap_property( $namespace, $name, $callback );

We can pass some data to any registered callback and if it returns a truthy value, then a property is returned that looks like <$namespace:$name>$value</$namespace:$name>. Thoughts?

@swissspidy
Copy link
Contributor

@felixarntz @adamsilverstein would be interested to hear your thoughts on this

@adamsilverstein
Copy link
Contributor

I'm not familiar enough with how sitemaps are typically extended to answer. I like the registry suggestion from @joemcgill which might provider a cleaner API than filters.

What do current popular sitemap plugins provide for extensibility?

@swissspidy
Copy link
Contributor

In #184 I propose a rather simple solution for this. It does not cover some edge cases like the order of elements, which I think could be handled in a future release if necessary.

@Rajesh-Techblissonline
Copy link

Rajesh-Techblissonline commented May 24, 2020

This is an interesting issue. I am using the Core Sitemaps module within my Platinum SEO Plugin. And I have addressed this problem with a filter that lets plugin authors to render the sitemap as they want usint their custom stylesheets. This filter will be added to the function render_sitemap($url_list) in class core_sitemaps_renderer . But the function definition need to be changed and it will need an additional argument $object_subtype. So the function definition becomes render_sitemap( $url_list, $object_subtype ).

Then the code within this function is changed to this.

	/**
	 * Renders a sitemap.
	 *
	 * @since 5.5.0
	 *
	 * @param array $url_list A list of URLs for a sitemap.
	 */
	public function render_sitemap( $url_list, $object_subtype ) {
		header( 'Content-type: application/xml; charset=UTF-8' );

		$this->check_for_simple_xml_availability();
		
		$sitemap_xml = apply_filters( "core_sitemaps_get_sitemap_xml", $url_list, $object_subtype );
		
		if (  empty( $sitemap_xml ) ) {

		    $sitemap_xml = $this->get_sitemap_xml( $url_list );
		
		}

		if ( ! empty( $sitemap_xml ) ) {
			// All output is escaped within get_sitemap_xml().
			// phpcs:ignore WordPress.Security.EscapeOutput.OutputNotEscaped
            $dom = new DOMDocument;
            $dom->preserveWhiteSpace = false;
            $dom->formatOutput = true;
            $dom->loadXML($sitemap_xml);
            echo $dom->saveXML();		
		}
	}

Note that a filter core_sitemaps_get_sitemap_xml has been added to the function that will let plugin authors to create their sitemap XML and return it. This filter also passed the object_subtype to the filter function that will help plugin authors to use an appropriate custom stylesheet based on the object_subtype.

This is how my plugin handles this.

add_filter('core_sitemaps_get_sitemap_xml', array($this, 'psp_get_sitemap_xml'), 10, 2);
	/**
	 * Gets XML for a sitemap.
	 *
	 * @since 5.5.0
	 *
	 * @param array $url_list A list of URLs for a sitemap.
	 * @return string|false A well-formed XML string for a sitemap index. False on error.
	 */
	public function psp_get_sitemap_xml( $url_list, $object_subtype ) {
	    
	    $psp_sm_settings = $this->psp_sitemap_settings;
	    
	    $psp_lastmod_sitemaps_enabled = isset($psp_sm_settings['include_lastmod']) ? $psp_sm_settings['include_lastmod'] : '';
	    
		$psp_image_sitemaps_enabled = isset($psp_sm_settings['include_images']) ? $psp_sm_settings['include_images'] : '';
	    
	    if ($object_subtype === 'post') {
	        
	         $psp_stylesheet_url = plugins_url( '/sitemap.xsl', __FILE__ );
                $this->stylesheet = '<?xml-stylesheet type="text/xsl" href="' . esc_url( $psp_stylesheet_url ) . '" ?>'; 
	        
	        if ($psp_lastmod_sitemaps_enabled) {
	        
    	        $psp_stylesheet_url = plugins_url( '/sitemap-post.xsl', __FILE__ );
                $this->stylesheet = '<?xml-stylesheet type="text/xsl" href="' . esc_url( $psp_stylesheet_url ) . '" ?>'; 
	        }
	        
            if ($psp_image_sitemaps_enabled) {
                
                $psp_stylesheet_url = plugins_url( '/sitemap-image.xsl', __FILE__ );
                $this->stylesheet = '<?xml-stylesheet type="text/xsl" href="' . esc_url( $psp_stylesheet_url ) . '" ?>';
                
            }
	    } else {
	        $psp_stylesheet_url = plugins_url( '/sitemap.xsl', __FILE__ );
            $this->stylesheet = '<?xml-stylesheet type="text/xsl" href="' . esc_url( $psp_stylesheet_url ) . '" ?>'; 
	    }
		$urlset = new SimpleXMLElement(
			sprintf(
				'%1$s%2$s%3$s',
				'<?xml version="1.0" encoding="UTF-8" ?>',
				$this->stylesheet,
				'<urlset xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:image="http://www.google.com/schemas/sitemap-image/1.1" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd http://www.google.com/schemas/sitemap-image/1.1 http://www.google.com/schemas/sitemap-image/1.1/sitemap-image.xsd" xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" />'
			)
		);

		foreach ( $url_list as $url_item ) {
			$url = $urlset->addChild( 'url' );

			// Add each attribute as a child node to the URL entry.
			foreach ( $url_item as $attr => $value ) {
				if ( 'url' === $attr ) {
					
					$url->addChild( $attr, esc_url( $value ) );
					
				} else if ('image' === $attr) {
				    
					foreach ($value as $imageattr) {
					    $image = $url->addChild('image:image', null, 'http://www.google.com/schemas/sitemap-image/1.1');
					    if(array_key_exists("loc", $imageattr)) $image->addChild('image:loc',esc_url( $imageattr['loc'] ), 'http://www.google.com/schemas/sitemap-image/1.1');
					    if(array_key_exists("title", $imageattr)) $image->addChild('image:title',esc_attr( $imageattr['title'] ), 'http://www.google.com/schemas/sitemap-image/1.1');
					    if(array_key_exists("caption", $imageattr)) $image->addChild('image:caption',esc_attr( $imageattr['caption'] ), 'http://www.google.com/schemas/sitemap-image/1.1');
					}
					
				} else {
				    
					$url->addChild( $attr, esc_attr( $value ) );
				}
			}
		}

		return $urlset->asXML();
	}

Pls. do let me know your feedback on this. @pbiron @swissspidy

@Rajesh-Techblissonline
Copy link

Rajesh-Techblissonline commented May 27, 2020

The URL LIST with all the custom POST attributes added by plugins are rendered by the plugin code as follows. This implementation may vary from one plugin to another plugin.

if ($object_subtype === 'post') {
	        
	         $psp_stylesheet_url = plugins_url( '/sitemap.xsl', __FILE__ );
                $this->stylesheet = '<?xml-stylesheet type="text/xsl" href="' . esc_url( $psp_stylesheet_url ) . '" ?>'; 
	        
	        if ($psp_lastmod_sitemaps_enabled) {
	        
    	        $psp_stylesheet_url = plugins_url( '/sitemap-post.xsl', __FILE__ );
                $this->stylesheet = '<?xml-stylesheet type="text/xsl" href="' . esc_url( $psp_stylesheet_url ) . '" ?>'; 
	        }
	        
            if ($psp_image_sitemaps_enabled) {
                
                $psp_stylesheet_url = plugins_url( '/sitemap-image.xsl', __FILE__ );
                $this->stylesheet = '<?xml-stylesheet type="text/xsl" href="' . esc_url( $psp_stylesheet_url ) . '" ?>';
                
            }
	    } else {
	        $psp_stylesheet_url = plugins_url( '/sitemap.xsl', __FILE__ );
            $this->stylesheet = '<?xml-stylesheet type="text/xsl" href="' . esc_url( $psp_stylesheet_url ) . '" ?>'; 
	    }

In the above piece of plugin code, the plugin checks whether the attributes lastmod and images are enabled for the POST (object-subtype) and sets an appropriate custom stylesheet for this sitemap-xml (based on the object-subtype and the enabled attributes).

For the rest of the object-subtypes, it just sets a default custom stylesheet as shown below.

} else {
	        $psp_stylesheet_url = plugins_url( '/sitemap.xsl', __FILE__ );
            $this->stylesheet = '<?xml-stylesheet type="text/xsl" href="' . esc_url( $psp_stylesheet_url ) . '" ?>'; 
	    }

The for loop then adds the attributes present in the URL LIST array to a SimpleXMLElement object and returns a sitemap-xml string, expected by the proposed filter core_sitemaps_get_sitemap_xml in the function render_sitemap of class core_sitemaps_renderer

foreach ( $url_list as $url_item ) {
			$url = $urlset->addChild( 'url' );

			// Add each attribute as a child node to the URL entry.
			foreach ( $url_item as $attr => $value ) {
				if ( 'url' === $attr ) {
					
					$url->addChild( $attr, esc_url( $value ) );
					
				} else if ('image' === $attr) {
				    
					foreach ($value as $imageattr) {
					    $image = $url->addChild('image:image', null, 'http://www.google.com/schemas/sitemap-image/1.1');
					    if(array_key_exists("loc", $imageattr)) $image->addChild('image:loc',esc_url( $imageattr['loc'] ), 'http://www.google.com/schemas/sitemap-image/1.1');
					    if(array_key_exists("title", $imageattr)) $image->addChild('image:title',esc_attr( $imageattr['title'] ), 'http://www.google.com/schemas/sitemap-image/1.1');
					    if(array_key_exists("caption", $imageattr)) $image->addChild('image:caption',esc_attr( $imageattr['caption'] ), 'http://www.google.com/schemas/sitemap-image/1.1');
					}
					
				} else {
				    
					$url->addChild( $attr, esc_attr( $value ) );
				}
			}
		}

		return $urlset->asXML();

@Rajesh-Techblissonline
Copy link

In #184 I propose a rather simple solution for this. It does not cover some edge cases like the order of elements, which I think could be handled in a future release if necessary.

Order can be handled in the WP_Query_Args or WP_Term_Args filter. It can be set to either ASC or DESC in the Query Args.

@swissspidy
Copy link
Contributor

Order can be handled in the WP_Query_Args or WP_Term_Args filter. It can be set to either ASC or DESC in the Query Args.

I was referring to order of attributes within a single sitemap entry, not WP_Query order

@Rajesh-Techblissonline
Copy link

Rajesh-Techblissonline commented May 28, 2020

@swissspidy Oh ok! My bad!

The above proposed filter to let the third party plugin developer to build his own sitemap-xml might take care of that too. So it will address

  1. the order of attributes within a sitemap entry,
  2. custom stylesheets for different object_types, and/or object_subtypes. Ideally, the filter should pass both the object_type and object_subtype to the function attached to the filter. In my example above, I have only used object_subtype. But it should pass object_type as well, since either of them may be used to determine custom stylesheets.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Type: Bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants