Skip to content
This repository has been archived by the owner on Sep 14, 2021. It is now read-only.

Enhance the sitemap stylesheet to dynamically generate columns in the table without knowing a priori what elements will appear in the sitemap. #153

Conversation

pbiron
Copy link
Contributor

@pbiron pbiron commented Apr 10, 2020

Issue Number

Fixes #152

Description

Well, I went ahead and took some time to work on this even tho no one has replied to the issue (I'm kind of compulsive that way :-).

This PR modifies the XSLT stylesheet for sitemaps to allow it to render columns in the HTML table output for all distinct children of Q{http://www.sitemaps.org/schemas/sitemap/0.9}url in the sitemap (see below for the meaning of the Q{...}xyz notation), without knowing a priori what children there may be.

The columns are ordered as follows:

  1. elements defined in the sitemaps XML Schema come first, in the order they are specified in the content model for Q{http://www.sitemaps.org/schemas/sitemap/0.9}url, i.e., URL, Last Modified, Change Frequency and Priority.
  2. After that will come "extension" elements in the http://www.sitemaps.org/schemas/sitemap/0.9 namespace, ordered lexically by their local-name() (see extension elements in sitemaps #151 for why that is necessary, even though such sitemaps are invalid according to the XML Schema)
  3. After that, come columns for all elements in extension namespaces (even though it is not yet possible to generate sitemap instances with elements in extension namespaces, again, see extension elements in sitemaps #151). Columns for extension elements are ordered lexically by their namespace URI and then by their local-name().

The new stylesheet is smart enough to deal with cases where different Q{http://www.sitemaps.org/schemas/sitemap/0.9}url elements have different children, e.g.

<urlset xmlns='http://www.sitemaps.org/schemas/sitemap/0.9'>
	<url>
		<loc>https://example.com/</loc>
	</url>
	<url>
		<loc>https://example.com/some-page/</loc>
		<lastmod>2020-03-01T18:27:27-07:00</lastmod>
	</url>
	<url>
		<loc>https://example.com/another-page/</loc>
		<changefreq>weekly</changefreq>
	</url>
	<url>
		<loc>https://example.com/yet-another-page/</loc>
		<foo xmlns='urn:sparrowhawkcomputing.com'>some value</foo>
	</url>
</urlset>

will produce a table like:

<table>
	<thead>
		<tr>
			<th>URL</th>
			<th>Last Modified</th>
			<th>Change Frequency</th>
			<th class='extension urn:sparrowhawkcomputing.com'>Foo</th>
		</tr>
	</thead>
	<tbody>
		<tr>
			<td><a href='https://example.com/'>https://example.com/</td>
			<td></td>
			<td></td>
			<td class='extension urn:sparrowhawkcomputing.com'></td>
		</tr>
		<tr>
			<td><a href='https://example.com/some-page/'>https://example.com/some-page/</td>
			<td>2020-03-01T18:27:27-07:00</td>
			<td></td>
			<td class='extension urn:sparrowhawkcomputing.com'></td>
		</tr>
		<tr>
			<td><a href='https://example.com/another-page/'>https://example.com/another-page/</td>
			<td></td>
			<td>weekly</td>
			<td class='extension urn:sparrowhawkcomputing.com'></td>
		</tr>
		<tr>
			<td><a href='https://example.com/yet-another-page/'>https://example.com/yet-another-page/</td>
			<td></td>
			<td></td>
			<td class='extension urn:sparrowhawkcomputing.com'>some value</td>
		</tr>
	</tbody>
</table>

That example also illustrates that a class attribute is added for columns resulting from extension elements in case the plugins that add them want to style such columns differently (to make it clear to users that they extensions). Of course, they won't be able to use the class shorthand notation in the CSS selectors, they'll have to do th[class~="urn:sparrowhawkcomputing.com"], td[class~="urn:sparrowhawkcomputing.com"] { color: red; } since the namespace URI's will contain characters not legal in the class shorthand selector syntax.

Because no browsers currently support anything other than XSLT/XPath 1.0, the stylesheet is a little more complicated than it would be if they supported XSLT/XPath 2.0/3.0. In particular, it requires using one of 2 extension functions, exsl:node-set() or msxsl:node-set(). The first is supported by Chrome, Firefox, Safari, Opera and other browsers; the later is supported by Edge and IE (11). As a fallback in case neither function is available, then the stylesheet will just render a single URL column.

We could (should?) get even more elaborate by providing a hook so that plugins that add extension elements to a sitemap can provide translatable text to use in the column heading for their extension elements (instead of just upper-casing the first letter of the local-name() as done here...but we can always add that later.

Here's a few miscelaneous notes:

  • I've included inline comments in the new stylesheet to explain what it's doing...hopefully they'll make sense. There is an equivalent of PHPDoc for XSLT called XSLTDoc. I didn't bother to use it, for now, but it something to consider in the long run.
  • the Q{...}xyz notation used here (and in comments in the stylesheet) is a way to refer to namespace qualified elements independent of the prefix used for that namespace in any particular XML instance. See the URIQualifiedName production in the XPath 3.0 spec.

Type of change

Please select the relevant options:

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Enhancement (change which improves an existing feature. E.g., performance improvement, docs update, etc.)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Steps to test

I've tested this in Chrome, Firefox, Edge and IE (11) on Windows. Would really appreciate folks testing it on Mac & Linux with as many different browsers as you can!!

To test, you can do something like:

add_filter( 'core_sitemaps_posts_url_list', function( $url_list ) {
	$changefreq         = array( 'daily', 'weekly', 'monthly', 'yearly' );
	
	// this will result in the Change Frequency and Priority columns
	// appearing (or not) randomly each time you load the sitemap
	// to test that only columns for elements that appear in the sitemap
	// appear in the resulting output.
	$include_changefreq = random_int( 0, 1 );
	$include_priority   = random_int( 0, 1 );

	foreach ( $url_list as $idx => &$url ) {
		if ( $include_changefreq ) {
			$url['changefreq'] = $changefreq[ random_int( 0, count( $changefreq ) - 1 ) ];
		}
		if ( $include_priority ) {
			$url['priority'] = random_int( 0, 10 ) / 10;
		}
		// the random_int() calls below are to simulate a plugin
		// that only adds a key when a post has a specific post meta key (or meta value)
		// and will test that appropriate empty td elements are output so that
		// values end up in the correct columns.
		if ( random_int( 0, 1 ) ) {
			$url['foo'] = "foo:{$idx}";
		}
		if ( random_int( 0, 1 ) ) {
			$url['bar'] = "bar:{$idx}";
		}
	}

	return $url_list;
} );

Acceptance criteria

  • My code follows WordPress coding standards.
  • I have performed a self-review of my own code.
  • If the changes are visual, I have cross browser / device tested.
  • I have commented my code, particularly in hard-to-understand areas.
  • My changes generate no new warnings.
  • I have added test instructions that prove my fix is effective or that my feature works.

… table without knowing a priori what elements will appear in the sitemap.
@googlebot
Copy link

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

ℹ️ Googlers: Go here for more info.

@googlebot googlebot added the cla: no Has not signed the Google CLA label Apr 10, 2020
…olumn headings output by the sitemap stylesheet.

Also introduces `esc_xml()` and `esc_xml__()` functions (which are intended to be included in the core merge proposal), that are to equivalent of `esc_html()` and `esc_html__()` but do XML-specific escaping.
* arrays whose keys are local names and
* whose values are column headings.
*/
$column_headings = apply_filters( 'core_sitemaps_stylesheet_column_headings', $column_headings );
Copy link
Contributor Author

@pbiron pbiron Apr 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's an example how to use this new filter with the current state of this plugin (i.e., when extension elements are in the http://www.sitemaps.org/schemas/sitemap/0.9 namespace):

add_filter( 'core_sitemaps_posts_url_list', function( $url_list ) {
   foreach ( $url_list as &$url_item ) {
      $url_item['my-extension-element' = __( 'some value', 'my-plugin' );
   }

   return $url_list;
} );

add_filter( 'core_sitemaps_stylesheet_column_headings', function( $column_headings ) {
   $column_headings['http://www.sitemaps.org/schemas/sitemap/0.9']['loc']                  = __( 'Permalink', 'my-plugin' );
   $column_headings['http://www.sitemaps.org/schemas/sitemap/0.9']['my-extension-element'] = __( 'Cool custom element', 'my-plugin' );
} );

And here's how it would be used if/when somethink like the proposal in #151 (comment) is incorporated in this plugin:

add_filter( 'core_site_maps_namespace_bindings', function( $namespace_bindings ) {
	$namespace_bindings[ 'my-plugin' ] = 'urn:my-plugin';
} );

add_filter( 'core_sitemaps_posts_url_list', function( $url_list ) {
   foreach ( $url_list as &$url_item ) {
      $url_item['my-plugin:extension-element'] = __( 'some value', 'my-plugin' );
   }

   return $url_list;
} );

add_filter( 'core_sitemaps_stylesheet_column_headings', function( $column_headings ) {
   $column_headings['http://www.sitemaps.org/schemas/sitemap/0.9']['loc'] = __( 'Permalink', 'my-plugin' );
   $column_headings['urn:my-plugin]['extension-element']                  = __( 'Cool custom element', 'my-plugin' );
} );

function esc_xml( $text ) {
$safe_text = wp_check_invalid_utf8( $text );
$safe_text = _wp_specialchars( $safe_text, ENT_QUOTES );
$safe_text = html_entity_decode( $safe_text, ENT_HTML5 );
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for these new esc_xml() and esc_xml__() functions is that far too many developers think that since they can use &amp;, &apos;, ", etc in XML then they can **also** use all the named character references they are used to using in HTML, e.g.,  , …`, etc...but they cannot...and doing so will result in a non-well-formed XML instance.

The call to html_entity_decode( $safe_text, ENT_HTML5 ) will replace all of the named character references defined in the HTML spec with their equivalent Unicode code points (e.g. &nbsp; will become \xA0, etc).

It would be nice PHP had a native function that would replace them with character references (e.g., '&nbsp;' would become &#A0;`) but unfortunately it doesn't :-(

Note that all uses of esc_attr() in this plugin (e.g., in Core_Sitemaps_Renderer::get_sitemap_xml()) should be replaced with calls to esc_xml(), but I thought it best to keep this PR strictly related to the stylesheet. Will do another PR for that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can see what happens when content like &nbsp; is included in an XML instance by doing the following (in v0.2.0 of this plugin...i.e., without the changes in this PR) :

add_filter( 'core_sitemaps_posts_url_list', function( $url_list ) {
   foreach ( $url_list as &$url_item ) {
      $url_item['foo'] = 'This&nbsp;will be a non-well-formed sitemap, and it will fail to render in the browser';
   }

   return $url_list;
} );
  • Chrome will just show a blank screen (and no error message in the console)
  • Firefox will show an error screen (without any error message), but will show XML Parsing Error: undefined entity in the console)
  • Edge and IE will show a screen with the text of content of each element in the sitemap (which is not conformant with the XML spec, but it's Microsoft, so what do you expect), and will display Invalid tag start: "<?". Question marks should not start tags. in the console, which is not actually what the error is, but as we all know when parse errors occur it can be hard to output the correct error message)
  • Not sure what Safari, Opera, etc will do, but I expect one of the above

Would anyone explicitly include &nbsp;, &hellip;, etc in element content in a sitemap? Maybe not, but they very well could include content stored in post meta, which easily could contain HTML named character references (since such post meta was likely stored so that it could be displayed in HTML).

@googlebot
Copy link

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

@googlebot googlebot added cla: yes Signed the Google CLA and removed cla: no Has not signed the Google CLA labels Apr 21, 2020
@pbiron
Copy link
Contributor Author

pbiron commented Apr 23, 2020

I believe the remaining phpcs errors reported in travis are spurious. They are related to the new esc_xml() and esc_xml__() functions, which are modeled directly on core's esc_html() and esc_html__() which apparently don't generate phpcs errors, so I'm not sure what to do about them.

@pbiron
Copy link
Contributor Author

pbiron commented Apr 26, 2020

CLosing in favor of #163.

@pbiron pbiron closed this Apr 26, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
cla: yes Signed the Google CLA
Projects
None yet
Development

Successfully merging this pull request may close these issues.

XSLT stylesheet does not display custom elements
2 participants