Skip to content

Creating metadata parsers

zhaofengli edited this page Nov 15, 2014 · 3 revisions

Metadata parsers extract metadata (title, author, publication date, etc.) from web pages. They extend the abstract class Reflinks\MetadataParser, take \DOMDocument objects as the input and returns Reflinks\Metadata. This tutorial describes how to create a new metadata parser.

Types of metadata parsers

There are two types of metadata parsers:

Independent parser

An independent parser implements the parse() method which only takes a DOMDocument and returns a Metadata object. When used in a MetadataParserChain, the default chain() method is called which merges the generated metadata into a central Metadata object. An example of an independent parser is Reflinks\MetadataParsers\TitleMetadataParser.

"Fixer" parser

A "fixer" parser is designed to modify the already-parsed metadata from the previous parsers in a MetadataParserChain. It achieves this by overriding the default chain() method in the abstract class. An example is Reflinks\MetadataParsers\FixerMetadataParser which corrects some common metadata errors.

Naming

A metadata parser should be named after the type of the metadata it's designed to parse, suffixed by MetadataParser, like SchemaOrgMetadataParser. They should reside in the Reflinks\MetadataParsers namespace.

Creating an independent parser

Let's begin by creating a test parser which extracts author information from elements with the class author according to a metadata scheme called "Foo" (This is, of course, not a real metadata specification). Create src/Reflinks/MetadataParsers/FooMetadataParser.php and insert:

<?php
/*
	If you want to contribute the code to upstream, you need to license it under the BSD 2-Clause
	license. See LICENSE for a template (Remember to change "Zhaofeng Li" to your name)
*/

/*
	Foo metadata parser
*/

namespace Reflinks\MetadataParsers;

use Reflinks\MetadataParser;
use Reflinks\Metadata;
use Reflinks\Utils;

class FooMetadataParser extends MetadataParser {
	public function parse( \DOMDocument $dom ) {
		$xpath = Utils::getXpath( $dom );
		$result = new Metadata();
		
		$authornodes = $xpath->query( "//x:*[@class='author']" );
		if ( $authornodes->length ) { // author found
			$result->author = Utils::getFirstNodeValue( $authornodes );
		}
		
		return $result;
	}
}

In this example, the parser gets a DOMXPath object using the utility method Utils::getXpath(), query the DOM for the wanted elements with it, and get the value of the first node in the DOMNodeList returned. To query the DOM with the DOMXPath object, you need to use the XML namespace x (which is actually http://www.w3.org/1999/xhtml).

Creating a "fixer" parser

A "fixer" parser is similar to an independent one, except that you override the default chain() method provided by the parent class. In this example, we create a parser that changes the author field to "Foogle" if it's "Google". Create src/Reflinks/MetadataParsers/FoogleFixerMetadataParser.php:

<?php
/*
	If you want to contribute the code to upstream, you need to license it under the BSD 2-Clause
	license. See LICENSE for a template (Remember to change "Zhaofeng Li" to your name)
*/

namespace Reflinks\MetadataParsers;

use Reflinks\MetadataParser;
use Reflinks\Metadata;

class FoogleFixerMetadataParser extends MetadataParser {
	public function parse( \DOMDocument $dom ) {}
	public function chain( \DOMDocument $dom, Metadata &$metadata ) {
		if ( $metadata->author == "Google" ) {
			$metadata->author = "Foogle";
		}
	}
}

As you see, you can modify any part of the metadata as you wish (including the url). In this way, you can replace URLs of archive sites like "archive.org" and WebCite with their source links.

Testing your parsers

A quick-and-dirty test page is provided at misc/parsertest.php for parser testing. Simply enter your HTML for testing (and optionally give it a fake URL), set up your parser chain and get the result.

To enable your parser, add it to $config['parserchain']. Create config/config.php if you haven't and insert:

<?php
$config['parserchain'][] = "MyMetadataParser";

Of course, you can also override the whole chain.