Creating metadata parsers
Metadata parsers extract metadata (title, author, publication date, etc.) from web pages. They extend the abstract class Reflinks\MetadataParser
, take \DOMDocument
objects as the input and returns Reflinks\Metadata
. This tutorial describes how to create a new metadata parser.
There are two types of metadata parsers:
An independent parser implements the parse()
method which only takes a DOMDocument
and returns a Metadata
object. When used in a MetadataParserChain
, the default chain()
method is called which merges the generated metadata into a central Metadata
object. An example of an independent parser is Reflinks\MetadataParsers\TitleMetadataParser
.
A "fixer" parser is designed to modify the already-parsed metadata from the previous parsers in a MetadataParserChain
. It achieves this by overriding the default chain()
method in the abstract class. An example is Reflinks\MetadataParsers\FixerMetadataParser
which corrects some common metadata errors.
A metadata parser should be named after the type of the metadata it's designed to parse, suffixed by MetadataParser
, like SchemaOrgMetadataParser
. They should reside in the Reflinks\MetadataParsers
namespace.
Let's begin by creating a test parser which extracts author information from elements with the class author
according to a metadata scheme called "Foo" (This is, of course, not a real metadata specification). Create src/Reflinks/MetadataParsers/FooMetadataParser.php
and insert:
<?php
/*
If you want to contribute the code to upstream, you need to license it under the BSD 2-Clause
license. See LICENSE for a template (Remember to change "Zhaofeng Li" to your name)
*/
/*
Foo metadata parser
*/
namespace Reflinks\MetadataParsers;
use Reflinks\MetadataParser;
use Reflinks\Metadata;
use Reflinks\Utils;
class FooMetadataParser extends MetadataParser {
public function parse( \DOMDocument $dom ) {
$xpath = Utils::getXpath( $dom );
$result = new Metadata();
$authornodes = $xpath->query( "//x:*[@class='author']" );
if ( $authornodes->length ) { // author found
$result->author = Utils::getFirstNodeValue( $authornodes );
}
return $result;
}
}
In this example, the parser gets a DOMXPath
object using the utility method Utils::getXpath()
, query the DOM for the wanted elements with it, and get the value of the first node in the DOMNodeList
returned. To query the DOM with the DOMXPath
object, you need to use the XML namespace x
(which is actually http://www.w3.org/1999/xhtml
).
A "fixer" parser is similar to an independent one, except that you override the default chain()
method provided by the parent class. In this example, we create a parser that changes the author field to "Foogle" if it's "Google". Create src/Reflinks/MetadataParsers/FoogleFixerMetadataParser.php
:
<?php
/*
If you want to contribute the code to upstream, you need to license it under the BSD 2-Clause
license. See LICENSE for a template (Remember to change "Zhaofeng Li" to your name)
*/
namespace Reflinks\MetadataParsers;
use Reflinks\MetadataParser;
use Reflinks\Metadata;
class FoogleFixerMetadataParser extends MetadataParser {
public function parse( \DOMDocument $dom ) {}
public function chain( \DOMDocument $dom, Metadata &$metadata ) {
if ( $metadata->author == "Google" ) {
$metadata->author = "Foogle";
}
}
}
As you see, you can modify any part of the metadata as you wish (including the url
). In this way, you can replace URLs of archive sites like "archive.org" and WebCite with their source links.
A quick-and-dirty test page is provided at misc/parsertest.php
for parser testing. Simply enter your HTML for testing (and optionally give it a fake URL), set up your parser chain and get the result.
To enable your parser, add it to $config['parserchain']
. Create config/config.php
if you haven't and insert:
<?php
$config['parserchain'][] = "MyMetadataParser";
Of course, you can also override the whole chain.