Harvest metadata records from web APIs such as Trove, DigitalNZ, RSS, etc
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
src/com/conaltuohy/apiharvester
.gitignore
README.md
build.xml

README.md

APIHarvester

An application for harvesting XML metadata records from Trove, DigitalNZ, and similar "Web APIs". APIHarvester can also be used to extract portions of an XML file stored on a local file system, simply by specifying a file: URI in the request. This makes it usable e.g. for checking quality of bulk metadata records.

APIHarvester is available as an executable Java archive (jar) file.

Running it without any parameters produces the following explanatory output:

APIHarvester is a tool to harvest XML records from a web API. APIHarvester will:

 • download a response from a given URL, if necessary retrying in the event of failure
 • split the response into multiple records which match an XPath expression
 • save each record under a filename specified using another XPath expression
 • continue harvesting from additional URLs, extracted from the response using another XPath expression

APIHarvester is controlled using XPath 1 expressions; see https://www.w3.org/TR/xpath/ for details.

Usage:

java -jar apiharvester.jar [parameter list]

Parameters are specified as [key=value]. Values containing spaces, ampersands, etc should be enclosed in quotes.
XML namespace prefixes can be bound to namespace URIs using 'xmlns:' parameters.

Parameters:

 • xmlns:foo
      Binds the 'foo' namespace prefix to a namespace URI, for use in the XPath expressions.
 • directory
      Location of output files. If not specified, the current directory is used.
 • url
      Initial URL to harvest from - required.
 • records-xpath
      XPath identifying the individual records within a response. If not specified, the entire response is saved as a single record.
 • id-xpath
      XPath of unique id for each record, evaluated within the context of each record - required.
 • discard-xpath
      XPath of elements or text which should be discarded, evaluated within the context of each record.
 • resume-when-xpath
      XPath determining whether to resume from a harvest page or not - default = "true()"
 • resumption-xpath
      XPath of URL or URLs for subsequent pages of data - if not specified only the initial URL will be harvested)
 • url-suffix
      Specifies a common suffix for URLs; useful for specifying an 'API key' for some APIs.
 • retries
      Specifies a number of times to retry in the event of any error; default is 3
 • delay
      Specifies a number of seconds to wait between requests; default is 0.
 • indent
      Specifies whether to indent the XML or not. Valid values are "yes" or "no". If unspecified, the value is "no".

Example:

java -jar apiharvester.jar retries=4 xmlns:foo="http://example.com/ns/foo" url="http://example.com/api?foo=bar" records-xpath="/foo:response/foo:result" id-xpath="concat('record-', @id)" discard-xpath="*[not(normalize-space())]" resumption-xpath="concat('/api?foo=bar&page=', /foo:response/@page-number + 1)" url-suffix="&api_key=asdkfjasd" indent=yes delay=10

See the Wiki for real examples