Skip to content

Toolchain: OAI PMH for repositories that identify resource files in a record element

Mark Jordan edited this page Oct 3, 2017 · 17 revisions

Overview

This toolchain allows the creation of Islandora import packages consisting of metadata and content files (PDFs, JPEGs, etc.) retrieved via the OAI-PMH protocol. The resulting Islandora import packages can then be ingested into Islandora using the standard Islandora Batch module.

The toolchain creates valid Islandora import packages for platforms that include the direct URL to the PDF or other resource file in a specific element within the OAI-PMH record (usually within a specific element in the Dublin Core XML). Currently the toolchain is limited to retrieving one PDF or other file for each resource described in an OAI-PMH record; compound objects cannot currently be migrated using this toolchain.

OAI-PMH provides a standardized mechanism for harvesting metadata about items but not for identifying files that make up the item. However, some OAI-PMH compliant repositories follow a convention where they put the direct link to the PDF or other resource file in a specific field in the OAI Dublin Core. For example, BePress' Digital Commons platform puts it in the second <dc:identifier> element. The RIOXX Application. Profile also specifies that the sole dc:identifier must contain a URL to the resource. This toolchain uses an XPath expression included in the .ini file to locate the resource file, and if one if found, it retrieves the file and adds it to the Islandora import package.

By default, this toolchain writes the Dublin Core metadata records retrieved from the OAI gateway. These can be loaded into Islandora, but if you want to transform the Dublin Core into MODS for loading into Islandora, you can do so by using a post-write hook script as illustrated below in the [WRITER] section.

This toolchain has not been tested with a wide variety of OAI-PMH repositories, so if you want to use it and it doesn't meet your needs, please open an issue.

Preparing the content files

All content added to Islandora import packages by this toolchain comes from the remote repository, so there is no need to prepare content.

Preparing the configuration file

All MIK configuration files are standard INI files which contain the following sections: [SYSTEM], [CONFIG], [FETCHER], [METADATA_PARSER], [FILE_GETTER], [WRITER], [MANIPULATORS], and [LOGGING]. Entries are required unless indicated otherwise below.

Commented lines begin with a semicolon. Values that contain whitespace or special characters (equals, semicolon, etc.) should be wrapped in double quotation marks. If in doubt, use the quotation marks. The order of the sections and the entries within each section do not matter.

The SYSTEM section

This section of the configuration file sets or overrides configuration settings for PHP and the various third-party PHP components used by MIK. It can contain the following entries:

  • date_default_timezone: Optional. Provide a default timezone if date.timezone is null in the the PHP INI. You will know if you need to use this setting because Monolog will throw MIK exceptions and halt MIK. Set to one of the valid PHP timezone values listed at http://php.net/manual/en/timezones.php.
  • verify_ca: Optional. OSX's default PHP configuration use Apple's Secure Transport rather than OpenSSL, causing issues with Certificate Authority verification in Guzzle requests against websites that use HTTPS. This setting allows Guzzle to override CA verification. You will know if you need to use this setting because Guzzle will write entries in your mik.log complaining about CA verification. Set to false to ignore CA verification.

Note: if you set verify_ca to false, you are bypassing HTTPS encryption between MIK and the remote website. Use at your own risk.

Example

[SYSTEM]
date_default_timezone = 'America/Vancouver'
verify_ca = false

The CONFIG section

Key-value pairs of configuration entries in this section are simply written to the top of the log file specified in the [LOGGING] section's path_to_log setting. You can add whatever values you want, but they are static (that is, they can't be dynamically derived at runtime). Therefore, all entries in this section are optional.

Example

[CONFIG]
config_id = oai-test
last_updated_on = "2016-07-20"
last_update_by = "Mark Jordan"

The FETCHER section

This section of the configuration file must contain the following entries:

  • class: Must be 'Oaipmh'.
  • oai_endpoint: Full URL to the source repository's OAI-PMH endpoint.
  • set_spec: Optional; the set spec that limits the OAI harvest to a specific set.
  • from: Optional; a date in either YYYY-MM-DD and YYYY-MM-DDThh:mm:ssZ format that defines the start date in a selective harvest. Date-based harvests are described in the OAI-PMH spec.
  • until: Optional; a date in either YYYY-MM-DD and YYYY-MM-DDThh:mm:ssZ format that defines the end date in a selective harvest.
  • metadata_prefix: Optional; the metadata prefix to use. Default is 'oai_dc'. Use 'mods' to harvest MODS file from repositories that support it.
  • temp_directory: Full path to the directory where the fetchers write data for use later in the toolchain.
  • use_cache: Optional; set to false in automated tests (in other words, you will not need to use this unless you are writing automated tests for this fetcher).

Example

[FETCHER]
class = Oaipmh
oai_endpoint = "http://kora.kpu.ca/do/oai/"
set_spec = collection:ART
metadata_prefix = oai_dc
from = 1990-01-01
until = 2010-12-31
temp_directory = "/tmp/oaitest_temp"

The METADATA_PARSER section

This section of the toolchain's configuration file contains the following entries:

  • class: Must be 'dc\OaiToDc'. Use 'mods\OaiToMods' to harvest MODS from repositories that support it.

Example

[METADATA_PARSER]
class = dc\OaiToDc

The FILE_GETTER section

This section of the toolchain's configuration file contains the following entries:

  • class: Must be 'OaipmhXpath'.
  • xpath_expression: A valid XPatch expression to the XML element in the OAI-PMH record that contains the direct URL to the resource file. For example, a value of "//dc:identifier[2]" indicates that the URL can be found in the second <dc:identifier> element. The XPath expression must return a single DOM node.
  • temp_directory: Full path to the directory where the file getter will write data for use later in the toolchain. Can be the same as the temp_directory value used in the [FETCHER] section.

Example

[FILE_GETTER]
class = OaipmhXpath
xpath_expression = "//dc:identifier[2]"
temp_directory = "/tmp/oaitest_temp"

The WRITER section

This section of the CSV toolchain's configuration file contains the following entries:

  • class: Must be 'Oaipmh'.
  • output_directory: The full path to the directory where output packages are written.
  • postwritehooks[]: Repeated entries for each post-write hook script used in this toolchain. Currently, MIK ships with only one post-write script (shown in the example below), which applies an XSLT stylesheet to the OAI_DC metadata to transform it into MODS. This script overwrites the source Dublin Core XML file but creates a backup before doing so.

Example

[WRITER]
class = Oaipm
output_directory = "/tmp/oaitest_output"
postwritehooks[] = "/usr/bin/php extras/scripts/postwritehooks/oai_dc_to_mods.php"

The MANIPULATORS section

This toolchain can use the SpecificSet and RandomSet fetcher manipulators. If you have have a use case for additional manipulators, please file an issue.

Example

[MANIPULATORS]
fetchermanipulators[] = "RandomSet|10"
; fetchermanipulators[] = "SpecificSet|kora_specific_set.txt"

The input file for the Specific Set manipulator must contain OAI-PMH identifiers in the same form that the OAI provider supplies them, e.g., colons may be URL escaped, as in this example:

oai%3Akora.kpu.ca%3Afacultypub-1089
oai%3Akora.kpu.ca%3Ascusc-1043

The LOGGING section

This section of the CSV toolchain's configuration file contains the following entries:

  • path_to_log: The full path to the standard log generated by MIK.

Example

[LOGGING]
path_to_log = "/tmp/oaitest_output/mik.log"
Clone this wiki locally