Skip to content
Mark Jordan edited this page Feb 3, 2016 · 1 revision

Below is a detailed overview of the various components of MIK. You may also want to visit the MIK Cookbook for brief articles on how to accomplish specific tasks using MIK.

How MIK works

MIK (the Move to Islandora Kit) is a command-line tool for converting source content and metadata into packages suitable for importing into Islandora. These import packages are compatible with existing batch import tools such as Islandora Batch, Islandora Book Batch, and Islandora Newspaper Batch. MIK does not import the content itself - it just prepares the content for importing using these tools. MIK's relationship to your content and Islandora can be visualized like this:

MIK overview

The benefits of decoupling the preparation from the loading of content are that you can run MIK as many times as you want without committing to adding it to Islandora, and that you can perform quality checks on the prepared content before loading it. This approach also provides a lot of flexibility in who prepares the content, when it is prepared, and where it is prepared.

Internally, MIK breaks the task of converting the input data into Islandora import packages down into discrete subtasks as illustrated in the diagram below:

MIK details

  • Fetchers query a data source to determine how many objects are to be imported, and perform some additional setup for the subsequent tasks.
  • Fetcher manipulators filter out items from the entire set of data retrieved by the fetcher. For example, you may only want to fetch book objects from a CONTENTdm collection that also contains images.
  • Metadata parsers get the metadata for an object and convert it into a format that Islandora can use, such as MODS XML.
  • File getters retrieve the content files associated with an object to be imported.
  • File getter manipulators provide a way to configure file getters to look in specific locations for files.
  • Writers save the converted content to disk in a directory structure that can be used by the standard Islandora batch import modules. After a writer has written out its package, it can initiate one or more post-write hooks (described below) that perform actions on the content in the packages.
  • File manipulators perform some processing on the files retrieved by file getters.
  • Metadata manipulators can modify or supplement the metadata XML file generated by metadata parsers.

A unique combination of one fetcher, one metadata parser, one file getter, zero or more manipulators, and one writer is an MIK "toolchain." When you run MIK you assemble a set of tools into a toolchain. A toolchain, and the options for how its component tools work, is defined in a single configuration file. Currently, MIK offers a CSV toolchain, and toolchains for a number of CONTENTdm object types. MIK is designed to be extensible, so it is relatively easy to create new toolchains.

Configuring MIK

To convert from a set of source content to a group of Islandora ingest packages, you need to provide MIK with some information. This involves creating 1) a configuration file and 2) a metadata mappings file.

The configuration (a.k.a. '.ini') file

MIK configuration files are stanard PHP .ini files that contain sections for each tool in the toolchain. Here is a simple example for the CSV toolchain:

[CONFIG]
; Configuration settings in the CONFIG section help you track your 
; content conversion jobs and get written to the log file if requested.
; Any key/value pairs you add here will be written to the log.
config_id = csvsample
last_updated_on = "2015-07-09"
last_update_by = "mjordan@sfu.ca"

[FETCHER]
class = Csv
input_file = "csv_test_input.csv"
# To read a tab-delimited file, use an actual tab character, not \t.
field_delimiter = ","
record_key = ID

[METADATA_PARSER]
class = mods\CsvToMods
; Path to the csv file that contains the CSV to MODS mappings.
mapping_csv_path = "example_csv_mapping.csv"

[FILE_GETTER]
class = CsvSingleFile
input_directory = "/tmp/mik_csv_input"
file_name_field = File

[WRITER]
class = CsvSingleFile
output_directory = "/tmp/mik_csv_output"

[MANIPULATORS]
; One or more metadatamanipulators classes
metadatamanipulators[] = FilterModsTopic
; One fetchermanipulator class with params
fetchermanipulator = "CsvSingleFile|jpg"

[LOGGING]
; Full paths to mik log and manipulator log files
path_to_log = "/tmp/mik.csv.log"
path_to_manipulator_log = '/tmp/mik_manipulator.log'

Here is an example for a toolchain for converting some multipage PDF objects in CONTENTdm into PDFs for batch loading into Islandora:

[CONFIG]
; Configuration settings in the CONFIG section help you track your 
; content conversion jobs and get written to the log file if requested.
; Any key/value pairs you add here will be written to the log.
config_id = "ecucals_4"
last_updated_on = "2015-10-21"
last_update_by = "mjordan@sfu.ca"

[FETCHER]
class = Cdm
alias = ecucals
ws_url = "http://content.lib.sfu.ca:81/dmwebservices/index.php?q="
record_key = pointer

[METADATA_PARSER]
class = mods\CdmToMods
alias = ecucals
ws_url = "http://content.lib.sfu.ca:81/dmwebservices/index.php?q="
; Path to the csv file that contains the CONTENTdm to MODS mappings.
mapping_csv_path = 'ecu_calendars_mapping.csv'
; Include the migrated from uri into your generated metadata (e.g., MODS)
include_migrated_from_uri = TRUE

[FILE_GETTER]
class = CdmPhpDocuments
input_directory = "/tmp/mik_input"
alias = ecucals
ws_url = "http://content.lib.sfu.ca:81/dmwebservices/index.php?q="
utils_url = "http://content.lib.sfu.ca/utils/"
temp_directory = "/tmp/test"

[WRITER]
class = CdmPhpDocuments
alias = ecucals
output_directory = "/tmp/mik_output"
metadata_filename = "MODS.xml"

[MANIPULATORS]
; One or more filemanipulators classes.
filemanipulators[] = ThumbnailFromCdm
; One or more metadatamanipulators classes.
; metadatamanipulators[] = FilterModsTopic
; metadatamanipulators[] = MetadatamanipulatorFoo
; One fetchermanipulator class with params
fetchermanipulator = "CdmCompound|Document-PDF"

[LOGGING]
; Full paths to mik log and manipulator log files
path_to_log = "/tmp/mik.csv.log"
path_to_manipulator_log = '/tmp/mik_manipulator.log'

Full documentation for .ini files used in the following toolchains are available:

The mappings file

Metadata parsers need to know which elements in the source metadata record for an object correspond to which MODS (or Dublin Core, etc.) elements. MIK defines these mappings in a CSV file that is separate from the MIK .ini file, but is referenced in the .ini file in the [METADATA_PARSER] mapping_csv_path variable.

The mapping file contains two columns. The one on the left identifies the field names in the "source" metadata record, and the one on the right defines the "target" XML snippet to hold the value of the corresponding source field. Two important things about the snippets:

  • They must be well-formed XML (that is, opening and closing tags must match, and must follow rules defining XML attribute syntax). You can check the well formedness of your snippets by running the ./mik --config=foo.ini --checkconfig=snippets command. This command does not validate your snippets against a schema.
  • They must include all XML from the first child of the root element down; that is, they are appended to the root element of the MODS, DC, etc. XML

You can use a text editor to create your mapping file, or a spreadsheet application like Excel or Google Sheets. Note that some applications (Excel and Sheets included) will "escape" parts of a CSV file by adding an extra set of double quotation marks. They are OK; MIK can handle any valid CSV conventions. These double quotation marks do have one undesirable side effect: they will cause your snippets to fail well-formedness tests. The best way to work around this problem is to check your snippets for well formedness before saving your spreadsheet as CSV.

The first row of your mapping file should not contain any column headings.

Snippets can (and usually do) contain the special %value% placeholder. MIK replaces this string is with the value of the source metadata field. For example, if your incoming Title is "Photograph of a dog" and Title is mapped to the MODS snippet

<titleInfo><title>%value%</title></titleInfo>

the resulting MODS markup will look like

<titleInfo><title>Photograph of a dog</title></titleInfo>

Sample mappings files

All three sample mapping files below were created in Google Sheets and then exported using the File/Download as/Comma-separate values menus.

Here is a simple metadata mapping file describing a collection of audio interviews. The first on is fairly simple, since the MODS snippets don't contain any attributes:

Title,<titleInfo><title>%value%</title></titleInfo>
Creator,<name><namePart>%value%</namePart></name>
Subject,<subject><topic>%value%</topic></subject>
Description,<abstract>%value%</abstract>
Date,<originInfo><dateIssued>%value%</dateIssued></originInfo>
Language,<language><languageTerm>%value%</languageTerm></language>
Duration,<physicalDescription><extent>%value%><extent><physicalDescription
Rights,<accessCondition>%value%</accessCondition>
Type,<genre>Interviews (Sound recordings)</genre>

This example builds on the first one by adding attributes. Notice that quotation marks around attribute values are escaped with an additional set of quotation marks, which were added by Google Sheets as part of the CSV export. This is not necessary, but it is allowed:

Title,<titleInfo><title>%value%</title></titleInfo>
Creator,"<name><namePart>%value%</namePart><role><roleTermtype=""text"" authority=""marcrelator"">creator</roleTerm><roleTerm type=""code"" authority=""marcrelator"">cre</roleTerm></role></name>"
Subject,<subject><topic>%value%</topic></subject>
Description,<abstract>%value%</abstract>
Date,"<keyDate=""yes""><originInfo><dateIssued encoding=""w3cdtf"" keyDate=""yes"">%value%</dateIssued></originInfo>"
Language,"<language><languageTerm type=""code"" authority=""iso639-2b"">%value%</languageTerm></language>"
Duration,<physicalDescription><extent>%value%><extent><physicalDescription
Rights,"<accessCondition type=""use and reproduction"">%value%</accessCondition>"
Type,"<genre authority=""lcsh"">Interviews (Sound recordings)</genre>"
Type of resource,<typeOfResource>sound recording-nonmusical</typeOfResource>

This last example is a bit more complex than the first two. It illustrates how to deal with source metadata that doesn't map to MODS elements. You can wrap your non-MODS snippet elements in MODS' <extension> element, as illustrated here:

Calendar name,<titleInfo><title>%value%</title></titleInfo>
School name,"<name type=""corporate""><namePart>%value%</namePart></name>"
Medium,<physicalDescription><form>%value%</form></physicalDescription>
Work Measurements,<physicalDescription><note>%value%</note></physicalDescription>
Publisher,<originInfo><publisher>%value%</publisher></originInfo>
Year,<originInfo><dateIssued>%value%</dateIssued></originInfo>
Format type,<genre>%value%</genre>
President,"<extension><president type=""ECU custom metadata for the ecucals collection"">%value%</president></extension>"
Board members,"<extension><board_members type=""ECU custom metadata for the ecucals collection"">%value%</board_members></extension>"
Administrators,"<extension><administrators type=""ECU custom metadata for the ecucals collection"">%value%</administrators></extension>"
Instructors,"<extension><instructors type=""ECU custom metadata for the ecucals collection"">%value%</instructors></extension>"
"Staff(technicians,support staff)","<extension type=""ECU custom metadata for the ecucals collection""><staff>%value%</staff></extension>"
Degree/Diplomas/Programs,<subject><topic>%value%</topic></subject>
Majors/Concentration,<subject><topic>%value%</topic></subject>
Honorary Degree Recipients,"<extension><honorary_degree_recipients type=""ECU custom metadata for the ecucals collection"">%value%</honorary_degree_recipients></extension>"
Scholarships/Awards Recipients,"<extension><scholarship_award_recipients type=""ECU custom metadata for the ecucals collection"">%value%</scholarship_award_recipients></extension>"
Notes,<note>%value%</note>

Ignoring source metadata elements that you don't want in your MODS, and adding MODS elements that don't exist in your source metadata

If a source metadata field is not represented in your mappings file, MIK ignores it. So for example, if:

  • your mappings file contains no XML snippet that correspond to a source metadata field
  • your mappings file contains a row that has a source field name but an empty (blank) snippet
  • your mappings file contains a row with a misspelled source field name

MIK doesn't add that metadata to the MODS documents it creates.

However, you can add snippets that don't correspond to a source field name. If you want to add target elements to your Islandora metadata that don't have a corresponding source element, you can do so by using null plus an integer as the source field name placeholder, as illustrated below:

null0,<accessCondition>This resource is in the Public Domain.</accessCondition>
null1,<note type="additional physical form">Also available on microfiche.</note>
null3,<identifier type="uuid"/>

The first two rows in the mappings file will add the <accessCondition> and <note type="additional physical form"> elements to the MODS file included in each ingest package. Some metadata manipulators use the null mappings to define a template used in adding dynamically generated values (like UUIDs) to MODS documents; the third row in the example above illustrates this.

The markup created by mapping from null source elements cannot contain the special %value placeholder used in other mappings - the markup is the same for every XML file. If you want to modify the markup based on some property of the object being created, you will need to use a metadata manipulator.

Manipulators

Manipulators are like plugins for MIK that let you perform tasks at specific times in the MIK execution lifecycle, or to change how fetchers, file getters, and metadata parsers work. All the code for a manipulator is encapsulated in a single PHP class file. Manipulators are registered in the MIK configuration file in the [MANIPULATORS] section, and may take parameters.

Fetcher manipulators

Fetcher manipulators filter the records that you want in a given conversion job. For example, the CdmCompound fetcher manipulator lets you specify whether the MIK job is applied to Document, Document-PDF, Document-EAD, Postcard, Picture Cube, or Monograph compound objects from CONTENTdm. The RandomSet fetcher manipulator will generate a random set of objects to create ingest packages for.

Fetcher manipulators are cumulative. In other words, if your toolchain uses more than one fetcher manipulator, the output of the first one registered in your .ini file is the input of the next one.

In addition to the two fetcher manipulators mentioned above, MIK provides some others:

The Specific Set fetcher manipulator lets you define a list of objects to process, which is very useful during testing, or you want to reprocess specific objects for some reason. To assist in the latter case, MIK provides a script that will generate a list of source IDs from the log file generated during a job, so you can reprocess the problematic objects easily.

Metadata manipulators

MIK offers "metadata manipulators", which are components of a toolchain that take a snippet from the mappings file and modify it before adding it to the MODS or DC XML file. You register metadata manipulators in the toolchain's .ini file (as illustrated above). One metadata manipulator you will use often is the FilterModsTopic manipulator, which splits repeated subject terms in your source metadata into separate MODS <topic> elements. Another common one is FilterDate, which lets you modify the format that dates are in.

An important design goal of metadata manipulators is that they let you clean up your source metadata before it gets imported into Islandora. For example, if you had a set of dates that were in US-style mm/dd/yy format and you wanted to convert them into yyyy-mm-dd format, you could do that with a metadata manipulator.

Documentation for the following metadata manipulators is available:

File getter manipulators

File getter manipulators let you modify or refine information about where MIK can find certain types of files. For example, in a CONTENTdm toolchain, you might want to tell MIK to get the master TIFF files for a collection of images in a specific directory on a file server. The file getter manipulators currently available include CONTENTdm Single File and CONTENTdm Filter Newspaper Master Paths.

File manipulators

File manipulators work similar to metadata manipulators except they modify (or create) files that are then imported into Islandora. Another use for them is to validate files; for example, MIK provides a file manipulator that retrieves an object's thumbnail from CONTENTdm.

Please note that file manipulators may be deprecated for the 1.0 release of MIK in favour of post-write hooks. If you need to retrieve or modify a source file, or create an additional datastream file for loading into Islandora, we recommend that you implement a post-write hook instead of a file manipulator. In any event please open an issue so we can discuss your use case or leave a comment on the deprecation issue.

Post-write hooks

MIK can run some scripts after it has written an import package to disk. These scripts are called "post-write hooks" and are enabled in the .ini file's [WRITER] section like this:

[WRITER]
...
postwritehooks[] = "/usr/bin/php extras/scripts/postwritehooks/validate_mods.php"
postwritehooks[] = "/usr/bin/php extras/scripts/postwritehooks/generate_fits.php"
postwritehooks[] = "/usr/bin/php extras/scripts/postwritehooks/object_timer.php"
; postwritehooks[] = "/usr/bin/php extras/scripts/postwritehooks/sample.php"
; postwritehooks[] = "/usr/bin/python extras/scripts/postwritehooks/sample.py"

The five scripts listed above are included in the MIK Github repository as examples. The ones named 'sample' illustrate some basic ways of using post-write hooks. Three complete functional scripts, extras/scripts/postwritehooks/validate_mods.php, extras/scripts/postwritehooks/generate_fits.php, and extras/scripts/postwritehooks/object_timer.php do useful things, as suggested by their names. They also illustrate how to use some of the components included with MIK such as Monolog and Guzzle.

A good example of how a post-write hook script can be used is to produce FITS output for newspaper page objects. As soon as MIK finishes creating the package, the generate_fits.php script runs FITS against each of the child OBJ datastream files and writes out its output for each one to TECHMD.xml within the page folder. This file is then loaded by the Islandora Newspaper Batch module, ending up as the TECHMD datastream for each page object. Another very useful example is validate_mods.php, which validates each MODS.xml file produced by MIK and writes out the result of the validation to the MIK log file.

Post-write hook scripts are passed three arguments from within the main mik script:

  1. the value of the current item's record_key,
  2. a comma-separated list of children record keys, and
  3. the absolute path to the current MIK .ini file.

Hook scripts can grab these parameters, split out the list of child record keys and load the configuration file. Post-write hook scripts can be written in any language that can parse a standard .ini file.

Every script in the postwritehooks[] list is executed as a background process at the end of the processing loop within MIK so that the scripts don't slow MIK down. Some important implications of this include:

  • The scripts cannot send any data back to the main MIK script.
  • If the scripts write output files or modify files created by other parts of MIK, the location of those files must be determined from data that is passed into the script as arguments or from within the .ini file.
  • A post-write hook script should not assume that another post-write hook script has completed running. In practice this means that they should only use as input the files that are produced by other parts of MIK, and not files generated by another post-write hook script. This also means that performing actions like zipping up output packages or creating Bags from them using a post-write hook script risks not including all of the output generated by other post-write hooks scripts.
Clone this wiki locally