Skip to content

Cookbook: Automating content ingestion workflows

Mark Jordan edited this page Apr 24, 2018 · 11 revisions

It is possible to automate ingestion of content into Islandora. The most common type of automation is to use shell scripts to run MIK and the various Islandora Batch modules. This Cookbook entry documents how you would implement an automated content ingestion workflow.

The components

Overview of automating worklows with MIK

  • Your content
    • Images, PDFs, videos, books, newspaper issues, etc. that you want to load into Islandora. Typically, this content would be the output of manual digitization processes, or it could be the output of a content management system. This content will also typically contain both the content files (images, PDFs, etc.) and the metadata describing them.
  • MIK
    • Validates the structure and arrangement of your raw content.
    • Creates Islandora import packages.
  • Islandora Import Package QA Tool
    • Validates the Islandora import packages created by MIK.
  • Islandora Batch (or the Batch Newsapapers, Book Batch, or Compound Batch modules)
    • Ingests the validated packages, typically using its drush (command line) interface.

Your content

The input to an automated workflow is your content, arranged in a way that conforms to MIK's toolchains. For example, if you are automating the ingestion of still images, your content would need to be arranged as documented in the CSV Single File toolchain. In other words, MIK would be expecting a CSV file containing one record per image, and the corresponding images:

Identifier,File,Title,Creator,Date taken,Subjects,Note
"image01","IMG_1410.JPG","Small boats in Havana Harbour","Jordan, Mark","2015-03-08","Boats; water","Taken on vacation in Cuba."
"image02","IMG_2549.JPG","Manhatten Island","Jordan, Mark","2015-09-13","Cityscapes","Taken from the ferry from downtown New York to Highlands, NJ. Weather was windy."
"image03","IMG_2940.JPG","Looking across Burrard Inlet","Jordan, Mark","2011-08-01",,"View from Deep Cove to Burnaby Mountain. Simon Fraser University is visible on the top of the mountain in the distance."
"image04","IMG_2958.JPG","Amsterdam waterfront","Jordan, Mark","2013-01-17",,"Amsterdam waterfront on an overcast day."
"image05","IMG_5083.JPG","Alcatraz Island","Jordan, Mark","2014-01-14","Alcatraz Federal Penitentiary; islands","Taken from Fisherman's Wharf, San Francisco."

drop_folder
├── IMG_1410.JPG
├── IMG_2549.JPG
├── IMG_2940.JPG
├── IMG_2958.JPG
├── IMG_5083.JPG
└── IMG_5083.JPG

This input data could be the output of an upstream automated or manual workflow, for example, a workflow used in a digitization lab or an automated dump from a content management system.

MIK

MIK converts your content and metadata into packages that can be ingested using Islandora's batch tools. MIK's configuration file contains all the information it needs to do its job, so it is easy to run MIK from within automated processes. However, it is important to understand how MIK validates its input. There are basically two options, 'strict' and 'realtime'. Detailed information on the differences between these to options is documented elsewhere, but basically your choice is to tell MIK to validate its input first before it starts generating packages or to tell MIK to simply skip generating a package if the input data for that package fails validation.

Islandora Import Package QA Tool

The Islandora Import Package QA Tool (iipqa for short) validates MIK's output to ensure that it will import into Islandora reliably. In an automated context, it is prudent to run it against MIK's output so that if it detects any issues, it will terminate the automated script. In the script below, the --strict flag in php iipqa --strict does this.

Islandora Batch

A sample shell script

#!/bin/bash
#######################################################################
# Sample bash script to automate ingestion of content into Islandora. #
#                                                                     #
# Usage: ./sample_scripted_workflow.sh                                #
#######################################################################

# 'set -e' tells the shell script to stop running if any commands
# within it exit with a non-0 value.
set -e

# Change into the MIK directory and run MIK. The .ini file includes
# tells MIK to write its output to /tmp/sample_packages. Also,
# we run MIK in 'realtime' input validation mode, so it skips
# packages with malformed input.
cd /path/to/mik
php mik -c sample_config.ini

# Delete log files, or better yet move them somewhere for analysis
# in case something goes wrong.
rm /tmp/sample_packages/*.log

# Change into the Islandora Import Package QA Tool and run it.
# We add the --strict option so it exists with 1 if any packages
# have errors. We tell it this so the the next step, running
# drush to ingest the content, does not happen.
cd /path/to/iipqa
php iipqa --strict -m single -l /tmp/sample_iipaq.log /tmp/sample_packages

# Change to the Islandora directory so we can run a batch ingest
# using Drush.
cd /var/wwww/html/sites/all
drush -v -u 1 islandora_batch_scan_preprocess --content_models=islandora:sp_basic_image --parent=test:collection --parent_relationship_pred=isMemberOfCollection --type=directory --scan_target=/tmp/sample_packages
drush -v -u 1 islandora_batch_ingest 

Cookbook table of contents

Clone this wiki locally