Skip to content
Adam edited this page Aug 26, 2014 · 5 revisions

This module was made with extensibility in mind, leveraging Object-Oriented programming.

Background

To help explain how exactly to extend, a bit of background on how the processing work is useful.

We break the ingest process into two steps:

  • Preprocessing
  • Ingest with derivative generation

Preprocessing

Preprocessing consists of scanning the input to be ingested, and building up a queue in the database. The queue structure used consists of three tables:

  • islandora_batch_queue: Describe the objects themselves and indicate a parent object which is also set to be ingested via batch (NULL if none)... Currently, we serialize classes subclassing IslandoraBatchObject, which in turn subclasses Tuque's NewFedoraObject (though this will likely change in the near future) and keep track of their associated IDs.
  • islandora_batch_state: Track the state of each object, based on their ID. The "state" is just a numeric value.
  • islandora_batch_resources: Track resources with associated types for each ID. There is no strict idea of what a "resource" is... Though it is probably it will be something like one of the filenames or a URL associated with the given object. This is not used inside of the core batch code, but it can be very useful for diagnosis and recovery if/when things go wrong, or possibly leveraged to avoid ingesting the same file multiple times, between different preprocessing runs.

In the preprocessing phase, it is expected that one would add all datastreams which contain files to one's instance of one's IslandoraBatchObject subclass; this is not CPU intensive, as there should not be any derivatives created or communication with Fedora at this point. It is suggested that one avoid adding datastreams content as strings (either using a datastream's content property, or the setContentFromString method) if they can be computed later, as they will cause the serialized value in the database to be larger.

There is an abstract base class for preprocessors which can help out, but if preprocessing for a given source is going to be particularly CPU intensive or otherwise take a long time, one might be best off skip using the base class and populate the queue on your own. Discussion around this topic likely merits its own page.

Ingest

The ingest process consists of grabbing objects from the queue which are ready to be ingested, calling the abstract batchProcess method to allow whatever computation is necessary to produce a correct object, less anything implemented via derivatives. Current implementations use this method to produce MODS and crosswalk to DC, as we are outside of the usual ingest context where we would have an ingest form with a DC transform.

After the basics are completed in batchProcess, we then just pass the object to islandora_add_object() to be added to Fedora. This works due to subclassing the IslandoraNewFedoraObject class. Derivatives are then generated normally through implementations of hook_islandora_derivatives().

Hierarchical ingests

Hierarchical ingests work by indicating the parent of a given object while preprocessing. If desired, the initial state of parent objects can be set such that all children will be ingested first. This feature helps to avoid generating and regenerating derivatives on the parent which aggregate children (as a PDF on a newspaper issue or a book might aggregate all of the pages "contained" there-in). The core of how hierarchies work is based on implementations of IslandoraBatchObject::getChildren() and preprocessors; for instance, the parent can be modified by changing the preprocessorParameters['parent_relationship_uri'] member on child objects.

Clone this wiki locally