Skip to content

Toolchain: CSV compound objects

Brandon Weigel edited this page Apr 18, 2024 · 36 revisions

Overview

This toolchain allows the creation of Islandora import packages consisting of compound objects where the metadata describing a set of objects is in a Comma Separated Value (CSV) file. The toolchian allows metadata describing the compound object as a whole and, optionally, metadata describing individual child objects. The resulting Islandora import packages can then be ingested into Islandora using the Islandora Compound Batch module.

Currently, this toolchain only works with flat compound objects. Compound objects may have any number of children, but all children must be immediate ancestors of the compound object. In other words, nested or hierarchical structures are not supported.

Requirements of the CSV file used as the input for this toolchain are:

  • the first row of the CSV file contains column labels/headings
  • all column headings must be unique, and the heading row cannot contain any empty headings
  • the records are separated by a single type of field delimiter (as defined in the [FETCHER] section's "field_delimiter" configuration setting, as described below)
  • each record in the CSV file corresponds to one Islandora object, either a compound object or a child object
  • one of the fields contains a unique identifier for each row in the file (in the [FETCHER] section's "record_key" configuration setting, as described below)
  • for child objects that have their own metadata, one of the fields contains a unique identifier for each child, expressed as the child's sequence (order) in its parent compound object in the file (in the [FETCHER] section's "child_key" configuration setting, as described below).
    • Note that if you do not have child-specific metadata, you do not need to account for child objects in your CSV file; they will get a minimal MODS datastream as described in the "Creating child-level MODS XML files" section below. However, you still need to include a column in our CSV that is identified in the "child_key" configuration setting. If your CSV does not contain any child-level metadata, this column must be empty. In other words, the "child_key" configuration setting and a corresponding column in your CSV file are always required.
  • one of the fields contains the name of the subdirectory that contains children files for each of the compound objects (in the [FILE_GETTER] section's "compound_directory_field" configuration setting, as described below).

Records that contain line returns are allowed in the CSV file, as long as those fields are enclosed (wrapped) in double quotation marks or some other valid enclosure characters. Also note that you can comment out problematic parent-level records in CSV input files.

Preparing the content files

Within the input directory, each set of files that makes up one compound object must be organized within a subdirectory. Within each compound object subdirectory, child files must be named such that the last segment of their filenames contains a sortable "sequence" string ("01", "02", etc.). The delimiter that separates this sequence string from the rest of the filename can be defined in the .ini file, but it defaults to an hyphen ('-'). This directory diagram shows three compound objects, with two, four, and two child images respectively, under an input directory, and illustrates the use of the sequence strings '01', '02', etc. as the last segment of all of the child filenames:

input_directory/
├── compounddobject1
│   ├── image-01.tif
│   └── image-02.tif
├── compounddobject2
│   ├── image-01.tif
│   ├── image-02.tif
│   ├── image-03.tif
│   └── image-04.tif
└── compounddobject3
    ├── image_this_is_ok-01.tif
    └── this_too-02.tif

The full path to the input directory must be specified in the [FILE_GETTER] section's input_directory value.

Preparing the CSV metadata file

The CSV input file is similar to the one used by the CSV Single File toolchain, except that it adds a few additional required columns. A very simple example CSV file, corresponding to the input directory example above, that contains rows describing both compound-level objects and child-level objects is:

Identifier,Child,Directory,Title,Date,Description
1,,compounddobject1,Compound 1,2016-08-05,"I am the first compound object."
2,,compounddobject2,Compound 2,2016-08-05,"I am the second compound object."
3,01,compounddobject2,"Compound 2 - First image",2016-08-05,"The first image in Compound 2"
4,02,compounddobject2,"Compound 2 - Second image",2016-08-05,"The second image in Compound 2"
5,,compounddobject3,Compound 3,2016-08-05,"I am the third compound object."

In this example, two of the chidren of the compound object with the identifier 2 have their own rows in the metadata file (as indicated by the values in the Child column and the parent's input directory in the Directory column), so they will have full MODS.xml files generated for them containing the values in the remaining columns. All other child objects will get minimal MODS.xml files containing only a title, which is derived from a template containing the parent compound object's title and the child's sequence in the parent compound object.

Preparing the configuration file

All MIK configuration files are standard INI files which contain the following sections: [SYSTEM], [CONFIG], [FETCHER], [METADATA_PARSER], [FILE_GETTER], [WRITER], [MANIPULATORS], and [LOGGING]. Entries are required unless indicated otherwise below.

Commented lines begin with a semicolon. Values that contain whitespace or special characters (equals, semicolon, etc.) should be wrapped in double quotation marks. If in doubt, use the quotation marks. The order of the sections and the entries within each section do not matter.

The SYSTEM section

This section of the configuration file sets or overrides configuration settings for PHP and the various third-party PHP components used by MIK. It can contain the following entries:

  • date_default_timezone: Optional. Provide a default timezone if date.timezone is null in the the PHP INI. You will know if you need to use this setting because Monolog will throw MIK exceptions and halt MIK. Set to one of the valid PHP timezone values listed at http://php.net/manual/en/timezones.php.
  • verify_ca: Optional. OSX's default PHP configuration use Apple's Secure Transport rather than OpenSSL, causing issues with Certificate Authority verification in Guzzle requests against websites that use HTTPS. This setting allows Guzzle to override CA verification. You will know if you need to use this setting because Guzzle will write entries in your mik.log complaining about CA verification. Set to false to ignore CA verification.

Example

[SYSTEM]
date_default_timezone = 'America/Vancouver'

The CONFIG section

Key-value pairs of configuration entries in this section are simply written to the top of the log file specified in the [LOGGING] section's path_to_log setting. You can add whatever values you want, but they are static (that is, they can't be dynamically derived at runtime). Therefore, all entries in this section are optional.

Example

[CONFIG]
config_id = csvexample
last_updated_on = "2016-08-20"
last_update_by = "Mark Jordan"

The FETCHER section

This section of the configuration file contains the following entries:

  • class: Required. Must be 'Csv'.
  • input_file: Required. Full path to the CSV file that contains the data describing the objects you are ingesting into Islandora
  • temp_directory: Required. Full path to the directory where the fetchers write data for use later in the toolchain.
  • field_delimiter: Optional. Default is a comma (,). The string or character used in the CSV file to delimit fields. To read a tab-delimited file, use an actual tab character enclosed in quotation marks, not \t.
  • field_enclosure: Optional. Default is double quotation mark ("). The string or character used in the CSV file to wrap values of fields that contain spaces.
  • escape_character: Optional. Default is backslash (\). The string or character used in the CSV file to escape field delimiters or field enclosure characters within field values.
  • use_cache: Optional. Set to false in automated tests (in other words, you will not need to use this unless you are writing automated tests for this fetcher).
  • record_key: Required. The column label identifying the field that contains each record's unique identifier within the CSV file.
  • child_key: Required. The column label identifying the field that contains the sequence number of the child object that the row's metadata applies to. Leave blank in rows that apply to parent compound objects.

Example

[FETCHER]
class = Csv
input_file = "/tmp/mik_csv_input.csv"
temp_directory = "/tmp/csv_temp"
field_delimiter = ","
record_key = Identifier
child_key = Child

The METADATA_PARSER section

This section of the CSV toolchain's configuration file contains the following entries:

  • class: Required. Must be 'mods\CsvToMods' or 'templated\Templated'. Use the former if simple source field-to-MODS-element mappings are sufficient for your needs, the latter if your source metadata requires complex logic to be converted to MODS.
  • mapping_csv_path: Required. The path, either full or relative to the mik script, where the metadata mapppings file is located.
  • repeatable_wrapper_elements: Optional. By default MIK reduces repeated top-level wrapper MODS elements (same element name with the same attributes) down to a single instance of the element. This setting lets you indicate which elements you want to be repeated (i.e, have multiple of) in your MODS. The most common use for this setting is to allow repeated <extension> elements.

Example

[METADATA_PARSER]
class = mods\CsvToMods
mapping_csv_path = "cartoons_mappings.csv"

The FILE_GETTER section

This section of the CSV toolchain's configuration file contains the following entries:

  • class: Required. Must be 'CsvCompound'.
  • input_directory: Required. The full path to the directory where the content files are located. The files should be named as described in the "Preparing the content files" section above. Giving this option an empty value (e.g., input_directory = ) and specifying MODS as the only datastream (e.g., [WRITER]datastreams[] = MODS) allows testing the generation of MODS without requiring access to the content files.
  • temp_directory: Required. Full path to the directory where the file getter will write data for use later in the toolchain. Can be the same as the temp_directory value used in the [FETCHER] section.
  • compound_directory_field: Required. The column label indicating the column containing each compound object's child files.
  • validate_input: Optional. Set to false if you do not want MIK to validate the files and directories under input_directory. Defaults to true. See this Cookbook entry for more detail.
  • validate_input_type: Optional. Set to strict if you want MIK to validate the files and directories under input_directory before moving on to generate ingest packages. Defaults to realtime. See this Cookbook entry for more detail.

Example

[FILE_GETTER]
class = CsvCompound
input_directory = "/tmp/csv_compound_example_tiffs"
temp_directory = "/tmp/csv_example_temp"
compound_directory_field = Directory

The WRITER section

This section of the CSV toolchain's configuration file contains the following entries:

  • class: Required. Must be 'CsvCompound'.
  • output_directory: Required. The full path to the directory where output packages are written.
  • metadata_filename: Required. The name of the metadata filename. Must be 'MODS.xml'.
  • postwritehooks: Optional. A multivalued list of post-write hook scripts. Values have two parts, the full path to the PHP, Python, or shell executable, and the full path to the script itself. A compound-specific post-write script is extras/scripts/postwritehooks/generate_compound_structure_file.php, which generates structure.xml files used by the Islandora Compound Batch module.
  • datastreams: Optional. A multivalued list of datastream files that you want MIK to create. If not included, MIK will create all the files that the various file getter, metadata parser, and writer classes used in the toolchain can create. If included, only the indicated datastream files will be generated. Most useful for testing metadata generation, for example datastreams[] = "MODS", which would tell MIK to generate only a MODS.xml file for each object.
  • child_title: Required. A string that is used as the template for child object titles. Applies to child objects that do not have their own rows in the input CSV file. Two tokens are available for use within this template, %parent_title% and %sequence_number%. For example, a value in this configuration field of "%parent_title%, page %sequence_number%" would produce titles in child objects like "View of Burnaby Mountain, side 2". The two tokens are optional; a value of "Page %sequence_number%" would produce child object titles like "Page 5".
  • child_sequence_separator: Optional. The string (usually one character) that is used in child object filenames to delimit the child's sequence number within the compound object. Defaults to a hyphen (-). The sequence number must be the last segment in the child filename. Note: As of commit 420ccbb22c461e3aef7ac4fec67c0982681f2de1 (April 24, 2018), the default for this value changed from an underscore (_) to a hyphen (-) to be consistent with the CSV Books and CSV Newspapers toolchains.
  • generate_child_modsxml: Optional. Set to false if you want to skip generation of MODS.xml for child objects. Useful for testing and troubleshooting. Defaults to true.
  • min_children: Optional. The minimum number of child objects allowed in compound objects. If the number of children in a compound object is less than this number, MIK will skip creating the corresponding ingest package and log that it has been skipped. Defaults to 2.

Example

[WRITER]
class = CsvCompound
output_directory = "/tmp/csv_compound_example_output"
metadata_filename = "MODS.xml"
postwritehooks[] = "/usr/bin/php extras/scripts/postwritehooks/validate_mods.php"
; Generates `structure.xml` files used by the Islandora Compound Batch module.
postwritehooks[] = "/usr/bin/php extras/scripts/postwritehooks/generate_compound_structure_file.php"
; During testing, we're just interested in MODS
; datastreams[] = OBJ
datastreams[] = MODS

The MANIPULATORS section

This section of the CSV toolchain's configuration file defines which manipulators should be used. Multiple manipulators can be defined for each type (fetchermanipulators, filegettermanipulators, metadatamanipulators) as illustrated below. The value of each entry is the manipulator class name plus any pip-separated parameters that the manipulator may require. Entries in this section are optional.

Example

[MANIPULATORS]
; fetchermanipulators[] = "RandomSet|50"
fetchermanipulators[] = "SpecificSet|cartoons_set.txt"
metadatamanipulators[] = "FilterModsTopic|subject"
metadatamanipulators[] = "AddUuidToMods"
metadatamanipulators[] = "AddCsvData"

The LOGGING section

This section of the CSV toolchain's configuration file contains the following entries:

  • path_to_log: Required. The full path to the standard log generated by MIK.
  • path_to_manipulator_log: Required. The full path to the log that the manipulators write status and error messages to.

Example

[LOGGING]
path_to_log = "/tmp/csv_compound_example_output/mik.log"
path_to_manipulator_log = "/tmp/csv_compound_example_outputmanipulator.log"

Creating child-level MODS XML files

As described above, if child objects have rows in the CSV file that indicate their sequence within the parent compound object, they will have full MODS XML files generated for them that are identical to the MODS files generated for compound objects. If child objects do not have their own rows in the CSV file, they will get a miminal MODS XML file based on the following template:

<mods xmlns="http://www.loc.gov/mods/v3" xmlns:mods="http://www.loc.gov/mods/v3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
  <titleInfo>
    <title>{$child_title}</title>
  </titleInfo>
</mods>

This template only produces a simple title for the child object. However, the CSV Compound toolchain provides a way to control what the value of $child_title is for each child object by allowing you to define three aspects of the child title:

  1. whether or not the parent compound object's title is included in the child's title
  2. whether or not the child's sequence within the parent compound object is included in the child's title
  3. some additional text that you want connecting these two bits of information within the child's title.

These three aspects of the child's title are configured within the .ini file in the [WRITER]child_title option. Two tokens are available for use within this template, %parent_title% and %sequence_number%. For example, a value in this configuration field of "%parent_title%, page %sequence_number%" would produce titles in child objects like "View of Burnaby Mountain, side 2". The two tokens are optional; for example, a value of "Page %sequence_number%" would produce child object titles like "Page 5".

Some sample values for the [WRITER]child_title option are:

  • %parent_title%, page %sequence_number%
    • Would produce titles like "View of Burnaby Mountain, page 4", assuming the parent object's title is "View of Burnaby Mountain"
  • %parent_title%, part %sequence_number%
    • Would produce titles like "Video of people at a boat show, part 2", assuming the parent object's title is "Video of people at a boat show"
  • %sequence_number%
    • Would produce titles like "2", because neither the %parent_title% nor the %sequence_number% are included.

In general, it is a good idea to include both %parent_title% and %sequence_number% since doing so give child objects titles that make more sense in search results (for example) than titles like "Part 2" or just "2".

Clone this wiki locally