Skip to content

Toolchain: CSV newspapers

Mark Jordan edited this page Feb 27, 2019 · 37 revisions

Overview

This toolchain allows the creation of Islandora import packages consisting of newspaper issues from issue-level metadata in a CSV file and master (TIFF or JPEG2000) page images. The resulting Islandora import packages can then be ingested into Islandora using the Islandora Nerwspaper Batch module.

Preparing the metadata

Requirements of the CSV file used as the input for this toolchain are that

  • the first row of the CSV file contains column labels/headings
    • all column headings must be unique, and the heading row cannot contain any empty headings
  • the records are separated by a single type of field delimiter (the record separator is defined in the [FETCHER] section's "field_delimiter" configuration setting, as described below)
  • each record in the CSV file corresponds to one newspaper issue
  • one of the fields contains a unique identifier for each row in the file (this column is defined in the [FETCHER] section's "record_key" configuration setting, as described below), and
  • one of the fields contains the name of the Directory where the issue's page files are located (this column is defined in the [FILE_GETTER] section's "file_name_field" configuration setting, as described below).
  • You can comment out problematic records in CSV input files.

A sample metadata CSV file, each row describing a single newspaper issue looks like this:

Identifier,Directory,Issue Title,Date
NL0001,1883-04-05,"New Londoner, April 5, 1883",1883-04-05
NL0002,1883-04-12,"New Londoner, April 12, 1883",1883-04-12
NL0003,1883-04-19,"New Londoner, April 19, 1883",1883-04-19
NL0004,1883-04-26,"New Londoner, April 26, 1883",1883-04-26

The metadata in this CSV file is specific to each newspaper issue, whereas metadata that is the same for all issues should be added using null mappings in the metadata mappings file. Here is a mappings file using column headings from the CSV metadata file above:

Sample mappings file:

"Issue Title","<titleInfo><title>%value%</title></titleInfo>",   
"Date","<originInfo><dateIssued encoding='w3cdtf' keyDate='yes'>%value%</dateIssued></originInfo>",
"Publisher","<originInfo><publisher>Heatley & Sons</publisher></originInfo>,
"Identifier","<identifier type='local'>%value%</identifier>",
"null0","<accessCondition type='use and reproduction'>Public domain</accessCondition>",
"null1","<genre authority='lcgft'>newspapers</genre>",
"null2","<typeOfResource>text</typeOfResource>",
"null3",<subject authority='lcsh'><geographic>Toronto (Ont.)--Newspapers</geographic></subject>",

Note that metadata mappings for all toolchains that create newspaper content-model output to be ingested into Islandora must contain a mapping to the MODS <dateIssued> element, which will contain the issue's data of publication in yyyy-mm-dd format

"Date","<originInfo><dateIssued encoding='w3cdtf' keyDate='yes'>%value%</dateIssued></originInfo>",

Preparing the master page images

In addition to issue-level metadata, you must prepare the corresponding newspaper page images for use with MIK. The way these images are arranged is important because MIK uses information embedded in directory and file names to identify which image corresponds to which newspaper page. Some general requirements for how you prepare the page image files are:

  • The input directory must contain master files for a single newspaper.
  • Allowed extensions for page image files are 'tiff', 'tif', or 'jp2', although others can be used with the allowed_file_extensions_for_OBJ configuration option.

In addition, each issue's page files must be in a directory named with the date of the issue in yyyy-mm-dd format. The page number must be present in the filename as the final hyphen-separated segment of the filename. Leading zeros in page numbers are stripped. Two examples of issue-level directories and files that follow these requirements are:

1910-01-01
  page-01.tif
  page-02.tif

and

1910-01-01
  1910-01-01-01.tif
  1910-01-01-02.tif

Both issue-level directories are valid since:

  • The issue's content is in a directory named using the issue publication date in yyyy-mm-dd format,
  • the segments of the page filenames are separated by hyphens (-) (although this is configurable), and
  • the last segment of each filename contains the page number.

The issue-level directories do not need to be arranged in any particular way, other than they do need to be within a single top-level directory (which is defined in the .ini file in the [FILE_GETTER]input_directory setting). A flat arrangement of issue-level directories is allowed, as is a hierarchical one (e.g., by year, then month, then day), as long as the issue-level directories are named using yyyy-mm-dd and the page image filenames contain the page number as the last part of the filename. Examples of valid issue-level directories (page files are not shown) are:

Hierarchical, with issue-level directories organized by year, then month:

input_directory
├── 1883
│   ├── 1883-04
│   │   ├── 1883-04-05
│   │   ├── 1883-04-12
│   │   ├── 1883-04-19
│   │   └── 1883-04-26
│   └── 1883-05
│       ├── 1883-05-03
│       ├── 1883-05-10
│       └── 1883-05-17
└── 1884
    ├── 1884-01
    │   ├── 1884-01-03
    │   ├── 1884-01-10
    │   ├── 1884-01-17
    │   ├── 1884-01-24
    │   └── 1884-01-31
    └── 1884-02
        ├── 1884-02-07
        ├── 1884-02-14
        ├── 1884-02-21
        └── 1884-02-28

Flat, with issue-level directories all in one directory:

input_directory
├── 1883-04-05
├── 1883-04-12
├── 1883-05-03
├── 1883-05-10
├── 1884-01-31
├── 1884-02-14
├── 1884-02-21
└── 1884-02-28

Including page-level OCR files

Optionally, you can add OCR files to input directories, which MIK will then add to the Islandora ingest packages. If these files are present, Islandora will use them as the OCR datastreams on page objects instead of having Islandora generate OCR.

To include page-level OCR in your input, name the OCR files the same as the corresponding page images, but give them extension .txt. Extending examples used above, the files would look like this:

1910-01-01
  page-01.tif
  page-01.txt
  page-02.tif
  page-02.txt

and

1910-01-01
  1910-01-01-01.tif
  1910-01-01-01.txt
  1910-01-01-02.tif
  1910-01-01-02.txt

If you include these files, you may also want to set the [WRITER] log_missing_ocr_files configuration setting to TRUE as documented below. If some of your page images do not have corresponding .txt files, you will get input validation errors. To work around this, add [FILE_GETTER] validate_input = false to your .ini file.

Preparing the configuration file

All MIK configuration files are standard INI files which contain the following sections: [SYSTEM], [CONFIG], [FETCHER], [METADATA_PARSER], [FILE_GETTER], [WRITER], [MANIPULATORS], and [LOGGING]. Entries are required unless indicated otherwise below.

Commented lines begin with a semicolon. Values that contain whitespace or special characters (equals, semicolon, etc.) should be wrapped in double quotation marks. If in doubt, use the quotation marks. The order of the sections and the entries within each section do not matter.

The SYSTEM section

This section of the configuration file sets or overrides configuration settings for PHP and the various third-party PHP components used by MIK. It can contain the following entries:

  • date_default_timezone: Optional. Provide a default timezone if date.timezone is null in the the PHP INI. You will know if you need to use this setting because Monolog will throw MIK exceptions and halt MIK. Set to one of the valid PHP timezone values listed at http://php.net/manual/en/timezones.php.
  • verify_ca: Optional. OSX's default PHP configuration use Apple's Secure Transport rather than OpenSSL, causing issues with Certificate Authority verification in Guzzle requests against websites that use HTTPS. This setting allows Guzzle to override CA verification. You will know if you need to use this setting because Guzzle will write entries in your mik.log complaining about CA verification. Set to false to ignore CA verification.

Example

[SYSTEM]
date_default_timezone = 'America/Vancouver'

The CONFIG section

Key-value pairs of configuration entries in this section are simply written to the top of the log file specified in the [LOGGING] section's path_to_log setting. You can add whatever values you want, but they are static (that is, they can't be dynamically derived at runtime). Therefore, all entries in this section are optional.

Example

[CONFIG]
config_id = CSVNewspapersTest
last_updated_on = "2016-05-14"
last_update_by = "mj"

The FETCHER section

This section of the configuration file contains the following entries:

  • class: Required. Must be 'Csv'.
  • input_file: Required. Full path to the CSV file that contains the data describing the objects you are ingesting into Islandora
  • temp_directory: Required. Full path to the directory where the fetchers write data for use later in the toolchain.
  • field_delimiter: Optional. Default is a comma (,). The string or character used in the CSV file to delimit fields. To read a tab-delimited file, use an actual tab character enclosed in quotation marks, not \t.
  • field_enclosure: Optional. Default is double quotation mark ("). The string or character used in the CSV file to wrap values of fields that contain spaces.
  • escape_character: Optional. Default is backslash (\). The string or character used in the CSV file to escape field delimiters or field enclosure characters within field values.
  • use_cache: Optional. Set to false in automated tests (in other words, you will not need to use this unless you are writing automated tests for this fetcher).
  • record_key: Required. The column label identifying the field that contains each record's unique identifier within the CSV file.

Example

[FETCHER]
class = Csv
input_file = '/home/mark/Downloads/csv_newspaper_samples.csv'
temp_directory = "/tmp/csv_newspapers_temp"
; The column heading in the CSV file that contains the name of the unique ID for each row.
record_key = Identifier

The METADATA_PARSER section

This section of the CSV newspaper toolchain's configuration file contains the following entries:

  • class: Required. Must be 'mods\CsvToMods' or 'templated\Templated'. Use the former if simple source field-to-MODS-element mappings are sufficient for your needs, the latter if your source metadata requires complex logic to be converted to MODS.
  • mapping_csv_path: Required. The path, either full or relative to the mik script, where the metadata mapppings file is located.
  • repeatable_wrapper_elements: Optional. By default MIK reduces repeated top-level wrapper MODS elements (same element name with the same attributes) down to a single instance of the element. This setting lets you indicate which elements you want to be repeated (i.e, have multiple of) in your MODS. The most common use for this setting is to allow repeated <extension> elements.

Example

[METADATA_PARSER]
class = mods\CsvToMods
mapping_csv_path = "cartoons_mappings.csv"

The FILE_GETTER section

This section of the CSV toolchain's configuration file contains the following entries:

  • class: Required. Must be 'CsvNewspapers'.
  • input_directory: Required. The full path to the directory where the page files are located. The files should be named as described in the "Preparing the content files" section above. Giving this option an empty value (e.g., input_directory = ) and specifying MODS as the only datastream (e.g., [WRITER]datastreams[] = MODS) allows testing the generation of MODS without requiring access to the content files.
  • temp_directory: RequireRequired. Full path to the directory where the file getter will write data for use later in the toolchain. Can be the same as the temp_directory value used in the [FETCHER] section.
  • file_name_field: The column label identifying the field that contains the name of the directory that corresponds to each issue.
  • validate_input: Optional. Set to false if you do not want MIK to validate the files and directories under input_directory. Defaults to true. See this Cookbook entry for more detail.
  • validate_input_type: Optional. Set to strict if you want MIK to validate the files and directories under input_directory before moving on to generate ingest packages. Defaults to realtime. See this Cookbook entry for more detail.
  • allowed_file_extensions_for_OBJ: optional. Array option defining what file extensions MIK should allow for page image files. If not specified, valid extensions are 'tiff', 'tif', 'jp2'.

Example

[FILE_GETTER]
class = CsvNewspapers
temp_directory = "/tmp/csv_newspapers_temp"
input_directory = "/home/mark/Downloads/non_cdm_newspaper_samples"
; The column heading in the CSV file that contains the name of the directory where the page files for the issue are.
file_name_field = Directory
allowed_file_extensions_for_OBJ[] = jpg

The WRITER section

This section of the CSV Newspapers toolchain's configuration file contains the following entries:

  • class: Required. Must be 'CsvNewspapers'.
  • output_directory: Required. The full path to the directory where output packages are written.
  • generate_page_modsxml: Optional. Set to false if you do not want MIK to generate page-level MODS.xml files. Additional detail is provided below.
  • metadata_filename: Required. Must be 'MODS.xml'.
  • postwritehooks: Optional. A multivalued list of post-write hook scripts. Values have two parts, the full path to the PHP, Python, or shell executable, and the full path to the script itself.
  • datastreams: Optional. A multivalued list of datastream files that you want MIK to create. If not included, MIK will create a MODS.xml file for each issue (and optionally for each page) plus ann OBJ.tiff (or other extension) file for each page. If included, only the indicated datastream files will be generated. Useful for testing metadata generation, for example datastreams[] = "MODS", which would tell MIK to generate only a MODS.xml file for each issue.You can also include 'OCR' in this list if you have provided page-level OCR files in your input directory, although if you leave datastreams[] empty and provide page-level OCR files, they will also be copied into the ingest packages.
  • page_sequence_separator: Optional. Default is a hypen (-). Character used to separate the segments of page-level filenames so the the last segment can be used to determine the page order/sequence.
  • log_missing_ocr_files: Optional. Default is FALSE. Set to TRUE if you want MIK to log missing OCR files.

Example

[WRITER]
class = CsvNewspapers
output_directory = "/tmp/csv_newspapers_output"
postwritehooks[] = "php extras/scripts/postwritehooks/validate_mods.php"
postwritehooks[] = "php extras/scripts/postwritehooks/object_timer.php"
; Must be MODS.xml
metadata_filename = MODS.xml
; datastreams[] = MODS

The MANIPULATORS section

This section of the CSV toolchain's configuration file defines which manipulators should be used. Multiple manipulators can be defined for each type (fetchermanipulators, filegettermanipulators, metadatamanipulators) as illustrated below. The value of each entry is the manipulator class name plus any pip-separated parameters that the manipulator may require. Entries in this section are optional.

Example

[MANIPULATORS]
; fetchermanipulators[] = "RandomSet|50"
fetchermanipulators[] = "SpecificSet|newspapers.txt"
metadatamanipulators[] = "FilterModsTopic|subject"
metadatamanipulators[] = "AddUuidToMods"
metadatamanipulators[] = "AddCsvData"

The LOGGING section

This section of the CSV toolchain's configuration file contains the following entries:

  • path_to_log: Required. The full path to the standard log generated by MIK.
  • path_to_manipulator_log: Required. The full path to the log that the manipulators write status and error messages to.

Example

[LOGGING]
path_to_log = "/tmp/csv_newspapers_output"/mik.log"
path_to_manipulator_log = "/tmp/csv_newspapers_output/manipulator.log"

Creating page-level MODS.xml files

The CSV metadata file contains rows describing issues, not pages. However, MIK provides and option to generate page-level MODS files, the purpose of which is to provide a descriptive title for each page and also to provide a <dataIssued> value to be used in searches. These page-level MODS files combine metadata from the parent issue with the page's specific page number. Specifically, MIK assumes that the parent (i.e., issue) MODS contains a <titleInfo><title> element and a <originInfo><dateIssued> element, e.g.:

"Title","<titleInfo><title>%value%</title></titleInfo>",
"Date","<originInfo><dateIssued encoding='w3cdtf' keyDate='yes'>%value%</dateIssued></originInfo>",

A sample page-level MODS.xml file generated using this option is:

<?xml version="1.0"?>
<mods xmlns="http://www.loc.gov/mods/v3" xmlns:mods="http://www.loc.gov/mods/v3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
  <titleInfo>
    <title>New Londoner, February 28, 1884, page 1</title>
  </titleInfo>
  <originInfo>
    <dateIssued encoding="w3cdtf">1884-02-28</dateIssued>
  </originInfo>
</mods>

Generation of page-level MODS.xml files is enabled by default. To disable it, add the following entry to your .ini file:

[WRITER]
generate_page_modsxml = false
Clone this wiki locally