Skip to content

Cookbook: Verifying that your Islandora ingest packages contain all expected files

Mark Jordan edited this page Nov 30, 2016 · 3 revisions

It is good practice to make sure that your content is structured properly before ingesting it into Islandora. The Islandora Import Package QA Tool provides a way to do this.

If you are using MIK to generate ingest packages from CONTENTdm, MIK provides a script (extras/scripts/check_files_php) that does what tool above does, plus performs some additional CONTENTdm-specific checks.

The script takes several options:

  • --cmodel : An Islandora content model PID. Required.
  • --dir : The directory containing the files you want to check, without the trailing slash. Required.
  • --files : A comma-separated list of files that need to be present. Required. For content models where the filenames are variable, use an asterisk to indicate the filename (e.g., '*.jpg, *.xml'). In some command shells on Windows you will need to enclose this list with double quotes.
  • --log : The path to the log file containing reports of missing files. Optional (default is ./mik_check_files.log).
  • --issue_level_metadata : Used only with the islandora:newspaperIssueCModel content model. The name of the metadata file to check existence of at the issue level (default is MODS.xml).

For newspapers migrated from CONTENTdm, the script also checks that the number of pages in the CONTENTdm version of each issue is the same as the number that have been migrated by MIK, and that all the OCR.txt files have a character encoding of UTF-8. These two checks are performed automatically -- you don't need to specify them when you run the script.

Here is an example of how to run the script:

php ./check_files.php --cmodel=islandora:newspaperIssueCModel --dir=m:/test_loads/chinese
times_1985-1989_fits09test --files="JP2.jp2,JPEG.jpg,MODS.xml,OBJ.tiff,OCR.txt,TN.jpg,TECHMD.xml"

After the script runs, it will print a summary of its results, like this:

There are no unexpected files in m:/test_loads/chinesetimes_1985-1989_fits09test.
There are no unexpected files in any issue-level directories.
There are no unexpected files in any newspaper page directories.
All newspaper issues in m:/test_loads/chinesetimes_1985-1989_fits09test have the files JP2.jp2,JPEG.jpg,MODS.xml,OBJ.tif
f,OCR.txt,TN.jpg,TECHMD.xml.
All of expected newspaper pages are present.
All OCR.txt files in m:/test_loads/chinesetimes_1985-1989_fits09test appear to be valid UTF-8.
More detail may be available in ./mik_check_files.log.

Details will be available in the log file.

Cookbook table of contents

Clone this wiki locally