Skip to content
lindareijnhoudt edited this page Feb 24, 2017 · 13 revisions

The command easy-export-dataset is intended to duplicate EASY datasets between fedora instances for testing purposes.

The export-import cycle

A typical session:

easy-export-dataset easy-dataset:NN sdoNN
easy-ingest sdoNN
easy-update-solr-index -i easy-dataset:MM
easy-update-fs-rdb easy-dataset:MM

Each command has a --help option, the same information and more is available in the readme documents that accompany the code on their github home pages.

  • When importing a dataset from the production environment into the test environment, change the DOI prefix to the test-DOI prefix 10.5072
  • The installed file /opt/easy-export-dataset/cfg/application.properties configures the default source repository, the help info shows how to overrule these defaults on the command line.
  • Replace NN with the value of the desired dataset. sdoNN is the name of a directory that should not exist, choose a name that suites you.
  • The value to use for MM is logged by the ingest command.
  • update-solr makes the new dataset appear in search results, update-fs-rdb makes the files visible.

User ids

The export command logs warnings with user-ids. These ids may not exist in the receiving repository. Either hack these values by replacing them with ids in the receiving repository or create users with these ids. Some of these user ids may roll off your screen, for example downloaders and file owners. You can retrieve them from the daily log file. Note that this log file also includes results from previous sessions on the same day, in that case you might want to drop the filters (sed strips the timestamps, with timestamps sort and uniq make no sense) from the following command:

grep 'fo.xml contains' /var/log/easy-export-dataset/easy-export-dataset.log \
    | sed 's/.* - //' | sort | uniq

An imaginary example of the filtered log illustrates the roles of the users in the downloaded dataset:

  fo.xml contains depositorId: DEPOSITOR
  fo.xml contains doneById: ARCHIVIST
  fo.xml contains property ownerId:
  fo.xml contains property ownerId: DEPOSITOR
  fo.xml contains requesterId: POTENTIALDOWNLOADER
  fo.xml contains user-id: DOWNLOADER
  fo.xml contains user-id: OTHERDDOWNLOADER

Identifiers

After ingestion the duplicated dataset may still have references to the original dataset. For example in RELS-EXT:

<itemID rdf:parseType="Literal">oai:easy.dans.knaw.nl:easy-dataset:300</itemID>
<hasPid rdf:parseType="Literal">urn:nbn:nl:ui:13-bcm-eps</hasPid>

and in EMD

<dc:identifier eas:scheme="AIP_ID">twips.dans.knaw.nl-7326384692708777784-1228583223913</dc:identifier>
<dc:identifier>P1804</dc:identifier>
<dc:identifier eas:identification-system="http://www.persistent-identifier.nl" 
               eas:scheme="PID">urn:nbn:nl:ui:13-bcm-eps</dc:identifier>
<dc:identifier eas:scheme="DMO_ID">easy-dataset:300</dc:identifier>
<dc:identifier eas:identification-system="http://dx.doi.org" 
               eas:scheme="DOI">10.5072/dans-xpa-uw47</dc:identifier>

Roundtrip Test

Comparing the export of an ingested export you have to wade through some noise:

  • the names of the folders are different because of new IDs
  • for a small dataset you can rename these folders before executing a diff -r
  • within the cfg.json files you will still see the different IDs
  • many dates are different
  • attributes might be ordered differently
  • we don't care about download histories yet

One significant difference remains: files lack a label in the exported cfg.json

Cause: the ingest does not fill in the labels in the foXML of files and folders. The label is available in

<foxml:property NAME="info:fedora/fedora-system:def/model#label" 
                VALUE="geinterviewde.jpg"/>

but not in the ingested dataset

<foxml:datastreamVersion ID="EASY_FILE.0" LABEL=""                  
                         CREATED="2015-12-14T12:37:08.022Z" 
                         MIMETYPE="image/jpeg" SIZE="36586">

as it was in the original datset

<foxml:datastreamVersion ID="EASY_FILE.1" 
                         LABEL="geinterviewde.jpg" 
                         CREATED="2015-08-28T22:34:33.370Z" MIMETYPE="image/jpeg" 
                         SIZE="36586">
Clone this wiki locally