Skip to content

Initial concept

ccacciari edited this page Sep 1, 2017 · 6 revisions

Initial concept

Look at the design document.

Initial implementation

It is possible to find the related scripts here.

The first implementation of the manifest document adopts the METS format, according to the METS schema.
The METS model is a container which can accomodate different metadata models. The Library of Congress (LOC) METS web site hosts some examples of METS profiles which combine different metadata schemas (PREMIS, MODS, MIX, DC, etc.). EUDAT should implement its own profile, but it is necessary to implement and consolidate the manifest schema before thinking about it. Particularly interesting is the profile ECHO Dep Generic METS Profile for Preservation and Digital Repository Interoperability because it mixes PREMIS and METS models.

A very minimal EUDAT manifest document could look like this:

<?xml version="1.0" encoding="utf-8"?>
<ns1:mets ID="_EUDATMETS_6c9ce44a-92b5-4235-9dba-54c069f31e35" LABEL="EUDAT METS document" xmlns:ns1="http://www.loc.gov/METS/" xmlns:ns2="http://www.w3.org/1999/xlink">
  <ns1:fileSec>
    <ns1:fileGrp>
      <ns1:fileGrp ID="claudio_50c05e40-a2bd-4f74-80a4-59daef0548e6">
        <ns1:fileGrp ID="claudio_50c05e40-a2bd-4f74-80a4-59daef0548e6__files__">
          <ns1:file ID="test2.txt_db1e9746-4ac7-4f9f-be90-42fe68822004">
            <ns1:FLocat LOCTYPE="URL" ns2:href="file://claudio/test2.txt" ns2:type="simple"/>
          </ns1:file>
          <ns1:file ID="test3.txt_bf985828-f3b9-4514-98b5-447795641cf5">
            <ns1:FLocat LOCTYPE="URL" ns2:href="file://claudio/test3.txt" ns2:type="simple"/>
          </ns1:file>
          <ns1:file ID="test.txt_8cd22ae9-63c8-4bdb-8719-6771a5361e33">
            <ns1:FLocat LOCTYPE="URL" ns2:href="file://claudio/test.txt" ns2:type="simple"/>
          </ns1:file>
        </ns1:fileGrp>
        <ns1:fileGrp ID="collection_A_aecf921d-a31b-40ed-8f61-63f44de7e5c1">
          <ns1:fileGrp ID="collection_A_aecf921d-a31b-40ed-8f61-63f44de7e5c1__files__">
            <ns1:file ID="datum_1.dat_4ba59a86-3eb9-4a75-9437-54ce62e8e920">
              <ns1:FLocat LOCTYPE="URL" ns2:href="file://claudio/collection_A/datum_1.dat" ns2:type="simple"/>
            </ns1:file>
            <ns1:file ID="metadatum_1.txt_b223c425-db73-44db-9f1d-8d14e4e3035a">
              <ns1:FLocat LOCTYPE="URL" ns2:href="file://claudio/collection_A/metadatum_1.txt" ns2:type="simple"/>
            </ns1:file>
          </ns1:fileGrp>
        </ns1:fileGrp>
        <ns1:fileGrp ID="collection_C_22dd3e5f-8d6c-4f0a-bd31-498d25629995">
          <ns1:fileGrp ID="collection_C_22dd3e5f-8d6c-4f0a-bd31-498d25629995__files__">
            <ns1:file ID="proofn1.txt_0751977d-e1fa-404c-a6ef-f8531851d2ba">
              <ns1:FLocat LOCTYPE="URL" ns2:href="file://claudio/collection_C/proofn1.txt" ns2:type="simple"/>
            </ns1:file>
          </ns1:fileGrp>
          <ns1:fileGrp ID="collection_C1_03addca3-4ff0-47fa-8f30-22e05090ff36">
            <ns1:fileGrp ID="collection_C1_03addca3-4ff0-47fa-8f30-22e05090ff36__files__">
              <ns1:file ID="proofn2.txt_d83a9d0f-fec5-4f0e-b5d2-ba34607ed22d">
                <ns1:FLocat LOCTYPE="URL" ns2:href="file://claudio/collection_C/collection_C1/proofn2.txt" ns2:type="simple"/>
              </ns1:file>
            </ns1:fileGrp>
          </ns1:fileGrp>
        </ns1:fileGrp>
        <ns1:fileGrp ID="collection_B_88551ac4-f8a1-4327-b55e-287e9193782c">
          <ns1:fileGrp ID="collection_B_88551ac4-f8a1-4327-b55e-287e9193782c__files__"/>
        </ns1:fileGrp>
      </ns1:fileGrp>
    </ns1:fileGrp>
  </ns1:fileSec>
  <ns1:structMap TYPE="Relational">
    <ns1:div LABEL="claudio" TYPE="digital collection">
      <ns1:div LABEL="rel_cb1e9746-4ac7-4f9f-be90-42fe68822007" TYPE="entity relation">
        <ns1:div LABEL="dataless" TYPE="DescriptiveMetaData">
          <ns1:fptr FILEID="test2.txt_db1e9746-4ac7-4f9f-be90-42fe68822004"/>
        </ns1:div>
        <ns1:div LABEL="miniseed" TYPE="Data">
          <ns1:fptr FILEID="test.txt_8cd22ae9-63c8-4bdb-8719-6771a5361e33"/>
        </ns1:div>
        <ns1:div LABEL="seed" TYPE="Composite">
          <ns1:fptr FILEID="test3.txt_bf985828-f3b9-4514-98b5-447795641cf5"/>
        </ns1:div>
        <ns1:div LABEL="miniseed" TYPE="Data">
          <ns1:fptr FILEID="datum_1.dat_4ba59a86-3eb9-4a75-9437-54ce62e8e920"/>
        </ns1:div>
      </ns1:div>
      <ns1:div LABEL="rel_ab1e9746-4ac7-4f9f-be90-42fe68822000" TYPE="entity relation">
        <ns1:div LABEL="miniseed" TYPE="Data">
          <ns1:fptr FILEID="datum_1.dat_4ba59a86-3eb9-4a75-9437-54ce62e8e920"/>
        </ns1:div>
        <ns1:div LABEL="miniseed" TYPE="Data">
          <ns1:fptr FILEID="test.txt_8cd22ae9-63c8-4bdb-8719-6771a5361e33"/>
        </ns1:div>
        <ns1:div LABEL="seed" TYPE="Composite">
          <ns1:fptr FILEID="test3.txt_bf985828-f3b9-4514-98b5-447795641cf5"/>
        </ns1:div>
        <ns1:div LABEL="dataless" TYPE="DescriptiveMetaData">
          <ns1:fptr FILEID="metadatum_1.txt_b223c425-db73-44db-9f1d-8d14e4e3035a"/>
        </ns1:div>
      </ns1:div>
      <ns1:div LABEL="unknown" TYPE="Data">
        <ns1:fptr FILEID="proofn1.txt_0751977d-e1fa-404c-a6ef-f8531851d2ba"/>
      </ns1:div>
      <ns1:div LABEL="unknown" TYPE="Data">
        <ns1:fptr FILEID="proofn2.txt_d83a9d0f-fec5-4f0e-b5d2-ba34607ed22d"/>
      </ns1:div>
    </ns1:div>
  </ns1:structMap>
</ns1:mets>

The above manifest represents the following iRODS collection:

ils -r /cinecaDMPZone2/home/claudio
  manifest.xml
  test2.txt
  test3.txt
  test.txt
  C- /cinecaDMPZone2/home/claudio/collection_A
/cinecaDMPZone2/home/claudio/collection_A:
  datum_1.dat
  metadatum_1.txt
  C- /cinecaDMPZone2/home/claudio/collection_B
/cinecaDMPZone2/home/claudio/collection_B:
  C- /cinecaDMPZone2/home/claudio/collection_C
/cinecaDMPZone2/home/claudio/collection_C:
  proofn1.txt
  C- /cinecaDMPZone2/home/claudio/collection_C/collection_C1
/cinecaDMPZone2/home/claudio/collection_C/collection_C1:
  proofn2.txt

And the metadata relations described in the following json-ld file:

{
  "@context": {
    "@vocab": "./conf/EudatControlledVocabulary.jsonld",
    "path": "@id"
  },
  "Structure": [
    {
      "@type": "Entity",
      "path": "_:claudio/test.txt",
      "type": "Data",
      "format": "miniseed"
    },
    {
      "@type": "Entity",
      "path": "_:claudio/test2.txt",
      "type": "DescriptiveMetaData",
      "format": "dataless",
      "isRelatedTo": [
        { "@id": "_:claudio/test.txt" },
        { "@id": "_:claudio/test3.txt" },
        { "@id": "_:claudio/collection_A/datum_1.dat" }
      ]
    },
    {
      "@type": "Entity",
      "path": "_:claudio/test3.txt",
      "type": "Composite",
      "format": "seed"
    },
    {
      "@type": "Entity",
      "path": "_:claudio/collection_A/datum_1.dat",
      "type": "Data",
      "format": "miniseed",
      "isRelatedTo": [
        { "@id": "_:claudio/test.txt" },
        { "@id": "_:claudio/test3.txt" },
        { "@id": "_:claudio/collection_A/metadatum_1.txt"}
      ]     
    },   
    {
      "@type": "Entity",
      "path": "_:*meta*",
      "type": "DescriptiveMetaData",
      "format": "dataless"
    }
  ]
}

The use of json-ld to describe the metadata relations requires the definition of a controlled vocabulary.

The aforementioned example shows the adopted approach to ingest metadata into the B2SAFE service. The json-ld format and controlled vocabulary are flexible enough to support unix-like syntax to define the data and metadata entity patterns. For example:

all the files ending with the character sequence ".hdf5" are data and all the file ending with ".meta" are metadata and are associated to the hdf5 file with the same name.

{
  "@type": "Entity",
  "path": "_:*.hdf5",
  "type": "Data",
  "format": "HDF5"
},

{
  "@type": "Entity",
  "path": "_:${name}.meta",
  "type": "DescriptiveMetaData",
  "format": "MyFormat"
  "isRelatedTo": [
    { "@id": "_:${name}.hdf5" }
  ]
}

The whole process is implemented through two separate scripts:

  • mets_factory.py: it takes a collection path in input and provide a METS file as output. The input path can be a file system or an iRODS path. The data-metadata relation document has to be compliant with the json-ld format and the EUDAT controlled vocabulary, while the manifest is a METS formatted document. If the collection specified contains a sub-collection that already has a METS formatted manifest, then a link will be created from the manifest of the top collection to the manifest of the sub-collection.
$ ./mets_factory.py -h
usage: mets_factory.py [-h] [-dbg] [-d] (-i IRODS | -f FILESYSTEM) confpath
  
METS factory
  
positional arguments:
  confpath              path to the configuration file
  
optional arguments:
  -h, --help            show this help message and exit
  -dbg, --debug         enable debug
  -d, --dryrun          run without performing any real change
  -i IRODS, --irods IRODS
                        irods path
  -f FILESYSTEM, --filesystem FILESYSTEM
                        fs path
  • b2safe_neo4j_client.py: it takes an iRODS path to a manifest file as input, parses it and uploads the content to a neo4j instance. If there is already a graph in the neo4j DB for the collection under the specified path, then the b2safe_neo4j_client will compare the "old" and "new" manifests (that assumed both to be under the collection path), extract the changes and update the graph accordingly. If the top collection contains a sub-collection also with a manifest file describing it and the manifest file of the top collection has a link element (mptr) in the structural map pointing to it, than the b2safe_neo4j_client will create a graph for the sub-collection if there is not already one or update the existing one and connect the subgraph with the graph of the top collection with the IS_LINKED_TO relation. Support for the hierarchy of manifest files allows to decrease the size of each single manifest, making it easier to parse and, also, allowing to process multipe sub-collections in parallel.
$ ./b2safe_neo4j_client.py -h
usage: b2safe_neo4j_client.py [-h] [-dbg] [-d] confpath path
B2SAFE graphDB client
positional arguments:
  confpath       path to the configuration file
  path           irods path to the data
optional arguments:
  -h, --help     show this help message and exit
  -dbg, --debug  enable debug
  -d, --dryrun   run without performing any real change

The final result of the process in the graphDB appears like this: neo4j picture

  • additionally there is a script to validate the mets manifest file validate_mets_manifest.py: it takes the path to the manifest file, that needs to be validated, analyse this of inconsistencies inside the mets dokument, so if there is no file element defined in fileSec that is referenced in structural map, and if all files referenced are existence in iRODS collection. The XML parser checks if the xml itself is well-formed and if not the exception is cached and logged in the log file. At the end it writes the validaiton results in the log file and validaiton status in iRODS metadata and as attribute RECORDSTATUS of the metsHdr element inside the manifest.
$ validate_mets_manifest.py -h
usage: validate_mets_manifest.py [-h] [-confpath CONFPATH] [-dbg] [-d]
                                 [-path PATH] [-u USER]

B2SAFE manifest validation

optional arguments:
  -h, --help            show this help message and exit
  -confpath CONFPATH    path to the configuration file
  -dbg, --debug         enable debug
  -d, --dryrun          run without performing any real change
  -path PATH            irods path to the manifest to validate
  -u USER, --user USER  irods user

Clone this wiki locally