Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define shared/standard 7.x DSID -> PCDM Use class mappings #854

Closed
mjordan opened this issue Jun 25, 2018 · 19 comments
Closed

Define shared/standard 7.x DSID -> PCDM Use class mappings #854

mjordan opened this issue Jun 25, 2018 · 19 comments

Comments

@mjordan
Copy link
Contributor

mjordan commented Jun 25, 2018

I'm sure this issue is floating just under the surface of a lot of what we've already accomplished (e.g., we're already using the term "Preservation Master") for images, but it might be useful to discuss defining a standard mapping between Islandora 7.x DSIDs and PCDM Use classes. For example:

7.x CLAW
OBJ http://pcdm.org/use#OriginalFile
OCR http://pcdm.org/use#ExtractedText
TN http://pcdm.org/use#ThumbnailImage

There will be some gaps here, largely dependent on the idiosyncrasies of 7.x solution-packs. For example, what is the corresponding PCDM Use class for the large image solution pack's JPG and the JP2 datastreams?

Using standard mappings will help us:

  • when we start modelling paged content
  • to promote cross-repo interoperability through consistent use of linked data classes and properties
  • in developing migration strategies and tools
  • in creating derivative-generation and indexing rules (transcript, OCR, etc.)
  • in developing preservation capabilities; for example, even though most 7.x solution packs do not create preservation-friendly derivatives, the PDF solution pack offers the option of generating a PDFa, which would be a http://pcdm.org/use#PreservationMasterFile resource.

In general, we should ask "if we don't use PCDM Use classes to characterize binary resources, what vocabularies do we use?"

@DiegoPino
Copy link
Contributor

@mjordan i agree that we need to at least suggest a common vocab to semantically denote the purpouse and the origin of our binaries and how they relate to each other (in the rels-int fasion). I would say "ontology" but i still have my doubts about how much of an complete ontology #use was able to describe when updated the last time in 2015. What makes me a little nervous is deciding if PCDM is really as widely used and maintained and if the ontology accepts changes and requests. All there is a subclass of pcdm:file and to be finer we could need some predicates also? Like derivedFrom, computedFrom, transformedFrom and how?

What is using Archivematica? Or thinking about using in the future? Or Archivespace? Wondering if we could work with those communities too (together) to come have a common ontology that is extendable and also comes from some other base ones (use is note derived and has e.g no skos close concepts or anything that binds them to other ontologies) so we are not isolated.

In any case, i support the effort or standardising and using more ontologies for binary resources.

E.g: In my own development i'm simply skipping derivative generation for images dropping the need for extra drupal nodes/media.I know it is totally not the Islandora way, but hey, IIIF already defines pretty well thumbnails and all other viewing options and their Ontologies are derived from other well known and widely used ones. Makes little sense for me if caching systems like the ones provided by IIIF Servers fulfil the promise and lower the processing needs (Thumbnails are generating in realtime and then cached via Cantaloupe). So in that case, IIIF ontologies (for viewing) are good. And IIIF API 3.0 specs define Video, etc capabilities.

@mjordan
Copy link
Contributor Author

mjordan commented Jun 25, 2018

@DiegoPino I agree with you that it's worth looking beyond PCDM if it doesn't fit our needs. I take your point that #use has not been updated since 2016, but I don't think there's anything stopping us from proposing updates, including any predicates like the ones you suggest. I mentioned PCDM because it started out as being shared by a wider, allied community, but if that is no longer the case, let's accept that. Maybe there's an opportunity here to grow that broader community.

I don't think Archivematica describes individual files the we we'd need to; if anything, I am guessing it would look to PREMIS, which doesn't have vocabulary similar to PCDM Use (its objectCharacteristicsExtension semantic unit punts on the sort of thing we are talking about). I'm totally in favor of looking at IIIF, as long as it provides a useful vocabulary for objects other than images. Can you point me to documentation on IIIF's use of other ontologies?

What I think it's very important to avoid is making up our own vocabulary without first eliminating others.

@dannylamb
Copy link
Contributor

@mjordan Here's a mapping by @ruebot from way back when for moving from Fedora 3 to 4. It never got any traction at the time, which I imagine is because it was just too far out for people to see when the software didn't really exist yet. It just maps moves the DSID to a predicate as a straight up string, and doesn't attempt to pick something out of an ontology or vocabulary. That doesn't solve this particular problem, but there's a lot of other mappings in there that may still be pertinent as we consider our migration strategy. I think it's definitely still worth a look.

I also agree that it's important not to make our own vocabulary. But I'll take that a step further and say that we also don't need to identify an ontology/vocabulary that perfectly fits our needs, because it probably doesn't exist. Mixing and matching seems to be the way to go so long as you're not semantically violating the original intent of the various sources.

@mjordan
Copy link
Contributor Author

mjordan commented Jun 25, 2018

@dannylamb thanks for the pointer to that earlier work. I had forgotten about that. Also:

mixing and matching

++

@rangel35
Copy link

@dannylamb that's the way I was thinking, for our purposes I've been looking at different ontologies to see what "pieces" work for us and making a list of things to use in different situations...still very much a work in progress but there is a lot out there

@dannylamb
Copy link
Contributor

dannylamb commented Jun 29, 2018

@DiegoPino FYI, if you went and replaced the pcdmuse terms with ones from any other vocab/ontology and then updated a handful of views and contexts in Drupal, you can completely bend the derivative system (and whatever else) to your will. It'd be nice to see someone do that with something non-pcdm, as I know you are not the only one out there interested in other ontologies. What part of IIIF in particular are you looking at?

@rosiel
Copy link
Member

rosiel commented Aug 21, 2018

I can't find the IIIF ontology that Diego alluded to, but as for PCDM - it does seem to cover 95% of our datastreams. For instance, as Mark asked about Large Image's JP2 and JPG - I think they both qualify as ServiceFile because they're both served as the web-presentation of the object in different contexts (if you have a large image viewer and if you don't, respectively). (i.e. multiple files may have the same PCDM #use but their mimetypes would have them handled differently).

In many cases, our PreservationMasterFile is also the OriginalFile - PDFA is the only case I can think of where they may be different. Can we apply multiple #use classes to the same file? I would think we could.

I like the idea of applying a framework - the PCDM ontology or any other - because it lets us semantically define what these 'datastreams' are in a way that isn't just an Islandora convention that grew organically. I'm not saying that we shouldn't mix and match from different ontologies if necessary, but that this might be a good opportunity to really examine how the datastreams we have fit together.

A way to describe a derivative's origin (in RELS-INT fashion) would be fantastic, but we don't model that right now. Depending on what ontology we use (and if it has application to the 'DSID' problem), it may warrant a separate discussion/ticket. For which I offer: CRMdig, an ontology about digital provenance. https://www.ics.forth.gr/isl/index_main.php?l=e&c=656

@dannylamb
Copy link
Contributor

I don't think there's anything prohibiting you from slapping multiple pcdmuse types on a single object. And yeah... we can switch everything over to Original File. I made everything Preservation Master as a best guess.

When you're asking about derivative's origin, you mean linking back to the Original File from all the derivatives? That'd be nice to have in link headers as well as the RDF.

@dannylamb
Copy link
Contributor

@rosiel Totally egregious, but there's a "convertedFrom" in the IANA link registry: https://www.iana.org/assignments/link-relations/link-relations.xhtml

It's meant for moving from draft to proposal to release candidate status for specs, but hey, we sure are converting those derivatives from a source....

@ajs6f
Copy link

ajs6f commented Aug 22, 2018

@dannylamb,

The document linked to was later converted to the document that contains this link relation. For example, an RFC can have a link to the Internet-Draft that became the RFC; in that case, the link relation would be "convertedFrom".

I don't think it's "meant for moving from draft to…", I think that's just an example. But maybe not?

@dannylamb
Copy link
Contributor

So is an OBJ from 7.x an "Original File" or a "Preservation Master File"? I've been saying Preservation Master, but looking at it now... I'm having second thoughts.

@mjordan
Copy link
Contributor Author

mjordan commented Oct 3, 2018

IMO it's an "Original File". As far as I know, the only standard 7.x solution pack that creates what we should consider a preservation master is the PDF SP, which optionally creates a PDF/A.

@dannylamb
Copy link
Contributor

dannylamb commented Oct 11, 2018

I'm playing around with this mapping in migrate_7x_claw and I've come up with

7.x CLAW
OBJ http://pcdm.org/use#OriginalFile
PDFA http://pcdm.org/use#PreservationMasterFile
OCR http://pcdm.org/use#ExtractedText
TN http://pcdm.org/use#ThumbnailImage
MEDIUM_SIZE http://pcdm.org/use#ServiceFile
JP2 http://pcdm.org/use#IntermediateFile
RELS-EXT http://islandora.ca/ontology/relsext#
DC http://purl.org/dc/elements/1.1/
MODS http://www.loc.gov/mods/v3
TECHMD http://hul.harvard.edu/ois/xml/ns/fits/fits_output

All the XML datastreams (RELS-EXT, DC, MODS, TECHMD) seem like the sort of thing that should be parsed and applied as fields on the node (RELS-EXT, DC, and MODS) or the Original File (TECHMD). But I don't see any harm in having tags to identify them for now while we sort out all the xpaths.

@dannylamb
Copy link
Contributor

FYI totally guessing on those last four. I just threw in the namespaces for their respective ontologies. Open to suggestion on those for sure.

@mjordan
Copy link
Contributor Author

mjordan commented Oct 11, 2018

Should we include AUDIT in this list? Not sure what URI to suggest at this point.... other than the one used in the Fedora 3.8 FOXML: info:fedora/fedora-system:format/xml.fedora.audit.

@mjordan
Copy link
Contributor Author

mjordan commented Oct 11, 2018

Sorry, that should be info:fedora/fedora-system:def/audit#.

@dannylamb
Copy link
Contributor

What does info:fedora resolve to? http://www.fedora.info/definitions/1/0/? Just a guess from looking at a foxml file, but I can't actually find the ontology on the net.

@dannylamb
Copy link
Contributor

http://fedora.info/definitions/1/0/access/ObjState is the closest thing I can find.

@dannylamb
Copy link
Contributor

@mjordan I threw up a PR just to get it out there. I'll add AUDIT once we can figure out a url for it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants