Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metadata: Dataverse project take ownership of documentation for creating metadata blocks #3168

Closed
tdilauro opened this issue Jun 13, 2016 · 15 comments
Assignees
Labels
Feature: Developer Guide Feature: Metadata User Role: Sysadmin Installs, upgrades, and configures the system, connects via ssh

Comments

@tdilauro
Copy link
Contributor

tdilauro commented Jun 13, 2016

At JHU we currently have two use cases for creating custom metadata blocks:

  • Migration of custom metadata fields (the instructions for which don't explicitly mention metadata blocks, but they are implied) for production/staging instances; and
  • Experimentation with metadata models to support software citation/archiving/preservation models.

Neither of these is (though the latter may eventually be) suitable for inclusion in common metadata blocks that would be supported by Dataverse developers or the community at large, so we need to be able to create these blocks locally.

The Dataverse Team did not expect that individual instances would create their own metadata blocks, so documentation for them is sparse. Since we needed a better of understanding of how to do this, I put together a document that captured my understanding and asked the DV team (thanks @posixeleni, @pdurbin, @zoidy, @bmckinney, @scolapasta, and @bencomp for your contributions) to help fix errors and clarify points.

At this point, Dataverse 4.x Metadata Blocks syntax/semantics is in pretty good shape with regard to defining and loading the metadata blocks, so it would be great if the project would take ownership and responsibility for maintaining the document in some (perhaps completely different) form.

NB: More support/documentation for needed Solr schema changes are still needed to provide full custom metadata block support for local instances.

@pdurbin
Copy link
Member

pdurbin commented Jun 28, 2016

#3180 (comment) is the most recent example of me updating the Solr schema due to a field being added.

@pdurbin
Copy link
Member

pdurbin commented Jun 23, 2017

Here's a comment by @edzale at #3506

Hi,
in our local installation, we would like to customize the metadata:

  • remove the Astronomy metadata block because it's not accurate for our scientific domains
  • add and remove certain metadata elements in the other blocks
  • and define controlled vocabularies for certain metadata elements
    Questions: is there any available documentation about these kind of customizations? Is there any way to use a remote controlled vocabulary (accessible through an API for example)?
    Thank you in advance for your help.

@pdurbin
Copy link
Member

pdurbin commented Jan 14, 2018

From IRC today:

"I'm a newbie to Dataverse and evaluating it against CKAN for a potential client. I was wondering what the process is to customize the metadata fields for a dataset, and file metadata? I didn't see anything in the documentation but I very well may have missed it."

http://irclog.iq.harvard.edu/dataverse/2018-01-14

It would be nice to add some documentation on this, assuming we want to support custom metadata blocks.

@jggautier
Copy link
Contributor

jggautier commented Apr 18, 2018

I worked more on the first section of the document that explains how the metadata block tsv is put together. It looks like the second section, about steps for installing metadata blocks, could use more details by those who've gone through that process. There are also questions about how to edit/reinstall metadatablocks that could be answered here by people who've done it.

Perhaps the doc could be reviewed by a developer to make sure it's clear and accurate. Then decide how it should be added to the guides. And new issues can be created to add more information about editing/reinstalling blocks, etc.

@janetm
Copy link

janetm commented May 22, 2018

Hi All
Implications around customising Dataverse metatdata blocks

I've recently had a conversation with Danny and Gustavo (which they found unclear) about the Dataverse Metadata Blocks and issues with local customisation and harvesting. I hope this explains better...

Australian Data Archive (ADA) publish mainly social science survey data so there are some DDI elements/Dataverse fields that use a fairly static vocab. An example is Kind of Data [Survey data, Census data, textual data, diaries, aggregate...]; Unit of Analysis; Time Method etc.

I'm not referring to vocab servers, but drop-downs/tick boxes as already implemented using the TSV. At the moment, to create standard metadata we include the vocab lists in our templates as text, or refer archivists to documentation.

Implications of customising Dataverse metadata blocks:

  • unknown whether Dataverse metadata fields may have values hard-coded into the Dataverse application.
  • for harvesting purposes, any custom modification will cause issues for import.
    ***Is there the possibility of selective harvesting so modified fields could be excluded from harvesting?

We are also having ongoing discussions with Julian about the copyright and version DDI elements not included in the Citation Block - which we have to combine as text in the Notes field. I'm not sure where this is at?

These comments may be better in another space, let me know.
Thanks
Janet

@jggautier
Copy link
Contributor

jggautier commented Jun 19, 2018

Hi @janetm. Thanks for pointing out issues about customizing metadata blocks and how it affects harvesting. And apologies for replying so late. I agree that it's appropriate that we try to clarify these questions in the documentation for creating metadata blocks, which I think should include editing metadata blocks.

I hope I can help answer your questions here and in the documentation (and of course invite developers to yell at me when I'm wrong :) :

Australian Data Archive (ADA) publish mainly social science survey data so there are some DDI elements/Dataverse fields that use a fairly static vocab. An example is Kind of Data [Survey data, Census data, textual data, diaries, aggregate...]; Unit of Analysis; Time Method etc.

I'm not referring to vocab servers, but drop-downs/tick boxes as already implemented using the TSV. At the moment, to create standard metadata we include the vocab lists in our templates as text, or refer archivists to documentation.

Implications of customising Dataverse metadata blocks:

  • unknown whether Dataverse metadata fields may have values hard-coded into the Dataverse application.

I can't imagine any technical issues with editing the default tsv files to allow controlled vocabularies for Kind of Data, Unit of Analysis and other fields that I think you have in mind. (We know that a large number of CV terms raises usability issues, but DDI guidelines suggest a small number of terms for the fields you've mentioned, right?)

  • for harvesting purposes, any custom modification will cause issues for import.
    ***Is there the possibility of selective harvesting so modified fields could be excluded from harvesting?

I think modified fields are already excluded from harvesting: @scolapasta told me that during harvesting Dataverse will try to harvest metadata even when it's a metadata document that isn't composed the way Dataverse expects it to be. I take this to mean that if during harvesting Dataverse expects Kind of Data in the oai_ddi.xml, like this:

...
<sumDscr>
	...
	<timePrd ...></timePrd>
	<collDate ...></collDate>
	<dataKind>KindOfData1</dataKind>
	<geogCover></geogCover>
	...
</sumDscr>
...

But the element name <dataKind> is changed to <kindOfData>, or its order is changed (e.g. if it switches places with colldate), it will exclude <kindOfData> and harvest the rest. I'd think that it would fail to harvest metadata that doesn't have the several fields needed for dataset publication.

(Since Dataverse creates ddi.xml that won't validate against the schema because some elements are put in the wrong places or misused, I've always wondered if while harvesting valid ddi.xml, Dataverse would ignore elements because it expects to find them in the wrong places.)

We are also having ongoing discussions with Julian about the copyright and version DDI elements not included in the Citation Block - which we have to combine as text in the Notes field. I'm not sure where this is at?

There's a github issue (#4570) about migrating datasets that already have versions. I think it's complicated because Dataverse automatically assigns versions, so we need to think about how migrating >1 versions will work. I don't know how the versioning that Dataverse does now affects harvesting. (I see that on the search results pages, the cards of harvested datasets don't include version numbers, so maybe it's not an issue?)

For the copyright element issue (and any of these issues really), could we email to schedule a time to chat? In an issue about making Dataverse produce valid ddi metadata (#3648), I proposed using the copyright element differently than I think you and Steve would like to, and I'd like to get your thoughts.

Thanks!

@pdurbin
Copy link
Member

pdurbin commented Jun 19, 2018

@jggautier to me taking ownership of the documentation means adding a page to the dev guide on this topic. It would mean a pull request. Does that make sense?

The lack of documentation definitely came up during the Dataverse Community Meeting last week. I'd love for this issue to be prioritized. Also, I'd like to point out that #4451 is related.

@jggautier
Copy link
Contributor

jggautier commented Aug 9, 2018

Adding a page (or maybe adding content on an existing page) in the guides sounds good to me. It'll put content on GitHub and make it versioned. I'd need to talk to someone more familiar with Sphynx about how to move the content in the Google Doc into the Dataverse guides.

It sounds like you think that adding more info to the Google Doc about installing or reinstalling metadata blocks should be considered after the content has been moved to the guides.

@jggautier
Copy link
Contributor

jggautier commented Aug 15, 2018

During estimation, @pameyer suggested saving the Google Doc as a docx file and using Pandoc (https://pandoc.org) to convert that to .rst, which Sphynx uses.

The team agreed to move to the guides only content we feel is solid right now - the first section that describes the parts of the metadata block tsv - and open other GitHub issues for moving other content to the guides, i.e. instructions and guidelines for editing and installing metadata blocks.

One thing not discussed was where in the guides this should go. Users can use this info to create or edit metadata blocks during installation, and create or edit metadata blocks after installation. So I could see this going in the installation guide or the admin guide. Currently, the Appendix is the only section with info about metadata blocks.

@pdurbin
Copy link
Member

pdurbin commented Aug 15, 2018

I think the Admin Guide would be a good place. Perhaps we could add the question "Am I happy with the metadata fields available out of the box or do I want to create a custom metadata block?" at http://guides.dataverse.org/en/4.9.2/installation/prep.html#decisions-to-make and link to the new content in the Admin Guide.

@dlmurphy dlmurphy self-assigned this Aug 24, 2018
dlmurphy added a commit that referenced this issue Aug 24, 2018
Converted the old google doc into a .rst and added it to our guides. Still needs some syntax finessing.
@dlmurphy
Copy link
Contributor

For future reference: Pete's suggested method worked very well for converting a google doc to a properly formatting .rst file for our guides:

  1. Download the google doc as a .docx
  2. Use pandoc to convert the .docx to a .rst
  3. Add the .rst to the proper docs folder and add an entry for it in the index
  4. Do some finessing of the syntax to make sure it renders properly and add a table of contents to the page

@dlmurphy
Copy link
Contributor

I've added the new page, but when I'm back on Tuesday I'll finish the syntax fine tuning and we'll be good to go.

@mheppler
Copy link
Contributor

Looks good so far, @dlmurphy. You can preview the guides .rst files in GitHub, and the table formatting is solid from what I can see.

@dlmurphy
Copy link
Contributor

Cleaned up the syntax in a49c1bd and it's looking much nicer now. Sending to code review for @jggautier to make sure his vision has been realized.

@dlmurphy dlmurphy removed their assignment Aug 28, 2018
@dlmurphy dlmurphy self-assigned this Aug 28, 2018
dlmurphy added a commit that referenced this issue Aug 28, 2018
Made some edits to both formatting and content based on @jggautier's review
dlmurphy added a commit that referenced this issue Aug 28, 2018
Linked "Appendix" subsection
@dlmurphy dlmurphy removed their assignment Aug 28, 2018
@jggautier
Copy link
Contributor

Awesome. Thanks @dlmurphy. Moving to QA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature: Developer Guide Feature: Metadata User Role: Sysadmin Installs, upgrades, and configures the system, connects via ssh
Projects
None yet
Development

No branches or pull requests

8 participants