Data Citation Plan #103

theathorn · 2019-08-09T23:55:29Z

This RFC outlines a 3-phase plan for providing Data Citation support for the HCA DCP experimental data and associated metadata.

Status: Oversight Review
Last call for oversight review: 4 Oct 2019

Summary of Review Discussion for Approvers
There has been general acceptance of the 3 proposed implementation phases with Phase 1 consisting of a stable project URL.
Objections to using an external DOI registration agency (such as Zenodo) have been raised. This matter has been referred to UX for further research, following which a recommendation to the Oversight Committee will be made.
There was a lengthy discussion on the possible use of Compact Identifiers within the HCA metadata but there is no decision to adopt these at this time.

Initial draft.

diekhans

Overall, really good and to the point. The only real concern is phase 1 not preserving version. Given a project could change a lot over time, know exactly what is cited is important to FAIR.

rfcs/text/0000-data-citation-plan.md

diekhans · 2019-08-11T17:40:50Z

rfcs/text/0000-data-citation-plan.md

+The Data Browser project details page will add a "To cite this project please copy this link" item.
+This "stable URL" will link back to the production site project page using the project's UUID (e.g. https://data.humancellatlas.org/explore/projects/cc95ff89-2e68-4a08-a234-480eca21ce79).
+The URL refers to the “live" view of the project and is therefore subject to additions and updates (e.g. corrections) of data and metadata. However, these are expected to be infrequent and should not affect existing primary data.
+If an existing project is deleted and re-ingested then the cited project UUID would become invalid. If such re-ingestion is allowed then a means must be provided to redirect the original "stable URL" to the new version of the project.


It would be much better to ban reingestion, even if a project has to be updated with a one-off program.

You meant "ban reingestion"?

I think one could ban re-ingestion for everything except the experiment restructure case, as long as one was prepared for delay in design and implementation. Experiment restructuring is a much more complex case where we don't know if it's realistically possible yet except by re-ingestion. This re-ingestion could preserve the UUID by keeping the existing project metadata, though that may also require AUDR deletion work to delete all the old bundles via ingest before adding new ones.

I think this depends on the definition of "reingestion". I take it as "delete everything and start over as if the original data was never there. While I think the modeling of experiment restructure is best kept basic (e.g. no more than this pile of data use to be this old pile of data), I don't think it should completely lose track of the fact that it is related to the old pile of data.

Yes, I think this needs a determination of exactly how new data is related to old data. If it doesn't need to be on a bundle by bundle or file by file basis then the problem could become much simpler.

I think the important functionality is that an a user has someway (ideally both programatic and web page) to discover that an old UUID is no longer in use and some sort of pointer to where to do to get the same data, MVP could be old UUID to project id ideally with some sort of text explaining what sort of changed happened

Updated Phase 1 to include a redirection facility for deleted and re-ingested projects.

@diekhans If your concerns have been addressed, may I resolve this conversation?

@diekhans If your concerns have been addressed can you now approve this PR? Your status is still "requested changes".

diekhans · 2019-08-11T17:48:33Z

rfcs/text/0000-data-citation-plan.md

+If an existing project is deleted and re-ingested then the cited project UUID would become invalid. If such re-ingestion is allowed then a means must be provided to redirect the original "stable URL" to the new version of the project.
+Note: Scientists are *already* citing such project based URLs in publications.
+
+See the "Unresolved Questions" section as to whether or not a formal DOI is required for Phase 1.


Data is commonly cited by accession an occasionally by URL, so I don't think it is a requirement to use DOIs are all. However, a survey of some relevant journal requirements would help clarify if the URL is sufficient.

Will bring this up to the UX team.

I presume you meant "use DOIs are all" to be "Use DOIs at all".

Ingest does plan to automatically submit data to suitable long term resources such as BioStudies, BioSamples and ENA, this gives citable accessions which act as a similar level to DOIs but more granular.

Is "project.biostudies_accessions" currently being populated or will it be in the future? If populated I believe it will already show up on the Project Detail page.

I removed the requirement for a formal DOI in Phase 1.

We don't currently populate it automatically but certain do plan to. If it would help with getting data citation sorted we could talk with @justincc and @morrisonnorman about if that work can be prioritized

rfcs/text/0000-data-citation-plan.md

mckinsel · 2019-08-19T16:48:29Z

Should we do anything for users who don't get their data via the Data Browser?

mckinsel · 2019-08-19T17:26:28Z

rfcs/text/0000-data-citation-plan.md

+It is proposed to split the initial implementation into three phases:
+
+### Phase 1
+This is designed to satisfy the minimal set of requirements for User Stories #1 through #4 by providing only "per-project" citations.


Are you sure a list of links is enough for cc-by attribution?

@gabsie Can you comment on this as you raised the original requirement? Is the intent that a data consumer licenses their published work with CC-BY and includes URLs to the DCP in that publication? We aren't currently licensing DCP content with a CC-BY license, so what do we need to do in the DCP to satisfy this requirement?

https://wiki.creativecommons.org/wiki/best_practices_for_attribution

@lauraclarke Where is this CC-BY attribution going and who is creating it? Is a publication author putting the CC-BY license in their publication and linking to the DCP project page? Or is the DCP going to attach CC-BY licenses automatically to each project or to the site as a whole?

So the cc-by attribution should likely be at the project level the same way as DOIs will be

If you look at figshare, https://figshare.com/articles/Malignant_Cancer_Cell_Nucleus/9751670
Zenodo https://zenodo.org/record/3363060#.XWoe8ZNKjOQ they both put it on the individual study/project pages

Please note this cc-by license is different from if we as the DCP chose to license the static content of our browser, that sounds like something we should discuss but not here

Updated the Summary. I didn't alter the CC-BY User Story, assuming that scientific authors will include such a license in their publications. Should we be including a CC-BY license on each Project Detail page of the Data Browser? i.e. Are we saying the data files for each DCP project are licensed under CC-BY?

@lauraclarke Can you answer the question in my previous comment? I think I may be misunderstanding something here.

So I think we need to review how we support people understanding that the data is licensed using cc-by

I think the citation widget gives people a way to meet cc-by attribution needs

They will need to be able to add attribution to whenever they reuse something, at the project level seems a good starting point, Ultimately we might want a way for someone to give us any identifier and getting an appropriate attribution text for that identifier in our system

A paper which might be useful in considering solutions https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0213090

And a blog post https://nlmdirector.nlm.nih.gov/2019/06/11/socio-legal-barriers-to-data-reuse/

mweiden · 2019-08-28T19:51:56Z

@mckinsel @theathorn During Phase 1, I guess the assumption is that users will understand how to link the project uuid to the project uuid in the Matrix Service?

theathorn · 2019-08-29T00:35:36Z

I guess the assumption is that users will understand how to link the project uuid to the project uuid in the Matrix Service?

The citation link (which the Data Browser would supply to the user and may well encode the project_uuid) would get the user to Data Browser project detail page where they could click the "mtx" icon to download the project's matrix.

theathorn · 2019-08-29T00:40:10Z

Should we do anything for users who don't get their data via the Data Browser?

I think you mean programmatic access via the Matrix Service and/or Query Service? That's currently out-of-scope for this RFC but I'm open to suggestions from Tech Arch as to how this might be achieved in the future.

lauraclarke

Should we do anything for users who don't get their data via the Data Browser?

@mckinsel given all the info to create the citation is in the metadata hopefully other users won't find it too difficult to assembly, we might want to consider how easy it would be to add a cli or api call that provided citation text for a given project uuid

lauraclarke

This looks like a good plan which I would be happy to see go forward with minor tweaks.

The most important absence from my perspective is clear acceptance criteria for each phase and success metrics which allow us to test if our chosen solution is useful to the community.

lauraclarke · 2019-08-29T13:14:29Z

rfcs/text/0000-data-citation-plan.md

+
+## Summary
+
+Data Contributors and Data Consumers need to be able to cite data in the DCP that is referenced in scientific publications.


This seems like the least important use case, if there is already a scientific publication most scientists will reference the publication and not directly link to the DCP

Should I rephrase to state that authors of publications need to provide a citation to data in the DCP?

lauraclarke · 2019-08-29T13:15:30Z

rfcs/text/0000-data-citation-plan.md

+  - Create a link in the DOI repository back to the project details page in the Data Browser.
+
+### Acceptance Criteria [optional]
+


Please add acceptance criteria for the three phases and success metrics that we can use to see if our solutions actually work for the community

Added acceptance criteria for each phase - please review.

@lauraclarke Are the acceptance criteria, err, acceptable?

Thanks for the acceptance criteria, these look good but the phase 3 criteria seem disconnected from the user stories as they are written now. The user stories don't mention a data release

Added Data Release User Story.

rfcs/text/0000-data-citation-plan.md

lauraclarke · 2019-08-29T13:18:35Z

rfcs/text/0000-data-citation-plan.md

+If an *immutable* view of the cited data is a requirement, is this technically feasible before a full implementation of support for versioned files (i.e. the AUDR RFC)?
+
+For Phase 1 must a data citation provide a DOI or is a "stable URL" sufficient?
+


I think a stable url is sufficient for MVP. It would be good to understand the cost of DOI assignment so we can work out if it can be done in a suitable time frame for it to be part of the first version of if we are better leaving it to later phases

As agreed, Phase 1 has "stable URLs" only and DOIs are introduced in Phase 2.

rfcs/text/0000-data-citation-plan.md

justincc · 2019-08-29T16:57:06Z

rfcs/text/0000-data-citation-plan.md

+  - a project's expression matrix outputs, generated by the Matrix Service
+  - a project's metadata
+
+If an *immutable* view of the cited data is a requirement, is this technically feasible before a full implementation of support for versioned files (i.e. the AUDR RFC)?


In principle, everything is immutable in the datastore, it's just a matter of finding it. If we had project bundles they could be revved and then point to a completely different set of bundle UUIDs each time. This would require a lot of adaptation by data browser, etc. But then I think you need this even if you have AUDR.

Resolve comment on scientific publications.

Adopted Laura's alternative suggestion for scientific publications.

Explicitly state "stable non-versioned project URL". Add redirect page for re-ingested projects. Do not require a DOI for Phase 1.

lauraclarke · 2019-09-23T09:00:22Z

rfcs/text/0000-data-citation-plan.md

+Using a DOI that links to an external repository that can store a manifest for each version of a project may be the simplest way to provide access to versions of the data for a project.
+
+The creation/update process would perform the following steps:
+  - Ingest creates a new DOI (new project) or a new version of an existing DOI (updated project) when the submission is deemed “complete”.


The word "complete" here is unclear. What is meant by complete? There are many ways to define that, more specific details here would be useful

Attempted to clarify - OK now?

lauraclarke · 2019-09-23T09:01:08Z

rfcs/text/0000-data-citation-plan.md

+#### Phase 2 (in addition to Phase 1 criteria)
+  - The Data Operations team is able to create an immutable citation reference for each project in a Data Release
+  - The citation reference consists of a versioned DOI
+  - The Data Operations team can update the version of the project's DOI in a later Data Release


Do you mean project or release here? why would the data operations team need to update an individual project's DOI?

Clarified to explain that a project may be subject to updates between Data Releases.

rfcs/text/0000-data-citation-plan.md

Add a Data Operations team User Story. Update the Acceptance Criteria for Stories HumanCellAtlas#5 and HumanCellAtlas#6 accordingly.

lauraclarke · 2019-09-30T09:53:49Z

Should we do anything for users who don't get their data via the Data Browser?

I think you mean programmatic access via the Matrix Service and/or Query Service? That's currently out-of-scope for this RFC but I'm open to suggestions from Tech Arch as to how this might be achieved in the future.

@theathorn Maybe a feature request for the HCA cli would be a get citation which would return the same text as the data portal widget does, at least make it easy for programmatic users to get the info without having to browse the website and find what they are looking for.

jahilton · 2019-09-30T22:41:06Z

rfcs/text/0000-data-citation-plan.md

+2. As a data consumer (e.g. researcher with a keyboard), I want to be able to view and share a unique citation identifier so that a reader of my manuscript can obtain the data needed to reproduce my results. Anyone can use the citation identifier to view and download all the original cited data and metadata files for a project from the DCP.
+3. As a data consumer, I need a simple way to reference a project in the DCP so that I can fulfill the requirements of a Creative Commons attribution license (CC-BY).
+4. As a data contributor or consumer, I need a way to use the citation identifier to access the output produced by the DCP Matrix Service for the data being cited.
+5. As a member of the Data Operations team I want to be able to create citations for each of the projects that I include in a Data Release (which is defined as a point-in-time selection of specific verions of data from the DCP).


I think one should be able to create a citation for each project, regardless of inclusion of a Data Distribution. So maybe this is better...

As a data consumer, I am going to publish research based on HCA data and I want to create a citation that references the specific version of each project that I used.

...and just read User Story 6, so I think that covers this. I change my suggestion to remove 5, fix the typo in 6 ('contributor') and remove 'or data consumer' in 6 (a data consumer updating a project is not on the table)

...reading further still, I see that a 'research project' in 6 actually seems to refer to a 'data consumer-curated collection of HCA data'
I think my suggested user story above covers both cases of DataOps cutting a Data Distribution and a consumer making their own collection. If each version of each project can be referenced, then any collection is just a list of those references, right?
Unless there is going to be a difference between a collection made by someone DCP internal vs external

The implementation mechanism for User Stories 5 and 6 may differ. We may not require generic Data Portal login for the Data Ops team (e.g. by supplying specific internal tools to them) and the level of citation granularity may be restricted (e.g. to whole projects and/or the whole Data Distribution). User Story 6 allow for a wide audience of authorized users to create arbitrary cross-project citations.

Similar with my comment below, I think assumptions about future functionality is introducing unnecessary complications (Data Distributions, DataOps logins, non-DCP user logins, collections). For instance, it has not been decided that all Data Distributions reference data at the 'project' level.
Would a more straightforward (and thus, easier to spec out & implement) proposal be Phase 1: Stable non-versioned citable project URLs & Phase 2: Stable citable project version URLs? It seems to me that a project version should be citable independent of any collection/distribution, so once Phase 2 is complete, those project versions could easily be grouped/listed in whatever packaging is required, whether by DCPers or non-DCPers.

My intention was to state that for Phase 2 the Data Ops team decides when a project version becomes citable and creates the citable reference (DOI) for each version. Phase 3 then allows external users to create their own citations for any arbitrary set of data, I've attempted to remove any dependence on anything but the most basic Data Distribution features.

@theathorn I think that if we automatically make the headline url for a project citable then we should also automate citability of distinct project versions when we support access at that level regardless of why the version was updated, it will make the whole process much simpler and much more FAIR

@jahilton and the Data Ops team will, of course, end up needing to make decisions about how releases are updated and versioned but I agree with @jahilton that being too specific on mechanisms around that here is a bad idea

Removed all references to Data Distributions and replaced with "discrete project versions".

rfcs/text/0000-data-citation-plan.md

jahilton · 2019-09-30T22:56:15Z

rfcs/text/0000-data-citation-plan.md

+  - The Data Operations team is able to create an immutable citation reference for each project in a Data Release
+  - The citation reference consists of a versioned DOI
+  - The Data Operations team can update the version of a project's DOI in a later Data Release, for example when updates have been made to that project's data since the previous Data Release.
+  - The Data Browser provides a means of downloading the data associated with a specific version of a project from a Data Release


This seems beyond the scope of Data Citation and into the realm of Data Distribution access. Would you consider relieving this criteria from the current proposal? Or am I missing the connection?

Clarified to state that users must be able to download the cited data that makes up a Data Distribution. Isn't it necessary that users can download cited data so they can reproduce experimental results? Just trying to make that explicit here.

Yes, users must be able to download data from a Data Distribution, but it seems to be an unnecessary step when a simpler criterium would be "The Data Browser provides a means of downloading the data associated with a specific version of a project" (no matter where that version is mentioned)
I worry that adding unnecessary complexities are going to make implementation and acceptance more difficult to achieve. For instance, the current criterium relies on some functionality of Data Distribution, which has not been fully laid out yet.

Removed most dependencies on Data Distribution features.

rfcs/text/0000-data-citation-plan.md

diekhans

All good, really nice work.

* Create 0000-data-citation-plan.md Initial draft. * Tidy up prior to first review * Rephrased Summary Resolve comment on scientific publications. * Revised Summary Adopted Laura's alternative suggestion for scientific publications. * Updated Phase 1 Explicitly state "stable non-versioned project URL". Add redirect page for re-ingested projects. Do not require a DOI for Phase 1. * Update Phase 2 Specific that specific versions of projects are citable. Add requirement for a DOI. * Update Phase 3 Users can create versioned collections of data that are citable via a DOI. * Update external DOI website Add "may DOI resolve to an external website?" to Unresolved Questions. * Clarify meaning of DOI repositories * Minor updates to DOI section * Added Acceptance Criteria * Stable URLs for projects only Stable URLs are for projects only. Separate citation links are not provided for bundles or files within a project. * Updated Unresolved Questions * Clarify "Release View" question Clarify the desirable functionality of the Data Browser providing a view of a particular Data Release and allowing further faceted searches within that view. * Minor grammar fixes * Added Shepherd * Updates rarely affect primary data * Clarify decision process for DOI assigning entity * Add summary titles to the 3 phases * Clarify use of matrix output files * Update User Stories and Acceptance Criteria Add a Data Operations team User Story. Update the Acceptance Criteria for Stories HumanCellAtlas#5 and HumanCellAtlas#6 accordingly. * Clarify Ingest DOI update suggestion * Delete unused optional sections * Remove unnecessary implementation note * Change Data Release to Data Distribution * Define Authorized User * Clarify "cited data" in Phase 2 * Remove dependencies on Data Distribution features * Remove all references to Data Dsitribution * Spelling corrections * Approved as rfc14

theathorn added 2 commits August 2, 2019 19:16

Create 0000-data-citation-plan.md

6bd82e4

Initial draft.

Tidy up prior to first review

4cb5fe8

theathorn added the Architecture label Aug 9, 2019

theathorn requested review from mweiden, kislyuk, briandoconnor, brianraymor, diekhans, jahilton, morrisonnorman, lauraclarke, gabsie and adriennes August 9, 2019 23:57

theathorn mentioned this pull request Aug 10, 2019

Define requirements for Data Citation HumanCellAtlas/dcp#424

Closed

theathorn requested a review from hannes-ucsc August 10, 2019 00:17

diekhans requested changes Aug 11, 2019

View reviewed changes

theathorn requested a review from tburdett August 19, 2019 16:16

mckinsel reviewed Aug 19, 2019

View reviewed changes

lauraclarke reviewed Aug 29, 2019

View reviewed changes

lauraclarke approved these changes Aug 29, 2019

View reviewed changes

morrisonnorman requested changes Aug 29, 2019

View reviewed changes

rfcs/text/0000-data-citation-plan.md Outdated Show resolved Hide resolved

rfcs/text/0000-data-citation-plan.md Outdated Show resolved Hide resolved

rfcs/text/0000-data-citation-plan.md Outdated Show resolved Hide resolved

rfcs/text/0000-data-citation-plan.md Outdated Show resolved Hide resolved

justincc reviewed Aug 29, 2019

View reviewed changes

rfcs/text/0000-data-citation-plan.md Outdated Show resolved Hide resolved

justincc reviewed Aug 29, 2019

View reviewed changes

theathorn added 3 commits September 9, 2019 16:01

Rephrased Summary

f7d3352

Resolve comment on scientific publications.

Revised Summary

7881310

Adopted Laura's alternative suggestion for scientific publications.

Updated Phase 1

17f7a30

Explicitly state "stable non-versioned project URL". Add redirect page for re-ingested projects. Do not require a DOI for Phase 1.

lauraclarke reviewed Sep 23, 2019

View reviewed changes

rfcs/text/0000-data-citation-plan.md Outdated Show resolved Hide resolved

morrisonnorman approved these changes Sep 23, 2019

View reviewed changes

theathorn added 8 commits September 27, 2019 17:07

Updates rarely affect primary data

b2a0c38

Clarify decision process for DOI assigning entity

f6b9a5d

Add summary titles to the 3 phases

4931804

Clarify use of matrix output files

3b790ea

Update User Stories and Acceptance Criteria

f4fae2a

Add a Data Operations team User Story. Update the Acceptance Criteria for Stories HumanCellAtlas#5 and HumanCellAtlas#6 accordingly.

Clarify Ingest DOI update suggestion

95f7071

Delete unused optional sections

03f9647

Remove unnecessary implementation note

da7512c

theathorn added rfc-oversight-review and removed rfc-community-review labels Sep 28, 2019

jahilton reviewed Sep 30, 2019

View reviewed changes

rfcs/text/0000-data-citation-plan.md Outdated Show resolved Hide resolved

jahilton reviewed Sep 30, 2019

View reviewed changes

rfcs/text/0000-data-citation-plan.md Outdated Show resolved Hide resolved

theathorn added 3 commits September 30, 2019 17:04

Change Data Release to Data Distribution

f46165f

Define Authorized User

d6419fb

Clarify "cited data" in Phase 2

24fe761

hannes-ucsc approved these changes Oct 1, 2019

View reviewed changes

theathorn added 2 commits October 1, 2019 17:11

Remove dependencies on Data Distribution features

95162c5

Remove all references to Data Dsitribution

7ac7962

diekhans approved these changes Oct 3, 2019

View reviewed changes

theathorn added 2 commits October 2, 2019 17:39

Spelling corrections

c304fdb

Approved as rfc14

dc4a19c

theathorn merged commit 389714b into HumanCellAtlas:master Oct 4, 2019


		## Summary

		Data Contributors and Data Consumers need to be able to cite data in the DCP that is referenced in scientific publications.

		- Create a link in the DOI repository back to the project details page in the Data Browser.

		### Acceptance Criteria [optional]

		If an immutable view of the cited data is a requirement, is this technically feasible before a full implementation of support for versioned files (i.e. the AUDR RFC)?

		For Phase 1 must a data citation provide a DOI or is a "stable URL" sufficient?

Data Citation Plan #103

Data Citation Plan #103

Conversation

theathorn commented Aug 9, 2019 • edited Loading

diekhans left a comment

Choose a reason for hiding this comment

diekhans Aug 11, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

diekhans Aug 29, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

theathorn Aug 17, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mckinsel commented Aug 19, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mweiden commented Aug 28, 2019

theathorn commented Aug 29, 2019

theathorn commented Aug 29, 2019

lauraclarke left a comment

Choose a reason for hiding this comment

lauraclarke left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lauraclarke commented Sep 30, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jahilton Sep 30, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

diekhans left a comment

Choose a reason for hiding this comment

theathorn commented Aug 9, 2019 •

edited

Loading

diekhans Aug 11, 2019 •

edited

Loading

diekhans Aug 29, 2019 •

edited

Loading

theathorn Aug 17, 2019 •

edited

Loading

jahilton Sep 30, 2019 •

edited

Loading