Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Citation Plan #103

Merged
merged 31 commits into from
Oct 4, 2019
Merged

Data Citation Plan #103

merged 31 commits into from
Oct 4, 2019

Conversation

theathorn
Copy link
Contributor

@theathorn theathorn commented Aug 9, 2019

This RFC outlines a 3-phase plan for providing Data Citation support for the HCA DCP experimental data and associated metadata.

Status: Oversight Review
Last call for oversight review: 4 Oct 2019

Summary of Review Discussion for Approvers
There has been general acceptance of the 3 proposed implementation phases with Phase 1 consisting of a stable project URL.
Objections to using an external DOI registration agency (such as Zenodo) have been raised. This matter has been referred to UX for further research, following which a recommendation to the Oversight Committee will be made.
There was a lengthy discussion on the possible use of Compact Identifiers within the HCA metadata but there is no decision to adopt these at this time.

Copy link
Contributor

@diekhans diekhans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, really good and to the point. The only real concern is phase 1 not preserving version. Given a project could change a lot over time, know exactly what is cited is important to FAIR.

rfcs/text/0000-data-citation-plan.md Outdated Show resolved Hide resolved
The Data Browser project details page will add a "To cite this project please copy this link" item.
This "stable URL" will link back to the production site project page using the project's UUID (e.g. https://data.humancellatlas.org/explore/projects/cc95ff89-2e68-4a08-a234-480eca21ce79).
The URL refers to the “live" view of the project and is therefore subject to additions and updates (e.g. corrections) of data and metadata. However, these are expected to be infrequent and should not affect existing primary data.
If an existing project is deleted and re-ingested then the cited project UUID would become invalid. If such re-ingestion is allowed then a means must be provided to redirect the original "stable URL" to the new version of the project.
Copy link
Contributor

@diekhans diekhans Aug 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be much better to ban reingestion, even if a project has to be updated with a one-off program.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You meant "ban reingestion"?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think one could ban re-ingestion for everything except the experiment restructure case, as long as one was prepared for delay in design and implementation. Experiment restructuring is a much more complex case where we don't know if it's realistically possible yet except by re-ingestion. This re-ingestion could preserve the UUID by keeping the existing project metadata, though that may also require AUDR deletion work to delete all the old bundles via ingest before adding new ones.

Copy link
Contributor

@diekhans diekhans Aug 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this depends on the definition of "reingestion". I take it as "delete everything and start over as if the original data was never there. While I think the modeling of experiment restructure is best kept basic (e.g. no more than this pile of data use to be this old pile of data), I don't think it should completely lose track of the fact that it is related to the old pile of data.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think this needs a determination of exactly how new data is related to old data. If it doesn't need to be on a bundle by bundle or file by file basis then the problem could become much simpler.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the important functionality is that an a user has someway (ideally both programatic and web page) to discover that an old UUID is no longer in use and some sort of pointer to where to do to get the same data, MVP could be old UUID to project id ideally with some sort of text explaining what sort of changed happened

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated Phase 1 to include a redirection facility for deleted and re-ingested projects.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@diekhans If your concerns have been addressed, may I resolve this conversation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@diekhans If your concerns have been addressed can you now approve this PR? Your status is still "requested changes".

If an existing project is deleted and re-ingested then the cited project UUID would become invalid. If such re-ingestion is allowed then a means must be provided to redirect the original "stable URL" to the new version of the project.
Note: Scientists are *already* citing such project based URLs in publications.

See the "Unresolved Questions" section as to whether or not a formal DOI is required for Phase 1.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Data is commonly cited by accession an occasionally by URL, so I don't think it is a requirement to use DOIs are all. However, a survey of some relevant journal requirements would help clarify if the URL is sufficient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will bring this up to the UX team.

Copy link
Contributor Author

@theathorn theathorn Aug 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I presume you meant "use DOIs are all" to be "Use DOIs at all".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ingest does plan to automatically submit data to suitable long term resources such as BioStudies, BioSamples and ENA, this gives citable accessions which act as a similar level to DOIs but more granular.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "project.biostudies_accessions" currently being populated or will it be in the future? If populated I believe it will already show up on the Project Detail page.

I removed the requirement for a formal DOI in Phase 1.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't currently populate it automatically but certain do plan to. If it would help with getting data citation sorted we could talk with @justincc and @morrisonnorman about if that work can be prioritized

rfcs/text/0000-data-citation-plan.md Outdated Show resolved Hide resolved
rfcs/text/0000-data-citation-plan.md Outdated Show resolved Hide resolved
@mckinsel
Copy link

Should we do anything for users who don't get their data via the Data Browser?

It is proposed to split the initial implementation into three phases:

### Phase 1
This is designed to satisfy the minimal set of requirements for User Stories #1 through #4 by providing only "per-project" citations.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure a list of links is enough for cc-by attribution?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gabsie Can you comment on this as you raised the original requirement? Is the intent that a data consumer licenses their published work with CC-BY and includes URLs to the DCP in that publication? We aren't currently licensing DCP content with a CC-BY license, so what do we need to do in the DCP to satisfy this requirement?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lauraclarke Where is this CC-BY attribution going and who is creating it? Is a publication author putting the CC-BY license in their publication and linking to the DCP project page? Or is the DCP going to attach CC-BY licenses automatically to each project or to the site as a whole?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the cc-by attribution should likely be at the project level the same way as DOIs will be

If you look at figshare, https://figshare.com/articles/Malignant_Cancer_Cell_Nucleus/9751670
Zenodo https://zenodo.org/record/3363060#.XWoe8ZNKjOQ they both put it on the individual study/project pages

Please note this cc-by license is different from if we as the DCP chose to license the static content of our browser, that sounds like something we should discuss but not here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the Summary. I didn't alter the CC-BY User Story, assuming that scientific authors will include such a license in their publications. Should we be including a CC-BY license on each Project Detail page of the Data Browser? i.e. Are we saying the data files for each DCP project are licensed under CC-BY?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lauraclarke Can you answer the question in my previous comment? I think I may be misunderstanding something here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think we need to review how we support people understanding that the data is licensed using cc-by

I think the citation widget gives people a way to meet cc-by attribution needs

They will need to be able to add attribution to whenever they reuse something, at the project level seems a good starting point, Ultimately we might want a way for someone to give us any identifier and getting an appropriate attribution text for that identifier in our system

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A paper which might be useful in considering solutions https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0213090

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mweiden
Copy link
Contributor

mweiden commented Aug 28, 2019

@mckinsel @theathorn During Phase 1, I guess the assumption is that users will understand how to link the project uuid to the project uuid in the Matrix Service?

@theathorn
Copy link
Contributor Author

I guess the assumption is that users will understand how to link the project uuid to the project uuid in the Matrix Service?

The citation link (which the Data Browser would supply to the user and may well encode the project_uuid) would get the user to Data Browser project detail page where they could click the "mtx" icon to download the project's matrix.

@theathorn
Copy link
Contributor Author

Should we do anything for users who don't get their data via the Data Browser?

I think you mean programmatic access via the Matrix Service and/or Query Service? That's currently out-of-scope for this RFC but I'm open to suggestions from Tech Arch as to how this might be achieved in the future.

Copy link
Member

@lauraclarke lauraclarke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do anything for users who don't get their data via the Data Browser?

@mckinsel given all the info to create the citation is in the metadata hopefully other users won't find it too difficult to assembly, we might want to consider how easy it would be to add a cli or api call that provided citation text for a given project uuid

Copy link
Member

@lauraclarke lauraclarke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a good plan which I would be happy to see go forward with minor tweaks.

The most important absence from my perspective is clear acceptance criteria for each phase and success metrics which allow us to test if our chosen solution is useful to the community.


## Summary

Data Contributors and Data Consumers need to be able to cite data in the DCP that is referenced in scientific publications.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like the least important use case, if there is already a scientific publication most scientists will reference the publication and not directly link to the DCP

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I rephrase to state that authors of publications need to provide a citation to data in the DCP?

- Create a link in the DOI repository back to the project details page in the Data Browser.

### Acceptance Criteria [optional]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add acceptance criteria for the three phases and success metrics that we can use to see if our solutions actually work for the community

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added acceptance criteria for each phase - please review.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lauraclarke Are the acceptance criteria, err, acceptable?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the acceptance criteria, these look good but the phase 3 criteria seem disconnected from the user stories as they are written now. The user stories don't mention a data release

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added Data Release User Story.

rfcs/text/0000-data-citation-plan.md Outdated Show resolved Hide resolved
If an *immutable* view of the cited data is a requirement, is this technically feasible before a full implementation of support for versioned files (i.e. the AUDR RFC)?

For Phase 1 must a data citation provide a DOI or is a "stable URL" sufficient?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a stable url is sufficient for MVP. It would be good to understand the cost of DOI assignment so we can work out if it can be done in a suitable time frame for it to be part of the first version of if we are better leaving it to later phases

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As agreed, Phase 1 has "stable URLs" only and DOIs are introduced in Phase 2.

rfcs/text/0000-data-citation-plan.md Outdated Show resolved Hide resolved
rfcs/text/0000-data-citation-plan.md Outdated Show resolved Hide resolved
rfcs/text/0000-data-citation-plan.md Outdated Show resolved Hide resolved
rfcs/text/0000-data-citation-plan.md Outdated Show resolved Hide resolved
rfcs/text/0000-data-citation-plan.md Outdated Show resolved Hide resolved
rfcs/text/0000-data-citation-plan.md Outdated Show resolved Hide resolved
- a project's expression matrix outputs, generated by the Matrix Service
- a project's metadata

If an *immutable* view of the cited data is a requirement, is this technically feasible before a full implementation of support for versioned files (i.e. the AUDR RFC)?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In principle, everything is immutable in the datastore, it's just a matter of finding it. If we had project bundles they could be revved and then point to a completely different set of bundle UUIDs each time. This would require a lot of adaptation by data browser, etc. But then I think you need this even if you have AUDR.

Resolve comment on scientific publications.
Adopted Laura's alternative suggestion for scientific publications.
Explicitly state "stable non-versioned project URL".
Add redirect page for re-ingested projects.
Do not require a DOI for Phase 1.
Using a DOI that links to an external repository that can store a manifest for each version of a project may be the simplest way to provide access to versions of the data for a project.

The creation/update process would perform the following steps:
- Ingest creates a new DOI (new project) or a new version of an existing DOI (updated project) when the submission is deemed “complete”.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The word "complete" here is unclear. What is meant by complete? There are many ways to define that, more specific details here would be useful

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Attempted to clarify - OK now?

#### Phase 2 (in addition to Phase 1 criteria)
- The Data Operations team is able to create an immutable citation reference for each project in a Data Release
- The citation reference consists of a versioned DOI
- The Data Operations team can update the version of the project's DOI in a later Data Release
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean project or release here? why would the data operations team need to update an individual project's DOI?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarified to explain that a project may be subject to updates between Data Releases.

@lauraclarke
Copy link
Member

Should we do anything for users who don't get their data via the Data Browser?

I think you mean programmatic access via the Matrix Service and/or Query Service? That's currently out-of-scope for this RFC but I'm open to suggestions from Tech Arch as to how this might be achieved in the future.

@theathorn Maybe a feature request for the HCA cli would be a get citation which would return the same text as the data portal widget does, at least make it easy for programmatic users to get the info without having to browse the website and find what they are looking for.

2. As a data consumer (e.g. researcher with a keyboard), I want to be able to view and share a unique citation identifier so that a reader of my manuscript can obtain the data needed to reproduce my results. Anyone can use the citation identifier to view and download all the original cited data and metadata files for a project from the DCP.
3. As a data consumer, I need a simple way to reference a project in the DCP so that I can fulfill the requirements of a Creative Commons attribution license (CC-BY).
4. As a data contributor or consumer, I need a way to use the citation identifier to access the output produced by the DCP Matrix Service for the data being cited.
5. As a member of the Data Operations team I want to be able to create citations for each of the projects that I include in a Data Release (which is defined as a point-in-time selection of specific verions of data from the DCP).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think one should be able to create a citation for each project, regardless of inclusion of a Data Distribution. So maybe this is better...

As a data consumer, I am going to publish research based on HCA data and I want to create a citation that references the specific version of each project that I used.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...and just read User Story 6, so I think that covers this. I change my suggestion to remove 5, fix the typo in 6 ('contributor') and remove 'or data consumer' in 6 (a data consumer updating a project is not on the table)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...reading further still, I see that a 'research project' in 6 actually seems to refer to a 'data consumer-curated collection of HCA data'
I think my suggested user story above covers both cases of DataOps cutting a Data Distribution and a consumer making their own collection. If each version of each project can be referenced, then any collection is just a list of those references, right?
Unless there is going to be a difference between a collection made by someone DCP internal vs external

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation mechanism for User Stories 5 and 6 may differ. We may not require generic Data Portal login for the Data Ops team (e.g. by supplying specific internal tools to them) and the level of citation granularity may be restricted (e.g. to whole projects and/or the whole Data Distribution). User Story 6 allow for a wide audience of authorized users to create arbitrary cross-project citations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar with my comment below, I think assumptions about future functionality is introducing unnecessary complications (Data Distributions, DataOps logins, non-DCP user logins, collections). For instance, it has not been decided that all Data Distributions reference data at the 'project' level.
Would a more straightforward (and thus, easier to spec out & implement) proposal be Phase 1: Stable non-versioned citable project URLs & Phase 2: Stable citable project version URLs? It seems to me that a project version should be citable independent of any collection/distribution, so once Phase 2 is complete, those project versions could easily be grouped/listed in whatever packaging is required, whether by DCPers or non-DCPers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intention was to state that for Phase 2 the Data Ops team decides when a project version becomes citable and creates the citable reference (DOI) for each version. Phase 3 then allows external users to create their own citations for any arbitrary set of data, I've attempted to remove any dependence on anything but the most basic Data Distribution features.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@theathorn I think that if we automatically make the headline url for a project citable then we should also automate citability of distinct project versions when we support access at that level regardless of why the version was updated, it will make the whole process much simpler and much more FAIR

@jahilton and the Data Ops team will, of course, end up needing to make decisions about how releases are updated and versioned but I agree with @jahilton that being too specific on mechanisms around that here is a bad idea

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed all references to Data Distributions and replaced with "discrete project versions".

- The Data Operations team is able to create an immutable citation reference for each project in a Data Release
- The citation reference consists of a versioned DOI
- The Data Operations team can update the version of a project's DOI in a later Data Release, for example when updates have been made to that project's data since the previous Data Release.
- The Data Browser provides a means of downloading the data associated with a specific version of a project from a Data Release
Copy link
Contributor

@jahilton jahilton Sep 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems beyond the scope of Data Citation and into the realm of Data Distribution access. Would you consider relieving this criteria from the current proposal? Or am I missing the connection?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarified to state that users must be able to download the cited data that makes up a Data Distribution. Isn't it necessary that users can download cited data so they can reproduce experimental results? Just trying to make that explicit here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, users must be able to download data from a Data Distribution, but it seems to be an unnecessary step when a simpler criterium would be "The Data Browser provides a means of downloading the data associated with a specific version of a project" (no matter where that version is mentioned)
I worry that adding unnecessary complexities are going to make implementation and acceptance more difficult to achieve. For instance, the current criterium relies on some functionality of Data Distribution, which has not been fully laid out yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed most dependencies on Data Distribution features.

Copy link
Contributor

@diekhans diekhans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good, really nice work.

@theathorn theathorn merged commit 389714b into HumanCellAtlas:master Oct 4, 2019
diekhans pushed a commit to diekhans/dcp-community that referenced this pull request Oct 31, 2019
* Create 0000-data-citation-plan.md

Initial draft.

* Tidy up prior to first review

* Rephrased Summary

Resolve comment on scientific publications.

* Revised Summary

Adopted Laura's alternative suggestion for scientific publications.

* Updated Phase 1

Explicitly state "stable non-versioned project URL".
Add redirect page for re-ingested projects.
Do not require a DOI for Phase 1.

* Update Phase 2

Specific that specific versions of projects are citable.
Add requirement for a DOI.

* Update Phase 3

Users can create versioned collections of data that are citable via a DOI.

* Update external DOI website

Add "may DOI resolve to an external website?" to Unresolved Questions.

* Clarify meaning of DOI repositories

* Minor updates to DOI section

* Added Acceptance Criteria

* Stable URLs for projects only

Stable URLs are for projects only.
Separate citation links are not provided for bundles or files within a project.

* Updated Unresolved Questions

* Clarify "Release View" question

Clarify the desirable functionality of the Data Browser providing a view of a particular Data Release and allowing further faceted searches within that view.

* Minor grammar fixes

* Added Shepherd

* Updates rarely affect primary data

* Clarify decision process for DOI assigning entity

* Add summary titles to the 3 phases

* Clarify use of matrix output files

* Update User Stories and Acceptance Criteria

Add a Data Operations team User Story.
Update the Acceptance Criteria for Stories HumanCellAtlas#5 and HumanCellAtlas#6 accordingly.

* Clarify Ingest DOI update suggestion

* Delete unused optional sections

* Remove unnecessary implementation note

* Change Data Release to Data Distribution

* Define Authorized User

* Clarify "cited data" in Phase 2

* Remove dependencies on Data Distribution features

* Remove all references to Data Dsitribution

* Spelling corrections

* Approved as rfc14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet