Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add field to reference data usage citations #259

Closed
mbjones opened this issue Mar 12, 2017 · 12 comments
Closed

add field to reference data usage citations #259

mbjones opened this issue Mar 12, 2017 · 12 comments
Assignees
Labels
Milestone

Comments

@mbjones
Copy link
Contributor

@mbjones mbjones commented Mar 12, 2017


Author Name: Matt Jones (Matt Jones)
Original Redmine Issue: 6283, https://projects.ecoinformatics.org/ecoinfo/issues/6283
Original Date: 2013-12-06
Original Assignee: Matt Jones


Consider adding an optional top level field to eml-dataset to provide this, possibly something like:

@/eml/dataset/dataUsageCitation which would be of type CitationType
@

See discussion on eml-dev regarding this issue:
http://lists.nceas.ucsb.edu/ecoinformatics/pipermail/eml-dev/2013-December/002004.html

@mbjones

This comment has been minimized.

Copy link
Contributor Author

@mbjones mbjones commented Mar 12, 2017


Original Redmine Comment
Author Name: Matt Jones (Matt Jones)
Original Date: 2013-12-11T18:41:41Z


Here's a proposed element definition for the /eml/dataset/citation field. I have preliminarily checked this into EML trunk (r2344) for incorporation in the next release. As it is optional and at the end of the dataset fields, it should be fully backward compatible with prior versions of EML.

<xs:element name="citation" type="cit:CitationType" minOccurs="0" maxOccurs="unbounded">
  <xs:annotation>
    <xs:appinfo>
       <doc:tooltip>Data Citation</doc:tooltip>
       <doc:summary>A citation to articles or products in which the
       dataset is used or referenced.</doc:summary>
       <doc:description>A citation to articles or products in which the
       dataset is used or referenced. The citation element contains 
       general information about a literature resource that has used or
       references this dataset resource.
       </doc:description>
    </xs:appinfo>
  </xs:annotation>
</xs:element>

@mbjones mbjones added this to the EML2.2.0 milestone Mar 12, 2017
@mbjones mbjones added this to TODO in EML 2.2.0 Release Mar 12, 2017
@mbjones mbjones added enhancement and removed Status: New labels Mar 12, 2017
@mbjones mbjones moved this from TODO to High priority in EML 2.2.0 Release Apr 22, 2017
@mbjones mbjones mentioned this issue Apr 22, 2017
31 of 31 tasks complete
@mbjones mbjones removed the Category: eml label Jul 24, 2017
@mbjones

This comment has been minimized.

Copy link
Contributor Author

@mbjones mbjones commented Jul 24, 2017

Checking this, it has already been merged into both master and the BRANCH_EML_2_2 branch with SHA 1ad87d3. So can be closed and reviewed for release.

@mbjones mbjones moved this from High priority to In progress in EML 2.2.0 Release Jul 24, 2017
@mbjones mbjones self-assigned this Jul 24, 2017
@mbjones mbjones closed this Jul 24, 2017
@mbjones mbjones moved this from In progress to Completed in EML 2.2.0 Release Jul 24, 2017
@csjx

This comment has been minimized.

Copy link
Member

@csjx csjx commented Jul 24, 2017

After re-reading the thread in the eml-dev email list, I think we may need to re-open this issue in order to iron out the definition of this element. Carl (@cboettig) raised the issue that the definition should a reference to the "canonical" paper associated with the dataset. Margaret (@mobb) and Wade brought up the issue that citing all articles that use this dataset is not realistic in that the information will go stale quickly. Before we write this in stone, let's be sure it's clearly defined. Carl, can you suggest improvements on definition?

@csjx csjx reopened this Jul 24, 2017
@mbjones

This comment has been minimized.

Copy link
Contributor Author

@mbjones mbjones commented Jul 24, 2017

@cboettig

This comment has been minimized.

Copy link
Member

@cboettig cboettig commented Jul 24, 2017

Thanks for revisiting this.

I still think this is an important issue, but difficult to implement satisfactorily because the whole idea is more of a practical hack than a technically precise idea. We struggled with this recently in codeMeta, and settled on a user's suggestion of referencePublication, codemeta/codemeta#144.

To be clear, I think there's a difference in the mind of many researchers between a publication that is closely connected with the creation of a dataset, and other subsequent papers that also "use" the same data. This is the notion of "canonical" that I'm trying to get at which I think is not reflected in https://projects.ecoinformatics.org/ecoinfo/issues/6283.

I do not think a "canonical" citation link gets stale in the same way as a list of 'publications that use the data' do.

In Dryad, the concept is well-defined and clearly advertised on every data page: "Please cite the following publication as well as the dataset", since all Dryad datasets must be associated with a unique publication. It's unclear how to define this connection in EML.

In general, it seems this concept is a hack, best summarized as "please cite the following paper because the powers-at-be care a lot more about how many citations my papers get than my data, even though semantically / logically citing the data is more meaningful". The real problem of course is that the notion of "citation" is both semantically vague and fundamentally overloaded as a tool for communicating a provenance relationship and a metric for quality. I see the Dryad policy as essentially trying to split these roles: cite here (a paper) for metrics, cite here (data) for provenance, but clearly leaves something to be desired.

Such a 'canonical' paper isn't purely a citation bucket of course, e.g. it is probably also a description of the collection, quality control etc of the data, e.g. is essentially part of the metadata record, but I think EML already provides a mechanism to indicate that.

p.s. isn't there an unrelated issue here about how to associate two top-level EML objects? (e.g. dataset, software, literature, protocols). Seems like it might be reasonable to want to have both in the same EML document, or at least have a good vocab for expressing how they relate (or maybe ORE / PROV is already the solution there).

@mbjones

This comment has been minimized.

Copy link
Contributor Author

@mbjones mbjones commented Jul 25, 2017

This citation field defined here is not meant to be a canonical citation field, but rather a data usage citation, defined as A citation to articles or products in which the dataset is used or referenced.

I'm really not too enthusiastic about a canonical citation that is independent of our existing citation fields. Currently, an EML document contains all of the info needed to cite a dataset. A canonical citation that was separate from these core bibliographic fields would be redundant, and therefore would introduce confusion if the fields differed. Typically, this would be in the following format (or equivalent using the same fields):

Creator(s). pubDate. Title. Publisher. packageId.

For a specific example from an EML document in the Arctic Data Center:

Dr. Matt Nolan, Austin S. Post, William Hauer, Alexander Zinck, and Shad O'Neel. 2017. Photogrammetric scans of aerial photographs of North American glaciers, 1975. Roll 2 jpegs. Arctic Data Center. doi:10.18739/A2H21W.

Maybe we should call the element usageCitation rather than just citation to make the intent of the field clearer.

@cboettig

This comment has been minimized.

Copy link
Member

@cboettig cboettig commented Jul 25, 2017

I agree with all of the previous comments that it seems somewhat backwards for the dataset metadata to provide a record of what papers have cited it. It's hard to imagine this being up-to-date as a metric of who has cited the data, or particularly useful to someone else using the data.

I also agree that the notice Dryad pastes on every dataset:

When using this data, please cite the original publication:

Morales MA, Zink AG (2017) Mechanisms of aggregation in an ant-tended treehopper: Attraction to mutualists is balanced by conspecific competition. PLOS ONE 12(7): e0181429. http://dx.doi.org/10.1371/journal.pone.0181429

Additionally, please cite the Dryad data package:

Morales MA, Zink AG (2017) Data from: Mechanisms of aggregation in an ant-tended treehopper: attraction to mutualists is balanced by conspecific competition. Dryad Digital Repository. http://dx.doi.org/10.5061/dryad.6pt0m

is redundant, (particularly given the Dryad mechaism for titles for data!) and if someone is using the data it would seem logical to just cite the data. (Or arguably, just cite the paper, which has the citation to the data inside it).

Yet despite embracing a very minimal metadata model elsewhere, Dryad clearly thought this concept was practically important to the community it wanted to serve, redundant or not. To the extent that this approach creates confusion, that confusion it is already here whether or not we can express it in metadata. It would seem for some fraction of the community, "please cite X paper" is a meaningful bit of metadata, just as citing for purposes of attribution rather than provenance (e.g. citing 'the original' paper everyone else has also cited, even if you're only familiar with it's contents from other papers you've read) is a recognized norm.

okay </rant>, sorry

@mbjones

This comment has been minimized.

Copy link
Contributor Author

@mbjones mbjones commented Jul 25, 2017

@cboettig I agree that there is no way for the the data usage citation list to remain complete, but the ones that are listed would be definitive examples of usage. The request for the data usage citation list is directly from ESA who wants to require such a list as part of a submission to their revamped journal for data papers, as described in #269. Their idea is that a data set should be demonstrably usable and used before it is published as a data paper, and this field provides the evidence of that. So, I think its important to include it.

I understand where you are coming from on the "please cite X paper" redundancy. Its not in general where I think the community of practice is heading, despite the Dryad example. But let's see if we can get other people to weigh in on whether an additional canonicalCitation field should be added (this discussion should really be in another ticket so as to not confuse the two requests for citations).

@mobb

This comment has been minimized.

Copy link
Contributor

@mobb mobb commented Jul 25, 2017

I think I would like to know generally how publishers/repositories/libraries record the relationships (between papers that use data and the data itself) before I comment. I think including the element
at a high level (eml/dataset) specifically for ESA's use would be a mistake. A group of us at ESIP this week may have a chance to talk about it.

mbjones added a commit that referenced this issue Sep 9, 2017
Also updated the definition to reflect the intended purpose of the field.
@mbjones mbjones added review and removed review labels Oct 30, 2017
@mobb

This comment has been minimized.

Copy link
Contributor

@mobb mobb commented Feb 19, 2018

Adding a single element for relating a published paper has one use case (see example 1 below). It works well with datasets where there is a direct correspondence to a research paper. If that is what is intended, best to say so up front.

Further, the documentation should state what type of relationship is expected to be mapped from this node, eg., I believe the simplest one is isCitedBy
https://schema.datacite.org/meta/kernel-4.1/doc/DataCite-MetadataKernel_v4.1.pdf (p 26)
The EML documentation cannot be the least bit obtuse; that will invite misuse.

Even with that use case, requiring the EML constructor to build an entire citation tree may be hard to defend, when other biblio formats are easier to create, and the simplest representation (the paper's DOI) is not required by the EML citation schema.

For dataset management strategies that are not tightly coupled with research (ie, independent pathways for data and research papers, typical of large research groups like LTER sites), this element will not work for most associations (example 2, below). Those are better done externally. Mainly, it’s a chicken-egg problem: if metadata are immutable, it is impossible to even add this element post-hoc without generating a new DOI and destroying the original linkage from the paper.

Examples:

  1. Here is a dataset that could have used this element, had it existed:
    https://portal.edirepository.org/nis/mapbrowse?scope=knb-lter-sbc&identifier=100
    The reference to the paper it supports in in the abstract. We waited till the last minute so that we could include the paper’s DOI.

  2. Here is a paper that cites SBC LTER datasets by their DOIs, with the independent pathways for data and research described above. The dataset metadata is immutable, so cannot be augmented with a paper reference.
    http://dx.doi.org/10.1038/ncomms13757

@cboettig

This comment has been minimized.

Copy link
Member

@cboettig cboettig commented Feb 19, 2018

I agree with @mobb that this is a very specific use case that should be clearly identified, and that this is not really useful (or even compatible) with data management workflow that are not tightly coupled with particular research. I also agree that EML makes a somewhat cumbersome bibliographic platform, particularly without support for DOIs (personally I'd love to see EML match DataCite schema.org descriptions, guess this could be done via the semantics extension, but that's neither here nor there).

I dont think isCitedBy is nearly specific enough to distinguish between a related paper that contains fundamental metadata about how and why the data were collected (i.e. the use case here), and any other of the myriad reasons one might cite the data (which sounds like way too dynamic a notion to be of any use, as basically everyone in this thread has already said). Of course this kind of information could belong in the EML itself (particularly in the appropriate Methods section, abstract, etc), but for many records -- e.g. all of Dryad -- it isn't, and researchers may feel they have good reasons to prefer to put that material in a "paper", just as EML allows them to keep all of the raw "data" files as separate files which it merely describes in hands-off neutral manner, without any attempt to insist those files are in some normalized or standard format. Maybe it would be more explicit to introduce such an element as an external resource that is part of methods (which can pretty much already be done in EML, if somewhat opaquely).

@mbjones

This comment has been minimized.

Copy link
Contributor Author

@mbjones mbjones commented Feb 23, 2018

@cboettig wrote:

I dont think isCitedBy is nearly specific enough to distinguish between a related paper that contains fundamental metadata about how and why the data were collected (i.e. the use case here)

That is not the use case we are trying to support here. That idea of a referencePublication is a use case covered in issue #277. Let's please discuss issues around 'related' and 'reference' publications, such as how Dryad lists a reference publication, in that issue, and not here.

This ticket is to discuss usageCitation, which is explicitly intended to allow a non-comprehensive list of citations in which the data were explicitly used. There should be no ambiguity about the semantics of the field, in that only works in which the data were actually used should be included as a usageCitation.

Also, I agree that this list will never be comprehensive. Groups like Make Data Count are working on building services to collate lists of citations to data sets. We will always need external services like that. However, its also reasonable to allow a dataset author to explicitly indicate one or more usageCitation examples, especially given that they were likely the first to use the data and the papers probably have particular relevance to understanding the data set. The ESA committee focused on data citation thinks this is critical metadata to understand how to use a data set, and so I am strongly inclined to include this element to facilitate that community application of EML.

@mbjones mbjones closed this Apr 25, 2018
@mbjones mbjones removed the needs-review label Apr 25, 2018
mbjones added a commit that referenced this issue Apr 25, 2018
These include a new ability to use Bibtex citation format both within
the `citation` element, and within a new `bibtex` element, to create
lists of refs using these in a literatureCited element (#300), as well
as in usageCitation (#259), and referencePublication (#277) elements.
All of this helps support data papers (#269), for which pandoc-style
citation keys can be used to cite these references in the text of
Markdown blocks in the EML document.  Added these features as
demonstrations in the eml-data-paper.xml sample document.
@mbjones mbjones added hacktoberfest and removed hacktoberfest labels Jun 29, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
4 participants
You can’t perform that action at this time.