New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add field to reference data usage citations #259
Comments
Original Redmine Comment Here's a proposed element definition for the /eml/dataset/citation field. I have preliminarily checked this into EML trunk (r2344) for incorporation in the next release. As it is optional and at the end of the dataset fields, it should be fully backward compatible with prior versions of EML.
|
Checking this, it has already been merged into both master and the BRANCH_EML_2_2 branch with SHA 1ad87d3. So can be closed and reviewed for release. |
After re-reading the thread in the eml-dev email list, I think we may need to re-open this issue in order to iron out the definition of this element. Carl (@cboettig) raised the issue that the definition should a reference to the "canonical" paper associated with the dataset. Margaret (@mobb) and Wade brought up the issue that citing all articles that use this dataset is not realistic in that the information will go stale quickly. Before we write this in stone, let's be sure it's clearly defined. Carl, can you suggest improvements on definition? |
Here's the link to that thread in eml-dev for reference: http://lists.nceas.ucsb.edu/ecoinformatics/pipermail/eml-dev/2013-December/thread.html#2004 |
Thanks for revisiting this. I still think this is an important issue, but difficult to implement satisfactorily because the whole idea is more of a practical hack than a technically precise idea. We struggled with this recently in codeMeta, and settled on a user's suggestion of To be clear, I think there's a difference in the mind of many researchers between a publication that is closely connected with the creation of a dataset, and other subsequent papers that also "use" the same data. This is the notion of "canonical" that I'm trying to get at which I think is not reflected in https://projects.ecoinformatics.org/ecoinfo/issues/6283. I do not think a "canonical" citation link gets stale in the same way as a list of 'publications that use the data' do. In Dryad, the concept is well-defined and clearly advertised on every data page: "Please cite the following publication as well as the dataset", since all Dryad datasets must be associated with a unique publication. It's unclear how to define this connection in EML. In general, it seems this concept is a hack, best summarized as "please cite the following paper because the powers-at-be care a lot more about how many citations my papers get than my data, even though semantically / logically citing the data is more meaningful". The real problem of course is that the notion of "citation" is both semantically vague and fundamentally overloaded as a tool for communicating a provenance relationship and a metric for quality. I see the Dryad policy as essentially trying to split these roles: cite here (a paper) for metrics, cite here (data) for provenance, but clearly leaves something to be desired. Such a 'canonical' paper isn't purely a citation bucket of course, e.g. it is probably also a description of the collection, quality control etc of the data, e.g. is essentially part of the metadata record, but I think EML already provides a mechanism to indicate that. p.s. isn't there an unrelated issue here about how to associate two top-level EML objects? (e.g. dataset, software, literature, protocols). Seems like it might be reasonable to want to have both in the same EML document, or at least have a good vocab for expressing how they relate (or maybe ORE / PROV is already the solution there). |
This I'm really not too enthusiastic about a canonical citation that is independent of our existing citation fields. Currently, an EML document contains all of the info needed to cite a dataset. A canonical citation that was separate from these core bibliographic fields would be redundant, and therefore would introduce confusion if the fields differed. Typically, this would be in the following format (or equivalent using the same fields):
For a specific example from an EML document in the Arctic Data Center:
Maybe we should call the element |
I agree with all of the previous comments that it seems somewhat backwards for the dataset metadata to provide a record of what papers have cited it. It's hard to imagine this being up-to-date as a metric of who has cited the data, or particularly useful to someone else using the data. I also agree that the notice Dryad pastes on every dataset:
is redundant, (particularly given the Dryad mechaism for titles for data!) and if someone is using the data it would seem logical to just cite the data. (Or arguably, just cite the paper, which has the citation to the data inside it). Yet despite embracing a very minimal metadata model elsewhere, Dryad clearly thought this concept was practically important to the community it wanted to serve, redundant or not. To the extent that this approach creates confusion, that confusion it is already here whether or not we can express it in metadata. It would seem for some fraction of the community, "please cite X paper" is a meaningful bit of metadata, just as citing for purposes of attribution rather than provenance (e.g. citing 'the original' paper everyone else has also cited, even if you're only familiar with it's contents from other papers you've read) is a recognized norm. okay |
@cboettig I agree that there is no way for the the data usage citation list to remain complete, but the ones that are listed would be definitive examples of usage. The request for the data usage citation list is directly from ESA who wants to require such a list as part of a submission to their revamped journal for data papers, as described in #269. Their idea is that a data set should be demonstrably usable and used before it is published as a data paper, and this field provides the evidence of that. So, I think its important to include it. I understand where you are coming from on the "please cite X paper" redundancy. Its not in general where I think the community of practice is heading, despite the Dryad example. But let's see if we can get other people to weigh in on whether an additional |
I think I would like to know generally how publishers/repositories/libraries record the relationships (between papers that use data and the data itself) before I comment. I think including the element |
Also updated the definition to reflect the intended purpose of the field.
Adding a single element for relating a published paper has one use case (see example 1 below). It works well with datasets where there is a direct correspondence to a research paper. If that is what is intended, best to say so up front. Further, the documentation should state what type of relationship is expected to be mapped from this node, eg., I believe the simplest one is isCitedBy Even with that use case, requiring the EML constructor to build an entire citation tree may be hard to defend, when other biblio formats are easier to create, and the simplest representation (the paper's DOI) is not required by the EML citation schema. For dataset management strategies that are not tightly coupled with research (ie, independent pathways for data and research papers, typical of large research groups like LTER sites), this element will not work for most associations (example 2, below). Those are better done externally. Mainly, it’s a chicken-egg problem: if metadata are immutable, it is impossible to even add this element post-hoc without generating a new DOI and destroying the original linkage from the paper. Examples:
|
I agree with @mobb that this is a very specific use case that should be clearly identified, and that this is not really useful (or even compatible) with data management workflow that are not tightly coupled with particular research. I also agree that EML makes a somewhat cumbersome bibliographic platform, particularly without support for DOIs (personally I'd love to see EML match DataCite schema.org descriptions, guess this could be done via the semantics extension, but that's neither here nor there). I dont think |
@cboettig wrote:
That is not the use case we are trying to support here. That idea of a This ticket is to discuss Also, I agree that this list will never be comprehensive. Groups like Make Data Count are working on building services to collate lists of citations to data sets. We will always need external services like that. However, its also reasonable to allow a dataset author to explicitly indicate one or more |
These include a new ability to use Bibtex citation format both within the `citation` element, and within a new `bibtex` element, to create lists of refs using these in a literatureCited element (#300), as well as in usageCitation (#259), and referencePublication (#277) elements. All of this helps support data papers (#269), for which pandoc-style citation keys can be used to cite these references in the text of Markdown blocks in the EML document. Added these features as demonstrations in the eml-data-paper.xml sample document.
Author Name: Matt Jones (Matt Jones)
Original Redmine Issue: 6283, https://projects.ecoinformatics.org/ecoinfo/issues/6283
Original Date: 2013-12-06
Original Assignee: Matt Jones
Consider adding an optional top level field to eml-dataset to provide this, possibly something like:
@/eml/dataset/dataUsageCitation which would be of type CitationType
@
See discussion on eml-dev regarding this issue:
http://lists.nceas.ucsb.edu/ecoinformatics/pipermail/eml-dev/2013-December/002004.html
The text was updated successfully, but these errors were encountered: