Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestion to include variable 'subtype' from Crossref #21

Open
bmkramer opened this issue Jul 28, 2021 · 4 comments
Open

Suggestion to include variable 'subtype' from Crossref #21

bmkramer opened this issue Jul 28, 2021 · 4 comments
Assignees

Comments

@bmkramer
Copy link

Crossref metadata contain the variable 'subtype' for records with publication type 'posted-content'. Including this variable allows e.g. distinguishing preprints from other types of posted-content in downstream analysis.

cc @cameronneylon

@bmkramer
Copy link
Author

bmkramer commented Jul 28, 2021

NB In the JSON response of the Crossref REST API, this a top-level field, see e.g. https://api.crossref.org/works/10.5194/acp-2015-1010 line 214

However, Crossref describes it here as as an attribute of posted-content called 'type': https://crossref.org/documentation/content-registration/content-type-markup-guide/posted-content-includes-preprints/#00084, and this is also how it's documented in the schema documentation: https://data.crossref.org/reports/help/schema_doc/4.4.2/schema_4_4_2.html#posted_content

So... my simple thought was that it could be added to
observatory-dags/observatory/dags/database/schema/crossref_metadata_2021-01-01.json as such:
{
"mode": "NULLABLE",
"name": "subtype",
"type": "STRING",
"description": "Enumeration, one of the type ids from https://data.crossref.org/reports/help/schema_doc/4.4.2/schema_4_4_2.html#posted_content."
},

... but that assumes the telescope workflow is using the REST API json result structure.

See pull request The-Academic-Observatory/observatory-platform#456

@aroelo
Copy link
Collaborator

aroelo commented Jul 29, 2021

Hi @bmkramer,

Thanks for your pull request!
I can see we currently have the 'type' field in our schema, for which one of the options is 'posted-content'.

I haven't seen the 'subtype' field in our data so far though. The way our data pipeline is set up it should give an error when there is a field in the data that is not in our schema. The latest snapshot from 2021-05-01 was loaded into BigQuery without any issues.

We get our data from the Crossref Metadata Plus snapshots that are available (https://www.crossref.org/documentation/metadata-plus/metadata-plus-snapshots/). It's in the json format and uses the REST API, but with the /snapshot route, so perhaps a different schema is used for the output there?

The 'subtype' field is not listed in the document here: https://github.com/CrossRef/rest-api-doc/blob/master/api_format.md, but perhaps this document is outdated.

Do you our @cameronneylon know who would be the best person to contact, so I could ask some questions about which schema is used for the snapshots and whether we should be getting the 'subtype' field as well?

EDIT: I found the same work from your example (https://api.crossref.org/works/10.5194/acp-2015-1010) in the metadata plus snapshot and there is no subtype field there. I suspect the schema for the snapshot route of the API is slightly different and @cameronneylon Is looking further into this.

@aroelo aroelo self-assigned this Jul 29, 2021
@bmkramer
Copy link
Author

Thanks @aroelo , and yes, we have asked Crossref about this. To be continued!

@rhosking rhosking transferred this issue from The-Academic-Observatory/observatory-platform Aug 31, 2021
@aroelo
Copy link
Collaborator

aroelo commented Aug 31, 2021

Update:
After contact with Crossref they informed us that this field should be included soon in the new snapshots, either in October or September.
The new snapshots should pull data directly from the REST API, so I assume that these will then be similar in format.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants