Inf 595/update oa schemas #164

kathrynnapier · 2023-04-04T14:43:06Z

Schema descriptions updated for the aggregate and doi json files.

New files created as new fields have been added to both schemas.

Aggregate schema based off the current author table, as it had the most fields (please let me know if this should be changed!).

Formatting improvements to come in future updates. If the descriptions for the most part make sense, it would be good to get them into the bigquery tables sooner rather than later and we can continue to improve over time.

Schema's have been uploaded to coki-scratch-space.Kathryn.test_agg_schema and coki-scratch-space.Kathryn.test_doi_schema as a test and for ease of viewing.

kathrynnapier · 2023-04-04T14:46:22Z

And- my sincere apologies for the mess I created when creating the branch off a VERY old version of develop!

codecov · 2023-04-04T15:06:26Z

Codecov Report

All modified lines are covered by tests ✅

Comparison is base (ffa9d4d) 95.18% compared to head (0059e67) 95.22%.

Additional details and impacted files

@@             Coverage Diff             @@
##           develop     #164      +/-   ##
===========================================
+ Coverage    95.18%   95.22%   +0.04%     
===========================================
  Files           20       20              
  Lines         5209     5238      +29     
  Branches       720      727       +7     
===========================================
+ Hits          4958     4988      +30     
  Misses         161      161              
+ Partials        90       89       -1

Files	Coverage Δ
...ic_observatory_workflows/workflows/doi_workflow.py	`94.35% <100.00%> (+0.60%)`	⬆️

... and 1 file with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

keegansmith21 · 2023-04-05T02:14:57Z

@jdddog do the new schemas necessarily need to have updated dates?

jdddog

Thanks for the PR Kathryn.

The structure of the DOI and aggregate tables were updated after the Refactor DOI table PR #171, so the schemas will need updating.

I think that we should generate parts of the DOI table schema, specifically the crossref, unpaywall, openalex, open_citations and pubmed structs using their JSON schema files, otherwise the schema will get out of date and the table will not load. The events struct is an exception as this is not actually what the original Crossref Events data looks like.

The schemas don't need dates as dated schemas are only used when backfilling older versions of a dataset.

jdddog · 2023-09-06T02:57:08Z

@jdddog do the new schemas necessarily need to have updated dates?

Yeah the schemas don't need dates as dated schemas are only used when backfilling older versions of a dataset.

…ns, events completed

alexmassen-hane · 2023-09-27T08:18:24Z

As requested, I have added a function to create the DOI schema based off definitions instead of pulling it from the doi_.json file. The paths to the schemas are now attached to the SQLQuery object instead of being passed down with the kwargs for each parallel task.

For tables such as Unpaywall, Pumbed and OpenAlex, all of the fields from their respective source tables are brought into the DOI table. Although, for the Crossref Events (events), open_citations, coki and the affiliation part of the DOI table have been separated out into their own schemas and placed in the "intermediate" folder as they all contain calculated fields produced in either the "intermediate_

" task or "create_doi" stage of the workflow.

The definition of the Crossref metadata part of the schema is messy in comparison as it uses a combination of a few original fields and calculates new ones when creating the intermediate table.

The schemas for the aggregate tables still needed to addressed. I will separate them out into their own schema soon.

kathrynnapier requested review from jdddog and keegansmith21 April 4, 2023 14:43

keegansmith21 force-pushed the INF-595/update-oa-schemas branch from f9bdf01 to 6c33b03 Compare April 4, 2023 23:42

jdddog mentioned this pull request Sep 6, 2023

INF-648: Add Pubmed to DOI table #182

Merged

jdddog requested changes Sep 6, 2023

View reviewed changes

kathrynnapier and others added 7 commits September 22, 2023 12:25

INF-595 doi table descriptions, crossref, openalex, mag, open_citatio…

0cc3c14

…ns, events completed

started including coki and unpaywall fields

d912283

INF-595 completion of affiliations

148b251

INF-595 minor modifications to formatting

5355efd

INF-595 schema descriptions for aggregate tables

80bc0bd

INF-595: Added member to aggregate schema

d0b826d

Create doi schema from source schemas

0059e67

alexmassen-hane force-pushed the INF-595/update-oa-schemas branch from 9a0230e to 0059e67 Compare September 27, 2023 07:53

alexmassen-hane marked this pull request as draft September 27, 2023 08:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inf 595/update oa schemas #164

Inf 595/update oa schemas #164

kathrynnapier commented Apr 4, 2023

kathrynnapier commented Apr 4, 2023

codecov bot commented Apr 4, 2023 •

edited

Loading

keegansmith21 commented Apr 5, 2023

jdddog left a comment

jdddog commented Sep 6, 2023

alexmassen-hane commented Sep 27, 2023

Inf 595/update oa schemas #164

Are you sure you want to change the base?

Inf 595/update oa schemas #164

Conversation

kathrynnapier commented Apr 4, 2023

kathrynnapier commented Apr 4, 2023

codecov bot commented Apr 4, 2023 • edited Loading

Codecov Report

keegansmith21 commented Apr 5, 2023

jdddog left a comment

Choose a reason for hiding this comment

jdddog commented Sep 6, 2023

alexmassen-hane commented Sep 27, 2023

codecov bot commented Apr 4, 2023 •

edited

Loading