Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inf 595/update oa schemas #164

Draft
wants to merge 7 commits into
base: main
Choose a base branch
from
Draft

Conversation

kathrynnapier
Copy link
Contributor

Schema descriptions updated for the aggregate and doi json files.

New files created as new fields have been added to both schemas.

Aggregate schema based off the current author table, as it had the most fields (please let me know if this should be changed!).

Formatting improvements to come in future updates. If the descriptions for the most part make sense, it would be good to get them into the bigquery tables sooner rather than later and we can continue to improve over time.

Schema's have been uploaded to coki-scratch-space.Kathryn.test_agg_schema and coki-scratch-space.Kathryn.test_doi_schema as a test and for ease of viewing.

@kathrynnapier
Copy link
Contributor Author

And- my sincere apologies for the mess I created when creating the branch off a VERY old version of develop!

@codecov
Copy link

codecov bot commented Apr 4, 2023

Codecov Report

All modified lines are covered by tests ✅

Comparison is base (ffa9d4d) 95.18% compared to head (0059e67) 95.22%.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop     #164      +/-   ##
===========================================
+ Coverage    95.18%   95.22%   +0.04%     
===========================================
  Files           20       20              
  Lines         5209     5238      +29     
  Branches       720      727       +7     
===========================================
+ Hits          4958     4988      +30     
  Misses         161      161              
+ Partials        90       89       -1     
Files Coverage Δ
...ic_observatory_workflows/workflows/doi_workflow.py 94.35% <100.00%> (+0.60%) ⬆️

... and 1 file with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@keegansmith21 keegansmith21 force-pushed the INF-595/update-oa-schemas branch from f9bdf01 to 6c33b03 Compare April 4, 2023 23:42
@keegansmith21
Copy link
Contributor

@jdddog do the new schemas necessarily need to have updated dates?

Copy link
Contributor

@jdddog jdddog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR Kathryn.

The structure of the DOI and aggregate tables were updated after the Refactor DOI table PR #171, so the schemas will need updating.

I think that we should generate parts of the DOI table schema, specifically the crossref, unpaywall, openalex, open_citations and pubmed structs using their JSON schema files, otherwise the schema will get out of date and the table will not load. The events struct is an exception as this is not actually what the original Crossref Events data looks like.

The schemas don't need dates as dated schemas are only used when backfilling older versions of a dataset.

@jdddog
Copy link
Contributor

jdddog commented Sep 6, 2023

@jdddog do the new schemas necessarily need to have updated dates?

Yeah the schemas don't need dates as dated schemas are only used when backfilling older versions of a dataset.

@alexmassen-hane alexmassen-hane force-pushed the INF-595/update-oa-schemas branch from 9a0230e to 0059e67 Compare September 27, 2023 07:53
@alexmassen-hane alexmassen-hane marked this pull request as draft September 27, 2023 08:04
@alexmassen-hane
Copy link
Collaborator

As requested, I have added a function to create the DOI schema based off definitions instead of pulling it from the doi_.json file. The paths to the schemas are now attached to the SQLQuery object instead of being passed down with the kwargs for each parallel task.

For tables such as Unpaywall, Pumbed and OpenAlex, all of the fields from their respective source tables are brought into the DOI table. Although, for the Crossref Events (events), open_citations, coki and the affiliation part of the DOI table have been separated out into their own schemas and placed in the "intermediate" folder as they all contain calculated fields produced in either the "intermediate_

" task or "create_doi" stage of the workflow.

The definition of the Crossref metadata part of the schema is messy in comparison as it uses a combination of a few original fields and calculates new ones when creating the intermediate table.

The schemas for the aggregate tables still needed to addressed. I will separate them out into their own schema soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants