-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inf 595/update oa schemas #164
base: main
Are you sure you want to change the base?
Conversation
And- my sincere apologies for the mess I created when creating the branch off a VERY old version of develop! |
Codecov ReportAll modified lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## develop #164 +/- ##
===========================================
+ Coverage 95.18% 95.22% +0.04%
===========================================
Files 20 20
Lines 5209 5238 +29
Branches 720 727 +7
===========================================
+ Hits 4958 4988 +30
Misses 161 161
+ Partials 90 89 -1
☔ View full report in Codecov by Sentry. |
f9bdf01
to
6c33b03
Compare
@jdddog do the new schemas necessarily need to have updated dates? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR Kathryn.
The structure of the DOI and aggregate tables were updated after the Refactor DOI table PR #171, so the schemas will need updating.
I think that we should generate parts of the DOI table schema, specifically the crossref, unpaywall, openalex, open_citations and pubmed structs using their JSON schema files, otherwise the schema will get out of date and the table will not load. The events struct is an exception as this is not actually what the original Crossref Events data looks like.
The schemas don't need dates as dated schemas are only used when backfilling older versions of a dataset.
Yeah the schemas don't need dates as dated schemas are only used when backfilling older versions of a dataset. |
…ns, events completed
9a0230e
to
0059e67
Compare
As requested, I have added a function to create the DOI schema based off definitions instead of pulling it from the doi_.json file. The paths to the schemas are now attached to the SQLQuery object instead of being passed down with the kwargs for each parallel task. For tables such as Unpaywall, Pumbed and OpenAlex, all of the fields from their respective source tables are brought into the DOI table. Although, for the Crossref Events (events), open_citations, coki and the affiliation part of the DOI table have been separated out into their own schemas and placed in the "intermediate" folder as they all contain calculated fields produced in either the "intermediate_ " task or "create_doi" stage of the workflow.The definition of the Crossref metadata part of the schema is messy in comparison as it uses a combination of a few original fields and calculates new ones when creating the intermediate table. The schemas for the aggregate tables still needed to addressed. I will separate them out into their own schema soon. |
Schema descriptions updated for the aggregate and doi json files.
New files created as new fields have been added to both schemas.
Aggregate schema based off the current author table, as it had the most fields (please let me know if this should be changed!).
Formatting improvements to come in future updates. If the descriptions for the most part make sense, it would be good to get them into the bigquery tables sooner rather than later and we can continue to improve over time.
Schema's have been uploaded to coki-scratch-space.Kathryn.test_agg_schema and coki-scratch-space.Kathryn.test_doi_schema as a test and for ease of viewing.