-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Load topics from OpenAlex + Refactor of OpenAlex work processor #1592
Conversation
…reate enriched data
…er upload missing concepts + topics
@@ -0,0 +1,146 @@ | |||
import logging |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code below is a refactored + optimized version of the code below:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is 50% existing + 50% new I'd say
from utils.openalex import OpenAlex | ||
|
||
|
||
class Command(BaseCommand): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since were are starting a new app with topic
, it would be good to have some tests for this command?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I was planning on adding some tests today!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
The changed code and the newly introduced code could use some tests. :)
|
||
# if topic exists, determine if we need to update it | ||
needs_update = False | ||
if topic and has_dates: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why might a topic not have dates? Are there scenarios where we would want to update topics that don't have dates?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, OpenAlex objects have two versions of an entity such as Topic
- one that includes all data including dates and another that is a "dehydrated" version of the same object that includes only the most essential data relevant to the object. For example, when fetching works, it will return a bunch of related dehydrated objects for efficiency such as dehydratedTopic
and dehydratedConcept
.
We still want to create these records whenever possible. For example, a new paper may include new topic and concepts we may not yet have and we want to add these in. We could also fetch the full objects from OpenAlex when we encounter dehydrated ones but that is not necessary and incurs significant performance impact. For example, one paper may have x concepts and y topics which will require x + y requests to open alex
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
process_openalex_works
which is responsible for upserting works and their related entities by eliminating dead code and removing unnecessary code that significantly slows down paper processing such ashydrate_paper_concepts
subfield
,field
, anddomain
load_works_from_openalex
which has two modes:fetch
orbackfill
. This script will be iterated upon as we will need to do a bunch of additional operations such as creating authors when fetching paperswork_type
,is_retracted
, ....