Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load topics from OpenAlex + Refactor of OpenAlex work processor #1592

Merged
merged 22 commits into from
May 15, 2024

Conversation

yattias
Copy link
Contributor

@yattias yattias commented May 14, 2024

  • Refactor process_openalex_works which is responsible for upserting works and their related entities by eliminating dead code and removing unnecessary code that significantly slows down paper processing such as hydrate_paper_concepts
  • Added a django script to fetch and create topics from OpenAlex and their related entities such as subfield, field, and domain
  • Added a script load_works_from_openalex which has two modes: fetch or backfill. This script will be iterated upon as we will need to do a bunch of additional operations such as creating authors when fetching papers
  • Enriched paper entity by adding various new fields such as work_type, is_retracted, ....

@yattias yattias requested a review from a team as a code owner May 14, 2024 02:53
@@ -0,0 +1,146 @@
import logging
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is 50% existing + 50% new I'd say

from utils.openalex import OpenAlex


class Command(BaseCommand):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since were are starting a new app with topic, it would be good to have some tests for this command?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was planning on adding some tests today!

Copy link
Member

@gzurowski gzurowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

The changed code and the newly introduced code could use some tests. :)


# if topic exists, determine if we need to update it
needs_update = False
if topic and has_dates:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why might a topic not have dates? Are there scenarios where we would want to update topics that don't have dates?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, OpenAlex objects have two versions of an entity such as Topic - one that includes all data including dates and another that is a "dehydrated" version of the same object that includes only the most essential data relevant to the object. For example, when fetching works, it will return a bunch of related dehydrated objects for efficiency such as dehydratedTopic and dehydratedConcept.

We still want to create these records whenever possible. For example, a new paper may include new topic and concepts we may not yet have and we want to add these in. We could also fetch the full objects from OpenAlex when we encounter dehydrated ones but that is not necessary and incurs significant performance impact. For example, one paper may have x concepts and y topics which will require x + y requests to open alex

https://docs.openalex.org/api-entities/institutions/institution-object#the-dehydratedinstitution-object

Copy link
Contributor

@koutst koutst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yattias yattias merged commit 7cc649f into master May 15, 2024
1 check passed
@yattias yattias deleted the oa/doc-topics branch May 15, 2024 02:16
@yattias yattias added this to the OpenAlex Topics, Subfields milestone May 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
3 participants