Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix/OpenAlex "abstract_inverted_index" field is sometimes a string, not a dictionary #173

Merged
merged 2 commits into from
Jul 27, 2023

Conversation

alexmassen-hane
Copy link
Collaborator

@alexmassen-hane alexmassen-hane commented Jul 26, 2023

For the recent 'works' output for the OpenAlex data dump, the "abstract_inverted_index" field can sometimes be json string dump. This PR adds the ability to decode the string dump and pull out the keys and values as needed. The resulting data should still be in the same format as before and it will not require a re-run of the entire workflow.

Please see the post on the Google forum for OpenAlex:
https://groups.google.com/g/openalex-users/c/TsVgxd_GEuw

@codecov
Copy link

codecov bot commented Jul 26, 2023

Codecov Report

Patch coverage: 100.00% and project coverage change: -0.02% ⚠️

Comparison is base (5019bbc) 95.79% compared to head (8777079) 95.78%.
Report is 1 commits behind head on develop.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop     #173      +/-   ##
===========================================
- Coverage    95.79%   95.78%   -0.02%     
===========================================
  Files           19       19              
  Lines         4642     4649       +7     
  Branches       622      624       +2     
===========================================
+ Hits          4447     4453       +6     
  Misses         122      122              
- Partials        73       74       +1     
Files Changed Coverage Δ
...ervatory_workflows/workflows/openalex_telescope.py 93.21% <100.00%> (+0.10%) ⬆️

... and 1 file with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@jdddog jdddog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this Alex, I just have a small suggestion, other than that it looks good.

Comment on lines 1019 to 1042
if not isinstance(obj.get(field), dict):
if not isinstance(obj.get(field), (dict, str)):
return
keys = list(obj[field].keys())
values = [str(value)[1:-1] for value in obj[field].values()]
else:
# If data is held in a string dump, load json string again.
if isinstance(obj.get(field), str):
obj_part = json.loads(obj[field])
field2 = "InvertedIndex"
if isinstance(obj_part.get(field2), dict):
keys = list(obj_part[field2].keys())
values = [str(value)[1:-1] for value in obj_part[field2].values()]

index_sum = sum(len(value.split(", ")) for value in values)
assert (
index_sum == obj_part["IndexLength"]
), f"Calculated IndexLength {index_sum} does not match value from file {obj_part['IndexLength']}."

obj[field] = {"keys": keys, "values": values}
else:
raise TypeError(f"obj_part['InvertedIndex'] is not a dictionary: {obj_part}")
else:
keys = list(obj[field].keys())
values = [str(value)[1:-1] for value in obj[field].values()]

obj[field] = {"keys": keys, "values": values}
obj[field] = {"keys": keys, "values": values}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to check if the index length matches, as it is a problem for OpenAlex to fix if that is the case. It could be a data quality check that we do in the future (if we re-ran the workflow and updated the field).

You could simplify the code a bit by putting the parsing into a function, like this:

    field = "abstract_inverted_index"
    if field in obj:
        def parse_abstract(dict_: dict):
            keys_ = list(dict_.keys())
            values_ = [str(value_)[1:-1] for value_ in dict_.values()]
            return {"keys": keys_, "values": values_}
        if isinstance(obj.get(field), str):
            data = json.loads(obj[field])
            obj[field] = parse_abstract(data["InvertedIndex"])
        elif isinstance(obj.get(field), dict):
            obj[field] = parse_abstract(obj[field])
        else:
            return

Copy link
Contributor

@jdddog jdddog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Alex, it looks good.

@jdddog jdddog merged commit afd3453 into develop Jul 27, 2023
2 of 3 checks passed
@jdddog jdddog deleted the fix/openalex-abstract_inverted_index-json_dump branch July 27, 2023 04:45
alexmassen-hane added a commit that referenced this pull request Jul 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants