-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix/OpenAlex "abstract_inverted_index" field is sometimes a string, not a dictionary #173
Fix/OpenAlex "abstract_inverted_index" field is sometimes a string, not a dictionary #173
Conversation
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## develop #173 +/- ##
===========================================
- Coverage 95.79% 95.78% -0.02%
===========================================
Files 19 19
Lines 4642 4649 +7
Branches 622 624 +2
===========================================
+ Hits 4447 4453 +6
Misses 122 122
- Partials 73 74 +1
☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing this Alex, I just have a small suggestion, other than that it looks good.
if not isinstance(obj.get(field), dict): | ||
if not isinstance(obj.get(field), (dict, str)): | ||
return | ||
keys = list(obj[field].keys()) | ||
values = [str(value)[1:-1] for value in obj[field].values()] | ||
else: | ||
# If data is held in a string dump, load json string again. | ||
if isinstance(obj.get(field), str): | ||
obj_part = json.loads(obj[field]) | ||
field2 = "InvertedIndex" | ||
if isinstance(obj_part.get(field2), dict): | ||
keys = list(obj_part[field2].keys()) | ||
values = [str(value)[1:-1] for value in obj_part[field2].values()] | ||
|
||
index_sum = sum(len(value.split(", ")) for value in values) | ||
assert ( | ||
index_sum == obj_part["IndexLength"] | ||
), f"Calculated IndexLength {index_sum} does not match value from file {obj_part['IndexLength']}." | ||
|
||
obj[field] = {"keys": keys, "values": values} | ||
else: | ||
raise TypeError(f"obj_part['InvertedIndex'] is not a dictionary: {obj_part}") | ||
else: | ||
keys = list(obj[field].keys()) | ||
values = [str(value)[1:-1] for value in obj[field].values()] | ||
|
||
obj[field] = {"keys": keys, "values": values} | ||
obj[field] = {"keys": keys, "values": values} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to check if the index length matches, as it is a problem for OpenAlex to fix if that is the case. It could be a data quality check that we do in the future (if we re-ran the workflow and updated the field).
You could simplify the code a bit by putting the parsing into a function, like this:
field = "abstract_inverted_index"
if field in obj:
def parse_abstract(dict_: dict):
keys_ = list(dict_.keys())
values_ = [str(value_)[1:-1] for value_ in dict_.values()]
return {"keys": keys_, "values": values_}
if isinstance(obj.get(field), str):
data = json.loads(obj[field])
obj[field] = parse_abstract(data["InvertedIndex"])
elif isinstance(obj.get(field), dict):
obj[field] = parse_abstract(obj[field])
else:
return
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Alex, it looks good.
…ot a dictionary (#173)
For the recent 'works' output for the OpenAlex data dump, the "abstract_inverted_index" field can sometimes be json string dump. This PR adds the ability to decode the string dump and pull out the keys and values as needed. The resulting data should still be in the same format as before and it will not require a re-run of the entire workflow.
Please see the post on the Google forum for OpenAlex:
https://groups.google.com/g/openalex-users/c/TsVgxd_GEuw