Fix/OpenAlex "abstract_inverted_index" field is sometimes a string, not a dictionary #173

alexmassen-hane · 2023-07-26T08:28:46Z

For the recent 'works' output for the OpenAlex data dump, the "abstract_inverted_index" field can sometimes be json string dump. This PR adds the ability to decode the string dump and pull out the keys and values as needed. The resulting data should still be in the same format as before and it will not require a re-run of the entire workflow.

Please see the post on the Google forum for OpenAlex:
https://groups.google.com/g/openalex-users/c/TsVgxd_GEuw

codecov · 2023-07-26T08:54:12Z

Codecov Report

Patch coverage: 100.00% and project coverage change: -0.02% ⚠️

Comparison is base (5019bbc) 95.79% compared to head (8777079) 95.78%.
Report is 1 commits behind head on develop.

Additional details and impacted files

@@             Coverage Diff             @@
##           develop     #173      +/-   ##
===========================================
- Coverage    95.79%   95.78%   -0.02%     
===========================================
  Files           19       19              
  Lines         4642     4649       +7     
  Branches       622      624       +2     
===========================================
+ Hits          4447     4453       +6     
  Misses         122      122              
- Partials        73       74       +1

Files Changed	Coverage Δ
...ervatory_workflows/workflows/openalex_telescope.py	`93.21% <100.00%> (+0.10%)`	⬆️

... and 1 file with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

jdddog

Thanks for fixing this Alex, I just have a small suggestion, other than that it looks good.

jdddog · 2023-07-26T23:15:05Z

academic_observatory_workflows/workflows/openalex_telescope.py

-        if not isinstance(obj.get(field), dict):
+        if not isinstance(obj.get(field), (dict, str)):
            return
-        keys = list(obj[field].keys())
-        values = [str(value)[1:-1] for value in obj[field].values()]
+        else:
+            # If data is held in a string dump, load json string again.
+            if isinstance(obj.get(field), str):
+                obj_part = json.loads(obj[field])
+                field2 = "InvertedIndex"
+                if isinstance(obj_part.get(field2), dict):
+                    keys = list(obj_part[field2].keys())
+                    values = [str(value)[1:-1] for value in obj_part[field2].values()]
+
+                    index_sum = sum(len(value.split(", ")) for value in values)
+                    assert (
+                        index_sum == obj_part["IndexLength"]
+                    ), f"Calculated IndexLength {index_sum} does not match value from file {obj_part['IndexLength']}."
+
+                    obj[field] = {"keys": keys, "values": values}
+                else:
+                    raise TypeError(f"obj_part['InvertedIndex'] is not a dictionary: {obj_part}")
+            else:
+                keys = list(obj[field].keys())
+                values = [str(value)[1:-1] for value in obj[field].values()]

-        obj[field] = {"keys": keys, "values": values}
+                obj[field] = {"keys": keys, "values": values}


I don't think we need to check if the index length matches, as it is a problem for OpenAlex to fix if that is the case. It could be a data quality check that we do in the future (if we re-ran the workflow and updated the field).

You could simplify the code a bit by putting the parsing into a function, like this:

field = "abstract_inverted_index" if field in obj: def parse_abstract(dict_: dict): keys_ = list(dict_.keys()) values_ = [str(value_)[1:-1] for value_ in dict_.values()] return {"keys": keys_, "values": values_} if isinstance(obj.get(field), str): data = json.loads(obj[field]) obj[field] = parse_abstract(data["InvertedIndex"]) elif isinstance(obj.get(field), dict): obj[field] = parse_abstract(obj[field]) else: return

jdddog

Thanks Alex, it looks good.

…ot a dictionary (#173)

Fix for if abstract_inverted_index is a string

40180d0

alexmassen-hane requested a review from jdddog July 26, 2023 08:30

jdddog requested changes Jul 26, 2023

View reviewed changes

Made requested changes

8777079

jdddog approved these changes Jul 27, 2023

View reviewed changes

jdddog merged commit afd3453 into develop Jul 27, 2023
2 of 3 checks passed

jdddog deleted the fix/openalex-abstract_inverted_index-json_dump branch July 27, 2023 04:45

alexmassen-hane added a commit that referenced this pull request Jul 28, 2023

Fix/OpenAlex "abstract_inverted_index" field is sometimes a string, n…

3765261

…ot a dictionary (#173)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/OpenAlex "abstract_inverted_index" field is sometimes a string, not a dictionary #173

Fix/OpenAlex "abstract_inverted_index" field is sometimes a string, not a dictionary #173

alexmassen-hane commented Jul 26, 2023 •

edited

Loading

codecov bot commented Jul 26, 2023 •

edited

Loading

jdddog left a comment

jdddog Jul 26, 2023

jdddog left a comment

Fix/OpenAlex "abstract_inverted_index" field is sometimes a string, not a dictionary #173

Fix/OpenAlex "abstract_inverted_index" field is sometimes a string, not a dictionary #173

Conversation

alexmassen-hane commented Jul 26, 2023 • edited Loading

codecov bot commented Jul 26, 2023 • edited Loading

Codecov Report

jdddog left a comment

Choose a reason for hiding this comment

jdddog Jul 26, 2023

Choose a reason for hiding this comment

jdddog left a comment

Choose a reason for hiding this comment

alexmassen-hane commented Jul 26, 2023 •

edited

Loading

codecov bot commented Jul 26, 2023 •

edited

Loading