Data Aggregation Improvements #22

rhosking · 2020-10-08T00:28:22Z

A list of useful improvements to the DOI/Entity Aggregation Pipeline. This list also replaces and organises a few issues that have been around the backlog for awhile and need addressing. Closing The-Academic-Observatory/observatory-platform#272, The-Academic-Observatory/observatory-platform#146, The-Academic-Observatory/observatory-platform#129, The-Academic-Observatory/observatory-platform#110, The-Academic-Observatory/observatory-platform#70 as they are now covered here

Cross cutting issues

Grids

Does it make more sense to aggregate grids to their top level organisation? Thus including all publications in the parents count for any other grid that has a child relationship to that parent grid?
- Create a grid table, which is a direct grouping by each grid_id, which is how the current institutions table works
- rename the current institution list in the affiliations section of the DOIs table to grids.
- Create a new institutions table, this will be aggregated entities (counts of child institutions included with their parents).
- Build on the extend_grid query, for each gridId, create an array of all parents up the chain. This might involve a step in python too
- Create a new institutional affiliation list in the DOIs table. This will contain many more links per DOI than the previous version. As there will be a link for every explicitly linked grid, and also a link to all the parents grids right up to the top. Because some grids end up nesting into a national Government (The USA for example) taking a pure approach of always aggregating to the parent of each grid hierarchy hides too much detail. So a publication linked to the NCI, will also be linked to the NIH, the Department of health and human services, and the Government of the USA. The final table will allow an end user to pick the level they are interested in
- For each of these newly established aggregation links, also include the children who were the publishing institution. This will allow for later workflows to break out the relative contributions of the parts.

Groups

Similar to the second point for grids, for a group is it useful to track a minimum amount of metrics for each member grid, so analysis/visualisation of the parent can be broken into its constituent parts to offer a greater level of understanding?

Countries

Is it helpful to have a minimal set of information for each institution within that country, so later analysis/viz can understand the relative contributions from each of the constituent parts of the whole? Due to collaborations, the sum of all the counts for each institution > total, but perhaps the relative sizes might be useful?

Regions

Same as the above for countries, exact rather than breaking down by institution, it will be broken down into all the countries contained in that region

Publishers, Funders and Journals

Similar to the above, but having a list broken down by all contributing institutions for publishers and journals, and funded institution by Funder. Helpful for downstream analysis/viz to understand relative impacts of the parts.

Funders

Does it make more sense to aggregate funders to their top level organisation?
A large proportion of funder references in crossref do not have an associated fundref ID. This limits the use of grouping funding by country of origin, or type of funder, as this information comes from the fundref database.
For the aggregation on funder entities, it currently uses the ID field (which is the fundref_id). This relates to the above point, by using the name field we get a larger set of results. A conditional Jinja statement might be a workaround for this, but really extra attention needs to go into disambiguating funder references.
Fundref, ISO and Geonames differences (alpha 2 v 3) issues to work through to ensure correct joining and colour pallets

Citations

Include full list of cited and cited-by dois as part of the DOIs table (as a repeated nested field). Include published dates
Create derived dataset, based on MAG citations, but in the format of OpenCitations
As part of the final DOI tables schema, Bucket, and sum count for citations from articles over various time periods from publication of the article in focus
As part of the final DOI tables, similar to the above, but do the same for articles that are cited by the focal article. Creating buckets in the periods leading up to publication (aka how far back in the literature did the authors look too)
As part of the final DOI tables schema, create counts for incoming citations bucketed into country of origin. Either as a wide sparse sub-table, or as a list ignoring countries with no incoming citations. Count each country separately, thus one count for each country whom each author is associated with. (This sum of all citations from all country > total citations)
As part of the final DOI tables schema, create counts for outgoing citations bucketed into country of cited work. Either as a wide sparse sub-table, or as a list ignoring countries with no outgoing citations

Events

Move beyond just counting events based on type, and to a histogram based approach bucketing the various types of events in time slots relative to publication date. This is an extension to the aggregrate crossref events script
In the same script, keep the full list of events, including the time in which it happened.
Details TBD, but creating a dataset that pulls the discipline information from MAG for each DOI, associate each event with these disciplines to create a time bucket intensity score, or alt-metric hotness, by month for each of the discipline categories.

Diversity

For institutions, re-include the diversity table join in the new aggregation workflow

bechandcock · 2020-10-27T04:35:47Z

Group Aggregation / Dashboard Sandbox-Dev
The dashboard for Sandbox-Dev may have some data aggregation issues on the Groups tab. For example, for "us_btaa_chicago" which has 3 GRIDs, one with no data;

in the Institutions dashboard the University of Chicago (grid.170205.1) has a total of all publications of 267,124 while Argonne National Laboratory (grid.187073.a) has 103,490.
While I would expect some co-authorships, in the groups table "us_btaa_chicago" has a total of all publications of 181,070 which is less than the University of Chicago on its own.
I checked the groups table and the "us_btaa_chicago" has the correct 3 GRIDs in it.

bechandcock · 2020-10-27T08:01:55Z

DOI: MAG:
Some of the GRIDs for MAG author affiliations are not being assigned correctly. It is unclear to me if in the DOI table microsoft_academic_graph.authors.authors.GridId is assigned by us or comes direct from MAG, as it is where the "raw" author affiliation is standardised. e.g.

DOI: 10.1080/13658816.2018.1521523

From our database:
microsoft_academic_graph.authors.authors.OriginalAuthor: Taylor Oshan
microsoft_academic_graph.authors.authors.OriginalAffiliation: Center for Geospatial Information Science, Department of Geographical Sciences, University of Maryland, College Park, MD, USA.
microsoft_academic_graph.authors.authors.GridId: null
microsoft_academic_graph.authors.authors.DisplayName: University of Maryland, College Park

bechandcock · 2020-10-27T08:32:56Z

Discipline Aggregation
There is a need for aggregation of disciplines to coarser levels, e.g. if the 19 Microsoft Academic Graph level-0 Fields of Study were aggregated as follows:

Engineering and Technology | Computer science | Engineering
Health | Medicine | Psychology
Humanities and Arts | Art | History | Philosophy
Science | Biology | Chemistry | Environmental science | Geography | Geology | Materials science | Mathematics | Physics
Social and Political Sciences | Business | Economics | Political science | Sociology

rhosking self-assigned this Oct 8, 2020

rhosking transferred this issue from The-Academic-Observatory/observatory-platform Aug 31, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Aggregation Improvements #22

Data Aggregation Improvements #22

rhosking commented Oct 8, 2020

bechandcock commented Oct 27, 2020 •

edited

Loading

bechandcock commented Oct 27, 2020 •

edited

Loading

bechandcock commented Oct 27, 2020 •

edited

Loading

Data Aggregation Improvements #22

Data Aggregation Improvements #22

Comments

rhosking commented Oct 8, 2020

bechandcock commented Oct 27, 2020 • edited Loading

bechandcock commented Oct 27, 2020 • edited Loading

bechandcock commented Oct 27, 2020 • edited Loading

bechandcock commented Oct 27, 2020 •

edited

Loading

bechandcock commented Oct 27, 2020 •

edited

Loading

bechandcock commented Oct 27, 2020 •

edited

Loading