Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Aggregation Improvements #22

Open
17 of 41 tasks
rhosking opened this issue Oct 8, 2020 · 3 comments
Open
17 of 41 tasks

Data Aggregation Improvements #22

rhosking opened this issue Oct 8, 2020 · 3 comments
Assignees

Comments

@rhosking
Copy link
Contributor

rhosking commented Oct 8, 2020

A list of useful improvements to the DOI/Entity Aggregation Pipeline. This list also replaces and organises a few issues that have been around the backlog for awhile and need addressing. Closing The-Academic-Observatory/observatory-platform#272, The-Academic-Observatory/observatory-platform#146, The-Academic-Observatory/observatory-platform#129, The-Academic-Observatory/observatory-platform#110, The-Academic-Observatory/observatory-platform#70 as they are now covered here

Cross cutting issues

  • Ensure consistent use of the various crossref dates. Issued date is more reliable across the various output types
  • Add OA types to the collaboration analysis
  • Potentially extend Collaboration analysis with information on discipline and funding
  • Review additional OA types and any definition issues. Gold-only for example is required
  • Removed filtered_list comments
  • Do we want to create monthly aggregations as well as yearly?
    • Create published_year_month in the dois table
    • Modify the aggregate_dois query to have the switchable option between grouping by year
    • Extend the Telescope to enable running in either year or month mode
    • Extend the DAG to ensure both run each week
  • Add citation counts to discipline aggregations
  • Simplify metrics and oa_citations fields
  • Include citation breakdowns from both OpenCitations and MAG for comparison
  • Fix green_in_home repo workflow
  • Remove old commented code, and turn conditional commented code into jinja conditional logic
  • Refactor event aggregation code to reduce storage costs and allow for growth of new event types
  • Ensure duplicate institutions are not found in the affiliation.institutions list

Grids

  • Does it make more sense to aggregate grids to their top level organisation? Thus including all publications in the parents count for any other grid that has a child relationship to that parent grid?
    • Create a grid table, which is a direct grouping by each grid_id, which is how the current institutions table works
    • rename the current institution list in the affiliations section of the DOIs table to grids.
    • Create a new institutions table, this will be aggregated entities (counts of child institutions included with their parents).
    • Build on the extend_grid query, for each gridId, create an array of all parents up the chain. This might involve a step in python too
    • Create a new institutional affiliation list in the DOIs table. This will contain many more links per DOI than the previous version. As there will be a link for every explicitly linked grid, and also a link to all the parents grids right up to the top. Because some grids end up nesting into a national Government (The USA for example) taking a pure approach of always aggregating to the parent of each grid hierarchy hides too much detail. So a publication linked to the NCI, will also be linked to the NIH, the Department of health and human services, and the Government of the USA. The final table will allow an end user to pick the level they are interested in
    • For each of these newly established aggregation links, also include the children who were the publishing institution. This will allow for later workflows to break out the relative contributions of the parts.

Groups

  • Similar to the second point for grids, for a group is it useful to track a minimum amount of metrics for each member grid, so analysis/visualisation of the parent can be broken into its constituent parts to offer a greater level of understanding?

Countries

  • Is it helpful to have a minimal set of information for each institution within that country, so later analysis/viz can understand the relative contributions from each of the constituent parts of the whole? Due to collaborations, the sum of all the counts for each institution > total, but perhaps the relative sizes might be useful?

Regions

  • Same as the above for countries, exact rather than breaking down by institution, it will be broken down into all the countries contained in that region

Publishers, Funders and Journals

  • Similar to the above, but having a list broken down by all contributing institutions for publishers and journals, and funded institution by Funder. Helpful for downstream analysis/viz to understand relative impacts of the parts.

Funders

  • Does it make more sense to aggregate funders to their top level organisation?
  • A large proportion of funder references in crossref do not have an associated fundref ID. This limits the use of grouping funding by country of origin, or type of funder, as this information comes from the fundref database.
  • For the aggregation on funder entities, it currently uses the ID field (which is the fundref_id). This relates to the above point, by using the name field we get a larger set of results. A conditional Jinja statement might be a workaround for this, but really extra attention needs to go into disambiguating funder references.
  • Fundref, ISO and Geonames differences (alpha 2 v 3) issues to work through to ensure correct joining and colour pallets

Citations

  • Include full list of cited and cited-by dois as part of the DOIs table (as a repeated nested field). Include published dates
  • Create derived dataset, based on MAG citations, but in the format of OpenCitations
  • As part of the final DOI tables schema, Bucket, and sum count for citations from articles over various time periods from publication of the article in focus
  • As part of the final DOI tables, similar to the above, but do the same for articles that are cited by the focal article. Creating buckets in the periods leading up to publication (aka how far back in the literature did the authors look too)
  • As part of the final DOI tables schema, create counts for incoming citations bucketed into country of origin. Either as a wide sparse sub-table, or as a list ignoring countries with no incoming citations. Count each country separately, thus one count for each country whom each author is associated with. (This sum of all citations from all country > total citations)
  • As part of the final DOI tables schema, create counts for outgoing citations bucketed into country of cited work. Either as a wide sparse sub-table, or as a list ignoring countries with no outgoing citations

Events

  • Move beyond just counting events based on type, and to a histogram based approach bucketing the various types of events in time slots relative to publication date. This is an extension to the aggregrate crossref events script
  • In the same script, keep the full list of events, including the time in which it happened.
  • Details TBD, but creating a dataset that pulls the discipline information from MAG for each DOI, associate each event with these disciplines to create a time bucket intensity score, or alt-metric hotness, by month for each of the discipline categories.

Diversity

  • For institutions, re-include the diversity table join in the new aggregation workflow
@rhosking rhosking self-assigned this Oct 8, 2020
@bechandcock
Copy link
Contributor

bechandcock commented Oct 27, 2020

Group Aggregation / Dashboard Sandbox-Dev
The dashboard for Sandbox-Dev may have some data aggregation issues on the Groups tab. For example, for "us_btaa_chicago" which has 3 GRIDs, one with no data;

  • in the Institutions dashboard the University of Chicago (grid.170205.1) has a total of all publications of 267,124 while Argonne National Laboratory (grid.187073.a) has 103,490.
  • While I would expect some co-authorships, in the groups table "us_btaa_chicago" has a total of all publications of 181,070 which is less than the University of Chicago on its own.
  • I checked the groups table and the "us_btaa_chicago" has the correct 3 GRIDs in it.

@bechandcock
Copy link
Contributor

bechandcock commented Oct 27, 2020

DOI: MAG:
Some of the GRIDs for MAG author affiliations are not being assigned correctly. It is unclear to me if in the DOI table microsoft_academic_graph.authors.authors.GridId is assigned by us or comes direct from MAG, as it is where the "raw" author affiliation is standardised. e.g.

DOI: 10.1080/13658816.2018.1521523

  • From our database:
  • microsoft_academic_graph.authors.authors.OriginalAuthor: Taylor Oshan
  • microsoft_academic_graph.authors.authors.OriginalAffiliation: Center for Geospatial Information Science, Department of Geographical Sciences, University of Maryland, College Park, MD, USA.
  • microsoft_academic_graph.authors.authors.GridId: null
  • microsoft_academic_graph.authors.authors.DisplayName: University of Maryland, College Park

@bechandcock
Copy link
Contributor

bechandcock commented Oct 27, 2020

Discipline Aggregation
There is a need for aggregation of disciplines to coarser levels, e.g. if the 19 Microsoft Academic Graph level-0 Fields of Study were aggregated as follows:

  • Engineering and Technology | Computer science | Engineering 
  • Health | Medicine | Psychology
  • Humanities and Arts | Art | History | Philosophy
  • Science | Biology | Chemistry | Environmental science | Geography | Geology | Materials science | Mathematics | Physics
  • Social and Political Sciences | Business | Economics | Political science | Sociology

@rhosking rhosking transferred this issue from The-Academic-Observatory/observatory-platform Aug 31, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants