Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recommend using Research Organization Repository for organization-based metadata #27

Open
alanorth opened this issue Jul 13, 2020 · 9 comments

Comments

@alanorth
Copy link
Collaborator

The Research Organization Repository (ROR) is a database with nearly 100,000 organizations originally seeded from the GRID.ac dataset. Their metadata is updated monthly and it includes links to FundRef (CrossRef), GRID.ac, Wikidata, Wikipedia, etc and even has multilingual aliases and acronyms. For example, see this API search for one of the precursor institutes to ILRI:

https://api.ror.org/organizations?affiliation=International+Livestock+Centre+for+Africa

I have been investigating using this for our sponsors/investors and institutional affiliations and I am really impressed. They provide an API, a monthly ror.json dump, and an OpenRefine reconciliation service. Also there seems to be a community feedback process where we can suggest new organizations, which I suspect will be very valuable to them with all the metadata we've collected in CGSpace, MELSpace, CLARISA, etc.

BTW I've also written a Python script called ror-lookup.py that will validate a text file of organizations against the ror.json dump (faster than the API of course).

Let me know what you think!

@htobon
Copy link
Member

htobon commented Jul 27, 2020

Hi Alan, this sounds promising. I am happy to understand this further and seek synergies.

@alanorth
Copy link
Collaborator Author

Yeah I think it's really promising. Their data is fantastic! I am trying to convince CGSpace and MELSpace to use it. Let's see how far we get... BTW, on CGSpace we have 5866 unique organizations/affiliations/funders, and 1515 (25.8%) of those match with ROR already.

@htobon
Copy link
Member

htobon commented Jul 31, 2020

Have you already checked the ones we have in CLARISA?
https://clarisa.cgiar.org/swagger/index.html#/Institutions%20Lists/getAllInstitutionsUsingGET

We are, indeed, in the process of finalizing the alignment with MELSpace and will start with the institutions list of the Agresso of the Alliance between Bioversity and CIAT.

Happy to elaborate further.

@alanorth
Copy link
Collaborator Author

alanorth commented Aug 2, 2020

@htobon yeah I've looked at CLARISA a few times. I raised some concerns about the data in 2019-10:

  • strange Unicode characters like U+00AD and U+200B in a few records
  • unnecessary whitespace in a few records
  • mixing of multiple versions of organizations in differently languages in one name value in many records
  • the CLARISA API needs a key (Swagger doesn't count, that's not for automated programmatic access)

I looked again last week and the issues are still there.

Not to mention, Clarisa only has around 3,500 entries. ROR has 97,000, and their data is MUCH higher quality, with links to permanent identifiers in many other large public datasets, and proper support for multi-lingual names and acronyms, not to mention their API is open and they provide monthly data dumps. I would recommend everyone align / map to ROR at this point. Store a "ror_id" field where the value maps to ROR and keep your own where it doesn't...

@IanCal
Copy link

IanCal commented Aug 11, 2020

If it helps, the data in ROR is still just the grid data, and there's some more metadata available from GRID (all cc0 except the geonames associated data which is CCBY)

Really happy to hear the data is useful for you and you're finding the data high quality, we've put a lot of work into it.

@alanorth
Copy link
Collaborator Author

Thanks, @IanCal. ROR is easier to use because of the monthly JSON releases. GRID only releases an RDF file if I'm not mistaken. RDF is much more complex to parse. :P

@IanCal
Copy link

IanCal commented Aug 20, 2020

@alanorth grid is in json, csv and RDF for the bulk in the releases on figshare, and a variety of formats (json, ttl, nt, etc) if you want to access individual records (pages are machine readable, using either content negotiation or changing the url - https://grid.ac/institutes/grid.5335.0.json) :)

We've got a help page for using the figshare api to access all versions (as the collection has a DOI as well as each individual release).

@alanorth
Copy link
Collaborator Author

An update on this, as of July, 2021 GRID is being retired and RoR will pick up the maintenance and updating of the data set.

https://ror.org/blog/2021-07-12-ror-grid-the-way-forward/

We should amend CG Core docs to recommend RoR.

@amandafrench
Copy link

Sorry to jump into your comments -- I'm the new Technical Community Manager for ROR, and I'm happy to answer any questions you might have!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants