Merge pull request #19 from Project-TAPIR/develop

* include notebooks for OpenAIRE person-works and work-projects * include ROR ID in ORCID search * update docs
Project-TAPIR · Mar 21, 2022 · 3f27250 · 3f27250
2 parents 22f05b1 + 13a19b7
commit 3f27250
Show file tree

Hide file tree

Showing 8 changed files with 444 additions and 15 deletions.
diff --git a/README.md b/README.md
@@ -3,7 +3,7 @@
 [![DOI](https://zenodo.org/badge/447263093.svg)](https://zenodo.org/badge/latestdoi/447263093)
 [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/Project-TAPIR/pidgraph-notebooks/main)
 
-A collection of Jupyter notebooks with examples of querying different PID providers like [ORCID](https://orcid.org/), [ROR](https://ror.readme.io/), [Crossref](https://www.crossref.org/) and PID graphs like the [FREYA PID Graph](https://blog.datacite.org/powering-the-pid-graph/) and [OpenAlex](https://openalex.org/about) for connected objects. 
+A collection of Jupyter notebooks with examples of querying different PID providers like [ORCID](https://orcid.org/), [ROR](https://ror.readme.io/), [Crossref](https://www.crossref.org/) and PID graphs like the [FREYA PID Graph](https://blog.datacite.org/powering-the-pid-graph/), [OpenAlex](https://openalex.org/about) and [OpenAIRE](https://www.openaire.eu/) for connected objects. 
 
 Currently included connections:
 * organization-organization
@@ -17,7 +17,11 @@ Currently included connections:
 * person-works
   * input: ORCID
   * output: list of works authored/created by the person, each identified by their DOI
-  * data sources: Crossref, FREYA PID Graph, OpenAlex, ORCID
+  * data sources: Crossref, FREYA PID Graph, OpenAlex, ORCID, OpenAIRE
+* work-projects
+  * input: DOI
+  * output: list of projects the work was produced in, each identified by their OpenAIRE project ID
+  * data sources: OpenAIRE
 
 
 Please navigate into the respective folder to see the list of available notebooks. 
@@ -35,3 +39,5 @@ you can use this link to launch the notebooks on Binder where you can execute an
 In the joint project [TAPIR](https://projects.tib.eu/tapir/en/) (Partially Automated Persistent Identifier-based Reporting), partially automated procedures for research reporting are being tested in the context of university and non-university research. To this end, the question is being investigated : 
 
 To what extent can the necessary data aggregation be carried out on the basis of openly available research information using persistent identifiers?
+
+*More information in our blog post "[Project TAPIR: Harvesting the power of PIDs](https://blogs.tib.eu/wp/tib/2022/03/01/project-tapir-harvesting-the-power-of-pids/)"*
diff --git a/organization-people/orcid_get_people_by_organization.ipynb b/organization-people/orcid_get_people_by_organization.ipynb
@@ -8,7 +8,7 @@
    "source": [
     "### Query ORCID for people affiliated with an organization and filter for current employees only\n",
     "\n",
-    "This notebook queries the [ORCID API](https://api.orcid.org/v3.0/) for all [people affiliated with an organization](https://info.orcid.org/faq/how-do-i-find-orcid-record-holders-at-my-institution/) and additionally narrows down the affiliation to people **currently employed** by the organization. From the resulting list of people we output the ORCID iDs.\n",
+    "This notebook queries the [ORCID Public API](https://api.orcid.org/v3.0/) for all [people affiliated with an organization](https://info.orcid.org/faq/how-do-i-find-orcid-record-holders-at-my-institution/) and additionally narrows down the affiliation to people **currently employed** by the organization. From the resulting list of people we output the ORCID iDs.\n",
     "\n",
     "*Disclosure:\n",
     "The process of querying the ROR API for additional identifiers and using them to query the ORCID API for affiliated people is the same as used by the [FREYA PID Graph](https://blog.datacite.org/powering-the-pid-graph/) and is implemented in [DataCite Application API](https://doi.org/10.5438/8gb0-v673).*"
@@ -105,6 +105,7 @@
     "\n",
     "#--- example execution\n",
     "ror_data=query_ror_api(example_ror)\n",
+    "organization_ror_id=example_ror.replace(\"https://ror.org/\", \"\")\n",
     "# if you want to see the retrieved metadata, uncomment next lines\n",
     "#import pprint\n",
     "#pprint.pprint(ror_data)"
@@ -148,7 +149,7 @@
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "grid ID: grid.461819.3\n",
+      "Grid ID: grid.461819.3\n",
       "Wikidata ID: Q2399120\n"
      ]
     }
@@ -169,7 +170,7 @@
     "\n",
     "#--- example execution\n",
     "organization_grid_id=extract_grid_from_ror_data(ror_data)\n",
-    "print(\"grid ID: \" + str(organization_grid_id or ''))\n",
+    "print(\"Grid ID: \" + str(organization_grid_id or ''))\n",
     "organization_wikidata_id=extract_wikidata_from_ror_data(ror_data)\n",
     "print(\"Wikidata ID: \" + str(organization_wikidata_id or ''))"
    ]
@@ -261,7 +262,7 @@
    },
    "source": [
     "### Connection organization -> people\n",
-    "The second part of the process is to query for the people affiliated with the organization. For this we use the ORCID API and search for people affiliated with an organization like it is explained in the ORCID tutorial [\"How do I find ORCID record holders at my institution?\"](https://info.orcid.org/faq/how-do-i-find-orcid-record-holders-at-my-institution/). As parameters for the query we use the Grid ID and Ringgold ID for the organization.\n"
+    "The second part of the process is to query for the people affiliated with the organization. For this we use the ORCID API and search for people affiliated with an organization like it is explained in the ORCID tutorial [\"How do I find ORCID record holders at my institution?\"](https://info.orcid.org/faq/how-do-i-find-orcid-record-holders-at-my-institution/). As parameters for the query we use the ROR ID, Grid ID and Ringgold ID for the organization.\n"
    ]
   },
   {
@@ -391,11 +392,13 @@
     "# URL for ORCID search API\n",
     "ORCID_SEARCH_API = \"https://pub.orcid.org/v3.0/expanded-search/\"\n",
     "\n",
-    "# query ORCID with an organization's Grid ID and Ringgold\n",
-    "def query_orcid_for_affiliations(grid_id, ringgold_id):\n",
-    "    query = f\"grid-org-id:{grid_id}\" if grid_id else \"\"\n",
-    "    query += \" OR \" if grid_id and ringgold_id else \"\"\n",
-    "    query += f\"ringgold-org-id:{ringgold_id}\" if ringgold_id else \"\"\n",
+    "# query ORCID with an organization's ROR, Grid and Ringgold ID\n",
+    "def query_orcid_for_affiliations(ror_id, grid_id, ringgold_id):\n",
+    "    grid_search = f\"grid-org-id:{grid_id}\" if grid_id else \"\"\n",
+    "    ringgold_search = f\"ringgold-org-id:{ringgold_id}\" if ringgold_id else \"\"\n",
+    "    ror_search = f\"ror-org-id:{ror_id}\" if ror_id else \"\"\n",
+    "    orga_search_ids = [ror_search, grid_search, ringgold_search]\n",
+    "    query = ' OR '.join(filter(None, orga_search_ids))\n",
     "\n",
     "    response = requests.get(url=ORCID_SEARCH_API,\n",
     "                          params={'q': query},\n",
@@ -415,7 +418,7 @@
     "\n",
     "\n",
     "#-- example execution\n",
-    "affiliated_people = query_orcid_for_affiliations(organization_grid_id, organization_ringgold_id)\n",
+    "affiliated_people = query_orcid_for_affiliations(organization_ror_id, organization_grid_id, organization_ringgold_id)\n",
     "affiliated_count = affiliated_people.get('num-found','')\n",
     "print(f\"Number of affiliated people: {affiliated_count}\")\n",
     "\n",

diff --git a/person-works/README.md b/person-works/README.md
@@ -6,4 +6,5 @@ Currently available PID Graphs:
 * [Crossref](https://www.crossref.org/)
 * [FREYA PID Graph](https://blog.datacite.org/powering-the-pid-graph/)
 * [OpenAlex](https://openalex.org/about)
-* [ORCID](https://orcid.org/)
+* [ORCID](https://orcid.org/)
+* [OpenAIRE](https://www.openaire.eu/)
diff --git a/person-works/openaire_get_publications_by_person.ipynb b/person-works/openaire_get_publications_by_person.ipynb
@@ -0,0 +1,198 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Query OpenAIRE for publications authored by a person\n",
+    "This notebook queries the [OpenAIRE HTTP API](https://graph.openaire.eu/develop/api.html) via its `/publications` endpoint for publications authored by a person. It takes an ORCID iD as input which is used to filter for publications where one of the creators' `orcid` field matches the given ORCID iD. From the resulting list of publications we output all DOIs.\n",
+    "\n",
+    "*Note:\n",
+    "The API has several different endpoints for research outputs: they are divided into publications, research data, software metadata and other research products, so to get a full picture about a person's research output, you would have to query all of these endpoints and union their results.*"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {
+    "pycharm": {
+     "name": "#%%\n"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# Prerequisites:\n",
+    "import requests                    # dependency for making HTTP calls\n",
+    "from benedict import benedict      # dependency for dealing with json"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "collapsed": true,
+    "pycharm": {
+     "name": "#%% md\n"
+    }
+   },
+   "source": [
+    "The input for this notebook is an ORCID iD, e.g. '`0000-0003-2499-7741`'."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {
+    "pycharm": {
+     "name": "#%%\n"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# input parameter\n",
+    "example_orcid_id=\"0000-0003-2499-7741\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We use it to query the OpenAIRE HTTP API for publications that specified the ORCID iD within their metadata in one of the creators `orcid` field. Since the API uses pagination, we need to loop through all pages to get the complete result set."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {
+    "pycharm": {
+     "name": "#%%\n"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# OpenAIRE endpoint to query for publications\n",
+    "OPENAIRE_API_PUBLICATIONS = \"https://api.openaire.eu/search/publications\"\n",
+    "\n",
+    "# query OpenAIRE for all publications that are connected to orcid\n",
+    "def query_openaire_for_person2publications(orcid_id):\n",
+    "    page = 1\n",
+    "    max_page = 1\n",
+    "\n",
+    "    while page <= max_page:\n",
+    "        params = {'orcid': orcid_id, 'page': page, 'format': \"json\"}\n",
+    "        response = requests.get(url=OPENAIRE_API_PUBLICATIONS,\n",
+    "                                params=params)\n",
+    "        response.raise_for_status()\n",
+    "        result=response.json()\n",
+    "\n",
+    "        # calculate max page number in first loop\n",
+    "        if max_page == 1:\n",
+    "            max_page = determine_max_page(result)\n",
+    "        page = page + 1\n",
+    "        yield result\n",
+    "\n",
+    "# calculate max number of result pages\n",
+    "def determine_max_page(response_data):\n",
+    "    response_dict = benedict.from_json(response_data)\n",
+    "    items_total = response_dict.get('response.header.total.$')\n",
+    "    items_per_page = response_dict.get('response.header.size.$')\n",
+    "    max_page_ceil = items_total // items_per_page + bool(items_total % items_per_page)\n",
+    "    return max_page_ceil\n",
+    "\n",
+    "\n",
+    "# ---- example execution\n",
+    "list_of_pages=query_openaire_for_person2publications(example_orcid_id)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "From the resulting list of publications we extract and print out each title and DOI. \n",
+    "\n",
+    "*Note: publications that do not have a DOI assigned, will not be printed.*"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Number of publications found: 6\n",
+      "\n",
+      "10.15488/11463, Roadmap to FAIR Research Information in Open Infrastructures\n",
+      "10.1515/bd.2006.40.4.466, Informationsvermittlung: Personalisiertes Lernen in der Bibliothek: das Düsseldorfer Online-Tutorial (DOT) Informationskompetenz\n",
+      "10.1080/00048623.2006.10755322, Teaching Information Literacy with the Lerninformationssystem\n",
+      "10.3389/frma.2021.694307, Enhancing Knowledge Graph Extraction and Validation From Scholarly Publications Using Bibliographic Metadata\n",
+      "10.3897/rio.7.e66264, OPTIMETA – Strengthening the Open Access publishing system through open citations and spatiotemporal metadata\n",
+      "10.1016/j.procs.2019.01.074, The Research Core Dataset (KDSF) in the Linked Data context\n"
+     ]
+    }
+   ],
+   "source": [
+    "# from the result pages, extract the data about each publication\n",
+    "def extract_publications_from_page(page):\n",
+    "    return [pub for pub in benedict.from_json(page).get('response.results.result') or []]\n",
+    "\n",
+    "# extract DOI from publication\n",
+    "def extract_doi(pub):\n",
+    "    oaf_result=benedict.from_json(pub).get('metadata.oaf:entity.oaf:result')\n",
+    "\n",
+    "    # unfortunately the json data is inconsistently modeled:\n",
+    "    # if there is one pid/title for a publication, it is a json object\n",
+    "    # if there are multiple pids/titles for a publication, they form a json list\n",
+    "    pids=oaf_result.get('pid') or []\n",
+    "    is_doi = lambda pid: pid.get('@classid')==\"doi\"\n",
+    "    if isinstance(pids, list):\n",
+    "        dois=[pid['$'] for pid in pids if is_doi(pid)]\n",
+    "    else:\n",
+    "        dois= [pids['$']] if is_doi(pids) else []\n",
+    "    doi=dois[0] if dois else None # pick the first one\n",
+    "    \n",
+    "    titles=oaf_result.get('title') or []\n",
+    "    is_main_title = lambda title: title.get('@classid')==\"main title\"\n",
+    "    if isinstance(titles, list):\n",
+    "        main_titles=[title['$'] for title in titles if is_main_title(title)]\n",
+    "    else:\n",
+    "        main_titles=[titles['$']] if is_main_title(titles) else []\n",
+    "    title=main_titles[0] if main_titles else None  # pick the first one\n",
+    "\n",
+    "    return doi, title\n",
+    "\n",
+    "\n",
+    "#--- example execution\n",
+    "for page in list_of_pages or []:\n",
+    "    publications=extract_publications_from_page(page)\n",
+    "    print(f\"Number of publications found: {len(publications)}\\n\")\n",
+    "    for pub in publications:\n",
+    "        doi,title = extract_doi(pub)\n",
+    "        if doi:\n",
+    "            print(f\"{doi}, {title}\")"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.6"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 1
+}
diff --git a/person-works/openalex_get_works_by_person.ipynb b/person-works/openalex_get_works_by_person.ipynb
@@ -44,7 +44,7 @@
     "id": "Rt7GUbFcaNxi"
    },
    "source": [
-    "The input for the query is an ORCID URL, e.g. '`https://orcid.org/0000-0003-2499-7741`'"
+    "The input for this notebook is an ORCID URL, e.g. '`https://orcid.org/0000-0003-2499-7741`'"
    ]
   },
   {

diff --git a/person-works/orcid_get_works_by_person.ipynb b/person-works/orcid_get_works_by_person.ipynb
@@ -8,7 +8,7 @@
    "source": [
     "### Query ORCID for works authored by a person\n",
     "\n",
-    "This notebook queries the [ORCID API](https://pub.orcid.org/v3.0/) to retrieve works listed in a person's ORCID record. It takes an ORCID URL or iD as input to retrieve the ORCID record of a person and the works listed on it. From the resulting list of works we output all DOIs."
+    "This notebook queries the [ORCID Public API](https://pub.orcid.org/v3.0/) to retrieve works listed in a person's ORCID record. It takes an ORCID URL or iD as input to retrieve the ORCID record of a person and the works listed on it. From the resulting list of works we output all DOIs."
    ]
   },
   {

diff --git a/work-projects/README.md b/work-projects/README.md
@@ -0,0 +1,6 @@
+## work-projects
+
+A Jupyter notebook showing an example of using a persistent identifier for a publication (DOI)
+as input for retrieving the project a work was produced in (identified by its OpenAIRE project ID).
+
+* [OpenAIRE](https://www.openaire.eu/)