diff --git a/README.md b/README.md index e5b6724..5099ebf 100644 --- a/README.md +++ b/README.md @@ -3,7 +3,7 @@ [![DOI](https://zenodo.org/badge/447263093.svg)](https://zenodo.org/badge/latestdoi/447263093) [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/Project-TAPIR/pidgraph-notebooks/main) -A collection of Jupyter notebooks with examples of querying different PID providers like [ORCID](https://orcid.org/), [ROR](https://ror.readme.io/), [Crossref](https://www.crossref.org/) and PID graphs like the [FREYA PID Graph](https://blog.datacite.org/powering-the-pid-graph/) and [OpenAlex](https://openalex.org/about) for connected objects. +A collection of Jupyter notebooks with examples of querying different PID providers like [ORCID](https://orcid.org/), [ROR](https://ror.readme.io/), [Crossref](https://www.crossref.org/) and PID graphs like the [FREYA PID Graph](https://blog.datacite.org/powering-the-pid-graph/), [OpenAlex](https://openalex.org/about) and [OpenAIRE](https://www.openaire.eu/) for connected objects. Currently included connections: * organization-organization @@ -17,7 +17,11 @@ Currently included connections: * person-works * input: ORCID * output: list of works authored/created by the person, each identified by their DOI - * data sources: Crossref, FREYA PID Graph, OpenAlex, ORCID + * data sources: Crossref, FREYA PID Graph, OpenAlex, ORCID, OpenAIRE +* work-projects + * input: DOI + * output: list of projects the work was produced in, each identified by their OpenAIRE project ID + * data sources: OpenAIRE Please navigate into the respective folder to see the list of available notebooks. @@ -35,3 +39,5 @@ you can use this link to launch the notebooks on Binder where you can execute an In the joint project [TAPIR](https://projects.tib.eu/tapir/en/) (Partially Automated Persistent Identifier-based Reporting), partially automated procedures for research reporting are being tested in the context of university and non-university research. To this end, the question is being investigated : To what extent can the necessary data aggregation be carried out on the basis of openly available research information using persistent identifiers? + +*More information in our blog post "[Project TAPIR: Harvesting the power of PIDs](https://blogs.tib.eu/wp/tib/2022/03/01/project-tapir-harvesting-the-power-of-pids/)"* diff --git a/organization-people/orcid_get_people_by_organization.ipynb b/organization-people/orcid_get_people_by_organization.ipynb index cf3ca84..34d91a1 100644 --- a/organization-people/orcid_get_people_by_organization.ipynb +++ b/organization-people/orcid_get_people_by_organization.ipynb @@ -8,7 +8,7 @@ "source": [ "### Query ORCID for people affiliated with an organization and filter for current employees only\n", "\n", - "This notebook queries the [ORCID API](https://api.orcid.org/v3.0/) for all [people affiliated with an organization](https://info.orcid.org/faq/how-do-i-find-orcid-record-holders-at-my-institution/) and additionally narrows down the affiliation to people **currently employed** by the organization. From the resulting list of people we output the ORCID iDs.\n", + "This notebook queries the [ORCID Public API](https://api.orcid.org/v3.0/) for all [people affiliated with an organization](https://info.orcid.org/faq/how-do-i-find-orcid-record-holders-at-my-institution/) and additionally narrows down the affiliation to people **currently employed** by the organization. From the resulting list of people we output the ORCID iDs.\n", "\n", "*Disclosure:\n", "The process of querying the ROR API for additional identifiers and using them to query the ORCID API for affiliated people is the same as used by the [FREYA PID Graph](https://blog.datacite.org/powering-the-pid-graph/) and is implemented in [DataCite Application API](https://doi.org/10.5438/8gb0-v673).*" @@ -105,6 +105,7 @@ "\n", "#--- example execution\n", "ror_data=query_ror_api(example_ror)\n", + "organization_ror_id=example_ror.replace(\"https://ror.org/\", \"\")\n", "# if you want to see the retrieved metadata, uncomment next lines\n", "#import pprint\n", "#pprint.pprint(ror_data)" @@ -148,7 +149,7 @@ "name": "stdout", "output_type": "stream", "text": [ - "grid ID: grid.461819.3\n", + "Grid ID: grid.461819.3\n", "Wikidata ID: Q2399120\n" ] } @@ -169,7 +170,7 @@ "\n", "#--- example execution\n", "organization_grid_id=extract_grid_from_ror_data(ror_data)\n", - "print(\"grid ID: \" + str(organization_grid_id or ''))\n", + "print(\"Grid ID: \" + str(organization_grid_id or ''))\n", "organization_wikidata_id=extract_wikidata_from_ror_data(ror_data)\n", "print(\"Wikidata ID: \" + str(organization_wikidata_id or ''))" ] @@ -261,7 +262,7 @@ }, "source": [ "### Connection organization -> people\n", - "The second part of the process is to query for the people affiliated with the organization. For this we use the ORCID API and search for people affiliated with an organization like it is explained in the ORCID tutorial [\"How do I find ORCID record holders at my institution?\"](https://info.orcid.org/faq/how-do-i-find-orcid-record-holders-at-my-institution/). As parameters for the query we use the Grid ID and Ringgold ID for the organization.\n" + "The second part of the process is to query for the people affiliated with the organization. For this we use the ORCID API and search for people affiliated with an organization like it is explained in the ORCID tutorial [\"How do I find ORCID record holders at my institution?\"](https://info.orcid.org/faq/how-do-i-find-orcid-record-holders-at-my-institution/). As parameters for the query we use the ROR ID, Grid ID and Ringgold ID for the organization.\n" ] }, { @@ -391,11 +392,13 @@ "# URL for ORCID search API\n", "ORCID_SEARCH_API = \"https://pub.orcid.org/v3.0/expanded-search/\"\n", "\n", - "# query ORCID with an organization's Grid ID and Ringgold\n", - "def query_orcid_for_affiliations(grid_id, ringgold_id):\n", - " query = f\"grid-org-id:{grid_id}\" if grid_id else \"\"\n", - " query += \" OR \" if grid_id and ringgold_id else \"\"\n", - " query += f\"ringgold-org-id:{ringgold_id}\" if ringgold_id else \"\"\n", + "# query ORCID with an organization's ROR, Grid and Ringgold ID\n", + "def query_orcid_for_affiliations(ror_id, grid_id, ringgold_id):\n", + " grid_search = f\"grid-org-id:{grid_id}\" if grid_id else \"\"\n", + " ringgold_search = f\"ringgold-org-id:{ringgold_id}\" if ringgold_id else \"\"\n", + " ror_search = f\"ror-org-id:{ror_id}\" if ror_id else \"\"\n", + " orga_search_ids = [ror_search, grid_search, ringgold_search]\n", + " query = ' OR '.join(filter(None, orga_search_ids))\n", "\n", " response = requests.get(url=ORCID_SEARCH_API,\n", " params={'q': query},\n", @@ -415,7 +418,7 @@ "\n", "\n", "#-- example execution\n", - "affiliated_people = query_orcid_for_affiliations(organization_grid_id, organization_ringgold_id)\n", + "affiliated_people = query_orcid_for_affiliations(organization_ror_id, organization_grid_id, organization_ringgold_id)\n", "affiliated_count = affiliated_people.get('num-found','')\n", "print(f\"Number of affiliated people: {affiliated_count}\")\n", "\n", diff --git a/person-works/README.md b/person-works/README.md index 20dce81..3e9c108 100644 --- a/person-works/README.md +++ b/person-works/README.md @@ -6,4 +6,5 @@ Currently available PID Graphs: * [Crossref](https://www.crossref.org/) * [FREYA PID Graph](https://blog.datacite.org/powering-the-pid-graph/) * [OpenAlex](https://openalex.org/about) -* [ORCID](https://orcid.org/) \ No newline at end of file +* [ORCID](https://orcid.org/) +* [OpenAIRE](https://www.openaire.eu/) \ No newline at end of file diff --git a/person-works/openaire_get_publications_by_person.ipynb b/person-works/openaire_get_publications_by_person.ipynb new file mode 100644 index 0000000..4325fd5 --- /dev/null +++ b/person-works/openaire_get_publications_by_person.ipynb @@ -0,0 +1,198 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Query OpenAIRE for publications authored by a person\n", + "This notebook queries the [OpenAIRE HTTP API](https://graph.openaire.eu/develop/api.html) via its `/publications` endpoint for publications authored by a person. It takes an ORCID iD as input which is used to filter for publications where one of the creators' `orcid` field matches the given ORCID iD. From the resulting list of publications we output all DOIs.\n", + "\n", + "*Note:\n", + "The API has several different endpoints for research outputs: they are divided into publications, research data, software metadata and other research products, so to get a full picture about a person's research output, you would have to query all of these endpoints and union their results.*" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [], + "source": [ + "# Prerequisites:\n", + "import requests # dependency for making HTTP calls\n", + "from benedict import benedict # dependency for dealing with json" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "collapsed": true, + "pycharm": { + "name": "#%% md\n" + } + }, + "source": [ + "The input for this notebook is an ORCID iD, e.g. '`0000-0003-2499-7741`'." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [], + "source": [ + "# input parameter\n", + "example_orcid_id=\"0000-0003-2499-7741\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We use it to query the OpenAIRE HTTP API for publications that specified the ORCID iD within their metadata in one of the creators `orcid` field. Since the API uses pagination, we need to loop through all pages to get the complete result set." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [], + "source": [ + "# OpenAIRE endpoint to query for publications\n", + "OPENAIRE_API_PUBLICATIONS = \"https://api.openaire.eu/search/publications\"\n", + "\n", + "# query OpenAIRE for all publications that are connected to orcid\n", + "def query_openaire_for_person2publications(orcid_id):\n", + " page = 1\n", + " max_page = 1\n", + "\n", + " while page <= max_page:\n", + " params = {'orcid': orcid_id, 'page': page, 'format': \"json\"}\n", + " response = requests.get(url=OPENAIRE_API_PUBLICATIONS,\n", + " params=params)\n", + " response.raise_for_status()\n", + " result=response.json()\n", + "\n", + " # calculate max page number in first loop\n", + " if max_page == 1:\n", + " max_page = determine_max_page(result)\n", + " page = page + 1\n", + " yield result\n", + "\n", + "# calculate max number of result pages\n", + "def determine_max_page(response_data):\n", + " response_dict = benedict.from_json(response_data)\n", + " items_total = response_dict.get('response.header.total.$')\n", + " items_per_page = response_dict.get('response.header.size.$')\n", + " max_page_ceil = items_total // items_per_page + bool(items_total % items_per_page)\n", + " return max_page_ceil\n", + "\n", + "\n", + "# ---- example execution\n", + "list_of_pages=query_openaire_for_person2publications(example_orcid_id)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "From the resulting list of publications we extract and print out each title and DOI. \n", + "\n", + "*Note: publications that do not have a DOI assigned, will not be printed.*" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Number of publications found: 6\n", + "\n", + "10.15488/11463, Roadmap to FAIR Research Information in Open Infrastructures\n", + "10.1515/bd.2006.40.4.466, Informationsvermittlung: Personalisiertes Lernen in der Bibliothek: das Düsseldorfer Online-Tutorial (DOT) Informationskompetenz\n", + "10.1080/00048623.2006.10755322, Teaching Information Literacy with the Lerninformationssystem\n", + "10.3389/frma.2021.694307, Enhancing Knowledge Graph Extraction and Validation From Scholarly Publications Using Bibliographic Metadata\n", + "10.3897/rio.7.e66264, OPTIMETA – Strengthening the Open Access publishing system through open citations and spatiotemporal metadata\n", + "10.1016/j.procs.2019.01.074, The Research Core Dataset (KDSF) in the Linked Data context\n" + ] + } + ], + "source": [ + "# from the result pages, extract the data about each publication\n", + "def extract_publications_from_page(page):\n", + " return [pub for pub in benedict.from_json(page).get('response.results.result') or []]\n", + "\n", + "# extract DOI from publication\n", + "def extract_doi(pub):\n", + " oaf_result=benedict.from_json(pub).get('metadata.oaf:entity.oaf:result')\n", + "\n", + " # unfortunately the json data is inconsistently modeled:\n", + " # if there is one pid/title for a publication, it is a json object\n", + " # if there are multiple pids/titles for a publication, they form a json list\n", + " pids=oaf_result.get('pid') or []\n", + " is_doi = lambda pid: pid.get('@classid')==\"doi\"\n", + " if isinstance(pids, list):\n", + " dois=[pid['$'] for pid in pids if is_doi(pid)]\n", + " else:\n", + " dois= [pids['$']] if is_doi(pids) else []\n", + " doi=dois[0] if dois else None # pick the first one\n", + " \n", + " titles=oaf_result.get('title') or []\n", + " is_main_title = lambda title: title.get('@classid')==\"main title\"\n", + " if isinstance(titles, list):\n", + " main_titles=[title['$'] for title in titles if is_main_title(title)]\n", + " else:\n", + " main_titles=[titles['$']] if is_main_title(titles) else []\n", + " title=main_titles[0] if main_titles else None # pick the first one\n", + "\n", + " return doi, title\n", + "\n", + "\n", + "#--- example execution\n", + "for page in list_of_pages or []:\n", + " publications=extract_publications_from_page(page)\n", + " print(f\"Number of publications found: {len(publications)}\\n\")\n", + " for pub in publications:\n", + " doi,title = extract_doi(pub)\n", + " if doi:\n", + " print(f\"{doi}, {title}\")" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.6" + } + }, + "nbformat": 4, + "nbformat_minor": 1 +} \ No newline at end of file diff --git a/person-works/openalex_get_works_by_person.ipynb b/person-works/openalex_get_works_by_person.ipynb index 0a3cfd8..4e9e446 100644 --- a/person-works/openalex_get_works_by_person.ipynb +++ b/person-works/openalex_get_works_by_person.ipynb @@ -44,7 +44,7 @@ "id": "Rt7GUbFcaNxi" }, "source": [ - "The input for the query is an ORCID URL, e.g. '`https://orcid.org/0000-0003-2499-7741`'" + "The input for this notebook is an ORCID URL, e.g. '`https://orcid.org/0000-0003-2499-7741`'" ] }, { diff --git a/person-works/orcid_get_works_by_person.ipynb b/person-works/orcid_get_works_by_person.ipynb index cc123dd..095c40b 100644 --- a/person-works/orcid_get_works_by_person.ipynb +++ b/person-works/orcid_get_works_by_person.ipynb @@ -8,7 +8,7 @@ "source": [ "### Query ORCID for works authored by a person\n", "\n", - "This notebook queries the [ORCID API](https://pub.orcid.org/v3.0/) to retrieve works listed in a person's ORCID record. It takes an ORCID URL or iD as input to retrieve the ORCID record of a person and the works listed on it. From the resulting list of works we output all DOIs." + "This notebook queries the [ORCID Public API](https://pub.orcid.org/v3.0/) to retrieve works listed in a person's ORCID record. It takes an ORCID URL or iD as input to retrieve the ORCID record of a person and the works listed on it. From the resulting list of works we output all DOIs." ] }, { diff --git a/work-projects/README.md b/work-projects/README.md new file mode 100644 index 0000000..fd1ad91 --- /dev/null +++ b/work-projects/README.md @@ -0,0 +1,6 @@ +## work-projects + +A Jupyter notebook showing an example of using a persistent identifier for a publication (DOI) +as input for retrieving the project a work was produced in (identified by its OpenAIRE project ID). + +* [OpenAIRE](https://www.openaire.eu/) \ No newline at end of file diff --git a/work-projects/openaire_get_projects_by_work.ipynb b/work-projects/openaire_get_projects_by_work.ipynb new file mode 100644 index 0000000..153a74c --- /dev/null +++ b/work-projects/openaire_get_projects_by_work.ipynb @@ -0,0 +1,215 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "collapsed": true, + "pycharm": { + "is_executing": true + } + }, + "source": [ + "### Query OpenAIRE for the project(s) a publication was produced in\n", + "This notebook queries the [OpenAIRE HTTP API](https://graph.openaire.eu/develop/api.html) for the project(s) a publication was produced in. It takes a DOI as input which is used to retrieve the publication's metadata via the API's `/publications` endpoint and checks if there is a `'isProducedBy'` relation to a project. If that is the case, the project's ID is used to query the API via its `/projects` endpoint and the title, call identifier and funded amount of the project are printed." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "# Prerequisites:\n", + "import requests # dependency for making HTTP calls\n", + "from benedict import benedict # dependency for dealing with json" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The input for this notebook is a DOI, e.g. '`10.1007/978-3-030-74296-6_19`'." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "# input parameter\n", + "example_doi=\"10.1007/978-3-030-74296-6_19\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We use it to query the OpenAIRE HTTP API for the specified publication and its metadata. " + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "# OpenAIRE endpoint to query for publications\n", + "OPENAIRE_API_PUBLICATIONS = \"https://api.openaire.eu/search/publications\"\n", + "\n", + "# query OpenAIRE for a specific publication\n", + "def query_openaire_for_publication(doi):\n", + " params = {'doi': doi, 'format': \"json\"}\n", + " response = requests.get(url=OPENAIRE_API_PUBLICATIONS,\n", + " params=params)\n", + " response.raise_for_status()\n", + " result=response.json()\n", + " return result\n", + "\n", + "\n", + "# ---- example execution\n", + "pub_response=query_openaire_for_publication(example_doi)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "From the complete response we get from the API, we extract the metadata for the specified publication.\n", + "If the metadata contains a reference to a project within the list of relations (`'rels'`), then extract the project's ID." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "['corda__h2020::c6af905285a4bcd97a2fdf7cadc3cf3a']\n" + ] + } + ], + "source": [ + "# extract the metadata about the publication from the response\n", + "path_to_result='response.results.result[0].metadata.oaf:entity.oaf:result'\n", + "oaf_result=benedict.from_json(pub_response).get(path_to_result, {})\n", + "\n", + "# extract the metadata about relations\n", + "# and check for each rel, if it is pointing to a project\n", + "rels=oaf_result.get('rels.rel') or []\n", + "is_rel_to_project = lambda rel: rel['to']['@class']==\"isProducedBy\" and rel['to']['@type']==\"project\"\n", + "\n", + "# unfortunately the json data is inconsistently modeled:\n", + "# if there is one rel for a publication, it is a json object\n", + "# if there are multiple rels for a publication, they form a json list\n", + "if isinstance(rels, list):\n", + " project_ids=[rel['to']['$'] for rel in rels if is_rel_to_project(rel)]\n", + "else:\n", + " project_ids= [rels['to']['$']] if is_rel_to_project(rels) else []\n", + "\n", + "print(project_ids)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "For each project ID, we query the OpenAIRE HTTP API via its `/projects` endpoint for the project's metadata." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [], + "source": [ + "# OpenAIRE endpoint to query for projects\n", + "OPENAIRE_API_PROJECTS = \"https://api.openaire.eu/search/projects\"\n", + "\n", + "# query OpenAIRE for a specific project\n", + "def query_openaire_for_project(openaire_project_id):\n", + " params = {'openaireProjectID': openaire_project_id, 'format': \"json\"}\n", + " response = requests.get(url=OPENAIRE_API_PROJECTS,\n", + " params=params)\n", + " response.raise_for_status()\n", + " result=response.json()\n", + " return result\n", + "\n", + "\n", + "# ---- example execution\n", + "project_responses=[query_openaire_for_project(project_id) for project_id in project_ids]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's extract and print each project's title, code, call identifier and funded amount." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Project data:\n", + " code: 819536\n", + " title: Knowledge Graph based Representation, Augmentation and Exploration of Scholarly Communication\n", + " callidentifier: ERC-2018-COG\n", + " fundedamount:1996250.0 EUR\n", + "\n" + ] + } + ], + "source": [ + "def extract_data_from_project(project_response):\n", + " path_to_project='response.results.result[0].metadata.oaf:entity.oaf:project'\n", + " oaf_project=benedict.from_json(project_response).get(path_to_project, {})\n", + " \n", + " title=oaf_project.get('title.$')\n", + " code=oaf_project.get('code.$')\n", + " callidentifier=oaf_project.get('callidentifier.$')\n", + " fundedamount=oaf_project.get('fundedamount.$')\n", + " currency=oaf_project.get('currency.$')\n", + " return title, code, callidentifier, f\"{fundedamount} {currency}\"\n", + "\n", + "\n", + "# ---- example execution\n", + "if (not project_responses):\n", + " print(\"No projects associated with publication\")\n", + "for project in project_responses:\n", + " title, code, callidentifier, fundedamount = extract_data_from_project(project)\n", + " print(\"Project data:\")\n", + " print(f\" code: {code}\\n title: {title}\\n callidentifier: {callidentifier}\\n fundedamount:{fundedamount}\\n\")" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.6" + } + }, + "nbformat": 4, + "nbformat_minor": 1 +} \ No newline at end of file