**Accessing and Using OpenAlex**
     
    


This guide is written for accesing OpenAlex, a large dataset of research works, researchers, venues and institutions.    
As of Aug 1st 2022, there are two ways to access OpenAlex:
- download the entire dataset    
- using the official API      

In this tutorial, we will walk through both approaches.

To download a snapshot of the OpenAlex dataset, you can follow the official tutoral here: https://docs.openalex.org/download-snapshot    
The dataset is stored in Amazon S3, in the form of gzip-compressed JSON lines. By following the tutoral, you can format the JSON files downloaded from S3 and upload the dataset to either a data warehouse (e.g. BigQuery) or a relational database (e.g. PostgreSQL). For this summer school, we created a sample of the dataset under the directory "./datasets/s4/openalex_sample/". 

In [1]:
# Loading Python Packages
import numpy as np
import pandas as pd

In [2]:
# Showing all the files under the directory. It shows that all the tables are in parquet. 
# we can load those tables with Pandas and PyArrow
!ls ./datasets/s4/openalex_sample/

authors_counts_by_year.parquet
authors_ids.parquet
authors.parquet
concepts_ancestors.parquet
concepts_counts_by_year.parquet
concepts_ids.parquet
concepts.parquet
concepts_related_concepts.parquet
institutions_associated_institutions.parquet
institutions_counts_by_year.parquet
institutions_geo.parquet
institutions_ids.parquet
institutions.parquet
venues_counts_by_year.parquet
venues_ids.parquet
venues.parquet
works_alternate_host_venues.parquet
works_authorships.parquet
works_biblio.parquet
works_concepts.parquet
works_ids.parquet
works_mesh.parquet
works_open_access.parquet
works.parquet
works_referenced_works.parquet
works_related_works.parquet


In [3]:
!pip install pyarrow

distutils: /opt/conda/include/python3.8/UNKNOWN
sysconfig: /opt/conda/include/python3.8[0m
user = False
home = None
root = None
prefix = None[0m
distutils: /opt/conda/include/python3.8/UNKNOWN
sysconfig: /opt/conda/include/python3.8[0m
user = False
home = None
root = None
prefix = None[0m


In [4]:
authors = pd.read_parquet('./datasets/s4/openalex_sample/authors.parquet')
works = pd.read_parquet('./datasets/s4/openalex_sample/works.parquet')

In [5]:
authors.head()

Unnamed: 0,id,orcid,display_name,display_name_alternatives,works_count,cited_by_count,last_known_institution,works_api_url,updated_date
0,https://openalex.org/A102677832,,Fausto Arellano-Carbajal,[],4,207,https://openalex.org/I170203145,https://api.openalex.org/works?filter=author.i...,2022-02-28
1,https://openalex.org/A102677832,,Fausto Arellano-Carbajal,[],4,207,https://openalex.org/I170203145,https://api.openalex.org/works?filter=author.i...,2022-02-28
2,https://openalex.org/A102677832,,Fausto Arellano-Carbajal,[],4,207,https://openalex.org/I170203145,https://api.openalex.org/works?filter=author.i...,2022-02-28
3,https://openalex.org/A104275227,,Benedick A. Fraass,[],280,8629,https://openalex.org/I1282927834,https://api.openalex.org/works?filter=author.i...,2022-03-09
4,https://openalex.org/A10451749,https://orcid.org/0000-0002-6418-2793,Ron O. Dror,[],206,25004,https://openalex.org/I97018004,https://api.openalex.org/works?filter=author.i...,2022-03-09


In [6]:
works.head()

Unnamed: 0,id,doi,title,display_name,publication_year,publication_date,type,cited_by_count,is_retracted,is_paratext,cited_by_api_url,abstract_inverted_index
0,https://openalex.org/W2320228714,https://doi.org/10.18632/oncotarget.8338,Upregulation of long intergenic noncoding RNA ...,Upregulation of long intergenic noncoding RNA ...,2016,2016-03-24,journal-article,52,False,False,https://api.openalex.org/works?filter=cites:W2...,"""{""""//"""": [0]"
1,https://openalex.org/W2320401797,https://doi.org/10.1109/access.2016.2548980,Ubiquitous Biofeedback Serious Game for Stress...,Ubiquitous Biofeedback Serious Game for Stress...,2016,2016-03-31,journal-article,35,False,False,https://api.openalex.org/works?filter=cites:W2...,"""{""""Serious"""": [0]"
2,https://openalex.org/W2320853052,https://doi.org/10.18632/oncotarget.7745,Suppression of miR-204 enables oral squamous c...,Suppression of miR-204 enables oral squamous c...,2016,2016-02-26,journal-article,51,False,False,https://api.openalex.org/works?filter=cites:W2...,"""{""""The"""": [0]"
3,https://openalex.org/W2321209815,https://doi.org/10.18632/oncotarget.8553,Engineered nanoparticles induce cell apoptosis...,Engineered nanoparticles induce cell apoptosis...,2016,2016-04-02,journal-article,53,False,False,https://api.openalex.org/works?filter=cites:W2...,"""{""""Engineered"""": [0]"
4,https://openalex.org/W2322514663,https://doi.org/10.1016/j.procs.2016.02.014,Automated Discovery of JavaScript Code Injecti...,Automated Discovery of JavaScript Code Injecti...,2016,2016-03-01,journal-article,15,False,False,https://api.openalex.org/works?filter=cites:W2...,"""{""""This"""": [0]"


More details about data manipulation and processing using Pandas will be introduced in tomorrow's section

While it is more convenient to download the entire snapshot to your computer, querying the dataset requires a lot of computational power and disk space (about 300 GB). When computational resources are limited, accessing OpenAlex through the official API could be a better choice.

In [7]:
import requests
import json

The API (Application Programming Interface) is a convinient way to get OpenAlex data. To use an API, we make calls to request data from the computer hosting the OpenAlex data. In order to do that, one way is to use the "requests" package imported above.

To get a single entity from OpenAlex, we need to construct an URL as:
```
https://api.openalex.org/<entity_name>/<entity_id>
```    
    
To give an example, let's retrieve a research work using the API:

In [8]:
works = pd.read_parquet('./datasets/s4/openalex_sample/works.parquet')

In [9]:
# Pick the first research work from the table "works"
work_to_retrieve = works.loc[0]

In [10]:
print('OpenAlex ID: ' + work_to_retrieve['id'] + '\n')
print('Title: ' + work_to_retrieve['title'] + '\n')
print('Publication Year: ' + str(work_to_retrieve['publication_year']) + '\n')
print('Cited by: ' + str(work_to_retrieve['cited_by_count']))

OpenAlex ID: https://openalex.org/W2320228714

Title: Upregulation of long intergenic noncoding RNA 00673 promotes tumor proliferation via LSD1 interaction and repression of NCALD in non-small-cell lung cancer

Publication Year: 2016

Cited by: 52


The last bit of the OpenAlex id (W2320228714) is the "enitity_id" for the API URL

In [11]:
print(works.loc[0, 'id'])

https://openalex.org/W2320228714


By making an API request, we will receive a JSON response. We then parse it and save it as an Python dictionary object

In [12]:
work_W2320228714 = requests.get(
    'https://api.openalex.org/works/W2320228714'
).json()

In [13]:
work_W2320228714

{'id': 'https://openalex.org/W2320228714',
 'doi': 'https://doi.org/10.18632/oncotarget.8338',
 'title': 'Upregulation of long intergenic noncoding RNA 00673 promotes tumor proliferation via LSD1 interaction and repression of NCALD in non-small-cell lung cancer',
 'display_name': 'Upregulation of long intergenic noncoding RNA 00673 promotes tumor proliferation via LSD1 interaction and repression of NCALD in non-small-cell lung cancer',
 'publication_year': 2016,
 'publication_date': '2016-03-24',
 'ids': {'openalex': 'https://openalex.org/W2320228714',
  'doi': 'https://doi.org/10.18632/oncotarget.8338',
  'mag': '2320228714',
  'pmid': 'https://pubmed.ncbi.nlm.nih.gov/27027352',
  'pmcid': 'https://www.ncbi.nlm.nih.gov/pmc/articles/5041926'},
 'host_venue': {'id': 'https://openalex.org/V126644158',
  'issn_l': '1949-2553',
  'issn': ['1949-2553'],
  'display_name': 'Oncotarget',
  'publisher': 'Impact Journals, LLC',
  'type': 'publisher',
  'url': 'https://doi.org/10.18632/oncotarget

In [14]:
print('OpenAlex ID: ' + work_W2320228714['id'] + '\n')
print('Title: ' + work_W2320228714['title'] + '\n')
print('Publication Year: ' + str(work_W2320228714['publication_year']) + '\n')
print('First Author： ' + [a['author']['display_name'] for a in work_W2320228714['authorships'] if (a['author_position'] == 'first')][0] + '\n')
print('Cited by: ' + str(work_W2320228714['cited_by_count']))

OpenAlex ID: https://openalex.org/W2320228714

Title: Upregulation of long intergenic noncoding RNA 00673 promotes tumor proliferation via LSD1 interaction and repression of NCALD in non-small-cell lung cancer

Publication Year: 2016

First Author： Xuefei Shi

Cited by: 52


We could do the same with other entities. For example, we can retrieve a venue, but this time, let's try getting a random venue:

In [15]:
random_venue = requests.get(
    'https://api.openalex.org/venues/random'
).json()

In [16]:
random_venue

{'id': 'https://openalex.org/V2764511345',
 'issn_l': '1450-2267',
 'issn': ['1450-2267'],
 'display_name': 'European journal of social sciences',
 'publisher': None,
 'works_count': 830,
 'cited_by_count': 2024,
 'is_oa': None,
 'is_in_doaj': None,
 'homepage_url': None,
 'ids': {'openalex': 'https://openalex.org/V2764511345',
  'issn_l': '1450-2267',
  'mag': '2764511345',
  'issn': ['1450-2267']},
 'x_concepts': [{'id': 'https://openalex.org/C17744445',
   'wikidata': 'https://www.wikidata.org/wiki/Q36442',
   'display_name': 'Political science',
   'level': 0,
   'score': 68.9},
  {'id': 'https://openalex.org/C199539241',
   'wikidata': 'https://www.wikidata.org/wiki/Q7748',
   'display_name': 'Law',
   'level': 1,
   'score': 63.9},
  {'id': 'https://openalex.org/C162324750',
   'wikidata': 'https://www.wikidata.org/wiki/Q8134',
   'display_name': 'Economics',
   'level': 0,
   'score': 59.4},
  {'id': 'https://openalex.org/C144024400',
   'wikidata': 'https://www.wikidata.org/wik

In many cases, instead of getting only one thing at a time, we want to get of list of things, In the code cell below, we ask for all concepts, instead of specifying a concept by its ID. The API returns a meta object with details about the query, along with a long list of concepts.

In [17]:
requests.get(
    'https://api.openalex.org/concepts'
).json()

{'meta': {'count': 65073,
  'db_response_time_ms': 28,
  'page': 1,
  'per_page': 25},
 'results': [{'id': 'https://openalex.org/C41008148',
   'wikidata': 'https://www.wikidata.org/wiki/Q21198',
   'display_name': 'Computer science',
   'level': 0,
   'description': 'theoretical study of the formal foundation enabling the automated processing or computation of information, for example on a computer or over a data transmission network',
   'works_count': 40334190,
   'cited_by_count': 218510853,
   'ids': {'openalex': 'https://openalex.org/C41008148',
    'wikidata': 'https://www.wikidata.org/wiki/Q21198',
    'mag': '41008148',
    'wikipedia': 'https://en.wikipedia.org/wiki/Computer%20science',
    'umls_cui': ['C0599726']},
   'image_url': 'https://upload.wikimedia.org/wikipedia/commons/6/6a/Sorting_quicksort_anim.gif',
   'image_thumbnail_url': 'https://upload.wikimedia.org/wikipedia/commons/thumb/6/6a/Sorting_quicksort_anim.gif/100px-Sorting_quicksort_anim.gif',
   'international'

To get a meaningful list of entity objects, we need to add parameters to ```filter```, ```search``` and ```sort``` the returned result.

Filter parameters could be formatted like: ```filter=attribute:value,attribute2:value2```    
For instance, we want to get all the level-0 concepts:

In [18]:
level_zero_concepts = requests.get(
    'https://api.openalex.org/concepts?filter=level:0'
).json()

In [19]:
for c in level_zero_concepts['results']:
    print(c['display_name'])

Computer science
Medicine
Chemistry
Psychology
Biology
Political science
Materials science
Art
Business
Geography
Physics
Environmental science
Mathematics
Philosophy
Sociology
History
Geology
Engineering
Economics


Getting venues that published more than 1000 research works：

In [20]:
big_venues = requests.get(
    'https://api.openalex.org/venues?filter=works_count:>1000'
).json()

In [21]:
for v in big_venues['results']:
    print(v['display_name'])

Social Science Research Network
Research Papers in Economics
ChemInform
Lecture Notes in Computer Science
The Lancet
BMJ
Nature
Science
Notes and Queries
PLOS ONE
JAMA
Bulletin of the American Physical Society
Social Science Research Network
Reactions Weekly
Journal of the American Chemical Society
Journal of Biological Chemistry
Physical Review B
Scientific American
Choice Reviews Online
Chemical & Engineering News
Journal of physics
Blood
Scientific Reports
Journal of the Acoustical Society of America
Proceedings of the National Academy of Sciences of the United States of America


Institutions that are outside of the US:

In [22]:
non_us_institutions = requests.get(
    'https://api.openalex.org/institutions?filter=country_code:!us'
).json()

In [23]:
for i in non_us_institutions['results']:
    print(i['display_name'])

University of Tokyo
University of Toronto
University of Oxford
University of Cambridge
Tsinghua University
University of British Columbia
Kyoto University
University College London
Imperial College London
Tohoku University
Universidade de São Paulo
Zhejiang University
Osaka University
University of Liège
Shanghai Jiao Tong University
McGill University
Sapienza University of Rome
University of Alberta
National University of Singapore
University of Melbourne
Kyushu University
University of Manchester
University of Queensland
University of Sydney
National Autonomous University of Mexico


We can also search by fields:

In [24]:
authors_named_leibniz = requests.get('https://api.openalex.org/authors?filter=display_name.search:leibniz').json()

In [25]:
for a in authors_named_leibniz['results']:
    print(a['display_name'])

Gottfried Wilhelm Leibniz
Leibniz, Gottfried Wilhelm, Freiherr von
Leibniz Hang
Leibniz-Informationszentrum Wirtschaft
Leibniz, Gottfried Wilhelm, Freiherr von
Freiherr von Leibniz
Gottfried Wilhelm Leibniz
Leibniz
Gottfried Wilhelm Leibniz
Wilhelm Leibniz
Gottfried Wilhelm Leibniz
Leibniz-Informationszentrum Wirtschaft
Leibniz Universit
Leibniz-Informationszentrum Wirtschaft
Gottfried Wilhelm Leibniz
Otto Leibniz
Gottfried Wilhelm Leibniz
Gesis Leibniz-Institut für Sozialwissenschaften
Leibniz, Gottfried Wilhelm, Freiherr von
Gottfried Wilhelm Leibniz
Leibniz, Gottfried Wilhelm, Freiherr von
Leibniz, Gottfried Wilhelm, Freiherr von
Leibniz
Leibniz-Institut für Länderkunde Leipzig
Gottfried Wilhelm Leibniz


In [26]:
most_popular_after_2010 = requests.get(
    'https://api.openalex.org/works?sort=cited_by_count:desc&filter=publication_year:>2010'
).json()

In [27]:
for w in most_popular_after_2010['results']:
    print('Work Name: ' + w['display_name'])
    print('Cited Times: ' + str(w['cited_by_count']) + '\n')

Work Name: R: A language and environment for statistical computing.
Cited Times: 129391

Work Name: Diagnostic and Statistical Manual of Mental Disorders
Cited Times: 110641

Work Name: Deep Residual Learning for Image Recognition
Cited Times: 82349

Work Name: Adam: A Method for Stochastic Optimization
Cited Times: 48873

Work Name: ImageNet Classification with Deep Convolutional Neural Networks
Cited Times: 47116

Work Name: CRC Handbook of Chemistry and Physics
Cited Times: 45113

Work Name: Hallmarks of cancer: the next generation.
Cited Times: 40467

Work Name: Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries
Cited Times: 39534

Work Name: Very Deep Convolutional Networks for Large-Scale Image Recognition
Cited Times: 37620

Work Name: Deep learning
Cited Times: 36889

Work Name: LIBSVM: A library for support vector machines
Cited Times: 36682

Work Name: Fitting Linear Mixed-Effects Models Using lme4
Cited Time

In some occasions, we want to group the returned results into facets. ```group_by``` parameter is available for such task.    
To give an example, we want to know the distribution of open access status in OpenAlex:

In [28]:
works_oa = requests.get('https://api.openalex.org/works?group_by=oa_status').json()

In [29]:
works_oa

{'meta': {'count': 6, 'db_response_time_ms': 981, 'page': 1, 'per_page': 200},
 'results': [],
 'group_by': [{'key': 'unknown',
   'key_display_name': 'unknown',
   'count': 108836134},
  {'key': 'closed', 'key_display_name': 'closed', 'count': 91761584},
  {'key': 'bronze', 'key_display_name': 'bronze', 'count': 13496370},
  {'key': 'gold', 'key_display_name': 'gold', 'count': 12663915},
  {'key': 'green', 'key_display_name': 'green', 'count': 7937031},
  {'key': 'hybrid', 'key_display_name': 'hybrid', 'count': 4212965}]}

We can also combine groups with filter. For example, we want to know the distribution of open access status for journal articles in OpenAlex:

In [30]:
journal_works_oa = requests.get('https://api.openalex.org/works?filter=type:journal-article&group_by=oa_status').json()

In [31]:
journal_works_oa

{'meta': {'count': 6, 'db_response_time_ms': 846, 'page': 1, 'per_page': 200},
 'results': [],
 'group_by': [{'key': 'closed',
   'key_display_name': 'closed',
   'count': 61898016},
  {'key': 'unknown', 'key_display_name': 'unknown', 'count': 24022732},
  {'key': 'gold', 'key_display_name': 'gold', 'count': 12146871},
  {'key': 'bronze', 'key_display_name': 'bronze', 'count': 11830219},
  {'key': 'green', 'key_display_name': 'green', 'count': 5664621},
  {'key': 'hybrid', 'key_display_name': 'hybrid', 'count': 3203087}]}