---
layout: default
title: APCs Tracking
parent: OpenAlex
nav_order: 2
---

# Tracking Article Processing Charges (APCs) for a given institution

In this notebook, we will query the OpenAlex API to answer the following questions:  

1. **How much are researchers at my institution paying in APCs?**
2. **Which publishers are collecting the most APCs from researchers at my institution?**
3. **How much money are my organization’s researchers saving in discounted APC charges from our transformative/read-publish agreements?**

Most organizations do not have an effective way of tracking the APCs that their researchers pay to publish in open access journals.  By estimating how much money is going to APCs each year, and which publishers are collecting the most APCs, libraries can make more informed decisions around the details of the read-publish agreements they have with various publishers.  

## Surveying APCs by publisher

### Steps

1. We need to get all the works published and corresponded by researchers at the institution
2. We get the publisher and APC for each publication
3. We sum the APCs (by publisher)
4. We calculate the APCs paid by the institution with a given list of read-publish agreement discounts

### Input

For inputs, we need an identifier for the institution and here we opted for its ROR ID. If we look up McMaster University in the ROR registry, we find its ROR ID is https://ror.org/02fa3aq2. We are also interested only in certain types of publications in a given year.  

In [1]:
SAVE_CSV = False  # flag to determine whether to save the output as a CSV file 

# input
ror_id = "https://ror.org/02fa3aq29"
publication_year = 2024
publication_types = ["article", "review"]

### Get OpenAlex ID of the given institution

We only want publications with corresponding authors, who are affiliated with McMaster University. However, OpenAlex currently does not support filtering corresponding institutions by ROR ID, we will need to find out the OpenAlex ID for McMaster using the [`institutions`](https://docs.openalex.org/api-entities/institutions) entity type.  

Our search criteria are as follows:  
- `ror`: ROR ID of the institution, `ror:https://ror.org/02fa3aq2`

Now we need to build an URL for the query from the following parameters:  
- Starting point is the base URL of the OpenAlex API: `https://api.openalex.org/`
- We append the entity type to it: `https://api.openalex.org/institutions`
- All criteria need to go into the query parameter filter that is added after a question mark: `https://api.openalex.org/institutions?filter=`
- To construct the filter value we take the criteria we specified and concatenate them using commas as separators: `https://api.openalex.org/institutions?filter=ror:https://ror.org/02fa3aq29`

In [2]:
import requests

# construct the url using the provided ror id
url = f"https://api.openalex.org/institutions?filter=ror:{ror_id}"

# send a get request to the constructed url
response = requests.get(url)

# parse the response json data
json_data = response.json()

# extract the institution id from the first result
institution_id = json_data["results"][0]["id"]  # https://openalex.org/I98251732

### Get all works published by researchers at the institution

Our search criteria are as follows:  
- `corresponding_institution_ids`: institution affiliated with the corresponding authors of a work (OpenAlex ID), `corresponding_institution_ids:https://openalex.org/I98251732`
- `publication_year`: the year the work was published, `publication_year:2024`
- [`types`](https://docs.openalex.org/api-entities/works/work-object#type): the type of the work, `type:article|review`

Now we need to build an URL for the query from the following parameters:  
- Starting point is the base URL of the OpenAlex API: `https://api.openalex.org/`
- We append the entity type to it: `https://api.openalex.org/works`
- All criteria need to go into the query parameter filter that is added after a question mark: `https://api.openalex.org/works?filter=`
- To construct the filter value we take the criteria we specified and concatenate them using commas as separators: `https://api.openalex.org/works?filter=corresponding_institution_ids:https://openalex.org/I98251732,publication_year:2024,type:article|review&page=1&per-page=50`

In [3]:
import numpy as np
import pandas as pd

def get_works_by_institution(institution_id, publication_year, publication_types, page=1, items_per_page=50):
    # construct the api url with the given institution id, publication year, publication types, page number, and items per page
    url = f"https://api.openalex.org/works?filter=corresponding_institution_ids:{institution_id},publication_year:{publication_year},type:{"|".join(publication_types)}&page={page}&per-page={items_per_page}"

    # send a GET request to the api and parse the json response
    response = requests.get(url)
    json_data = response.json()

    # convert the json response to a dataframe
    df_json = pd.DataFrame.from_dict(json_data["results"])

    next_page = True
    if df_json.empty: # check if the dataframe is empty (i.e., no more pages available)
        next_page = False

    # if there are more pages, recursively fetch the next page
    if next_page:
        df_json_next_page = get_works_by_institution(institution_id, publication_year, publication_types, page=page+1, items_per_page=items_per_page)
        df_json = pd.concat([df_json, df_json_next_page])

    return df_json

In [4]:
df_works = get_works_by_institution(institution_id, publication_year, publication_types)
if SAVE_CSV:
    df_works.to_csv(f"institution_works_{publication_year}.csv", index=True)

### Get Publishers and APCs in USD

In a `work` entity object, there are information about the publisher (`primary_location`) and the publication's APC listed by the publisher ([`apc_list`](https://docs.openalex.org/api-entities/works/work-object#apc_list)).  

`apc_list` describes the APC price listed by the publisher. At the time of writing this notebook, the only source for APC listing price data is [DOAJ](https://doaj.org/). For some publications, which their APCs are not available in DOAJ, we will need to infer an APC price in the calculation.  

In [5]:
# extract 'value_usd' from 'apc_list' if it is a dictionary (i.e. 'apc_list' exists in the work record); otherwise, set to null
df_works["apc_list_usd"] = df_works["apc_list"].apply(lambda apc_list: apc_list["value_usd"] if isinstance(apc_list, dict) else np.nan)

# extract 'id' from 'source' within 'primary_location' if 'source' exists; otherwise, set to null
df_works["source_id"] = df_works["primary_location"].apply(lambda location: location["source"]["id"] if location["source"] else np.nan)

# extract 'name' from 'source' within 'primary_location' if 'source' exists; otherwise, set to null
df_works["source_name"] = df_works["primary_location"].apply(lambda location: location["source"]["display_name"] if location["source"] else np.nan)

# extract 'issn' and 'issn_l' from 'source' within 'primary_location' if 'source' exists; otherwise, set to null
df_works["source_issn"] = df_works["primary_location"].apply(lambda location: location["source"]["issn"] if location["source"] else np.nan)
df_works["source_issn_l"] = df_works["primary_location"].apply(lambda location: location["source"]["issn_l"] if location["source"] else np.nan)

In [6]:
# calculate the average apc where 'apc_list_usd' is not null
apc_mean = df_works[df_works["apc_list_usd"].notnull()]["apc_list_usd"].mean()

# fill null values in 'apc_list_usd' with the calculated average
df_works["apc_list_usd"] = df_works["apc_list_usd"].fillna(apc_mean)

# fill null values in 'source_id' and 'source_name'
df_works["source_id"] = df_works["source_id"].fillna("unknown source")
df_works["source_name"] = df_works["source_name"].fillna("unknown source")
df_works["source_issn"] = df_works["source_issn"].fillna("unknown source")
df_works["source_issn_l"] = df_works["source_issn_l"].fillna("unknown source")

### Aggregate APCs Data

In [7]:
# group the dataframe by 'source_id' and 'source_issn_l'
# and aggregate 'source_name' by taking the maximum value (in this case the common string name of the source) 
# and 'apc_list_usd' by summing
df_apc = df_works.groupby(["source_id", "source_issn_l"]).agg({"source_name": "max", "apc_list_usd": "sum"})
if SAVE_CSV:
    df_apc.to_csv(f"apc_usd_by_source.csv", index=True)

df_apc

Unnamed: 0_level_0,Unnamed: 1_level_0,source_name,apc_list_usd
source_id,source_issn_l,Unnamed: 2_level_1,Unnamed: 3_level_1
https://openalex.org/S100014455,1756-0500,BMC Research Notes,1361.000000
https://openalex.org/S10012645,0363-9061,International Journal for Numerical and Analyt...,4530.000000
https://openalex.org/S100299040,0017-9078,Health Physics,2899.458883
https://openalex.org/S100662246,1748-2623,International Journal of Qualitative Studies o...,1790.000000
https://openalex.org/S100695177,0004-6256,The Astronomical Journal,4499.000000
...,...,...,...
https://openalex.org/S99498898,1567-5394,Bioelectrochemistry,3370.000000
https://openalex.org/S99546260,1836-9561,Journal of physiotherapy,3450.000000
https://openalex.org/S99961174,1363-2469,Journal of Earthquake Engineering,2899.458883
https://openalex.org/S99985186,1360-8592,Journal of Bodywork and Movement Therapies,2670.000000


In [8]:
total_apc = df_apc["apc_list_usd"].sum()
print(f"Total APC in {publication_year}: ${round(total_apc, 2)} USD.")

Total APC in 2024: $4288299.69 USD.


### Calculate Discounted APCs

In [9]:
if SAVE_CSV:
    df_apc.to_csv(f"apc_usd_by_source.csv", index=True)

df_apc

Unnamed: 0_level_0,Unnamed: 1_level_0,source_name,apc_list_usd
source_id,source_issn_l,Unnamed: 2_level_1,Unnamed: 3_level_1
https://openalex.org/S100014455,1756-0500,BMC Research Notes,1361.000000
https://openalex.org/S10012645,0363-9061,International Journal for Numerical and Analyt...,4530.000000
https://openalex.org/S100299040,0017-9078,Health Physics,2899.458883
https://openalex.org/S100662246,1748-2623,International Journal of Qualitative Studies o...,1790.000000
https://openalex.org/S100695177,0004-6256,The Astronomical Journal,4499.000000
...,...,...,...
https://openalex.org/S99498898,1567-5394,Bioelectrochemistry,3370.000000
https://openalex.org/S99546260,1836-9561,Journal of physiotherapy,3450.000000
https://openalex.org/S99961174,1363-2469,Journal of Earthquake Engineering,2899.458883
https://openalex.org/S99985186,1360-8592,Journal of Bodywork and Movement Therapies,2670.000000
