## Extract dataset used by a paper with OpenAlex API

Exemple of request for the paper "Exploring Soothness...":  https://api.openalex.org/works/https://doi.org/10.1007/978-3-031-16443-9_4?select=referenced_works

In [1]:
import requests
import numpy as np
from pprint import pprint
import csv

#Dictionnary of papers with names as key and DOIs as value. Loaded from papers.csv file
papers = {}
pap_reader = csv.DictReader(open('../../data/papers.csv'))
for p in pap_reader:
    papers[p["name"]] = p["DOI"]



#Dictionnary with dataset's name as key and context/segmented organ as value
datasets_doi = {}
datasets_context = {}

#The 2 informations are in datasets.csv file
ds_reader = csv.DictReader(open('../../data/datasets.csv'))
for ds in ds_reader:
    datasets_doi[ds["name"]] = ds["DOI"]
    datasets_context[ds["name"]] = ds["context"]

pprint(datasets_doi)

{'ACDC': '10.1109/TMI.2018.2837502',
 'BRATS': '10.1109/tmi.2014.2377694',
 'I2CVB': '10.1016/j.compbiomed.2015.02.009',
 'LA': '10.1016/j.media.2020.101832',
 'M&Ms': '10.1109/tmi.2021.3090082',
 'MSCMRSeg': '10.48550/arxiv.2006.12434',
 'Medical Decathlon': '10.1038/s41467-022-30695-9',
 'PROMISE12': '10.1016/j.media.2013.12.002',
 'Synapse': '10.7303/syn3193805'}


In [2]:
def doi_to_OpenAlexId(doi):
    base_url = f"https://api.openalex.org/works/doi:{doi}"
    r = requests.get(base_url)
    if r.status_code == 200:
        r_json = r.json()
        return r_json["id"]
    else:
        return None

#Dictionnary with dataset names as key and openalex id as value. We associate an openalex ID because it's the value in the "referenced_works" field given by the API.
#To save a request, I use directly the OpenAlex ID otherwise you need to make a request with the ID to get the DOI or the name.
datasets_id = {}
#Convert DOI to OpenAlexID
for ds  in datasets_doi:
    openalex_id = doi_to_OpenAlexId(datasets_doi[ds])
    if not openalex_id:
        print(f"Couldn't convert DOI for {ds} into OpenAlex ID")
    datasets_id[ds]=openalex_id

Couldn't convert DOI for Synapse into OpenAlex ID


Now we can query the OpenAlex API with paper's DOI and check in the "referenced_works" field which contains a list of OpenAlex ID if one of them match a dataset we're working with

In [9]:
#Base url for the request
base_url = "https://api.openalex.org/works/https://doi.org/"

#To only get the field of referenced works in the response
query_param = {
    "select":"referenced_works"
}


#Will be filled with datasets found in the "referenced works" response of each paper
datasets_used = {p:[] for p in papers}

for paper_name in papers:
    request_url = base_url + papers[paper_name]
    request = requests.get(request_url, params=query_param)
    json_response = request.json()
    for ds in datasets_id:
        if datasets_id[ds] in json_response['referenced_works']:
            datasets_used[paper_name].append(ds)
pprint(datasets_used)

{'Exploring Smoothness and Class-Separation for Semi-supervised Medical Image Segmentation': ['ACDC',
                                                                                              'LA'],
 'MaxStyle: Adversarial Style Composition for Robust Medical Image Segmentation': ['PROMISE12',
                                                                                   'I2CVB'],
 'Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation': [],
 'TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation': [],
 'nnFormer: Volumetric Medical Image Segmentation via a 3D Transformer': ['ACDC']}


## Extract Paper Task with the abstract contains in API Response

Exemple of request for the paper "Exploring Soothness...":  https://api.openalex.org/works/https://doi.org/10.1007/978-3-031-16443-9_4?select=abstract_inverted_index

Here we're searching for keywords in the abstract of the papers by using the same query as before but looking in the "abstract_inverted_index" field which contains the list of word in the abstract. The goal here is not to find the dataset but the task of the paper (brain segmentation, heart segmentation, etc...)

In [10]:
#Base url for the request
base_url = "https://api.openalex.org/works/https://doi.org/"

#To only get the field of referenced works in the response
query_param = {
    "select":"abstract_inverted_index"
}

key_words = ["Heart","Cardiac","Prostate","Multi-organ","Brain","Liver","Aorta","Gallblader","Spleen","Kidney","Pancreas","Stomach","Lung","Breast","Chest","Skin","Eye","Retina"]
paper_theme = {p:[] for p in papers}

for paper_name in papers:
    request_url = base_url + papers[paper_name]
    request = requests.get(request_url, params=query_param)
    json_response = request.json()
    abstract_words = list(json_response["abstract_inverted_index"])
    for kw in key_words:
        if np.array([kw.upper() in w.upper() for w in abstract_words]).any():
            paper_theme[paper_name].append(kw)

paper_theme

{'Exploring Smoothness and Class-Separation for Semi-supervised Medical Image Segmentation': [],
 'MaxStyle: Adversarial Style Composition for Robust Medical Image Segmentation': ['Cardiac',
  'Prostate'],
 'nnFormer: Volumetric Medical Image Segmentation via a 3D Transformer': [],
 'Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation': ['Cardiac',
  'Multi-organ'],
 'TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation': ['Cardiac',
  'Multi-organ']}

## Merge the information between context obtained with abstract and context obtained with referenced datasets 

Finally we can combine the 2 previous informations (referenced datasets and task in the abstract) to be more accurate on the task of a paper. As we can get the context/task of a dataset in datasets.csv file.

In [11]:
print("==============================Before fusion using only abstract==============================")
pprint(paper_theme)
for paper_name in papers:
    datasets = datasets_used[paper_name]
    for ds in datasets:
        if datasets_context[ds] not in paper_theme[paper_name]:
            paper_theme[paper_name].append(datasets_context[ds])
print("\n==============================After fusion using both abstract and referenced papers==============================")
pprint(paper_theme)

{'Exploring Smoothness and Class-Separation for Semi-supervised Medical Image Segmentation': [],
 'MaxStyle: Adversarial Style Composition for Robust Medical Image Segmentation': ['Cardiac',
                                                                                   'Prostate'],
 'Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation': ['Cardiac',
                                                                          'Multi-organ'],
 'TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation': ['Cardiac',
                                                                                 'Multi-organ'],
 'nnFormer: Volumetric Medical Image Segmentation via a 3D Transformer': []}

{'Exploring Smoothness and Class-Separation for Semi-supervised Medical Image Segmentation': ['Cardiac'],
 'MaxStyle: Adversarial Style Composition for Robust Medical Image Segmentation': ['Cardiac',
                                                                       

## Extract papers that use a dataset with OpenAlex

Exemple of request for ACDC: https://api.openalex.org/works?page=2&filter=cites:W2804047627

Gather papers that cite one of the dataset we want to monitor

### How does it proceed ?

2. For each dataset, make a get request to https://api.openalex.org/works/filter=cites:{openalexID_of_the_dataset}
3. From the API response, we iterate through the pages (a page contains a limited number of references) to get the DOI and year of the paper that cites the dataset
4. Export the result in paper_OpenAlex.csv 
Each line contains the title of the paper,the doi,the publication year, a dataset name. So we have a line per dataset a paper is referencing.

Request the API

In [7]:
#Dictionnary containing with dataset as key and a list of papers
paper_using = {ds:[] for ds in datasets_id}


for ds in datasets_id:
    #The list of paper referencing the dataset is decomposed in multiple pages so we have to iterate with the query parameter "page" to get them all.
    next_page = True
    page_number = 1
    while next_page:
        #Definition of the request
        base_url = "https://api.openalex.org/works"
        query_param = {
            "filter":f"cites:{datasets_id[ds]}",
            "page":page_number
        }
        request = requests.get(base_url,params=query_param)
        if request.status_code == 200:
            request_json = request.json()
            
            #For each paper referencing the dataset we get the title (with a little transformation to remove "," and "\n" inside of them), doi and publication year
            for res in request_json["results"]:
                title = res["title"]
                title = title.replace(",","")
                title = title.replace("\n","")
                doi = res["doi"]
                if doi is not None:
                    doi = doi[16:] #Remove the https://doi.org/
                    paper_using[ds].append((title,doi,res["publication_year"]))

            #If the results field is empty that mean we are at the last page so we can continue to the next dataset
            #otherwise we need to go to next page of the current dataset
            if not request_json["results"]:
                next_page = False
            else:
                page_number += 1
        else:
            next_page = False


/works?filter=cites%3Ahttps%3A%2F%2Fopenalex.org%2FW2804047627&page=1
/works?filter=cites%3Ahttps%3A%2F%2Fopenalex.org%2FW3093394156&page=1
/works?filter=cites%3Ahttps%3A%2F%2Fopenalex.org%2FW4312016581&page=1
/works?filter=cites%3Ahttps%3A%2F%2Fopenalex.org%2FW4226199676&page=1
/works?filter=cites%3Ahttps%3A%2F%2Fopenalex.org%2FW2106033751&page=1
/works?filter=cites%3Ahttps%3A%2F%2Fopenalex.org%2FW3172681723&page=1
/works?filter=cites%3Ahttps%3A%2F%2Fopenalex.org%2FW2049522781&page=1
/works?filter=cites%3Ahttps%3A%2F%2Fopenalex.org%2FW1641498739&page=1
/works?filter=cites%3ANone&page=1


In [4]:
paper_using

{'ACDC': [('nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation',
   '10.1038/s41592-020-01008-z',
   2021),
  ('Automated cardiovascular magnetic resonance image analysis with fully convolutional networks',
   '10.1186/s12968-018-0471-x',
   2018),
  ('Deep Learning for Cardiac Image Segmentation: A Review',
   '10.3389/fcvm.2020.00025',
   2019),
  ('Multi-Scale Self-Guided Attention for Medical Image Segmentation',
   '10.1109/jbhi.2020.2986926',
   2021),
  ('Deep Learning for Segmentation Using an Open Large-Scale Dataset in 2D Echocardiography',
   '10.1109/tmi.2019.2900516',
   2019),
  ('State-of-the-Art Deep Learning in Cardiovascular Image Analysis',
   '10.1016/j.jcmg.2019.06.009',
   2019),
  ('Evaluation of algorithms for Multi-Modality Whole Heart Segmentation: An open-access grand challenge',
   '10.1016/j.media.2019.101537',
   2019),
  ('Learning a Probabilistic Model for Diffeomorphic Registration',
   '10.1109/tmi.2019.2897112',
 

Export the result in papers_openalex.csv

In [11]:
with open("../../results/extracted_csv/paper_openalex.csv","w") as file:
    file.write(f"name,DOI,publication_year,dataset_used")
    for ds in datasets_id:
        lst_papers = paper_using[ds]
        for pap in lst_papers:
            file.write(f"\n{pap[0]},{pap[1]},{pap[2]},{ds}")