# Extraction of papers referencing datasets Using OpenCitation API with CrossRef database

## How does this file work ?

1. First, we get the DOI of the datasets we want to query using datasets.csv file
2. For each dataset, make a get request to https://opencitations.net/index/coci/api/v1/citations/{DOI_of_the_dataset}
3. From the API response, get the DOI and year of the paper that cites the dataset
4. Export the result in paper_coci.csv

Get data from datasets.csv

In [1]:
import requests
import numpy as np
from pprint import pprint
import csv

#Dictionnary with dataset's name as key and DOI as value
datasets_DOI = {}

ds_reader = csv.DictReader(open('./data/datasets.csv'))
for ds in ds_reader:
    datasets_DOI[ds["name"]] = ds["DOI"]

pprint(datasets_DOI)


{'ACDC': '10.1109/TMI.2018.2837502',
 'BRATS': '10.1109/tmi.2014.2377694',
 'I2CVB': '10.1016/j.compbiomed.2015.02.009',
 'LA': '10.1016/j.media.2020.101832',
 'M&Ms': '10.1109/tmi.2021.3090082',
 'MSCMRSeg': '10.48550/arxiv.2006.12434',
 'Medical Decathlon': '10.1038/s41467-022-30695-9',
 'PROMISE12': '10.1016/j.media.2013.12.002',
 'Synapse': '10.7303/syn3193805'}


Request the OpenCitation API

In [4]:
url_base = "https://opencitations.net/index/coci/api/v1/citations/"
paper_using = {ds:[] for ds in datasets_DOI}

for ds in datasets_DOI:
    req_url = url_base + datasets_DOI[ds]
    req = requests.get(req_url)
    if req.status_code == 200:
        req_json = req.json()
        for cite in req_json:
            doi = cite["citing"]
            year = cite["creation"][:4]
            paper_using[ds].append((doi,year))
    else:
        print(f"ERROR {req.status_code} for dataset {ds}")


ERROR 500 for dataset ACDC
ERROR 500 for dataset LA
ERROR 500 for dataset PROMISE12
ERROR 500 for dataset MSCMRSeg
ERROR 500 for dataset M&Ms
ERROR 500 for dataset Medical Decathlon
ERROR 500 for dataset I2CVB
ERROR 500 for dataset BRATS
ERROR 500 for dataset Synapse


Export the result in paper_coci.csv

In [3]:
with open("./extracted_csv/paper_coci.csv","w") as file:
    file.write(f"name,DOI,publication_year,dataset_used")
    for ds in datasets_DOI:
        lst_papers = paper_using[ds]
        for pap in lst_papers:
            file.write(f"\n{None},{pap[0]},{pap[1]},{ds}")