## Retrieving citation counts

Here we compare counts citation counts from different Crossref endpoints:

1. JSON REST API /works endpoint
1. XML API query
1. Forward links (members only)
1. Admin tool (GUI; members only)
1. Event Data [TO ADD]

### How do citation counts work?

For each DOI, we collect the references made to it. These come from reference lists deposited by members that either already contain the target DOI or have a DOI that was matched by us. 

The set of DOIs that reference a work is delivered via a number of tools that are collectively Cited-by. Below you can see how to access counts and itemized lists of these citations.


### How to run this notebook

To get the results:

1. Edit the values in 'user input' section.

1. Run all of the cells.

1. In the section 'admin tool' you'll need to visit Crossref's admin tool page and enter the value it gives you.

1. Finally, scroll down to the final summary section. Run these cells again to include the admin tool results.

If you are a Crossref member with credentials to deposit metadata and access the forward links API, you will need add your role (the login will use my_email@domain.com/role). Also, create a file called 'credentials.txt' that contains the corresponding password and save it in the same directory as this file.

Check dependencies. You will need to have the following installed: requests, json, pandas, urllib, xmltodict. The last of these in non-standard, but there is a cell in the Functions and Libraries section that can be uncommented to install it.

### Taking it further

In case you want to adapt or rerun a call in a different script, each time an API call is made we output the API call used. this should help you understand how to construct the queries you need.

You can find documentation about Cited-by at https://www.crossref.org/documentation/cited-by/, including information about retrieving citations at https://www.crossref.org/documentation/cited-by/retrieve-citations/. 

### A note on language

Here's a section for the pedants that matters in a small number of cases. References and citations aren't quite the same thing but are used interchangeably and we aren't very consistent about their usage.

- *A reference* is an item in the bibliography of a work. They are not all necessarily mentioned in the text or they may be mentioned several times.
- *A citation* refers to the specific place that a reference is mentioned in a work, i.e. in the text itself. 

At Crossref, we collect reference lists and output their contents through a service called Cited-by, so you can see already that there are issues with linguistic consistency. We don't define whether reference lists should contain only references or only citations, it's left to the interpretation of the members depositing metadata. This means that it's possible that you will see the same entry several times in a reference list (i.e. one for each citation), or items in the reference list that are not cited in the text. 

## User input

Edit the items in this section. They are used further down the notebook.

In [1]:
# which DOI would you like to check Cited-by counts for?
doi = '10.1002/cphc.201700310'

# including your email to make API requests polite and to log in for xml queries
my_email = 'test@crossref.org'

## To be completed by members to access forward links
my_role = 'my_role'

## Functions and libraries

In [None]:
## Uncomment to install xmltodict if needed
# ! pip install xmltodict

In [12]:
import requests
import json
from pandas import DataFrame
import pandas as pd
import urllib.parse
from urllib.request import urlopen

try:
    import xmltodict
except ModuleNotFoundError:
    raise ModuleNotFoundError('''Install xmltodict before continuing! You can do that by uncommenting the previous cell and running it.''')

In [None]:
## functions for handling json APIs


def print_my_query(url: str, params: dict) -> None:
    """print a query for users to take home"""
    if params:
        # hide passwords!
        if "pwd" in params:
            params["pwd"] = "[PASSWORD]"
        # put the parameters into a list
        params_list = [f"{key}={str(params[key])}" for key in params]
        # print the query
        query_url = f"{url}?" + "&".join(params_list)
    else:
        # if there are no parameters, just print the query URL
        query_url = url

    print(query_url)


def filters_to_params(filters: dict) -> str:
    """a hack for the /works endpoint, which uses filters and parameters

    parameters
    ----------
    filters:dict
        e.g. {'from-updated-date': '2023-12-01'}

    returns
    -------
    str:
        e.g. 'from-updated-date:2023-12-01'

    """
    param = ""
    for f in filters:
        if len(param) > 0:
            param += ","
        param += f + ":" + str(filters[f])
    return param


def query_json_api(
    url, params: dict | None = None, filters: dict | None = None
) -> dict:
    """query some API, expecting JSON output"""

    if params is None:
        params = {}
    # handle filters (in particular for the Crossref works endpoint)
    if filters:
        params["filter"] = filters_to_params(filters)

    # make the request
    if params:
        r = requests.get(url, params=params)
    else:
        r = requests.get(url)

    # print the query used
    print_my_query(url, params)

    # get json
    if r.status_code == 200:
        # it all worked as expected, get some json
        js = r.json()
    else:
        # if something went wrong, print the output and return an empty dictionary
        print(r.text)
        js = {}

    return js

In [5]:
# functions specific to XML APIs
def prepared_query(url:str, params=None) -> str:
    ''' 
    Get the query as a string. Similar to print_my_query but doesn't print or hide passwords. 
    Needed for using urllib instead of requests.
        
    '''
    if params:
        # put the parameters into a list
        params_list = [f"{key}={str(urllib.parse.quote(params[key]))}" for key in params]
        suffix = "&".join(params_list)
        # print the query
        query_url = url + '?' + suffix
    else:
        # if there are no parameters, just print the query URL
        query_url = url

    return query_url


def  load_credentials() -> str:
    ''' get a password from a file called credentials.txt '''
    with open('credentials.txt', 'r') as f:
      return f.readline()
    

def query_xml_api(url:str, params = None) -> dict:
    ''' Query an API expecting an XML response. 
    Using urllib because requests does weird things with the username and password.    
    '''

    # get a url to query
    query_url = prepared_query(url, params)

    # query using urllib
    with urlopen(query_url) as response:
        body = response.read()

    # print the query used
    print_my_query(url, params)

    # turn the response into a dictionaty
    return xmltodict.parse(body)


In [None]:
# other useful functions

def save_to_json(data:dict, fname:str) -> None:
    ''' save a dictionary to a json file'''
    with open(fname, 'w') as f:
        json.dump(data, f, indent=2)

def duplicate_check(fl_list:list) -> tuple[list, int]:
    ''' check a list of forward links and return only the unique values
    
    inputs
    ------
    fl_list: list
      a list of forward links from the Crossref XML API

    returns
    -------

    
    
  '''

    # let's check for duplicates
    linked_dois = []
    for work in fl_list:
        # get a key, either 'journal_cite' or 'book_cite', it doesn't matter which
        ls = list(work.keys())
        ls.remove('@doi')
        k = ls[0]

        # get the DOI of the citation
        linked_dois.append(work[k]['doi']['#text'])

    # remove dupliates from the list of citing DOIs
    unique_fl_dois = list(set(linked_dois))
    fl_unique_count = len(unique_fl_dois)
    print(f"{fl_unique_count} unique DOIs found")

    # for duplicates show how many times they occurred
    duplicate_fl_dois = [doi for doi in linked_dois if linked_dois.count(doi) > 1]
    duplicate_fl_dois = list(set(duplicate_fl_dois))
    print("The following DOIs were duplicated in the results:")
    for doi in duplicate_fl_dois:
        print (f"{linked_dois.count(doi)}\t {doi}")

    return unique_fl_dois, fl_unique_count

## REST API JSON

Query the works endpoint of the Crossref API, which returns json, including an is-referenced-by-count field.


In [7]:
# url of the Crossref json REST API
works_url = 'https://api.crossref.org/v1/works'
# query parameters
json_params = {'mailto': my_email}
# make the query
js = query_json_api(works_url + '/' + doi, json_params)

https://api.crossref.org/v1/works/10.1002/cphc.201700310?mailto=test@crossref.org


In [8]:
json_count = js['message']['is-referenced-by-count']
print(f"{json_count} references found in the Crossref json REST API")

31 references found in the Crossref json REST API


In [11]:
from json import dumps
print(dumps(js, indent=2))

{
  "status": "ok",
  "message-type": "work",
  "message-version": "1.0.0",
  "message": {
    "indexed": {
      "date-parts": [
        [
          2024,
          8,
          24
        ]
      ],
      "date-time": "2024-08-24T13:34:10Z",
      "timestamp": 1724506450693
    },
    "reference-count": 43,
    "publisher": "Wiley",
    "issue": "9",
    "license": [
      {
        "start": {
          "date-parts": [
            [
              2017,
              4,
              26
            ]
          ],
          "date-time": "2017-04-26T00:00:00Z",
          "timestamp": 1493164800000
        },
        "content-version": "vor",
        "delay-in-days": 0,
        "URL": "http://onlinelibrary.wiley.com/termsAndConditions#vor"
      }
    ],
    "content-domain": {
      "domain": [
        "chemistry-europe.onlinelibrary.wiley.com"
      ],
      "crossmark-restriction": true
    },
    "short-container-title": [
      "ChemPhysChem"
    ],
    "published-print": {
      "d

In [None]:
df = 

## Forward link query

Get an itemised list of citing items in XML format. This requires member login credentials, although any member can retrieve results for any other member.

We run the query twice: once including posted content and once excluding it.

In [None]:
# forward link query URL
fl_url = 'https://doi.crossref.org/servlet/getForwardLinks'

# define the parameters
fl_params = {
    'usr':f"{my_email}/{my_role}",
    'pwd': load_credentials(),
    'doi': doi,
    'include_postedcontent': 'false'
}
# parameters for including posted content
fl_params_with_posted_content = fl_params.copy()
fl_params_with_posted_content['include_postedcontent'] = 'true'

# get results without posted content
fl_xml = query_xml_api(fl_url, fl_params)

# get results with posted content
fl_xml_with_posted_content = query_xml_api(fl_url, fl_params_with_posted_content)

# delete the password as we don't need it any more
del fl_params['pwd']


In [10]:
# get the list of citations
try:
  fl_list = fl_xml['crossref_result']['query_result']['body']['forward_link']
except TypeError:
  # if there are no entries, 'body' is Null, let's handle that
  fl_list = []
fl_count = len(fl_list)

# if there's only one entry it comes back as a dict, not a list. Let's fix that
if type(fl_list) == dict:
  fl_list = [fl_list]

print(f"{fl_count} forward links found")

NameError: name 'fl_xml' is not defined

In [None]:
# get the list of citations (including posted content)
try:
  fl_list_pc = fl_xml_with_posted_content['crossref_result']['query_result']['body']['forward_link']
except TypeError:
  # if there are no entries, 'body' is Null, let's handle that
  fl_list_pc = []
fl_count_with_posted_content = len(fl_list_pc)

# if there's only one entry it comes back as a dict, not a list. Let's fix that
if type(fl_list_pc) == dict:
  fl_list_pc = [fl_list_pc]

print(f"{fl_count_with_posted_content} forward links found")

In [None]:
# check for duplicates
print('Without posted content:')
unique_fl_dois, fl_unique_count = duplicate_check(fl_list)

print('\nWith posted content:')
unique_fl_dois_with_pc, fl_with_pc_unique_count = duplicate_check(fl_list_pc)

In [None]:
# option to save the results
save_to_json(fl_xml, 'forward links.json')

## XML API

Get metadata about the DOI in XML format, which includes a citedby-count value as part of the crm-items.

In [None]:
xml_url = 'https://doi.crossref.org/search/doi'

xml_params = {
    'pid': my_email,
    'format': 'unixsd',
    'doi': doi
}

xml = query_xml_api(xml_url, params = xml_params)

In [None]:
save_to_json(xml, 'xml query results.json')

In [None]:
crm_items = xml['crossref_result']['query_result']['body']['query']['crm-item']
count = [item['#text'] for item in crm_items if item['@name']=='citedby-count']
xml_count = count[0]

print(f"{xml_count} citations found in the Crossref XML API")

## Event Data

Crossref Event Data gathers citations contained in reference lists through a source called 'crossref'. Note that the service is not at production level: data may be missing and there may be timeouts. In addition, references prior to October 2022 are unlikely to be included. 

In [None]:
event_data_url = 'https://api.eventdata.crossref.org/v1/events'

event_data_params ={
    'mailto': my_email,
    'source': 'crossref',
    'obj-id': doi
}

events = query_json_api(event_data_url, event_data_params)
events_count = len(events['message']['events'])

save_to_json(events, 'event data.json')

print(f"{events_count} references found in Event Data")

## Admin tool

Go to https://doi.crossref.org/servlet/submissionAdmin?sf=citedByLinks and put in the DOI (run the next cell to display it again).

Then, two cells below, add the number returned as the value of 'admin_count' and execute the cell.

In [None]:
doi

In [None]:
admin_count = -1

## Summary

Use Pandas to give a nicely formatted summary table.

In [9]:
counts = {
    'json api': json_count,
    'forward links': fl_count,
    'unique forward links': fl_unique_count,
    'unique forward links with posted content': fl_with_pc_unique_count,
    'xml api': xml_count,
    'admin tool': admin_count,
    'events': events_count
}

NameError: name 'fl_count' is not defined

In [None]:
counts_ps = DataFrame([counts], index=[doi])
counts_ps