# Collect BibTex from CrossRef

This script uses the [CrossRef REST API](https://github.com/CrossRef/rest-api-doc) to randomly retrieve DOIs for scholarly works. For each DOI, it calls the API again to retrieve the individual item record, transformed into BibTex.


In [None]:
import requests
import urllib.request
import ijson
import time

num_results = 5000

x_rate_limit_interval = 1
x_rate_limit_limit = 50
headers = {"User-Agent": "Virginia Tech DLRL (mailto:waingram@vt.edu)"}

First, download 5000 random conference proceedings from the CrossRef API and retrieve the DOI for each entry. Retrieve these 1000 at a time using the `rows` and `offset` parameters. 

Do this for each type in ```["book-section", "monograph", "report", "peer-review", "book-track", "journal-article", "book-part", "other", "book", "journal-volume", "book-set", "reference-entry", "proceedings-article", "journal", "component", "book-chapter", "proceedings-series", "report-series", "proceedings", "standard", "reference-book", "posted-content", "journal-issue", "dissertation", "dataset", "book-series", "edited-book", "standard-series"]```.

In [None]:
ref_type = "dissertation"

dois_file_name = f"data/crossref-{ref_type}.dois"

rows = 1000
offset = 0
dois = []
while rows + offset <= num_results:
    url = f"https://api.crossref.org/works?filter=type:{ref_type}&rows={rows}&offset={offset}"

    req = urllib.request.Request(url, headers=headers)
    res = urllib.request.urlopen(req)

    # observe rate limits
    x_rate_limit_limit = int(res.headers.get('X-Rate-Limit-Limit'))
    x_rate_limit_interval = int(res.headers.get('X-Rate-Limit-Interval').rstrip('s'))
    delay = float(x_rate_limit_interval) / x_rate_limit_limit
    time.sleep(delay)

    objects = ijson.items(res, "message.items.item")
    dois += [i["DOI"] for i in list(objects)]
    offset += rows

with open(dois_file_name, 'w') as d: 
    for doi in dois:
        d.write(doi + '\n')


For each DOI, call the CrossRef API again to get the BibTex. Save to a file.

In [None]:
bibtex_file_name = f"data/crossref-{ref_type}.bibtex"

with open(dois_file_name, 'r') as d:
    with open(bibtex_file_name, 'w') as f:
        dois = [line.rstrip() for line in d]
        for doi in dois:
            url = f"http://api.crossref.org/works/{doi}/transform/application/x-bibtex"

            r = requests.get(url, headers=headers)

            # observe rate limits
            x_rate_limit_limit = int(r.headers.get('X-Rate-Limit-Limit'))
            x_rate_limit_interval = int(r.headers.get('X-Rate-Limit-Interval').rstrip('s'))
            delay = float(x_rate_limit_interval) / x_rate_limit_limit
            time.sleep(delay)

            if r.ok:
                f.write(r.text + "\n")
