# Data Collection

Here we will utiilize the Crossref REST API to generate our random sample. We'll use the habanero library, a wrapper for the Crossref API, to make the process easier. More info on the package can be found here: https://github.com/sckott/habanero



In [1]:
from habanero import Crossref, WorksContainer
import pandas as pd
import time
from pathlib import Path
# Directories for storing the data
data_dir = Path('../data')
input_dir = data_dir / 'input'
# setting up the our queries. In order to be added to the 'Polite' pool for the API, add an email address so that they
# can contact you should any problems arise.
cr = Crossref(mailto='youremail@here.com')
# We are only interesting in journal articles and as such have a filter toward that end. 
# Additionally, we are utilizing the 'sample' feature that allows us to grab random works. Limit 100 per request
search = cr.works(filter = {'type':'journal-article'}, sample=100)
# The WorksContainer class allows us to easily parse through the responses so that way we can extract the records themselves
# and not all of the metadata associated with the API call.
x = WorksContainer(search)
# We'll set up a dataframe with this initial search just to make sure the format looks good and verifying the query results
df = pd.DataFrame(data= x.works)
df.head()

Unnamed: 0,indexed,reference-count,publisher,issue,content-domain,short-container-title,published-print,DOI,type,created,page,source,is-referenced-by-count,title,prefix,volume,author,member,reference,container-title,language,link,deposited,score,resource,issued,references-count,journal-issue,URL,ISSN,issn-type,subject,published,license,alternative-id,update-policy,published-online,assertion,original-title,subtitle,archive,abstract,funder,published-other,article-number
0,"{'date-parts': [[2022, 3, 31]], 'date-time': '...",4,Royal Society of Chemistry (RSC),13,"{'domain': [], 'crossmark-restriction': False}",[CrystEngComm],{'date-parts': [[2011]]},10.1039/c1ce90037d,journal-article,"{'date-parts': [[2011, 6, 3]], 'date-time': '2...",4303,Crossref,2,[Dynamic behaviour in the solid state],10.1039,13,"[{'given': 'Tomislav', 'family': 'Friščić', 's...",292,"[{'key': 'c1ce90037d-(cit1)/*[position()=1]', ...",[CrystEngComm],en,[{'URL': 'http://pubs.rsc.org/en/content/artic...,"{'date-parts': [[2017, 6, 20]], 'date-time': '...",0.0,{'primary': {'URL': 'http://xlink.rsc.org/?DOI...,{'date-parts': [[2011]]},4,"{'issue': '13', 'published-print': {'date-part...",http://dx.doi.org/10.1039/c1ce90037d,[1466-8033],"[{'value': '1466-8033', 'type': 'electronic'}]","[Condensed Matter Physics, General Materials S...",{'date-parts': [[2011]]},,,,,,,,,,,,
1,"{'date-parts': [[2022, 3, 30]], 'date-time': '...",0,Elsevier BV,5,"{'domain': [], 'crossmark-restriction': False}",[International Journal of Engineering Science],"{'date-parts': [[1975, 5]]}",10.1016/0020-7225(75)90022-1,journal-article,"{'date-parts': [[2003, 3, 14]], 'date-time': '...",547,Crossref,0,[Symposium on aircraft crashworthiness: Design...,10.1016,13,,78,,[International Journal of Engineering Science],en,[{'URL': 'https://api.elsevier.com/content/art...,"{'date-parts': [[2019, 3, 26]], 'date-time': '...",0.0,{'primary': {'URL': 'https://linkinghub.elsevi...,"{'date-parts': [[1975, 5]]}",0,"{'issue': '5', 'published-print': {'date-parts...",http://dx.doi.org/10.1016/0020-7225(75)90022-1,[0020-7225],"[{'value': '0020-7225', 'type': 'print'}]","[General Engineering, Mechanics of Materials, ...","{'date-parts': [[1975, 5]]}","[{'start': {'date-parts': [[1975, 5, 1]], 'dat...",[0020722575900221],,,,,,,,,,
2,"{'date-parts': [[2022, 4, 7]], 'date-time': '2...",0,Cambridge University Press (CUP),1,"{'domain': ['journals.cambridge.org'], 'crossm...",[J.R. Asiat. Soc. G.B. Irel.],"{'date-parts': [[1982, 1]]}",10.1017/s0035869x0015854x,journal-article,"{'date-parts': [[2011, 3, 15]], 'date-time': '...",1-2,Crossref,0,[Notes],10.1017,114,,56,,[Journal of the Royal Asiatic Society of Great...,en,[{'URL': 'https://www.cambridge.org/core/servi...,"{'date-parts': [[2019, 5, 25]], 'date-time': '...",0.0,{'primary': {'URL': 'https://www.cambridge.org...,"{'date-parts': [[1982, 1]]}",0,"{'issue': '1', 'published-print': {'date-parts...",http://dx.doi.org/10.1017/s0035869x0015854x,"[0035-869X, 2051-2066]","[{'value': '0035-869X', 'type': 'print'}, {'va...","[General Arts and Humanities, Cultural Studies]","{'date-parts': [[1982, 1]]}","[{'start': {'date-parts': [[2011, 3, 15]], 'da...",[S0035869X0015854X],http://dx.doi.org/10.1017/policypage,"{'date-parts': [[2011, 3, 15]]}",[{'value': 'Copyright © The Royal Asiatic Soci...,,,,,,,
3,"{'date-parts': [[2022, 3, 31]], 'date-time': '...",12,Wiley,4,"{'domain': [], 'crossmark-restriction': False}",[phys. stat. sol. (a)],"{'date-parts': [[2008, 4]]}",10.1002/pssa.200777892,journal-article,"{'date-parts': [[2008, 3, 20]], 'date-time': '...",901-904,Crossref,21,[Real time spectroscopic ellipsometry of sputt...,10.1002,205,"[{'given': 'Jian', 'family': 'Li', 'sequence':...",311,"[{'key': '10.1002/pssa.200777892-BIB1', 'doi-a...",[physica status solidi (a)],en,[{'URL': 'https://api.wiley.com/onlinelibrary/...,"{'date-parts': [[2021, 7, 4]], 'date-time': '2...",0.0,{'primary': {'URL': 'https://onlinelibrary.wil...,"{'date-parts': [[2008, 4]]}",12,"{'issue': '4', 'published-print': {'date-parts...",http://dx.doi.org/10.1002/pssa.200777892,"[1862-6300, 1862-6319]","[{'value': '1862-6300', 'type': 'print'}, {'va...","[Materials Chemistry, Electrical and Electroni...","{'date-parts': [[2008, 4]]}","[{'start': {'date-parts': [[2015, 9, 1]], 'dat...",,,,,,,,,,,
4,"{'date-parts': [[2022, 4, 3]], 'date-time': '2...",0,The University of Iowa,5,"{'domain': [], 'crossmark-restriction': False}",[The Annals of Iowa],"{'date-parts': [[1898, 4]]}",10.17077/0003-4827.2333,journal-article,"{'date-parts': [[2018, 6, 27]], 'date-time': '...",424-438,Crossref,0,[Major-General Frederick Steele],10.17077,3,"[{'given': 'John F.', 'family': 'Lacey', 'sequ...",6626,,[The Annals of Iowa],en,,"{'date-parts': [[2021, 9, 23]], 'date-time': '...",0.0,{'primary': {'URL': 'http://pubs.lib.uiowa.edu...,"{'date-parts': [[1898, 4]]}",0,"{'issue': '5', 'published-online': {'date-part...",http://dx.doi.org/10.17077/0003-4827.2333,"[0003-4827, 2473-9006]","[{'value': '0003-4827', 'type': 'print'}, {'va...",[General Materials Science],"{'date-parts': [[1898, 4]]}",,,,"{'date-parts': [[2014, 9, 23]]}",,,,,,,,


## Getting the full Sample
The initial query looks good, so we'll move on to getting the full sample. We're looking for 100,000 unique records. This may take some time to collect, so best to run it overnight or in the background.

We want 100,000 unique records, so we'll set up our loop to count the number of unique DOIs we have and stop once we have 100,000. We'll have duplicates, and we'll handle those in the data cleaning notebook (among other things).

Since this can take a while, we'll want to build in a safety net against errors and timeouts. If an error occurs, the data is saved, the script is given some sleep time, then it begins again.

Once it has hit 100,000 unique records, we'll save the file and move on to cleaning the data.

In [None]:
while len(set(df['DOI'])) < 105000:
    try:
        search = cr.works(filter={'type':'journal-article'}, sample=100)
        x = WorksContainer(search)
        for work in x.works:
            df.loc[len(df)] = work
    except:
        df.to_csv(input_dir / '01_raw_data.csv')
        time.sleep(10)

In [4]:
df.to_csv(input_dir / '01_raw_data.csv')