***************************************************************************************
Jupyter Notebooks from the Metadata for Everyone project

Code:
* Dennis Donathan II (https://orcid.org/0000-0001-8042-0539)

Project team: 
* Juan Pablo Alperin (https://orcid.org/0000-0002-9344-7439)
* Dennis Donathan II (https://orcid.org/0000-0001-8042-0539)
* Mike Nason (https://orcid.org/0000-0001-5527-8489)
* Julie Shi (https://orcid.org/0000-0003-1242-1112)
* Marco Tullney (https://orcid.org/0000-0002-5111-2788)

Last updated: xxx
***************************************************************************************

# Data Collection

Here we will utiilize the Crossref REST API to generate our random sample. We'll use the habanero library, a wrapper for the Crossref API, to make the process easier. More info on the package can be found here: https://github.com/sckott/habanero



In [None]:
from habanero import Crossref, WorksContainer
import pandas as pd
import time
from pathlib import Path
# Directories for storing the data
data_dir = Path('../data')
input_dir = data_dir / 'input'
# setting up the our queries. In order to be added to the 'Polite' pool for the API, add an email address so that they
# can contact you should any problems arise.
# cr = Crossref(mailto='youremail@here.com)
# We are only interesting in journal articles and as such have a filter toward that end. 
# Additionally, we are utilizing the 'sample' feature that allows us to grab random works. Limit 100 per request
search = cr.works(filter = {'type':'journal-article'}, sample=100)
# The WorksContainer class allows us to easily parse through the responses so that way we can extract the records themselves
# and not all of the metadata associated with the API call.
x = WorksContainer(search)
# We'll set up a dataframe with this initial search just to make sure the format looks good and verifying the query results
df = pd.DataFrame(data= x.works)
df.head()

In [None]:
# Please note: If you take a new sample, the results will also change slightly. 
# To repeat our calculations, use our data sample. 
# To check our results with a new analysis, get a new sample.

## Getting the full Sample
The initial query looks good, so we'll move on to getting the full sample. We're looking for 500,000 unique records. This may take some time to collect, so best to run it overnight or in the background.

We want 500,000 unique records, so we'll set up our loop to count the number of unique DOIs we have and stop once we have 500,000. We'll have duplicates, and we'll handle those in the data cleaning notebook (among other things).

Since this can take a while, we'll want to build in a safety net against errors and timeouts. If an error occurs, the data is saved, the script is given some sleep time, then it begins again.

Once it has hit 500,000 unique records, we'll save the file and move on to cleaning the data.

In [None]:
while len(set(df['DOI'])) < 530000:
    try:
        search = cr.works(filter={'type':'journal-article'}, sample=100)
        x = WorksContainer(search)
        for work in x.works:
            df.loc[len(df)] = work
    except:
        df.to_csv(input_dir / '01_raw_data.csv')
        time.sleep(10)

In [None]:
df.to_csv(input_dir / '01_raw_data.csv', index=False)