***************************************************************************************
Jupyter Notebooks from the Metadata for Everyone project

Code:
* Dennis Donathan II (https://orcid.org/0000-0001-8042-0539)

Project team: 
* Juan Pablo Alperin (https://orcid.org/0000-0002-9344-7439)
* Dennis Donathan II (https://orcid.org/0000-0001-8042-0539)
* Mike Nason (https://orcid.org/0000-0001-5527-8489)
* Julie Shi (https://orcid.org/0000-0003-1242-1112)
* Marco Tullney (https://orcid.org/0000-0002-5111-2788)

Last updated: xxx
***************************************************************************************

# Data Collection

Here we will utiilize the Crossref REST API to generate our random sample. We'll use the habanero library, a wrapper for the Crossref API, to make the process easier. More info on the package can be found here: https://github.com/sckott/habanero



In [None]:
from habanero import Crossref, WorksContainer
import pandas as pd
import time
from pathlib import Path
from typing import Optional, List
# Directories for storing the data
data_dir = Path('../data')
input_dir = data_dir / 'input'

In [None]:
def fetch_data(doi: str, email: Optional[str | None]):
    """Request function to query Crossref API.

    Args:
        doi (str): The DOI of an item, used for querying Crossref API

    Returns:
        JSON: with r.status_code == 200, returns JSON response
        None: r.status_code == 404 will return None as the resource was not found
        function: r.status_code == 504 returns the function to retry the query
    """
    base_url = 'https://doi.crossref.org/search/doi'
    params = {'format': 'unixsd',
             'doi': doi}
    if email:
        params['pid'] = email
    try:
        r = requests.get(base_url, params=params)
        if r.status_code == 200:
            soup = BeautifulSoup(r.content.decode('utf-8').replace('\n', '').replace('\r', ''), 'xml')
            return  ''.join(str(tag) for tag in soup.find_all()).replace('\n', '').replace('\r', '').replace('\t', '')
        elif r.status_code == 404:
            return None  
        elif r.status_code == 504:
            print(r.status_code)
            time.sleep(1)
            return fetch_data(doi)
        else:
            return None
    except Exception as e:
        print(f"Error fetching DOI {doi}: {e}")
        return None
    
def get_crossref(id_list: List[str]):
    """Primary function for querying Crossref API and collecting responses

    Args:
        id_list (list): List of all DOIs to be queried.
    """
    chunk_size = 5000
    tmp = []
    
    print(f"Going after: {len(id_list)}.")
    
    file_path = data_dir / 'doi_file.csv'
    if file_path.is_file():
        print(f"The file {file_path} exists.")
        # cut -d',' -f1 doi_file.csv > dois_read.csv
    else:
        pd.DataFrame(columns=['DOI', 'message']).to_csv(file_path, mode='w', index=False)

#     already_read = pd.read_csv(data_dir / 'dois_read.csv')
#     print(f"Already read: {len(already_read)}.")
#     id_list = list(set(id_list).difference(already_read.DOI.str.lower()))
    print(f"Going after: {len(id_list)}.")
        
    # Record the starting time
    start_time = time.time()
    
   
    with tqdm(total=len(id_list)) as pbar:
        for i, doi in enumerate(id_list):
            try:
                result = fetch_data(doi)
                if result is not None:
                    tmp.append({'DOI': doi, 'message': result})
                    
                if i % chunk_size == 0 or (i+1) == len(id_list):
                    pd.DataFrame(tmp).to_csv(data_dir / 'doi_file.csv', mode='a', index=False, header=False)
                    tmp = []
                    end_time = time.time()
                    if i/3 > (end_time - start_time):
                        pause = i/3 - (end_time - start_time) 
                        print(f"Sleeping: {int(pause)} seconds")
                        time.sleep(pause)

                pbar.update(1)
            except KeyboardInterrupt:
                if len(tmp) > 1: 
                    pd.DataFrame(tmp).to_csv(data_dir / 'doi_file.csv', mode='a', index=False, header=False)                
                raise
            except Exception as err:                
                print(err)

In [None]:
# Please note: If you take a new sample, the results will also change slightly. 
# To repeat our calculations, use our data sample. 
# To check our results with a new analysis, get a new sample.

## Getting the full Sample
The initial query looks good, so we'll move on to getting the full sample. We're looking for 500,000 unique records. This may take some time to collect, so best to run it overnight or in the background.

We want 500,000 unique records, so we'll set up our loop to count the number of unique DOIs we have and stop once we have 500,000. We'll have duplicates, and we'll handle those in the data cleaning notebook (among other things).

Since this can take a while, we'll want to build in a safety net against errors and timeouts. If an error occurs, the data is saved, the script is given some sleep time, then it begins again.

Once it has hit 500,000 unique records, we'll save the file and move on to cleaning the data.