# Hathi Download MARC

This script adds the MARC xml blob from the Hathi API to a TSV file, it is a preparatory script to run other scripts to extract metadata from the MARC data. 

It is a requirement that you have either the Hathi Record ID value (numeric id) or the HathiTrust Volume ID (xzy.123456789) in the data 

This script modifies the TSV file itself in batches, should the script timeout or other error you can rerun it and it will pickup where it left off, always run it on a backup of your orginal data files.

It creates a new column in the file `hathi_marc` which holds the MARC XML and `hathi_rights` which also adds the rights code for the title based on all volumes, if there is a mix of rights codes all will be included delimited with a pipe "|" character

In [None]:
import pandas as pd
import requests
import time


## Config
Set these variables below based on your setup

`path_to_tsv` - the path to the TSV file you want to run it on

`id_column_name` - the name of the column header that contains the hathi trust record id or valoume id, you can use either

`user_agent` - this is the value put into the headers on each request, it is good practice to identifiy your client/project when working with open free APIs

`pause_between_req` - number of seconds to wait between each API call

In [None]:
path_to_tsv = "/Users/m/Downloads/data-tmp/nyt_hardcover_fiction_bestsellers-hathitrust_metadata.tsv"
id_column_name = "htid"
user_agent = 'YOUR PROJECT NAME HERE'
pause_between_req = 0

In [None]:
def add_hathi(d):

    field = "htid"
    if type(d[id_column_name]) == int:
        field='recordnumber'


    # if there is already a value skip it
    if 'hathi_marc' in d:
        if type(d['hathi_marc']) == str:        
            print('Skip',d[id_column_name])
            return d
        
    url = f"https://catalog.hathitrust.org/api/volumes/full/{field}/{d[id_column_name]}.json"
    r = requests.get(url, headers={'Accept': 'application/json', 'User-Agent': user_agent})
    try:
        data = r.json()
    except:
        print("JSON decode error with:",d[id_column_name])
        return d

    if 'records' not in data:
        print("No record response found in:",d[id_column_name])
        return d

    for recordid in data['records']:        
        d['hathi_marc'] = data['records'][recordid]['marc-xml']

    if 'items' not in data:
        print("No items response found in:",d[id_column_name])
        return d

    rights_codes = []
    for item in data['items']:
        rights_codes.append(item['rightsCode'])

    rights_codes = list(set(rights_codes))
    code = "|".join(rights_codes)
    
    d['hathi_rights'] = code

    time.sleep(pause_between_req)

    return d

In [None]:
# load the tsv
df = pd.read_csv(path_to_tsv, sep='\t', header=0, low_memory=False)

# we are going to split the dataframe into chunks so we can save our progress as we go but don't want to save the entire file on on every record operation
n = 100  #chunk row size
list_df = [df[i:i+n] for i in range(0,df.shape[0],n)]

# loop through each chunk
for idx, df_chunk in enumerate(list_df):

    print("Working on chunk ", idx, 'of', len(list_df))
    list_df[idx] = list_df[idx].apply(lambda d: add_hathi(d),axis=1 )  

    reformed_df = pd.concat(list_df)
    reformed_df.to_csv(path_to_tsv, sep='\t')




In [None]:
#This last block just does QA on the data to see what if any rows were not populated

df = pd.read_csv(path_to_tsv, sep='\t', header=0, low_memory=False)
print("There are ", df['hathi_marc'].isnull().any().sum(), 'rows with no hathi_marc column populated, here are their', id_column_name, 'values:')
res = df.loc[df['hathi_marc'].isnull(), id_column_name].tolist()
print(print(res))


