# Author VIAF Download Via LCCN

This script will talk to viaf.org to find the VIAF for an LCCN. 

It is a requirement that you have a field with the LCCN in it

This script modifies the TSV file itself in batches, should the script timeout or other error you can rerun it and it will pickup where it left off, always run it on a backup of your orginal data files.

It creates a new column in the file `author_viaf` with the VIAF ID.

In [41]:
import pandas as pd
import requests
import time
import string
import unicodedata

## Config
Set these variables below based on your setup

`path_to_tsv` - the path to the TSV file you want to run it on

`id_column_name` - the name of the column header that contains LCCN value

`user_agent` - this is the value put into the headers on each request, it is good practice to identifiy your client/project when working with open free APIs

`pause_between_req` - number of seconds to wait between each API call



In [42]:
path_to_tsv = "/Users/m/Downloads/data-tmp/hathitrust_post45fiction_metadata.tsv"
id_column_name = "author_lccn"
user_agent = 'YOUR PROJECT NAME HERE'
pause_between_req = 0


In [43]:
def add_viaf(d):

    if type(d[id_column_name]) != str:     
        # no heading to use skipp
        return d
        
    # if there is already a value skip it
    if 'author_viaf' in d:        
        if pd.isnull(d['author_viaf']) == False:        
            print('Skip',d[id_column_name],d['author_viaf'])
            return d


    headers={'User-Agent': user_agent}
    url = f"https://viaf.org/viaf/sourceID/LC%7C{d[id_column_name]}"

    r = requests.get(url,headers=headers,allow_redirects=False)
    if r.status_code == 404:
        return d

    viaf = r.headers['Location'].split('/')[-1]
    d['author_viaf'] = viaf

    time.sleep(pause_between_req)

    return d



In [44]:
# load the tsv
df = pd.read_csv(path_to_tsv, sep='\t', header=0, low_memory=False)

# we are going to split the dataframe into chunks so we can save our progress as we go but don't want to save the entire file on on every record operation
n = 500  #chunk row size
list_df = [df[i:i+n] for i in range(0,df.shape[0],n)]

# loop through each chunk
for idx, df_chunk in enumerate(list_df):

    print("Working on chunk ", idx, 'of', len(list_df))

    # if you want it to skip X number of chunks uncomment this
    # if idx < 88:
    #     continue

    list_df[idx] = list_df[idx].apply(lambda d: add_viaf(d),axis=1 )  

    reformed_df = pd.concat(list_df)
    reformed_df.to_csv(path_to_tsv, sep='\t')




Working on chunk  0 of 760
46340201
115893402
68876905
32243514
7402360
106419530
39754103
297962196
87192126
112738469
29123813
5039782
187006121
38920586
113894534
69205518
294886266


KeyboardInterrupt: 

In [None]:
#This last block just does QA on the data to see what if any rows were not populated

# df = pd.read_csv(path_to_tsv, sep='\t', header=0, low_memory=False)
# print("There are ", df['hathi_marc'].isnull().any().sum(), 'rows with no hathi_marc column populated, here are their', id_column_name, 'values:')
# res = df.loc[df['hathi_marc'].isnull(), id_column_name].tolist()
# print(print(res))


