<h1>AMNH Library Catalog Entry Collector</h1>
<h3>Jonah Blumstein, Nov 19-20, 2016</h3>

<p>This script is for pulling entries from the library catalog of the American Museum of Natural History Library Catalog, which is built on top of Innovative's Sierra API. It was created for the AMNH Hack the Stacks hackathon for the AMNH API Portal team.</p>

<p>Step 1 is to download the following libraries.</p>

In [1]:
import base64
import requests
import json
import threading

<p>Next, use a client key and client secret granted by the museum to get an access token to work with the library catalog API. Getting the access token is a multi-step process.</p>
<ol>
<li>Get an auth code, which is client_key:client_secret converted to Base 64.</li>
<li>Make a POST request to the library catalog website url, specifying what kind of credentials requested in the header and the application type and the auth code in the body of the request.</li>
<li>Parse the response, which is a json object, for the access token.</li>
</ol>

In [2]:
#auth stuff
client_key = '######################'
client_secret = '#############'

def get_access_token(client_key, client_secret):
    '''get the oauth access token'''
    response = get_auth(client_key, client_secret)
    r = json.loads(response.text)
    return r['access_token']

def get_auth(client_key,client_secret):
    '''get auth'''
    encoded = get_encoded(client_key, client_secret)
    url = 'https://libcat1.amnh.org/iii/sierra-api/v3/token'
    headers = {
        'Content-Type': 'application/x-www-form-urlencoded',
        'Authorization': 'Basic ' + str(encoded)[2:-1]
        }
    payload={'grant_type': 'client_credentials'}
    response = requests.post(url, data=payload, headers=headers)
    return response

def get_encoded(client_key,client_secret):
    '''get encoded client_key:client_secret,
    gets appended to url to request auth'''
    client_key_and_secret = str(client_key + ':' + client_secret)
    b_client_key_and_secret = client_key_and_secret.encode('utf-8')
    encoded = base64.b64encode(b_client_key_and_secret)
    return encoded

In [3]:
access_token = get_access_token(client_key,client_secret)

<p>Now requests can be made to the library catalog. Here, requests for 2000 entries is made at a time (the maximum number of responses allowed, starting at the lowest catalog BIB ID corresponding to an entry, 1000001, and offset by 2000 each time. I.e., the first request will return the entries with BIB IDs 1000001 through 1002000, the second request 1002001 through 1004000, etc. This process is repeatedly iteratively until BIB ID 1204000 (about the highest ID corresponding to an entry) is hit.</p>
<p>Each entry from each pull is separated and saved in its own json file, named after its BIB ID.</p>

In [5]:
# request stuff
def get(url, access_token):
    headers = {'Authorization': 'Bearer ' + access_token}
    return requests.get(url, headers=headers)

# search stuff
def get_results(url, entity, access_token, offset):
    '''get a search result
    response is json'''
    url = "{}/{}/?limit=2000&offset={}".format(url, entity, offset)
    response = get(url, access_token)
    return json.loads(response.text)

def save_results(results, directory):
    for datum in results:
        id = datum['id']
        path = '{}/{}.json'.format(directory, id)
        with open(path, 'w+') as f:
            js_data = json.dumps(datum)
            f.write(js_data)

def get_and_save_results(base_url, access_token, offset):
    results = get_results(base_url, 'bibs', access_token, offset)
    save_results(results['entries'], '')
    out_message = "saved offset: {}".format(offset)
    print(out_message)

def get_and_save_range(base_url, access_token, offset, pages):
    for i in range(pages):
        get_and_save_results(base_url, access_token, i * 2000 + offset)

<p>Using threads, the speed of querying the database and saving the entries can be sped up by up to 6x.</p>
<p>In a future version of this script, exception handling needs to be added to deal with dying threads.</p>

In [6]:
base_url = 'https://libcat1.amnh.org/iii/sierra-api/v3'
threads = []

for i in range(0,6):
    t = threading.Thread(target=lambda: get_and_save_range(base_url, access_token, i * 34000, 17))
    threads.append(t)
    t.start()

saved offset: 170000
saved offset: 0
saved offset: 34000
saved offset: 172000
saved offset: 102000
saved offset: 68000
saved offset: 136000
saved offset: 174000
saved offset: 2000
saved offset: 104000
saved offset: 36000
saved offset: 70000
saved offset: 176000
saved offset: 138000
saved offset: 106000
saved offset: 38000
saved offset: 4000
saved offset: 72000
saved offset: 178000
saved offset: 108000
saved offset: 140000
saved offset: 40000
saved offset: 6000
saved offset: 110000
saved offset: 42000
saved offset: 112000
saved offset: 44000
saved offset: 142000
saved offset: 8000
saved offset: 114000
saved offset: 46000
saved offset: 144000
saved offset: 10000
saved offset: 116000
saved offset: 48000
saved offset: 146000
saved offset: 50000
saved offset: 118000
saved offset: 12000
saved offset: 52000
saved offset: 120000
saved offset: 148000
saved offset: 14000
saved offset: 54000
saved offset: 122000
saved offset: 56000
saved offset: 150000
saved offset: 124000
saved offset: 16000
sav

Exception in thread Thread-8:
Traceback (most recent call last):
  File "C:\Users\IBM_ADMIN\Anaconda3\lib\site-packages\requests\packages\urllib3\response.py", line 226, in _error_catcher
    yield
  File "C:\Users\IBM_ADMIN\Anaconda3\lib\site-packages\requests\packages\urllib3\response.py", line 489, in read_chunked
    chunk = self._handle_chunk(amt)
  File "C:\Users\IBM_ADMIN\Anaconda3\lib\site-packages\requests\packages\urllib3\response.py", line 459, in _handle_chunk
    self._fp._safe_read(2)  # Toss the CRLF at the end of the chunk.
  File "C:\Users\IBM_ADMIN\Anaconda3\lib\http\client.py", line 699, in _safe_read
    chunk = self.fp.read(min(amt, MAXAMOUNT))
  File "C:\Users\IBM_ADMIN\Anaconda3\lib\socket.py", line 378, in readinto
    return self._sock.recv_into(b)
  File "C:\Users\IBM_ADMIN\Anaconda3\lib\ssl.py", line 748, in recv_into
    return self.read(nbytes, buffer)
  File "C:\Users\IBM_ADMIN\Anaconda3\lib\ssl.py", line 620, in read
    v = self._sslobj.read(len, buffer)