## DPLA API Interactions

Code to interact with the DPLA API. Designed specifically to produce output suited to further, manual data entry, for eventual conversion to RDF according to the [Collex standard](http://wiki.collex.org/index.php/Main_Page).

In [5]:
from dpla_api import DplaApi
da = DplaApi()
# Run search, page_size is capped (by DPLA) at 500, but code will iterate through 
# as many pages as necessary to gather all results.
da.search("Godwin, William, 1756-1836", page_size=500) 
# All retrieved results will be stored in the da.metadata_records object.
da.build_arc_rdf_dataset()
da.create_tsv()

Query: 'Godwin William 1756-1836' returned 570 results
----Accessing results page 2
----Check: 570 records transferred
----Found: 0 existing RDF records
----Saved: 570 new metadata records
Completed writing data/radicalism-dpla-201603.tsv


The code below is an implementation for running a specific set of searches drawn from a json file.

In [3]:
from dpla_api import DplaApi
import json
da = DplaApi()
all_metadata = []
search_file = "data/dpla_subjects.json"
with open(search_file) as f:
    search_terms = json.load(f)
# Disciplines, in this case, is an extra data point useful for creating
# the brand of RDF used in the SiRO project.
for search_term, disciplines in search_terms.items():
    if len(da.metadata_records) > 10000:
        break
    else:
        da.search(search_term, page_size=500)
        if da.result.count > 0:
            da.build_arc_rdf_dataset(disciplines=disciplines)
            # all_metadata += da.metadata_records
# Create a TSV file of output for closer analysis / update of terms.
da.create_tsv(records=da.metadata_records)

Query: 'slave records' returned 1734 results
----Accessing results page 2
----Accessing results page 3
----Accessing results page 4
----Check: 1734 records transferred
----Found: 1 existing RDF records
----Saved: 1733 new metadata records
Query: 'exiles' returned 690 results
----Accessing results page 2
----Check: 690 records transferred
----Found: 1 existing RDF records
----Saved: 689 new metadata records
Query: 'abolitioniststs' returned 0 results
Query: 'asceticism' returned 86 results
----Check: 86 records transferred
----Found: 0 existing RDF records
----Saved: 86 new metadata records
Query: 'protest poetry english' returned 9 results
----Check: 9 records transferred
----Found: 0 existing RDF records
----Saved: 9 new metadata records
Query: 'thirty years war 1618-1648' returned 448 results
----Check: 448 records transferred
----Found: 0 existing RDF records
----Saved: 448 new metadata records
Query: 'subversive activities' returned 300 results
----Check: 300 records transferred
--

Once complete with any necessary changes, the tsv file can be used to create RDF records, 1 for each record.

In [1]:
from buildrdf import BuildRdf
br = BuildRdf()
br.build_rdf_from_tsv("data/radicalism-dpla-201603.tsv")

Processing 10232 records...
Processed 0 records
Processed 500 records
Processed 1000 records
Processed 1500 records
Processed 2000 records
Processed 2500 records
Processed 3000 records
Processed 3500 records
Processed 4000 records
Processed 4500 records
Processed 5000 records
Processed 5500 records
Processed 6000 records
Processed 6500 records
Processed 7000 records
Processed 7500 records
Processed 8000 records
Processed 8500 records
Processed 9000 records
Processed 9500 records
Processed 10000 records


In order to make sure each successive bath of DPLA results doesn't simply reintroduce the results of any previous batches, there is a listing of all previously encountered results stored in a JSON file. An on-the-fly version is updated with searches as they complete, _however_ this cache should be cleared if these search results don't wind up making it into RDF records, by using the `reset_matches` parameter below.

In [2]:
from dpla_api import DplaApi
da = DplaApi()
da.update_rdf_registry(reset_matches=True)

Matching records: 0
New records: 7986


In [4]:
# Build json file of search terms.
import json
subject_file = "data/estc_subjects.tsv"
subject_dict = {}
with open(subject_file) as f:
    for line in f:
        values = line.split("\t")
        if values[1] == "x":
            subject_dict[values[0]] = values[2]
with open("data/dpla_subjects.json", "w") as g:
    json.dump(subject_dict, g)

In [3]:
da._marc_record(da.all_returned_items[0])

True

In [None]:
for item in da.all_returned_items:
    if da._get_genre_from_marc(item) != "none":
        print da._get_genre_from_marc(item)