## Using the DPLA API

Code to interact with the DPLA API. Designed specifically to produce output suited to further, manual data entry, for eventual conversion to RDF according to the [Collex standard](http://wiki.collex.org/index.php/Main_Page).

### Search DPLA and return results as tsv file

In [2]:
from dpla_api import DplaApi

# Quoted search terms should be put inside double quotes, then single quotes.
# For example: '"civil rights"'
search_term = '"american indian movement"'
fields = []

da = DplaApi()
# Run search, page_size is capped (by DPLA) at 500, but code will iterate through 
# as many pages as necessary to gather all results.
da.search(search_term, page_size=500, fields=fields) 
# All retrieved results will be stored in the da.metadata_records object.
da.build_arc_rdf_dataset(check_match=False)
da.create_tsv()

Query: '"american indian movement"' returned 529 results
----Accessing results page 2
----Check: 529 records transferred
Completed writing data/radicalism-dpla.tsv


#### Check values within results. (Not usually necessary)

In [None]:
errors = 0
for r in da.all_returned_items:
    if "subject" in r["sourceResource"]:
        for s in r["sourceResource"]["subject"]:
            if "name" not in s:
                errors += 1
                print r["sourceResource"]["subject"]
                print r
            
print errors

### Transform TSV records to RDF

Once complete with any necessary changes, the tsv file can be used to create RDF records.

In [1]:
from buildrdf import BuildRdf

input_path = "data/radicalism-dpla.tsv"
output_path = "rdf/testtill.rdf"

br = BuildRdf(archive="dpla")
br.build_rdf_from_tsv(input_path, output_path, records_per_file=500) # first input file, then output file.

Processing 158 records...
Processed 158 records


The code below is an implementation for running a specific set of searches drawn from a json file.

In [None]:
from dpla_api import DplaApi
import json
da = DplaApi()
with open("mich-results.json") as f:
    data = json.load(f)
da.create_tsv(records=data)

In [None]:
import json
with open("mich-results.json", "w") as f:
    json.dump(da.metadata_records, f)

In [None]:
from dpla_api import DplaApi
import json
da = DplaApi()
all_metadata = []
search_file = "data/dpla_subjects.json"
with open(search_file) as f:
    search_terms = json.load(f)
# Disciplines, in this case, is an extra data point useful for creating
# the brand of RDF used in the SiRO project.
for search_term, disciplines in search_terms.items():
    if len(da.metadata_records) > 10000:
        break
    else:
        da.search(search_term, page_size=500)
        if da.result.count > 0:
            da.build_arc_rdf_dataset(disciplines=disciplines)
            # all_metadata += da.metadata_records
# Create a TSV file of output for closer analysis / update of terms.
da.create_tsv(records=da.metadata_records)

In order to make sure each successive bath of DPLA results doesn't simply reintroduce the results of any previous batches, there is a listing of all previously encountered results stored in a JSON file. An on-the-fly version is updated with searches as they complete, _however_ this cache should be cleared if these search results don't wind up making it into RDF records, by using the `reset_matches` parameter below.

In [None]:
from dpla_api import DplaApi
da = DplaApi()
da.update_rdf_registry(reset_matches=True)

In [None]:
# Build json file of search terms.
import json
subject_file = "data/estc_subjects.tsv"
subject_dict = {}
with open(subject_file) as f:
    for line in f:
        values = line.split("\t")
        if values[1] == "x":
            subject_dict[values[0]] = values[2]
with open("data/dpla_subjects.json", "w") as g:
    json.dump(subject_dict, g)

In [None]:
da._marc_record(da.all_returned_items[0])

In [None]:
for item in da.all_returned_items:
    if da._get_genre_from_marc(item) != "none":
        print da._get_genre_from_marc(item)

In [8]:
da.all_returned_items[10]

{u'@context': u'http://dp.la/api/items/context',
 u'@id': u'http://dp.la/api/items/a034909614a958902175e02e0a305354',
 u'@type': u'ore:Aggregation',
 u'aggregatedCHO': u'#sourceResource',
 u'dataProvider': u'United States Government Publishing Office (GPO)',
 u'id': u'a034909614a958902175e02e0a305354',
 u'ingestDate': u'2020-02-03T16:51:10.618Z',
 u'ingestType': u'item',
 u'isShownAt': u'http://catalog.gpo.gov/F/?func=direct&doc_number=001109560&format=999',
 u'object': u'http://fdlp.gov/images/gpo-tn.jpg',
 u'originalRecord': {u'stringValue': u'<record \nxmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">\n  <header>\n    <identifier>oai:catalog.gpo.gov:GPO01-001109560</identifier>\n    <datestamp>2019-11-08T00:33:06Z</datestamp>\n    <setSpec>PURL_ALL</setSpec>\n  </header>\n  <metadata>\n    <marc:record \n    xsi:schemaLocation="http://www.loc.gov/MARC21/slim http://www.loc.gov/standards/marcxml/schema/MARC21slim.xsd" xmlns:marc="http

In [3]:
da.metadata_records[1]

{'archive': '',
 'creator': [u'United States. Federal Bureau of Investigation'],
 'date': u'2004/2006',
 'discipline': '',
 'federation': 'SiRO',
 'genre': 'none',
 'id': u'http://dp.la/api/items/08063b7787e1e8ff5d865795e1367608',
 'language': [],
 'original_query': '"emmett till"',
 'role': '',
 'seeAlso': u'http://vault.fbi.gov/Emmett%20Till%20/',
 'source': u'Digital Library of Georgia',
 'subjects': [u'Governmental investigations--United States',
  u'United States. Federal Bureau of Investigation',
  u'African American youth--Violence against--Mississippi',
  u'African Americans--Violence against--Mississippi--History--20th century',
  u'African Americans--Mississippi',
  u'Hate crimes--Mississippi',
  u'Lynching--Mississippi--History--20th century',
  u'Mississippi--Race relations',
  u'Racism--Mississippi--History--20th century',
  u'Trials (Murder)--Mississippi--Sumner',
  u'Till, Emmett, 1941-1955--Death and burial',
  u'Milam, J. W.--Trials, litigation, etc',
  u'Bryant, Roy, 

In [1]:
pwd


u'/Users/higgi135/Projects/dpla-api'