# Extracting and visualizing spatial data from loc.gov JSON API

Digital mapping has become an increasingly accessible and valuable asset for humanities research. Working with spatially-referenced data offers exciting possibilities for placed-based scholarship, outreach, and teaching. It’s also a perfect avenue for interdisciplinary collaboration — between, say, humanities researchers new to GIS and spatial scientists who’ve been using it for decades.

Embedded within digital collections available from the Library of Congress website are geographic data, including the locations of items and their local contexts. 

![mapping HAER points](haer_tutorial_screenshot.JPG)

The story of these data would be incomplete, however, without a critical understanding of the history behind their collection and stewardship. In this tutorial, we demonstrate how loc.gov JSON API users can find and store spatial information from Library content with an awareness toward data quality, provenance, and why this broadened scope is important for informing research projects at the Library. 

### Rights and access

Rights and restrictions, including copyright, affect how you can use images, particularly if you want to publish, display, or otherwise distribute them. You can read more about copyright and other restrictions that apply to publication/distribution of images from the Prints & Photographs Division (P&P) at this link: https://www.loc.gov/rr/print/195_copr.html

The records in the case study that follows were created for the U.S. Government and are considered to be in the public domain. It is understood that access to this material rests on the condition that should any of it be used in any form or by any means, the author of such material and the Historic American Engineering Record of the Heritage Conservation and Recreation Service at all times be given proper credit.

### Data quality

Consistency and accuracy of geographic information stored with digital content on the Library of Congress website varies within and across collections. This tutorial was designed a method of exploring existing spatially-referenced data on items. Finding and analyzing meaningful patterns typically requires additional data corrections, attributes, and geocoding frameworks to ensure optimal coverage. In any case, time spent analyzing and interpeting data is mostly spent cleaning and grappling with the data.

## Case study: the Historic American Engineering Survey

As a graduate student of applied urban science, I was inspired at the outset of my internship with LC Labs to discover content about the built environment on the Library website. What I found was an expansive dataset of digitized photographs, drawings and reports recognized collectively as HHH: the Historic American Buildings Survey/Historic American Engineering Record/Historic American Landscapes Survey.

More about HHH: https://www.loc.gov/collections/historic-american-buildings-landscapes-and-engineering-records/about-this-collection/

*Browsing the collection:*

![How to search HHH](hhh_search_screenshot.gif)


*Example of material found in the collection:* 

View of the uptown platform at 79th Street. Photo by David Sagarin for the Historical American Engineering Record, Library of Congress, Prints and Photographs Division, August 1978.

![Interborough subway, NYC](haer_subway_screenshot.jpg)

### Things get HAER-y

I decided to dig a bit deeper into the engineering record. The Historic American Engineering Record, or HAER, was established by the National Park Service, the American Society of Civil Engineers and the Library of Congress in 1969. Documenting over 7,600 historic sites and structures related to engineering and industry, the collection is an ongoing effort with established guidelines for documentation. HAER was created to preserve these structures through rule-based documentation, and those documents have in turn been preserved through time.

Through my research, I encountered examples – as many have through the 60 years of its existence - that test those rules, require interpretations of a rules' intention, or that resulted in re-writing of the rules. What I've aim to develop here, as a result of both my own investigation and conversations with Library staff, is an approach to looking at the implications for scholarship of changes to preservation over time.  

Geography was actually the original motivating factor behind how these materials were organized by the Parks department. At a time when you had to go into physical folders to find records, you would find your site of interest using state name and record number. Geographic information, initially recorded in the Universal Transverse Mercator (UTM) coordinate system, was arduous to update; consequently, the locations chosen when converting to latitude and longitude were not always consistent. You see this in HAER records that contain `.v` in the title, which stands for 'in the vicinity'. These are typically rural structures that surveyors didn't get to geocode. To this day, many of them show up as points in the center of a nearby city. 

In 1997, this tradition established by NPS was carried on by the Library as the materials were ingested to provide digital access. During transmission, staff members at the time would pop out additional geographic context from items into the titles. When the collection later came to web, both old and new fields were mapped to the Library's own bibliographic system - NPS reorganized their system at the turn of century, adding even more information and possibilities to the collection. Today, the Cultural Resources GIS Office at NPS is working with Library to develop and implement standards for these geographic data.

One of the remarkable aspects of digital collections with the Library of Congress that I've found is the outsized contribution of a handful of prolific contributors. Data transmitted to the Library of Congress by contributor Julia Christianson, a HAER Historian with the National Park Service, appears particularly rich for our purpose of visualizing the spatial distribution of collection items - nearly 1,500 items with accurate latitude and longitude attributes intact. 

In speaking with Kit Arrington, a Digital Library Specialist with the P&P Division, I learned that although this particular subset is returned when filtering the collection by contributor, these are not the only items that Julia has helped transmit to the Library, nor was she necessarily the only one working on these items. What the subset represents instead is a snapshot from the preservation timeline - collections management found a metadata solution, and then applied it to everything they were transmitting at that moment in history. 

## Tutorial

In [None]:
### The following guide will demonstrate how to map the Historic American Engineering Record (HAER). 
### With minor changes, the same process of spatial data extraction and visualization could be applied
### to other collections containing explicit geographic information. 

The recommended convention in Python's own documentation is to import everything at the top, and on separate lines. For this tutorial, we'll be importing 3 packages into the notebook:

1) To get our data from the digitzed HAER collection, we can use the `requests` Python module to access the loc.gov JSON API.

2) Reading in coordinates means our data needs to be re-organized - a task for the popular analysis package, `pandas`. 

3) Finally, we'll do our visualization with `folium` to plot the locations on an interactive Leaflet map. 

In [6]:
import requests
import pandas as pd
import folium

### Identifying items

Getting up to speed with use of the loc.gov JSON API and Python to access the collection was a breeze, thanks to data exploration resources from former LC Labs resident Laura Wrubel, Software development librarian at GWU.

Grab more tips for loc.gov JSON API calls and URL parameterization from Laura's 'Accessing images for analysis' notebook:
https://github.com/lwrubel/data-exploration/blob/master/Accessing%20imagebs%20for%20analysis.ipynb

### Gathering geography



In [2]:
# Many of the prints & photographs in HAER are tagged with geographic coordinates ('latlong')
# Using the requests package we imported, we can easily 'get' data for an item as JSON and parse it for our latlong:

get_any_item = requests.get("https://www.loc.gov/item/al0006/?fo=json")
print('latlong: {}'.format(get_any_item.json()['item']['latlong']))

latlong: 32.45977,-86.47767


In [4]:
# To retrieve this sort of data point for a set of search results, we can use Laura's get_image_urls function. 
# This will allow us to store the latlong from each item in a list, working through each page of the search.

def get_image_urls(url, items=[]):
    '''
    Retrieves the lat_longs for items that have public URLs available. 
    Skips over items that are for the colletion as a whole or web pages about the collection.
    Handles pagination. 
    '''
    # request pages of 100 results at a time
    params = {"fo": "json", "c": 100, "at": "results,pagination"}
    call = requests.get(url, params=params)
    data = call.json()
    results = data['results']
    for result in results:
        # don't try to get images from the collection-level result
        if "collection" not in result.get("original_format") and "web page" not in result.get("original_format"):
            # take the last URL listed in the image_url array
            item = result.get("id")
            items.append(item)
    if data["pagination"]["next"] is not None: # make sure we haven't hit the end of the pages
        next_url = data["pagination"]["next"]
        #print("getting next page: {0}".format(next_url))
        get_image_urls(next_url, items)
        
    return items

To demonstrate with our subset of HAER from Justine Christianson, I'll use a search that targets items from HAER with the contributor 'Justine Christianson'.

In [2]:
url = "https://www.loc.gov/search/?fa=contributor:christianson,+justine&fo=json"

This is the base URL we will use for the API requests we'll be making as we run the function.

Now we can apply Laura's get_image_urls function to our search results URL, formatted in JSON, to get a list of image URLs: 

In [7]:
# retrieve all image URLs from the search results and store in a variable called 'image_urls'

image_urls = get_image_urls(url, items=[])

# how many URLs did we get?

len(image_urls)

1472

In [12]:
# to save on a little time, let's see what the first 100 look like
img100 = image_urls[0:100]

len(img100)

100

In [31]:
# storing latlongs in a set eliminates any potential duplicates
spatial_set = set()

# the parameters we set for our API calls taken the first function
p1 = {"fo" : "json"}

# loop through the item URLs
for img in img100:
    
    # make HTTP request to loc.gov API for each item
    r = requests.get(img, params=p1)
    
    # extract only from items with latlong attribute
    try:
        
        # expose in JSON format
        data = r.json()
        
        # parse for location
        results = data['item']['latlong']
        
        # add it to our running set
        spatial_set.add(results)
        
    # skip anything with missing 'latlong' data
    except:
        
        # on to the next item until we're through
        pass
    
# show us the data!
spatial_set

{'20.030186,-155.818784',
 '20.9175,-156.3258333',
 '30.024702,-94.044589',
 '32.7253908733282,-114.616614456155',
 '33.860011,-118.1856252',
 '34.030577,-116.857538',
 '37.006918,-76.312634',
 '37.274,-118.9684',
 '37.7620889746898,-119.860731072203',
 '37.806455,-122.273202',
 '37.807728,-122.420792',
 '37.9090223042677,-119.256820782623',
 '38.077236,-122.097628',
 '38.291201,-76.814178',
 '38.5731349,-82.8301677',
 '38.578893,-77.17804715345169',
 '38.585328,-77.16830001823983',
 '38.586396,-77.16996233163732',
 '38.587641,-77.16987353941916',
 '38.589819,-77.16421028094226',
 '38.590695,-77.17816273371744',
 '38.591802,-77.17146837438597',
 '38.592078,-77.17745827052356',
 '38.684596,-77.131794',
 '38.931345,-74.919311',
 '39.50589,-80.16813',
 '39.511984,-77.8306878',
 '39.643082,-78.889328',
 '39.855162,-87.336002',
 '39.918427,-75.136655',
 '39.98953,-78.255624',
 '40.13333,-76.39167',
 '40.23848,-74.83119',
 '40.391357,-79.85933180595799',
 '41.29281,-94.14976',
 '41.3259244,-

In [36]:
# How many unique data points were we able to gather?
len(spatial_set)

72

# Data manipulation

We've mined out the locations of a digital subset from the HAER colelction. But before we can move the set of coordinates we've gathered onto an interactive map, we'll need to restructure it with the popular `pandas` package. 

In [47]:
# convert latlong set to list
latlong_list = list(spatial_set)

# convert list to pandas dataframe
df = pd.DataFrame(latlong_list)

# split coordinates into two columns
df = df[0].str.split(',', expand=True)

# rename columns with latitude and longitude
df = df.rename(columns={0:'latitude', 1:'longitude'})

df

Unnamed: 0,latitude,longitude
0,43.6414716,-70.2408811
1,47.507055,-101.432257
2,47.6038321,-122.3300624
3,37.806455,-122.273202
4,42.630725,-73.777451
5,38.578893,-77.17804715345169
6,43.1525,-79.04167
7,42.9465841,-74.2095964
8,39.511984,-77.8306878
9,41.5263888889,-70.6730555556


At this stage you could export your tables of coordinates to combine with existing projects, visualize with other software, etc.

In [None]:
# print df to csv
# df.to_csv('haer_sample.csv')

While we're working in a Jupyter notebook, let's read back in the .CSV file and make a map right here.

In [51]:
# convert spreadsheet to pandas dataframe
latlong_df = pd.read_csv('haer_sample.csv', usecols=[1,2])

# Geovisualization
When managing data from the Library in Python, visualizing them in a Leaflet map. The open-source tool `folium` builds on our earlier data wrangling with `pandas` and the mapping strengths of the Leaflet.js library to create an interactive experience.

In [56]:
# back to a list for folium
latlong_list = latlong_df.values.tolist()

# center map around some coordinate
COORD = [40.7128, -74.0059]

map_haer = folium.Map(zoom_start=12, 
tiles='cartodbpositron', width=640, height=480)

for i in range(len(latlong_list)): 
    folium.CircleMarker(latlong_list[i], radius=1, color='#0080bb', fill_color='#0080bb').add_to(map_haer) 

In [57]:
map_haer

# Conclusion

think about why and how researchers can actually use data

and why they are confounded (or defeated) at every turn or the amount of time it takes for a staff member to have to explain the shortcomings of the data as it is publicly available (forms, extent, restrictions)