DataRescue at Boston College
On 19 April 2017, staff and friends of Boston College participated in Endangered Data Week by holding a data rescue event. The goal for this event was to identify, archive and secure NEH and IMLS datasets gathered from data.gov.
Approach to this task
There are a few ways of approaching this task. The most direct method is to manually scrape each respective data.gov dataset web page. At the time of the data rescue event, we identified 8 NEH datasets and 78 IMLS datasets. Manually scraping each dataset web page was manageable in theory but not desirable. We instead decided to automate this process with the hope of developing reusable scripts that take advantage of the publicly available data.gov APIs.
Data.gov and CKAN
Data.gov is built atop CKAN, an open-source data management system. CKAN offers a set of public APIs that allow for scripts to interact with the underlying data and metadata. Data.gov offers the same APIs but it should be noted that they don't actually host any datasets. Data.gov provides the metadata that contains the URI of the datasets. These datasets are usually hosted on their respective organization's servers.
Scraping data.gov APIs
With a little help from GitHub comments, we were able to figure out the API query structure.
We used the following API queries:
Parsing API query results
Next, we needed to parse the API query results to get the dataset URIs. Since data.gov doesn't store the datasets we need to find each dataset's URI. And since there are usually multiple formats for each dataset we wanted to fetch each one.
In the span of about 24 hours, we put together a few Python scripts using Jupyter Notebook to parse and fetch each dataset. These files are labeled as data.gov_imls-gov.ipynb and data.gov_imls-gov.ipynb. Also provided are normal Python scripts (.py) converted from the Jupyter Notebook files. These files have been tested to work with Python 2.7.13.
We decided to develop a script for each organization since the JSON key names for each organization API query were slightly different from each other.
It is definitely possible to refactor the scripts and make an adaptable framework for scraping data.gov APIs. But given our short time frame to develop these scripts, we quickly put together something that just worked.
The scripts expect to find a JSON file within the included
data/ directory. For instance,
data.gov_imls-gov.ipynb expects to find
neh-gov.json within data/neh/. From there, the script will parse the JSON file to find each dataset URI, download, and save the dataset to a generated directory where it found the JSON file. If there are multiple dataset formats then each one will be downloaded to the same directory. A log file of each transaction is also generated.