Skip to content
This repository has been archived by the owner on Nov 7, 2018. It is now read-only.

Downloader for FOIAonline #23

Merged
merged 32 commits into from Aug 28, 2014
Merged

Downloader for FOIAonline #23

merged 32 commits into from Aug 28, 2014

Conversation

konklone
Copy link
Contributor

foiaonline-intro

This script downloads metadata for every request and record in FOIAonline, and responsive documents wherever available. It also extracts any available text from PDFs among those responsive documents. It deposits metadata and documents as bulk data on disk, in predictably arranged directories.

It can download dozens of GBs of data from FOIAonline, and as this represents quite a bit of bandwidth and server load, we encourage others who wish to use this data contact us for a bulk data transfer, rather than re-downloading everything from FOIAonline's servers. This scraper is most useful as an ongoing way to stay "in sync" with FOIAonline's output.

If there's interest from others in the community in using this scraper, it'd be relatively easy to move this into its own project. For instance, it may merit a home at https://github.com/unitedstates. It's not known whether this scraper is going to be integrated into production systems here -- but it provides a very useful set of bulk data for search and analysis.

FOIAonline is a somewhat challenging website to scrape, as discovering records and requests means supplying a full-text search parameter from the search form: there is no way to "browse" requests.

foiaonline-search

Additionally, searching require first obtaining a session cookie and parameters with a GET to the form, then preserving that session with subsequent POSTs to navigate search results. Search results can be ordered by submission date, so it is relatively easy to keep up to date with "recent" results, and to resume interrupted pagination.

Fortunately, it appears that searching for agency "slugs" (e.g. "epa", "cbp") will reliably match on all of that agency's documents. So, a search for all 9 agencies' slugs should be sufficient to discover all documents.

foiaonline-results

Search results provide a very small amount of metadata, but it does include a unique ID that can be used to construct a permalink to a landing page (example) for individual requests, appeals, referrals, and records. For records, that landing page contains a download link, as well as other metadata about the file and the related request ID. However, that download link is not a permalink - it will expire over time.

foiaonline-record

So, this scraper is multi-step, and consists of various options to control the flow. To get started, you would run:

./tasks/foiaonline.py --meta --term=epa

This paginates through FOIAonline search results, and saves metadata for each found object. For example, the above search might turn up a result that would save the following JSON into the project's data/ dir:

data/foiaonline/meta/record/EPA/2014/090004d280329072.json
{
  "agency": "EPA",
  "id": "090004d280329072",
  "tracking": "EPA-R9-2014-006943",
  "type": "record",
  "year": "2014"
}

The unique ID is a non-descript hash generated by FOIAonline's internal database, with no relation to the object's but it is permanent and can be used to generate a permalink.

A run with --meta for each of the 9 agency terms in turn will ultimately download around ~425,000 metadata files for records, requests, referrals, and appeals (as of Aug 2014). If your filesystem is anything like mine, even though each JSON file is ~120 bytes, they will take up an effective size of 16KB each (4 IO blocks at 4K apiece) and this alone will weigh 1.8GB and be annoying for your computer to run disk operations on. Oh well.

Run without --meta to begin downloading landing pages, linked responsive documents, and extracting text from PDFs where possible. It's highly encouraged to run with --resume, which will check to see if responsive documents have already been downloaded and, if so, skip that record entirely.

./tasks/foiaonline.py --resume

For the above metadata example, this will visit the permalink, extract more metadata, and trigger a download using the landing page's linked file URL. The extended metadata and documents will be saved at:

data/foiaonline/data/record/EPA/2014/090004d280329072/record.json
data/foiaonline/data/record/EPA/2014/090004d280329072/record.pdf
data/foiaonline/data/record/EPA/2014/090004d280329072/record.txt

The scraper will attempt to guess the file type of the document based on the scraped metadata, which is not the best. This is an area for improvement -- but regardless, the document will be downloaded, and its file path can be predicted based on the data in record.json, which looks like this:

{
  "agency": "EPA",
  "author": null,
  "download_url": "https://foiaonline.regulations.gov/foia/action/getContent;jsessionid=F9F10A3A05BC6DCF83ADB2174AAA5945?objectId=AVa0S2yOxuDk75vynCP9XUruG4WUwOkb",
  "exemptions": null,
  "file_size": "0.0917959213256836",
  "file_type": "pdf",
  "landing_id": "090004d280329072",
  "landing_url": "https://foiaonline.regulations.gov/foia/action/public/view/record?objectId=090004d280329072",
  "released_on": "2014-08-14",
  "released_original": "Thu Aug 14 18:30:45 EDT 2014",
  "request_id": "EPA-R9-2014-006943",
  "retention": "6 year",
  "title": "04-16-14 1651 Pendergast",
  "type": "record",
  "unreleased": false,
  "year": "2014"
}

The script can also be run with --skip_doc, to avoid downloading documents and only bother fetching landing pages and scraping metadata. This will also trigger the use of cached HTML landing pages (saved in data/foiaonline/cache) where possible. This cached HTML will not be used normally, when docs are being downloaded, because the download link needs to be regenerated from the server.

Some still outstanding tasks:

  • Downloading more detailed metadata for non-records: requests, appeals, and referrals. Currently, we only store their unique IDs (which can be used to create permalinks).
  • More smartly handling file type detection and file-naming.
  • Verify that the record IDs we find through pagination match the record IDs available through looking at record IDs linked to requests we find through pagination.
  • Add some options and documentation specifically optimized for a sync strategy.
  • Write a backup script to get our copy of the data backed up into a place like S3.

Our next steps with this are to write a small data loader that will send this bulk data into foia-search, as we've already done for the State Dept data that we wrote a separate scraper. Our goal is to harmonize those two datasets when loaded, and make them cross-searchable.

would be annoying to re-download everything, so may end up using this in
a batch task, but in theory this should run on its own, and be saved as
a separate file-headers.json or something, and cached alongside the rest,
to be used in determine file type and extension for the downloaded
binary file.
konklone added a commit that referenced this pull request Aug 28, 2014
Downloader for FOIAonline
@konklone konklone merged commit 26b9643 into master Aug 28, 2014
@konklone konklone deleted the foiaonline branch August 28, 2014 03:33
khandelwal pushed a commit to khandelwal/foia that referenced this pull request Nov 28, 2014
khandelwal pushed a commit to khandelwal/foia that referenced this pull request Nov 28, 2014
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant