Downloader for FOIAonline #23

konklone · 2014-08-28T03:32:15Z

This script downloads metadata for every request and record in FOIAonline, and responsive documents wherever available. It also extracts any available text from PDFs among those responsive documents. It deposits metadata and documents as bulk data on disk, in predictably arranged directories.

It can download dozens of GBs of data from FOIAonline, and as this represents quite a bit of bandwidth and server load, we encourage others who wish to use this data contact us for a bulk data transfer, rather than re-downloading everything from FOIAonline's servers. This scraper is most useful as an ongoing way to stay "in sync" with FOIAonline's output.

If there's interest from others in the community in using this scraper, it'd be relatively easy to move this into its own project. For instance, it may merit a home at https://github.com/unitedstates. It's not known whether this scraper is going to be integrated into production systems here -- but it provides a very useful set of bulk data for search and analysis.

FOIAonline is a somewhat challenging website to scrape, as discovering records and requests means supplying a full-text search parameter from the search form: there is no way to "browse" requests.

Additionally, searching require first obtaining a session cookie and parameters with a GET to the form, then preserving that session with subsequent POSTs to navigate search results. Search results can be ordered by submission date, so it is relatively easy to keep up to date with "recent" results, and to resume interrupted pagination.

Fortunately, it appears that searching for agency "slugs" (e.g. "epa", "cbp") will reliably match on all of that agency's documents. So, a search for all 9 agencies' slugs should be sufficient to discover all documents.

Search results provide a very small amount of metadata, but it does include a unique ID that can be used to construct a permalink to a landing page (example) for individual requests, appeals, referrals, and records. For records, that landing page contains a download link, as well as other metadata about the file and the related request ID. However, that download link is not a permalink - it will expire over time.

So, this scraper is multi-step, and consists of various options to control the flow. To get started, you would run:

./tasks/foiaonline.py --meta --term=epa

This paginates through FOIAonline search results, and saves metadata for each found object. For example, the above search might turn up a result that would save the following JSON into the project's data/ dir:

data/foiaonline/meta/record/EPA/2014/090004d280329072.json

{
  "agency": "EPA",
  "id": "090004d280329072",
  "tracking": "EPA-R9-2014-006943",
  "type": "record",
  "year": "2014"
}

The unique ID is a non-descript hash generated by FOIAonline's internal database, with no relation to the object's but it is permanent and can be used to generate a permalink.

A run with --meta for each of the 9 agency terms in turn will ultimately download around ~425,000 metadata files for records, requests, referrals, and appeals (as of Aug 2014). If your filesystem is anything like mine, even though each JSON file is ~120 bytes, they will take up an effective size of 16KB each (4 IO blocks at 4K apiece) and this alone will weigh 1.8GB and be annoying for your computer to run disk operations on. Oh well.

Run without --meta to begin downloading landing pages, linked responsive documents, and extracting text from PDFs where possible. It's highly encouraged to run with --resume, which will check to see if responsive documents have already been downloaded and, if so, skip that record entirely.

./tasks/foiaonline.py --resume

For the above metadata example, this will visit the permalink, extract more metadata, and trigger a download using the landing page's linked file URL. The extended metadata and documents will be saved at:

data/foiaonline/data/record/EPA/2014/090004d280329072/record.json
data/foiaonline/data/record/EPA/2014/090004d280329072/record.pdf
data/foiaonline/data/record/EPA/2014/090004d280329072/record.txt

The scraper will attempt to guess the file type of the document based on the scraped metadata, which is not the best. This is an area for improvement -- but regardless, the document will be downloaded, and its file path can be predicted based on the data in record.json, which looks like this:

{
  "agency": "EPA",
  "author": null,
  "download_url": "https://foiaonline.regulations.gov/foia/action/getContent;jsessionid=F9F10A3A05BC6DCF83ADB2174AAA5945?objectId=AVa0S2yOxuDk75vynCP9XUruG4WUwOkb",
  "exemptions": null,
  "file_size": "0.0917959213256836",
  "file_type": "pdf",
  "landing_id": "090004d280329072",
  "landing_url": "https://foiaonline.regulations.gov/foia/action/public/view/record?objectId=090004d280329072",
  "released_on": "2014-08-14",
  "released_original": "Thu Aug 14 18:30:45 EDT 2014",
  "request_id": "EPA-R9-2014-006943",
  "retention": "6 year",
  "title": "04-16-14 1651 Pendergast",
  "type": "record",
  "unreleased": false,
  "year": "2014"
}

The script can also be run with --skip_doc, to avoid downloading documents and only bother fetching landing pages and scraping metadata. This will also trigger the use of cached HTML landing pages (saved in data/foiaonline/cache) where possible. This cached HTML will not be used normally, when docs are being downloaded, because the download link needs to be regenerated from the server.

Some still outstanding tasks:

Downloading more detailed metadata for non-records: requests, appeals, and referrals. Currently, we only store their unique IDs (which can be used to create permalinks).
More smartly handling file type detection and file-naming.
Verify that the record IDs we find through pagination match the record IDs available through looking at record IDs linked to requests we find through pagination.
Add some options and documentation specifically optimized for a sync strategy.
Write a backup script to get our copy of the data backed up into a place like S3.

Our next steps with this are to write a small data loader that will send this bulk data into foia-search, as we've already done for the State Dept data that we wrote a separate scraper. Our goal is to harmonize those two datasets when loaded, and make them cross-searchable.

…rong.

…g pages

would be annoying to re-download everything, so may end up using this in a batch task, but in theory this should run on its own, and be saved as a separate file-headers.json or something, and cached alongside the rest, to be used in determine file type and extension for the downloaded binary file.

Conflicts: .gitignore

Downloader for FOIAonline

add link to demo closes 18F#23

konklone added 30 commits August 12, 2014 22:40

renaming data/ to documents/

cd93d78

shell

592b617

allow import if needed

87f573a

ignore documents/ now

9d5746d

3.4.1

c23c4cc

Merge branch 'master' into foiaonline

5e5d285

just to suppress a warning

b375414

moving stuff around

8853e39

use urlretrieve, update scrapelib

f4cd20b

one more shuffle

1c81de2

ignoring things

6394447

session preservation and posting works

05beedd

okay we are saving metadata to disk now this is good

13be975

now it can download all the pages

108574e

add appeals/referrals, document agency slugs, don't overwrite meta

e5a6001

dang - assumption that tracking number was unique per-document very w…

fcf44d7

…rong.

helper script to archive metadata for foiaonline

9d6c1e3

downloading record PDFs and extracting text

cfe4ca6

some basic dir iteration over downloaded metadata

fe20c71

scraping more metadata, handling other file types, not caching landin…

f8d5833

…g pages

ignore exemption 5 subtypes explicitly for now

b416eaf

allow scoping of data fetching

252ed8c

added resume mode to data download for records

4f96a67

comment more

a9a116e

Handle unreleased docs

e44bd00

comments on file types

ded048f

update resume language

2e43ab8

always download docs as binary from foiaonline

4f2abd5

okay ignore misc/

eb1aaac

capture original release date, handle unexpected dates gracefully

5434f14

konklone added 2 commits August 27, 2014 17:19

Merge branch 'master' into foiaonline

43af6a7

Conflicts: .gitignore

konklone added a commit that referenced this pull request Aug 28, 2014

Merge pull request #23 from 18F/foiaonline

26b9643

Downloader for FOIAonline

konklone merged commit 26b9643 into master Aug 28, 2014

konklone deleted the foiaonline branch August 28, 2014 03:33

rjmajma mentioned this pull request Aug 28, 2014

Downloader for FOIAonline 18F/2015-foia-search#10

Closed

khandelwal pushed a commit to khandelwal/foia that referenced this pull request Nov 28, 2014

add link to demo closes 18F#23

e8ce48b

khandelwal pushed a commit to khandelwal/foia that referenced this pull request Nov 28, 2014

Merge pull request 18F#24 from ascott1/gh-pages

993ad74

add link to demo closes 18F#23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Downloader for FOIAonline #23

Downloader for FOIAonline #23

konklone commented Aug 28, 2014

Downloader for FOIAonline #23

Downloader for FOIAonline #23

Conversation

konklone commented Aug 28, 2014