This repository has been archived by the owner on Nov 7, 2018. It is now read-only.
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
would be annoying to re-download everything, so may end up using this in a batch task, but in theory this should run on its own, and be saved as a separate file-headers.json or something, and cached alongside the rest, to be used in determine file type and extension for the downloaded binary file.
Conflicts: .gitignore
khandelwal
pushed a commit
to khandelwal/foia
that referenced
this pull request
Nov 28, 2014
khandelwal
pushed a commit
to khandelwal/foia
that referenced
this pull request
Nov 28, 2014
add link to demo closes 18F#23
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This script downloads metadata for every request and record in FOIAonline, and responsive documents wherever available. It also extracts any available text from PDFs among those responsive documents. It deposits metadata and documents as bulk data on disk, in predictably arranged directories.
It can download dozens of GBs of data from FOIAonline, and as this represents quite a bit of bandwidth and server load, we encourage others who wish to use this data contact us for a bulk data transfer, rather than re-downloading everything from FOIAonline's servers. This scraper is most useful as an ongoing way to stay "in sync" with FOIAonline's output.
If there's interest from others in the community in using this scraper, it'd be relatively easy to move this into its own project. For instance, it may merit a home at https://github.com/unitedstates. It's not known whether this scraper is going to be integrated into production systems here -- but it provides a very useful set of bulk data for search and analysis.
FOIAonline is a somewhat challenging website to scrape, as discovering records and requests means supplying a full-text search parameter from the search form: there is no way to "browse" requests.
Additionally, searching require first obtaining a session cookie and parameters with a GET to the form, then preserving that session with subsequent POSTs to navigate search results. Search results can be ordered by submission date, so it is relatively easy to keep up to date with "recent" results, and to resume interrupted pagination.
Fortunately, it appears that searching for agency "slugs" (e.g. "epa", "cbp") will reliably match on all of that agency's documents. So, a search for all 9 agencies' slugs should be sufficient to discover all documents.
Search results provide a very small amount of metadata, but it does include a unique ID that can be used to construct a permalink to a landing page (example) for individual requests, appeals, referrals, and records. For records, that landing page contains a download link, as well as other metadata about the file and the related request ID. However, that download link is not a permalink - it will expire over time.
So, this scraper is multi-step, and consists of various options to control the flow. To get started, you would run:
This paginates through FOIAonline search results, and saves metadata for each found object. For example, the above search might turn up a result that would save the following JSON into the project's
data/
dir:The unique ID is a non-descript hash generated by FOIAonline's internal database, with no relation to the object's but it is permanent and can be used to generate a permalink.
A run with
--meta
for each of the 9 agency terms in turn will ultimately download around ~425,000 metadata files for records, requests, referrals, and appeals (as of Aug 2014). If your filesystem is anything like mine, even though each JSON file is ~120 bytes, they will take up an effective size of 16KB each (4 IO blocks at 4K apiece) and this alone will weigh 1.8GB and be annoying for your computer to run disk operations on. Oh well.Run without
--meta
to begin downloading landing pages, linked responsive documents, and extracting text from PDFs where possible. It's highly encouraged to run with--resume
, which will check to see if responsive documents have already been downloaded and, if so, skip that record entirely.For the above metadata example, this will visit the permalink, extract more metadata, and trigger a download using the landing page's linked file URL. The extended metadata and documents will be saved at:
The scraper will attempt to guess the file type of the document based on the scraped metadata, which is not the best. This is an area for improvement -- but regardless, the document will be downloaded, and its file path can be predicted based on the data in
record.json
, which looks like this:The script can also be run with
--skip_doc
, to avoid downloading documents and only bother fetching landing pages and scraping metadata. This will also trigger the use of cached HTML landing pages (saved indata/foiaonline/cache
) where possible. This cached HTML will not be used normally, when docs are being downloaded, because the download link needs to be regenerated from the server.Some still outstanding tasks:
Our next steps with this are to write a small data loader that will send this bulk data into
foia-search
, as we've already done for the State Dept data that we wrote a separate scraper. Our goal is to harmonize those two datasets when loaded, and make them cross-searchable.