GitHub - OpenTechStrategies/fbo: Download, store, and serve data from the nightly FBO data dumps

The FBO Nightly data file is a data file released via FTP each day. It contains data set off by angle brackets, but this departs from XML in ways that make normal XML tooling useless to process it. This ETL package downloads the nightly dumps and then converts them to json.

Dependencies

This code is tested under Python 3.4 and Python 3.5. It will not run under Python 2, though it doesn't use any 3-only features that we know of. Backporting it would not be terribly difficult if your environment demands it.

Versions reflect what we used in our python environment. They are a known-good combination of versions but there is no reason to think newer versions will break anything. If you change versions, use the test suite.

To use a Python virtual environment, set it up like so:

$ virtualenv --python=python3 --prompt "[fbo]" venv
$ . venv/bin/activate

Then install the python requirements:

$ pip install -r requirements.txt

If you want sqlite3, don't get it from pypi (you can't). Use your distro's packaged version:

$ apt-get install sqlite3

You'll want goose to migrate the db up and down, which requires golang. If you don't have golang, you'll need to install it and set your GOPATH. Here is how you might do that on a Debian box:

$ apt-get install golang
$ mkdir -p {path-to-fbo}/golang
$ export GOPATH={path-to-fbo}/golang

Then, install goose:

$ go get -u github.com/pressly/goose/cmd/goose

(Over a slow Net connection this might take a while, because it has to clone various git repositories locally under {path-to-cavetl}/leie/golang/src/github.com/.)

Now goose is installed in the bin subdir under your specified $GOPATH. Add that path to your $PATH:

$ export PATH=$PATH:$GOPATH/bin

Running

This program downloads the data files. Run it periodically to stay up to date. They update daily, so ideally you run this daily.

Run the etl process. See the DEPENDENCIES section for how to install goose, which is required to do this. Goose will keep your migrations synced up and well-formed (like migrating geese, I guess).

$ cd FBONightly
$ ./etl.py

(Over a slow, or even not fast, Net connection this might take a while the first time, because various large data files are being fetched from upstream sources. They will be cached locally, so future runs of etl.py will be speedier.)

Now data has been put into the development database specified in db/dbconf.yml. Feel free to edit dbconf.yml as you need.

Don't worry about running the ETL twice. The program is pretty good about not redoing work unnecessarily.

Serving requests is not yet supported. It will eventually look like this: To serve requests over the API, run:

$ ./serve.py

Data Retention

Once loaded into the database, old data files can be discarded. If you leave old data files in the data directory, they will be ignored unless you blow away the db. Keeping them will do no harm (they're not large) and they might prove useful if you need to rebuild. For example, if the data format changes and nobody notices that the ETL function is failing, you might need to go back and reconstruct.

Notes About The Data

The data is messy. The file uses angle brackets but isn't XML. Most entities don't have closing tags. Some entities are actually part of the data and should be treated as such. Some entities set up a nested structure, but don't display that structure. There's no documentation. It's what you might expect from a pile of data that predates easy, widespread tooling for XML and JSON.

There is a node-based parser for these files on GitHub, but it hides some structure, so we're doing our own parsing to avoid using it.

Data Sources

Data is taken from ftp://ftp.fbo.gov.

Database choice

This etl program takes the data from files and loads it in to a database. We are mainly doing writes here and no complex queries. For now, we'll aim at SQLite and might add Postgres later. We're not using an ORM for this, mainly because we don't need one yet.

For now, this code manages migrations with Goose. We might move to Liquibase in the future, but Goose is dead simple right now, so we're going with it. The goal is to keep the migrations somewhat language and ORM agnostic. Whatever generates SQL (your ORM, DOA, framework, etc.) for your schema can also generate that SQL for a goose migration target. If you are hand-writing your schema sql, put it in the goose migration files and those files become the canonical reference on correct table forms. If you are generating your SQL, then whatever input you are feeding to your SQL-generator (whether it be hibernate or sqlalchemy or whatever) should be the canonical form.

The advantage of using sqlite here is speed and simplicity. The thing is fast and easy to use. The ETL should result in a read-heavy, single-user database. It will be slow on the write side, fast on the read side, but that's ok. If anything says SQLite, that does.

The advantage of postgres is that you get an actual multi-client concurrent server. If you want to serve data directly to remote clients (as opposed to doing it over a web API), you'll want Postgresql.

Testing

There are no tests yet.

License, Copyright, and Contributing

Released under the terms of the GNU Affero General Public License, Version 3 or later. Contributions welcome at https://github.com/OpenTechStrategies/fbo

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
FBONightly		FBONightly
LICENSE		LICENSE
README.mdwn		README.mdwn
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FBONightly

FBONightly

LICENSE

LICENSE

README.mdwn

README.mdwn

requirements.txt

requirements.txt

Repository files navigation

Dependencies

Running

Data Retention

Notes About The Data

Data Sources

Database choice

Testing

License, Copyright, and Contributing

About

Releases

Packages

Contributors 2

Languages

License

OpenTechStrategies/fbo

Folders and files

Latest commit

History

Repository files navigation

Dependencies

Running

Data Retention

Notes About The Data

Data Sources

Database choice

Testing

License, Copyright, and Contributing

About

Resources

License

Stars

Watchers

Forks

Languages