Skip to content

OpenTechStrategies/fbo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 

Repository files navigation

The FBO Nightly data file is a data file released via FTP each day. It contains data set off by angle brackets, but this departs from XML in ways that make normal XML tooling useless to process it. This ETL package downloads the nightly dumps and then converts them to json.

Dependencies

This code is tested under Python 3.4 and Python 3.5. It will not run under Python 2, though it doesn't use any 3-only features that we know of. Backporting it would not be terribly difficult if your environment demands it.

Versions reflect what we used in our python environment. They are a known-good combination of versions but there is no reason to think newer versions will break anything. If you change versions, use the test suite.

To use a Python virtual environment, set it up like so:

$ virtualenv --python=python3 --prompt "[fbo]" venv
$ . venv/bin/activate

Then install the python requirements:

$ pip install -r requirements.txt

If you want sqlite3, don't get it from pypi (you can't). Use your distro's packaged version:

$ apt-get install sqlite3

You'll want goose to migrate the db up and down, which requires golang. If you don't have golang, you'll need to install it and set your GOPATH. Here is how you might do that on a Debian box:

$ apt-get install golang
$ mkdir -p {path-to-fbo}/golang
$ export GOPATH={path-to-fbo}/golang

Then, install goose:

$ go get -u github.com/pressly/goose/cmd/goose

(Over a slow Net connection this might take a while, because it has to clone various git repositories locally under {path-to-cavetl}/leie/golang/src/github.com/.)

Now goose is installed in the bin subdir under your specified $GOPATH. Add that path to your $PATH:

$ export PATH=$PATH:$GOPATH/bin

Running

This program downloads the data files. Run it periodically to stay up to date. They update daily, so ideally you run this daily.

Run the etl process. See the DEPENDENCIES section for how to install goose, which is required to do this. Goose will keep your migrations synced up and well-formed (like migrating geese, I guess).

$ cd FBONightly
$ ./etl.py

(Over a slow, or even not fast, Net connection this might take a while the first time, because various large data files are being fetched from upstream sources. They will be cached locally, so future runs of etl.py will be speedier.)

Now data has been put into the development database specified in db/dbconf.yml. Feel free to edit dbconf.yml as you need.

Don't worry about running the ETL twice. The program is pretty good about not redoing work unnecessarily.

Serving requests is not yet supported. It will eventually look like this: To serve requests over the API, run:

$ ./serve.py

Data Retention

Once loaded into the database, old data files can be discarded. If you leave old data files in the data directory, they will be ignored unless you blow away the db. Keeping them will do no harm (they're not large) and they might prove useful if you need to rebuild. For example, if the data format changes and nobody notices that the ETL function is failing, you might need to go back and reconstruct.

Notes About The Data

The data is messy. The file uses angle brackets but isn't XML. Most entities don't have closing tags. Some entities are actually part of the data and should be treated as such. Some entities set up a nested structure, but don't display that structure. There's no documentation. It's what you might expect from a pile of data that predates easy, widespread tooling for XML and JSON.

There is a node-based parser for these files on GitHub, but it hides some structure, so we're doing our own parsing to avoid using it.

Data Sources

Data is taken from ftp://ftp.fbo.gov.

Database choice

This etl program takes the data from files and loads it in to a database. We are mainly doing writes here and no complex queries. For now, we'll aim at SQLite and might add Postgres later. We're not using an ORM for this, mainly because we don't need one yet.

For now, this code manages migrations with Goose. We might move to Liquibase in the future, but Goose is dead simple right now, so we're going with it. The goal is to keep the migrations somewhat language and ORM agnostic. Whatever generates SQL (your ORM, DOA, framework, etc.) for your schema can also generate that SQL for a goose migration target. If you are hand-writing your schema sql, put it in the goose migration files and those files become the canonical reference on correct table forms. If you are generating your SQL, then whatever input you are feeding to your SQL-generator (whether it be hibernate or sqlalchemy or whatever) should be the canonical form.

The advantage of using sqlite here is speed and simplicity. The thing is fast and easy to use. The ETL should result in a read-heavy, single-user database. It will be slow on the write side, fast on the read side, but that's ok. If anything says SQLite, that does.

The advantage of postgres is that you get an actual multi-client concurrent server. If you want to serve data directly to remote clients (as opposed to doing it over a web API), you'll want Postgresql.

Testing

There are no tests yet.

License, Copyright, and Contributing

Copyright 2018 James Vasile Copyright 2018 Open Tech Strategies, LLC

Released under the terms of the GNU Affero General Public License, Version 3 or later. Contributions welcome at https://github.com/OpenTechStrategies/fbo

About

Download, store, and serve data from the nightly FBO data dumps

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages