Skip to content

Chicago/lead-etl

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

lead-etl

Introduction

This repository loads a variety of datasets for childhood lead poisoning modeling.

Implementation

The code for each phase of etl is located in the corresponding subdirectory and is executed using a drakefile. The output of each phase is contained in a database schema of the same name.

input: Load raw data, see input folder for more details.

buildings: Analyze the Chicago buildings shapefile to extract all addresses and group them into buildings and complexes.

aux: Process the data to prepare for model building. This includes summarizing and spatially joining datasets.

dedupe:Deduplicate the names of children from the blood tests and the WIC Cornerstone database.

output: Use the above to create final tables used for exploration, analysis and model feature generation.

Deployment

1. Dependencies

A) Program dependencies

Install these programs:

  • drake (tested with version 1.0.3)
  • mdbtools (0.7.1)
  • ogr2ogr (2.1.0) with PostgreSQL driver (requires libmq)
  • shp2pgsql (2.2.2)
  • postgresql-client (9.6.0)

B) Library dependencies

HDF5 can be installed on Debian-based systems using: sudo apt-get install libhdf5-serial-dev, or using conda: conda install hdf5.

Python modules can be installed using:

pip install -r requirements.txt

2. Database Setup:

A) Create and configure a PostgreSQL database:

Create a database on a PostgreSQL server (tested with version 9.5.4). Install the PostGIS (2.2.2) and unaccent extensions (requires admin privileges):

CREATE EXTENSION postgis;
CREATE EXTENSION unaccent;

B) Configure a profile file:

Copy ./lead/example_profile to ./lead/default_profile and set the indicated variables.

3. Test dependencies

Test that the above dependencies were successfull installed by changing by running lead/check_dependencies.sh:

$ lead/check_dependencies.sh
FOUND: drake, version: Drake now is on version: 1.0.3
FOUND: mdb-tools, version: mdbtools v0.7.1
FOUND: ogr2ogr, version: GDAL 2.1.0, released 2016/04/25
FOUND: shp2pgsql, version: RELEASE: 2.3.3 (r15473)
FOUND: PostgreSQL client, version: psql (PostgreSQL) 9.6.0
FOUND: PostgreSQL server, version: 9.6.0
FOUND: PostgreSQL extension postgis, version: 2.3.3
FOUND: PostgreSQL extension unaccent, version: 1.1
FOUND: Python pandas HDF support

4. Load American Community Survey data:

Use the acs2pgsql tool to load ACS 5-year data for Illinois into the database. Note that a subset of this data will be imported into the lead pipeline below, so the ACS data may be stored in a separate database from the lead data.

5. Run the ETL

Simply change to lead/ and run drake. To run steps in parallel add the argument --jobs=N where N is the number of cores to use.

License

MIT LICENSE

Contributors

- Eric Potash (epotash@uchicago.edu)

Releases

No releases published

Packages

No packages published

Languages

  • Python 79.5%
  • Shell 14.0%
  • Batchfile 6.5%