R script to normalize PAD data into discrete address records. Part of the NYC Geosearch Geocoder Project
The NYC Geosearch API is built on Pelias, the open source geocoding engine that powered Mapzen Search. To accomplish this, Labs uses the authoritative Property Address Directory (PAD) data from the NYC Department of City Planning's Geographic Systems Section. However, because the data represent ranges of addresses, the data must be normalized into an "expanded" form that Pelias will understand. This expansion process involves many factor-specific nuances that translate the ranges into discrete address rows.
We are treating the normalization of the PAD data as a separate data workflow from the PAD Pelias Importer. This script starts with the published PAD file, and outputs a normalized CSV of discrete addresses, ready to be picked up by the importer.
This script downloads a version of the PAD data from NYC's Bytes of the Big Apple. The Property Address Directory (PAD) contains geographic information about New York City’s approximately one million tax lots (parcels of real property) and the buildings on them. PAD was created and is maintained by the Department of City Planning’s (DCP’s) Geographic Systems Section (GSS). PAD is released under the BYTES of the BIG APPLE product line four times a year, reflecting tax geography changes, new buildings and other property-related changes.
This script will output a file in the
/data directory called
final.csv. This is the expanded output. To make sure the script is getting the latest version of PAD, check that the
source is pointing to the most updated version of PAD.
The script is incomplete! Find sample output here. Over the coming weeks, it should be finalized.
To "deploy" data as the source for the geosearch importer, run
npm run deploy. You must have s3cmd configured as it will run that command to upload output files. To setup for Digital Ocean spaces, see: https://www.digitalocean.com/community/tutorials/how-to-configure-s3cmd-2-x-to-manage-digitalocean-spaces.
For a new version of pad, two references to files need to be updated. In
download_data ensure that the download link points to the latest PAD version (17D, 18A, etc) and
load_data make sure the path to the street name dictionary (snd17Dcow.txt, snd18Acow.txt, etc) reflects the current release.
How to run locally
Make sure R is installed on your machine. If you just want CLI stuff:
$ brew install R
Install necessary packages
$ R > install.packages(c("tidyverse", "jsonlite", "downloader"))
(Note: this may take a long time. Go get a coffee or something)
Run the R script to normalize the new PAD data:
$ Rscript ./munge.R
Due to the nature of the PAD dataset, it is very likely that some data processing may be incompatible with new versions. At the very least, it if likely new entries will need to be added to the suffix lookup table data. Do not dispair. Use RStudio to step thru the munging process one step at a time. You'll get there. You got this!
If you're happy with your data, push it to digital ocean using the included shell script:
How to run if you have Docker installed
- Make sure you check the Bytes of Big Apple for the latest version of PAD (replace 20a with the latest version)
docker build --tag pad-normalize .
- Once the build is complete
docker run -v $(pwd)/data:/usr/local/src/scripts/data pad-normalize 20d
or in detached mode:
docker run -v $(pwd)/data:/usr/local/src/scripts/data -d pad-normalize 20d
How to run in Github Actions
Github actions will pick up the version of pad from
version.env, so please remember to update the pad version in this file before commit
github actions will run on all branches, and only deploy computed files to digitocean when pushing to the