Skip to content
R script to normalize PAD data into discrete address records
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
bin
data
.gitignore
Dockerfile
README.md
_classify.R
_clean.R
_dependencies.R
_download_data.R
_filter.R
_functions.R
_globals.R
_load_data.R
_sequence.R
munge.R
package.json
pad-munge.Rproj
push-to-bucket.sh
suffix_lookup.csv

README.md

labs-pad-normalize

R script to normalize PAD data into discrete address records. Part of the NYC Geosearch Geocoder Project

Introduction

The NYC Geosearch API is built on Pelias, the open source geocoding engine that powered Mapzen Search. To accomplish this, Labs uses the authoritative Property Address Directory (PAD) data from the NYC Department of City Planning's Geographic Systems Section. However, because the data represent ranges of addresses, the data must be normalized into an "expanded" form that Pelias will understand. This expansion process involves many factor-specific nuances that translate the ranges into discrete address rows.
screen shot 2018-01-18 at 2 48 09 pm

We are treating the normalization of the PAD data as a separate data workflow from the PAD Pelias Importer. This script starts with the published PAD file, and outputs a normalized CSV of discrete addresses, ready to be picked up by the importer.

Data

This script downloads a version of the PAD data from NYC's Bytes of the Big Apple. The Property Address Directory (PAD) contains geographic information about New York City’s approximately one million tax lots (parcels of real property) and the buildings on them. PAD was created and is maintained by the Department of City Planning’s (DCP’s) Geographic Systems Section (GSS). PAD is released under the BYTES of the BIG APPLE product line four times a year, reflecting tax geography changes, new buildings and other property-related changes.

R Script

This script will output a file in the /data directory called final.csv. This is the expanded output. To make sure the script is getting the latest version of PAD, check that the source is pointing to the most updated version of PAD.

Status

The script is incomplete! Find sample output here. Over the coming weeks, it should be finalized.

Deploy

To "deploy" data as the source for the geosearch importer, run npm run deploy. You must have s3cmd configured as it will run that command to upload output files. To setup for Digital Ocean spaces, see: https://www.digitalocean.com/community/tutorials/how-to-configure-s3cmd-2-x-to-manage-digitalocean-spaces.

For a new version of pad, two references to files need to be updated. In download_data ensure that the download link points to the latest PAD version (17D, 18A, etc) and load_data make sure the path to the street name dictionary (snd17Dcow.txt, snd18Acow.txt, etc) reflects the current release.

How to run

Make sure R is installed on your machine. If you just want CLI stuff:

$ brew install R

Install necessary packages

$ R
> install.packages(c("tidyverse", "jsonlite", "downloader"))

(Note: this may take a long time. Go get a coffee or something)

Run the R script to normalize the new PAD data:

$ Rscript ./munge.R

Due to the nature of the PAD dataset, it is very likely that some data processing may be incompatible with new versions. At the very least, it if likely new entries will need to be added to the suffix lookup table data. Do not dispair. Use RStudio to step thru the munging process one step at a time. You'll get there. You got this!

If you're happy with your data, push it to digital ocean using the included shell script:

$ ./push-to-bucket.sh
You can’t perform that action at this time.