Skip to content
main
Switch branches/tags
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
bin
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

labs-pad-normalize

CI

R script to normalize PAD data into discrete address records. Part of the NYC Geosearch Geocoder Project

Introduction

The NYC Geosearch API is built on Pelias, the open source geocoding engine that powered Mapzen Search. To accomplish this, Labs uses the authoritative Property Address Directory (PAD) data from the NYC Department of City Planning's Geographic Systems Section. However, because the data represent ranges of addresses, the data must be normalized into an "expanded" form that Pelias will understand. This expansion process involves many factor-specific nuances that translate the ranges into discrete address rows.
screen shot 2018-01-18 at 2 48 09 pm

We are treating the normalization of the PAD data as a separate data workflow from the PAD Pelias Importer. This script starts with the published PAD file, and outputs a normalized CSV of discrete addresses, ready to be picked up by the importer.

Data

This script downloads a version of the PAD data from NYC's Bytes of the Big Apple. The Property Address Directory (PAD) contains geographic information about New York City’s approximately one million tax lots (parcels of real property) and the buildings on them. PAD was created and is maintained by the Department of City Planning’s (DCP’s) Geographic Systems Section (GSS). PAD is released under the BYTES of the BIG APPLE product line four times a year, reflecting tax geography changes, new buildings and other property-related changes.

R Script

This script will output a file in the /data directory called final.csv. This is the expanded output. To make sure the script is getting the latest version of PAD, check that the source is pointing to the most updated version of PAD.

Status

The script is incomplete! Find sample output here. Over the coming weeks, it should be finalized.

Deploy

To "deploy" data as the source for the geosearch importer, run npm run deploy. You must have s3cmd configured as it will run that command to upload output files. To setup for Digital Ocean spaces, see: https://www.digitalocean.com/community/tutorials/how-to-configure-s3cmd-2-x-to-manage-digitalocean-spaces.

For a new version of pad, two references to files need to be updated. In download_data ensure that the download link points to the latest PAD version (17D, 18A, etc) and load_data make sure the path to the street name dictionary (snd17Dcow.txt, snd18Acow.txt, etc) reflects the current release.

How to run locally

Make sure R is installed on your machine. If you just want CLI stuff:

$ brew install R

Install necessary packages

$ R
> install.packages(c("tidyverse", "jsonlite", "downloader"))

(Note: this may take a long time. Go get a coffee or something)

Run the R script to normalize the new PAD data:

$ Rscript ./munge.R

Due to the nature of the PAD dataset, it is very likely that some data processing may be incompatible with new versions. At the very least, it if likely new entries will need to be added to the suffix lookup table data. Do not dispair. Use RStudio to step thru the munging process one step at a time. You'll get there. You got this!

If you're happy with your data, push it to digital ocean using the included shell script:

$ ./push-to-bucket.sh

How to run if you have Docker installed

  1. Make sure you check the Bytes of Big Apple for the latest version of PAD (replace 20a with the latest version)
docker build --tag pad-normalize .
  1. Once the build is complete
docker run -v $(pwd)/data:/usr/local/src/scripts/data pad-normalize 20d

or in detached mode:

docker run -v $(pwd)/data:/usr/local/src/scripts/data -d pad-normalize 20d

How to run in Github Actions

Github actions will pick up the version of pad from version.env, so please remember to update the pad version in this file before commit

github actions will run on all branches, and only deploy computed files to digitocean when pushing to the main branch

About

R script to normalize PAD data into discrete address records

Topics

Resources

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •