City Record Online Workgroup (CROW) - Parsing
This is the main repository containing efforts pertaining to the parsing efforts of CROW. For Notice Schema development, see https://github.com/CityOfNewYork/CROL-Schema.
Disclaimer. In case of conflicting document versions, please refer to documents mentioned in GitHub as the latest version.
- Gold standard - a human parsed file that showed the "correct" extraction of the different object.
- [The Main Schema - a reference file that shows what all the output fields should be and where (the source) they can be derived from.] (https://docs.google.com/spreadsheets/d/1str6vjjHS5EA_2ww9r4WjHA1t32Z00uLLbviegTc8WI/edit#gid=1430366155)
###Open Standard Links
- [Reference Standards.] (https://docs.google.com/document/d/1USFMTHfrmBzDvNW08b2f6osyl9I375d7h47uGcvxXjY/edit)
As the City embarks on implementing Intro 363-2014 and unlocking its daily actions, we are working together with the Department of Citywide Services to publish the City Record as open, clean and structured data. At the same time, we are unlocking decades of historical information and making it accessible to all, at no charge.
Our goal is to optimize the utility of City Record content by making accessible and structuring the data; addresses, dates, persons, subjects, agencies, contract types and more are parsed and made available as individual objects. This way, residents, organizations and small and large businesses alike will be able to access, interact and stay informed, whether through notifications, visualizations or other easy-to-use community tools.
- City of New York
- Citizens Union
- Dev Bootcamp
- Sunlight Foundation
- Came together to form a CROW parsing and scraping volunteer team
- Set up collaboration framework with DCAS
- Scraped PDFs from 2008 - 2014
- Proposed public notice schema
- Added “addresses” and “time & dates” fields to the City’s input workflow
For a list of current tasks, please see Issues.
Phase 1: Parsers and Schema
Develop a set of collaboratively produced open source library parsers to populate the Public Notice Data Standard schema using the DCAS pipeline
Work with DCAS to implement the pipeline into the City’s workflow by August 1, and use that as their way of publishing the City Record data
Publish a Public Notice Data Standard and documentation on an interactive website
Phase 2: PDF Scraping
- Scrape the archival PDFs
- Apply and modify the parsers to be able to parse and structure the data in the PDFs