Skip to content

toddwschneider/chicago-taxi-data

Repository files navigation

Chicago Taxi and Transportation Network Provider Data

Code to download, process, and analyze Chicago's publicly available taxi and Transportation Network Provider (Uber/Lyft) data. Raw data comes from the City of Chicago:

Used originally in support of this post: https://toddwschneider.com/posts/chicago-taxi-data/. Note that at the time that post was written, TNP data was not yet available.

This repo is something of a companion to the nyc-taxi-data repo. The repos share some similar code and structure, but do not explicitly depend on each other.

As of Q1 2020, the Chicago taxi dataset contains nearly 200 million rows, while the TNP dataset is around 130 million rows.

Instructions

1. Install PostgreSQL and PostGIS

Both are available via Homebrew on Mac OS X

2. Download and import Chicago taxi/TNP data

Note: the raw taxi data is a single uncompressed 70GB+ .csv file, it will take a little while to download!

If you prefer, you can download and process either the taxi or TNP dataset without the other

./initialize_database.sh
./download_raw_taxi_data.sh && ./download_raw_tnp_data.sh
./import_taxi_trip_data.sh && ./import_raw_tnp_data.sh
3. Incremental updates

New taxi data is available monthly; new TNP data quarterly. Once you've run the full setup, in the future you can download and process only the latest data by running

./update_taxi_trips_data.sh
./update_tnp_trips_data.sh

This has the advantage of not downloading the entire datasets every time you want to get the latest data

3. Analysis

Within the analysis/ subfolder, prepare_analysis.sql and analysis.R scripts to do analysis in Postgres and R

Some differences between Chicago and NYC taxi data

  • Chicago includes anonymous taxi medallion IDs, NYC does not
  • Chicago includes fare info for TNP trips, NYC's comparable FHV dataset does not
  • Chicago does not include information about which TNP provided which trip, NYC does
  • Chicago does not include precise location coordinates, only census tracts and community areas (and even then, only sometimes)
    • Since July 2016, NYC also does not provide precise coordinates
  • Chicago does not include precise timestamps, instead rounds pickups and drop offs to 15-minute intervals

Additional data sources included

Questions/issues/contact

todd@toddwschneider.com, or open a GitHub issue

About

Import and analyze Chicago public taxi and ride-hailing data

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published