## Scenario/Stakeholder Based Analysis of NYC taxi rides data
##### Authors: Panini Mokrala, Dmitrii Danilov

# Introduction

Through this project, we will investigate the decisions that various stakeholders in a Taxi Ecosystem take and overlay them with the varying weather conditions within New York City to check for correlation. We will take a closer look at the following definitions/assumptions before getting into questions-

1.   Stakeholders - We will consider the Commuters travelling in the Taxis and Taxi owners/drivers as stakeholders
2.   Decisions - Depending on the stakeholders we chose, there can be different decisions - for example, Taxi drivers can prefer fewer trips in a winter season compared to a summer season while commuters would prefer a taxi ride more often in a winter season compared to a summer season
3. Time period/ other assumptions  - We will be utilizing the taxi rides information between 2017 to 2019 to highlight the trends. There can be few interactive effects with the entry of Uber/Lyft. Due to data reliability issues, we will not be able to attribute this interactive effect

Now that we have a clearer understanding of the problem space, there are two main areas that we would like to address through this project depending on the stakeholder -

1. If you are a traveler or commuter - You can plan your trip by answering:
    a. How does the average fare/trip vary across the various taxi zones of New York with changing timeline and weather conditions?
    b. What is the average time/trip between two points in New York? How does that change with the weather variations across the time?
    c. At a given point in time and at a given temperature range, how does the availability of the taxis vary? (Please note: Availability is defined as number of active rides at a given point in time)
2. If you want to help a taxi driver/owner - You can help them plan their next season by answering : 
    a. How do the taxi availability vary across the various taxi zones? How do the weather conditions relate with the taxi availability over time? 
    b. What can be a reasonable fare estimate / trip that the owner can quote between the origin and destination?

While we enlisted the areas/ questions that we would like to provide insights to the stakeholders, we will further broaden/alter the scope of analysis as we go through the data mining exercise. As an outcome of this exercise, we would like to create a framework for the commuters and taxi driver/owners that helps them to plan their trips across various time points, locations and weather conditions

# Datasets and Data Sources

## <b>Data Source 1</b> : ##

> NYC Open Data -  New York Yellow & Green Taxi Trip data (Timestamp level)
  1. Size: N/A (As this is API based, we will only get details once the entire data is pulled)
  2. Format: API (JSON)
  3. Access method: Python API calls & Google BigQuery framework


Dataset Name | Link
--- | ---
Yellow Taxi (2019) | https://data.cityofnewyork.us/Transportation/2019-Yellow-Taxi-Trip-Data/2upf-qytp 
Yellow Taxi (2018) | https://data.cityofnewyork.us/Transportation/2018-Yellow-Taxi-Trip-Data/t29m-gskq
Yellow Taxi (2017) | https://data.cityofnewyork.us/Transportation/2017-Yellow-Taxi-Trip-Data/biws-g3hs 
Green Taxi (2019) | https://data.cityofnewyork.us/Transportation/2019-Green-Taxi-Trip-Data/q5mz-t52e 
Green Taxi (2018) | https://data.cityofnewyork.us/Transportation/2018-Green-Taxi-Trip-Data/w7fs-fd9i 
Green Taxi (2017) | https://data.cityofnewyork.us/Transportation/2017-Green-Taxi-Trip-Data/5gj9-2kzx 

## <b>Data Source 2</b> : ##

> NYC Open Data -  National Centers for Environmental Information
  1. Size: 17MB
  2. Format: CSV
  3. Access method: HTTP
  4. Location : https://www.ncdc.noaa.gov




## <b> Data Source 3 : </b>##

> NYC Open Data -  Taxi Zone information
  1. Size: 1MB
  2. Format: Shapefile
  3. Access method: HTTP
  4. Location : https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
  

# Data Exploration

We will first start with data exploration - Starting with Taxi zone information

Let's first mount the working location. In our case, we used the Google Drive infrastructure to complete this project and required dependencies

In [None]:
from google.colab import drive
from os.path import join

ROOT = '/content/drive'
PROJ = 'MyDrive/Milestones/milestone-1'

drive.mount(ROOT)
PROJECT_PATH = join(ROOT, PROJ)
%cd "$PROJECT_PATH"
%pwd

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content/drive/MyDrive/Milestones/milestone-1


'/content/drive/MyDrive/Milestones/milestone-1'

In [None]:
!pip install geopandas
!pip install altair_data_server

import io
import json
import pandas as pd
import numpy as np
import geopandas as gpd
from shapely.ops import cascaded_union
from google.cloud import bigquery
from google.oauth2 import service_account
import datetime as dt
import altair as alt

alt.data_transformers.enable('data_server')

Collecting geopandas
[?25l  Downloading https://files.pythonhosted.org/packages/f7/a4/e66aafbefcbb717813bf3a355c8c4fc3ed04ea1dd7feb2920f2f4f868921/geopandas-0.8.1-py2.py3-none-any.whl (962kB)
[K     |████████████████████████████████| 972kB 4.7MB/s 
Collecting pyproj>=2.2.0
[?25l  Downloading https://files.pythonhosted.org/packages/e4/ab/280e80a67cfc109d15428c0ec56391fc03a65857b7727cf4e6e6f99a4204/pyproj-3.0.0.post1-cp36-cp36m-manylinux2010_x86_64.whl (6.4MB)
[K     |████████████████████████████████| 6.5MB 46.6MB/s 
Collecting fiona
[?25l  Downloading https://files.pythonhosted.org/packages/37/94/4910fd55246c1d963727b03885ead6ef1cd3748a465f7b0239ab25dfc9a3/Fiona-1.8.18-cp36-cp36m-manylinux1_x86_64.whl (14.8MB)
[K     |████████████████████████████████| 14.8MB 311kB/s 
Collecting click-plugins>=1.0
  Downloading https://files.pythonhosted.org/packages/e9/da/824b92d9942f4e472702488857914bdd50f73021efea15b4cad9aca8ecef/click_plugins-1.1.1-py2.py3-none-any.whl
Collecting cligj>=0.5
  D

DataTransformerRegistry.enable('data_server')

## Understanding the <b> Taxi Zones </b> dataset

In [None]:
taxi_zones = gpd.read_file('https://s3.amazonaws.com/nyc-tlc/misc/taxi_zones.zip')
taxi_zones.to_crs(epsg=4326, inplace=True)
taxi_zones['centroid_lon'] = taxi_zones['geometry'].centroid.x
taxi_zones['centroid_lat'] = taxi_zones['geometry'].centroid.y

taxi_zones_b = taxi_zones.groupby('borough')['geometry'].agg(lambda x: cascaded_union(x).centroid).to_frame()
taxi_zones_b.columns = ['geometry']
taxi_zones_b.reset_index(inplace=True)
taxi_zones_b['centroid_lon'] = taxi_zones_b['geometry'].centroid.x
taxi_zones_b['centroid_lat'] = taxi_zones_b['geometry'].centroid.y



  This is separate from the ipykernel package so we can avoid doing imports until

  after removing the cwd from sys.path.


<b> Shape File Description </b>

Column Name | Definition
--- | ---
Object ID | Unique ID given to the Location
Shape_Leng | Length of the shape (Used to draw the locations)
Shape_Area | Area of the shape considered (Used to draw the locations)
zone | Name of the area with in the Borough
Location ID | Similar to Object ID
borough | Name of the area that is present (There are in total 6)
geometry | Co-ordinates needed to draw the graph
centroid_lon | Longitude of the Shape
centroid_lat | Latitude of the shape

Let's first find out the distribution between Boroughs and Locations

In [None]:
#Visualize the # of Zones by Borough
val_list = list((taxi_zones.groupby('borough', as_index=False).count()).sort_values('OBJECTID',axis = 0,ascending = False).borough.values)

In [92]:
#Distribution of Zones by area
dist_zones =   alt.Chart(taxi_zones).mark_bar().encode(
      y = alt.Y('borough',sort = val_list),  # The order to sort in),
      x = alt.X('count(zone):Q',title = 'Number of Zones'
          ),
  )

text_zones =   alt.Chart(taxi_zones).mark_text(baseline = 'middle',align = 'left').encode(
      y = alt.Y('borough',sort = val_list),  # The order to sort in),
      x = alt.X('count(zone):Q',title = 'Number of Zones'
          ),
      text = 'count(zone):Q'
  )

(dist_zones + text_zones).configure_mark(
    # we don't love the blue
    color='#008fd5'
).configure_axis(
 labelColor = 'grey',
 tickColor = 'grey'

).configure_view(
    # we don't want a stroke around the bars
    strokeWidth=0
).properties(
    # set the dimensions of the visualization
    width=500,
    height=180
).properties(
    # add a title
    title={
      "text": ["Distribution of Zones by Boroughs"], 
      "subtitle": ["Manhattan and Queens have highest Zones : 69"],
      "color": "Black",
      "subtitleColor": "grey",
        "fontSize":25
    }
).configure_title(
    anchor='start'
)


Output hidden; open in https://colab.research.google.com to view.

In [None]:
#Geographical Representation

tz_geo = json.loads(taxi_zones.to_json())['features']
tz_geo_b = json.loads(taxi_zones_b.to_json())['features']

alt.themes.enable('opaque')

base = alt.Chart(alt.Data(values=tz_geo)).mark_geoshape(
        stroke='black',
        strokeWidth=1
    ).encode(
        color=alt.Color('properties.borough:N', legend=None)
    ).properties(
        width=800,
        height=800
    )

labels = alt.Chart(alt.Data(values=tz_geo)).mark_text(
    baseline='top',
     ).properties(
        width=800,
        height=800
     ).encode(
         longitude='properties.centroid_lon:Q',
         latitude='properties.centroid_lat:Q',
         text='properties.LocationID:O',
         size=alt.value(8),
         opacity=alt.value(1)
     )

boroughs = alt.Chart(alt.Data(values=tz_geo_b)).mark_text(
    color='white',
    stroke='black',
    fontWeight='bold',
    strokeWidth=0.7,
    baseline='top'
     ).properties(
        width=800,
        height=800,
        title=alt.Text(text="NYC boroughs and taxi zones", fontSize=22)
     ).encode(
         longitude='properties.centroid_lon:Q',
         latitude='properties.centroid_lat:Q',
         text='properties.borough:N',
         size=alt.value(26),
         opacity=alt.value(1)
     )


base + labels + boroughs

Output hidden; open in https://colab.research.google.com to view.

## Understanding the Taxi Trip data

We will be utilizing the Shapefile and the Taxi trip data in conjunction to bring few visualizations. The flow from here will be as follows

In [None]:
key_path = 'auth.json'
credentials = service_account.Credentials.from_service_account_file(key_path)
client = bigquery.Client(credentials=credentials, project=credentials.project_id)

sql = '''SELECT dropoff_location_id, count(*) as count
FROM bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2017 
where dropoff_datetime between '2017-01-01' and '2018-01-01' 
group by dropoff_location_id;'''
dropoff_2017_df = client.query(sql).to_dataframe()

In [91]:
dropoff_2017_df.rename(columns={'dropoff_location_id': 'LocationID'}, inplace=True)
dropoff_2017_df['LocationID'] = dropoff_2017_df['LocationID'].astype('int64')

dropoff_2017 = taxi_zones.merge(dropoff_2017_df, on='LocationID')
dropoff_2017 = json.loads(dropoff_2017.to_json())['features']

base = alt.Chart(alt.Data(values=dropoff_2017)).mark_geoshape(
        stroke='black',
        strokeWidth=1
    ).encode(
        color=alt.Color('properties.count:Q', scale=alt.Scale(type='log'), legend=alt.Legend(title="Drop-off count"))
    ).properties(
        title=alt.Text(text="NYC taxi drop-off zones popularity", fontSize=22),
        width=800,
        height=800
    )

labels = alt.Chart(alt.Data(values=tz_geo)).mark_text(
    baseline='top',
     ).properties(
        width=800,
        height=800
     ).encode(
         longitude='properties.centroid_lon:Q',
         latitude='properties.centroid_lat:Q',
         text='properties.LocationID:O',
         size=alt.value(8),
         opacity=alt.value(1)
     )

base + labels

Output hidden; open in https://colab.research.google.com to view.

In [None]:
key_path = 'auth.json'
credentials = service_account.Credentials.from_service_account_file(key_path)

sql = '''SELECT pickup_location_id, count(*) as count
FROM bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2017 
where pickup_datetime between '2017-01-01' and '2018-01-01' 
group by pickup_location_id;'''
pickup_2017_df = client.query(sql).to_dataframe()

In [None]:
pickup_2017_df.rename(columns={'pickup_location_id': 'LocationID'}, inplace=True)
pickup_2017_df['LocationID'] = pickup_2017_df['LocationID'].astype('int64')

pickup_2017 = taxi_zones.merge(pickup_2017_df, on='LocationID')
pickup_2017 = json.loads(pickup_2017.to_json())['features']

base = alt.Chart(alt.Data(values=pickup_2017)).mark_geoshape(
        stroke='black',
        strokeWidth=1
    ).encode(
        color=alt.Color('properties.count:Q', scale=alt.Scale(type='log'), legend=alt.Legend(title="Pickup count"))
    ).properties(
        title=alt.Text(text="NYC taxi pickup zones popularity", fontSize=22),
        width=800,
        height=800
    )
    
labels = alt.Chart(alt.Data(values=tz_geo)).mark_text(
    baseline='top',
     ).properties(
        width=800,
        height=800
     ).encode(
         longitude='properties.centroid_lon:Q',
         latitude='properties.centroid_lat:Q',
         text='properties.LocationID:O',
         size=alt.value(8),
         opacity=alt.value(1)
     )

base + labels

Output hidden; open in https://colab.research.google.com to view.

In [None]:
sql = '''
SELECT 
dropoff_location_id, avg(fare_amount) as avg_fare
FROM 
bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2018
WHERE dropoff_datetime > '2018-01-01' and dropoff_datetime < '2019-01-01'
AND fare_amount > 0 and fare_amount < 1000
GROUP BY dropoff_location_id;
'''
avg_fare_2018_df = client.query(sql).to_dataframe()

In [None]:
avg_fare_2018_df.rename(columns={'dropoff_location_id': 'LocationID'}, inplace=True)
avg_fare_2018_df['LocationID'] = avg_fare_2018_df['LocationID'].astype('int64')
avg_fare_2018_df['avg_fare'] = avg_fare_2018_df['avg_fare'].astype('float64')

avg_fare_2018 = taxi_zones.merge(avg_fare_2018_df, on='LocationID')
avg_fare_2018 = json.loads(avg_fare_2018.to_json())['features']

base = alt.Chart(alt.Data(values=avg_fare_2018)).mark_geoshape(
        stroke='black',
        strokeWidth=1
    ).encode(
        color=alt.Color('properties.avg_fare:Q', legend=alt.Legend(title="Avg. fare"))
    ).properties(
        title=alt.Text(text="NYC average fare by taxi zone", fontSize=22),
        width=800,
        height=800
    )
    
labels = alt.Chart(alt.Data(values=tz_geo)).mark_text(
    baseline='top',
     ).properties(
        width=800,
        height=800
     ).encode(
         longitude='properties.centroid_lon:Q',
         latitude='properties.centroid_lat:Q',
         text='properties.LocationID:O',
         size=alt.value(8),
         opacity=alt.value(1)
     )

base + labels

Output hidden; open in https://colab.research.google.com to view.

In [93]:
#Fare and Trip analysis

sql = '''
SELECT 

pickup_location_id,dropoff_location_id, 

EXTRACT(YEAR FROM dropoff_datetime) AS year,
EXTRACT(MONTH FROM dropoff_datetime) AS month,
avg(fare_amount) as avg_fare,
count(*) as trips,
avg(trip_distance) as avg_trip_distance,

FROM 
bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2018
WHERE dropoff_datetime > '2018-01-01' and dropoff_datetime < '2019-01-01'
AND fare_amount > 0 and fare_amount < 1000
GROUP BY pickup_location_id,dropoff_location_id,EXTRACT(YEAR FROM dropoff_datetime),EXTRACT(MONTH FROM dropoff_datetime);
'''
fare_analysis_2018 = client.query(sql).to_dataframe()

In [None]:
cols = ['pickup_location_id','dropoff_location_id']
for c in cols:
  fare_analysis_2018[c] = fare_analysis_2018[c].astype('int64')

In [115]:
#fare_analysis_2018
fare_analysis = pd.merge(pd.merge(fare_analysis_2018,taxi_zones[['LocationID','borough','zone']],left_on = 'pickup_location_id',
                         right_on = 'LocationID',suffixes=('','_pickup'),how = 'left'),
                         taxi_zones[['LocationID','borough','zone']],left_on = 'dropoff_location_id',
                         right_on = 'LocationID',suffixes=('','_dropoff'),how = 'left')


In [129]:
fare_analysis[((fare_analysis['pickup_location_id']!=264)|(fare_analysis['dropoff_location_id']!=264)) & \
              (fare_analysis['borough']!=fare_analysis['borough_dropoff'])]\
[['borough','zone','borough_dropoff','zone_dropoff','month','year','trips','avg_fare','avg_trip_distance']]\
.sort_values(['trips','month','year'],axis = 0,ascending = False).head(50)

Unnamed: 0,borough,zone,borough_dropoff,zone_dropoff,month,year,trips,avg_fare,avg_trip_distance
65291,Queens,LaGuardia Airport,Manhattan,Times Sq/Theatre District,3,2018,28960,36.315296961,10.688997238
50835,Queens,LaGuardia Airport,Manhattan,Midtown East,3,2018,21705,31.886695232,9.939837825
65257,Queens,LaGuardia Airport,Manhattan,Midtown Center,3,2018,21099,34.333667946,10.298031186
50801,Manhattan,Times Sq/Theatre District,Queens,LaGuardia Airport,3,2018,19829,36.524944274,11.137568712
67643,Queens,JFK Airport,Manhattan,Times Sq/Theatre District,3,2018,17896,52.006814931,18.580777827
123997,Queens,LaGuardia Airport,Manhattan,Times Sq/Theatre District,5,2018,17205,37.894353967,10.570331299
50944,Manhattan,Midtown Center,Queens,LaGuardia Airport,3,2018,15988,33.914729797,10.433034776
139276,Queens,LaGuardia Airport,Manhattan,Times Sq/Theatre District,6,2018,15780,37.472508872,10.555271863
79414,Queens,LaGuardia Airport,Manhattan,Times Sq/Theatre District,4,2018,15707,36.590101229,10.609839562
261759,Queens,LaGuardia Airport,Manhattan,Times Sq/Theatre District,10,2018,15591,37.03483933,10.500388044
