## Scenario/Stakeholder Based Analysis of NYC taxi rides data
##### Authors: Panini Mokrala, Dmitrii Danilov

# Introduction

Through this project, we will investigate the decisions that various stakeholders in a Taxi Ecosystem take and overlay them with the varying weather conditions within New York City to check for correlation. We will take a closer look at the following definitions/assumptions before getting into questions-

1.   Stakeholders - We will consider the Commuters travelling in the Taxis and Taxi owners/drivers as stakeholders
2.   Decisions - Depending on the stakeholders we chose, there can be different decisions - for example, Taxi drivers can prefer fewer trips in a winter season compared to a summer season while commuters would prefer a taxi ride more often in a winter season compared to a summer season
3. Time period/ other assumptions  - We will be utilizing the taxi rides information between 2017 to 2019 to highlight the trends. There can be few interactive effects with the entry of Uber/Lyft. Due to data reliability issues, we will not be able to attribute this interactive effect

Now that we have a clearer understanding of the problem space, there are two main areas that we would like to address through this project depending on the stakeholder -

1. If you are a traveler or commuter - You can plan your trip by answering:
    a. How does the average fare/trip vary across the various taxi zones of New York with changing timeline and weather conditions?
    b. What is the average time/trip between two points in New York? How does that change with the weather variations across the time?
    c. At a given point in time and at a given temperature range, how does the availability of the taxis vary? (Please note: Availability is defined as number of active rides at a given point in time)
2. If you want to help a taxi driver/owner - You can help them plan their next season by answering : 
    a. How do the taxi availability vary across the various taxi zones? How do the weather conditions relate with the taxi availability over time? 
    b. What can be a reasonable fare estimate / trip that the owner can quote between the origin and destination?

While we enlisted the areas/ questions that we would like to provide insights to the stakeholders, we will further broaden/alter the scope of analysis as we go through the data mining exercise. As an outcome of this exercise, we would like to create a framework for the commuters and taxi driver/owners that helps them to plan their trips across various time points, locations and weather conditions

Let's first mount the working location. In our case, we used the Google Drive infrastructure to complete this project and required dependencies

In [None]:
from google.colab import drive
from os.path import join

ROOT = '/content/drive'
PROJ = 'MyDrive/Milestones/milestone-1'

drive.mount(ROOT)
PROJECT_PATH = join(ROOT, PROJ)
%cd "$PROJECT_PATH"
%pwd

Mounted at /content/drive
/content/drive/MyDrive/Milestones/milestone-1


'/content/drive/MyDrive/Milestones/milestone-1'

In [None]:
!pip install geopandas
!pip install altair_data_server
!pip install sodapy

import io
import json
import pandas as pd
import numpy as np
import geopandas as gpd
from shapely.ops import cascaded_union
from google.cloud import bigquery
from google.oauth2 import service_account
from sodapy import Socrata
from ipywidgets import interact, interactive, fixed, interact_manual, Layout
import ipywidgets as widgets
import datetime as dt
import altair as alt
import urllib
import seaborn as sns

cm = sns.light_palette("green", as_cmap=True)


key_path = 'auth.json'
credentials = service_account.Credentials.from_service_account_file(key_path)
bq_client = bigquery.Client(credentials=credentials, project=credentials.project_id)
soc_client = Socrata('data.cityofnewyork.us', 'erkBtGgCm1QXwrGaILeRCD1Xw', timeout=500)

Collecting geopandas
[?25l  Downloading https://files.pythonhosted.org/packages/2a/9f/e8a440a993e024c0d3d4e5c7d3346367c50c9a1a3d735caf5ee3bde0aab1/geopandas-0.8.2-py2.py3-none-any.whl (962kB)
[K     |████████████████████████████████| 972kB 9.1MB/s 
[?25hCollecting pyproj>=2.2.0
[?25l  Downloading https://files.pythonhosted.org/packages/e4/ab/280e80a67cfc109d15428c0ec56391fc03a65857b7727cf4e6e6f99a4204/pyproj-3.0.0.post1-cp36-cp36m-manylinux2010_x86_64.whl (6.4MB)
[K     |████████████████████████████████| 6.5MB 25.4MB/s 
[?25hCollecting fiona
[?25l  Downloading https://files.pythonhosted.org/packages/37/94/4910fd55246c1d963727b03885ead6ef1cd3748a465f7b0239ab25dfc9a3/Fiona-1.8.18-cp36-cp36m-manylinux1_x86_64.whl (14.8MB)
[K     |████████████████████████████████| 14.8MB 301kB/s 
Collecting cligj>=0.5
  Downloading https://files.pythonhosted.org/packages/42/1e/947eadf10d6804bf276eb8a038bd5307996dceaaa41cfd21b7a15ec62f5d/cligj-0.7.1-py3-none-any.whl
Collecting click-plugins>=1.0
  D

# Datasets and Data Sources

## <b>Data Source 1</b> : ##

> NYC Open Data -  New York Yellow Taxi Trip data (Timestamp level)
  1. Size: N/A (As this is API based, we will only get details once the entire data is pulled)
  2. Format: API (JSON)
  3. Access method: Python API calls & Google BigQuery framework

  
Dataset Name | Link
--- | ---
Yellow Taxi (2019) | https://data.cityofnewyork.us/Transportation/2019-Yellow-Taxi-Trip-Data/2upf-qytp 
Yellow Taxi (2018) | https://data.cityofnewyork.us/Transportation/2018-Yellow-Taxi-Trip-Data/t29m-gskq
Yellow Taxi (2017) | https://data.cityofnewyork.us/Transportation/2017-Yellow-Taxi-Trip-Data/biws-g3hs 


## <b>Data Source 2</b> : ##

> NYC Open Data -  National Centers for Environmental Information
  1. Size: 17MB
  2. Format: CSV
  3. Access method: HTTP
  4. Location : https://www.ncdc.noaa.gov

## <b> Data Source 3 : </b>##

> NYC Open Data -  Taxi Zone information
  1. Size: 1MB
  2. Format: Shapefile
  3. Access method: HTTP
  4. Location : https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
  

# Data Exploration

We will first start with data exploration - Starting with Taxi zone information. The dataset is stores in aws server to access. We will Geopandas to access the file. Please be sure to understand about the <a href="https://en.wikipedia.org/wiki/Shapefile">SHAPEFILES</a> before getting further in this section

## Zones information

In [None]:
## Read the file (It will be in a zip format)

taxi_zones = gpd.read_file('https://s3.amazonaws.com/nyc-tlc/misc/taxi_zones.zip')
taxi_zones.to_crs(epsg=4326, inplace=True)

## Get the x-axis and y-axis positions of all the taxi zones

taxi_zones['centroid_lon'] = taxi_zones['geometry'].centroid.x
taxi_zones['centroid_lat'] = taxi_zones['geometry'].centroid.y

## Few aggregations to ensure that we are able to visualize the position of the variables
taxi_zones_b = taxi_zones.groupby('borough')['geometry'].agg(lambda x: cascaded_union(x).centroid).to_frame()
taxi_zones_b.columns = ['geometry']
taxi_zones_b.reset_index(inplace=True)
taxi_zones_b['centroid_lon'] = taxi_zones_b['geometry'].centroid.x
taxi_zones_b['centroid_lat'] = taxi_zones_b['geometry'].centroid.y

tz_geo = json.loads(taxi_zones.to_json())['features']
tz_geo_b = json.loads(taxi_zones_b.to_json())['features']


  

  if __name__ == '__main__':


<b> Shape File Description </b>

Column Name | Definition
--- | ---
Object ID | Unique ID given to the Location
Shape_Leng | Length of the shape (Used to draw the locations)
Shape_Area | Area of the shape considered (Used to draw the locations)
zone | Name of the area with in the Borough
Location ID | Similar to Object ID
borough | Name of the area that is present (There are in total 6)
geometry | Co-ordinates needed to draw the graph
centroid_lon | Longitude of the Shape
centroid_lat | Latitude of the shape

<b>Visualizing the zones - </b> 

In [None]:
## Set up a default theme for the Altair interface

alt.themes.enable('opaque')

## Shape up the geometric objects in the chart

base = alt.Chart(alt.Data(values=tz_geo)).mark_geoshape(
        stroke='black',
        strokeWidth=1
    ).encode(
        color=alt.Color('properties.borough:N', legend=None)
    ).properties(
        width=800,
        height=800
    )

## Assign the chart labels / zone info

labels = alt.Chart(alt.Data(values=tz_geo)).mark_text(
    baseline='top',
     ).properties(
        width=800,
        height=800
     ).encode(
         longitude='properties.centroid_lon:Q',
         latitude='properties.centroid_lat:Q',
         text='properties.LocationID:O',
         size=alt.value(8),
         opacity=alt.value(1)
     )

## Enter the names of the 5 boroughs - The hierarchy of the dataset is as follows
## Zones>>> boroughs

boroughs = alt.Chart(alt.Data(values=tz_geo_b)).mark_text(
    color='white',
    stroke='black',
    fontWeight='bold',
    strokeWidth=0.7,
    baseline='top'
     ).properties(
        width=800,
        height=800,
        title=alt.Text(text="NYC boroughs and taxi zones", fontSize=22)
     ).encode(
         longitude='properties.centroid_lon:Q',
         latitude='properties.centroid_lat:Q',
         text='properties.borough:N',
         size=alt.value(26),
         opacity=alt.value(1)
     )


(base + labels + boroughs).properties(width = 150, height = 100)

Output hidden; open in https://colab.research.google.com to view.

Let's first find out the distribution between Boroughs and Locations

In [None]:
#Visualize the # of Zones by Borough
val_list = list((taxi_zones.groupby('borough', as_index=False).count()).sort_values('OBJECTID',axis = 0,ascending = False).borough.values)

In [None]:
#Distribution of Zones by area
dist_zones =   alt.Chart(taxi_zones).mark_bar().encode(
      y = alt.Y('borough',sort = val_list),  # The order to sort in),
      x = alt.X('count(zone):Q',title = 'Number of Zones'
          ),
  )

text_zones =   alt.Chart(taxi_zones).mark_text(baseline = 'middle',align = 'left').encode(
      y = alt.Y('borough',sort = val_list),  # The order to sort in),
      x = alt.X('count(zone):Q',title = 'Number of Zones'
          ),
      text = 'count(zone):Q'
  )

(dist_zones + text_zones).configure_mark(
    # we don't love the blue
    color='#008fd5'
).configure_axis(
 labelColor = 'grey',
 tickColor = 'grey'

).configure_view(
    # we don't want a stroke around the bars
    strokeWidth=0
).properties(
    # set the dimensions of the visualization
    width=500,
    height=180
).properties(
    # add a title
    title={
      "text": ["Distribution of Zones by Boroughs"], 
      "subtitle": ["Manhattan and Queens have highest Zones : 69"],
      "color": "Black",
      "subtitleColor": "grey",
        "fontSize":25
    }
).configure_title(
    anchor='start'
)


Output hidden; open in https://colab.research.google.com to view.

To summarize the Manhattan, Queens areas have the highest number of Zones - 69. Followed by the Brooklyn. The Manhattan is the business district of the city, while the Queens is the largest borough in the city, adjacent to Brooklyn. So it will be interesting to investigate how the # of trips, costings vary between the 5 boroughs

## Understanding the Taxi Trip data

We will be utilizing the Shapefile and the Taxi trip data in conjunction to bring few visualizations. The flow from here will be as follows

### Dropoffs variation

In [None]:
## Utility functions

## Dataframe creation

def create_dataframe(df,var1 = 'LocationID',var2 = 'Counts'):
  df.rename(columns={'dropoff_location_id': var1}, inplace=True)
  df[var1] = df[var1].astype('int64')
  if var2 =='avg_fare':
    df[var2] = df[var2].astype('float64')
  df_merge = taxi_zones.merge(df,on = var1)
  df_merge_json = json.loads(df_merge.to_json())['features']

  return df_merge,df_merge_json

## Heatmap creation

def base_heatmap(df,title_heatmap,chart_title,feature='properties.count:Q'):
  return alt.Chart(alt.Data(values=df)).mark_geoshape(stroke='black',strokeWidth=1).encode(
      color=alt.Color(feature, scale=alt.Scale(type='log'), legend=alt.Legend(title=title_heatmap))
      ).properties(title=alt.Text(text=chart_title, fontSize=22),width=600,height=600)

## Boxplot creation

def box_plot_creator(df,title,feature = 'count:Q'):
  return alt.Chart(df).mark_boxplot().encode(
      x=alt.X(feature,scale = alt.Scale(type='log')),y=alt.Y('borough:O',sort=val_list)).properties(title = alt.Text(text = title,fontSize = 16))


In [None]:
sql = '''SELECT dropoff_location_id, count(*) as count
FROM bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2017 
where dropoff_datetime between '2017-01-01' and '2018-01-01' 
group by dropoff_location_id;'''
dropoff_2017_df = bq_client.query(sql).to_dataframe()

In [None]:
dropoff_2017_info,dropoff_2017 = create_dataframe(dropoff_2017_df)

base_heatmap(dropoff_2017,title_heatmap = 'Counts',chart_title = 'Drop off distribution') + boroughs + labels

Output hidden; open in https://colab.research.google.com to view.

In [None]:
chart_1 = box_plot_creator(df = dropoff_2017_info,title = 'Dropoffs distribution by boroughs')
chart_1

Output hidden; open in https://colab.research.google.com to view.

As we see in the Annual Drop offs chart, we see a majority of trips happening in the Business district of Manhattan. Queens is an intersting borough with fewer areas having more than 0.5 million yearly trips. However, Brooklyn has good distribution of Taxi information. Let us see essential follow up questions - 

1) What are the zones with in Queens that has higher number of trips?

2) How does the distribution of pickups vary?

3) Should we see how the price / trip varies in the zones to see the most profitable borough?

In [None]:
dropoff_2017_info[(dropoff_2017_info['count']>500000) & (dropoff_2017_info['borough']=='Queens')][['borough','zone','count']]

Unnamed: 0,borough,zone,count
6,Queens,Astoria,529456
128,Queens,JFK Airport,998009
134,Queens,LaGuardia Airport,1315008


### Pickups variation

Distribution variation for Pickups

In [None]:
sql = '''SELECT pickup_location_id as LocationID, count(*) as count
FROM bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2017 
where pickup_datetime between '2017-01-01' and '2018-01-01' 
group by pickup_location_id;'''
pickup_2017_df = bq_client.query(sql).to_dataframe()

In [None]:
pickup_2017_info,pickup_2017 = create_dataframe(pickup_2017_df)

base_heatmap(pickup_2017,title_heatmap = 'Counts_Color_Map',chart_title = 'Pickups distribution') + boroughs + labels

Output hidden; open in https://colab.research.google.com to view.

Let's look into the distribution of pick ups

In [None]:
box_plot_creator(df = pickup_2017_info,title = 'Pickups distribution by boroughs')

Output hidden; open in https://colab.research.google.com to view.

There are two areas in Queens that experience higher proportion of Pickups. Most probably, they are the airports in Queens. Let's check once with the code

In [None]:
pickup_2017_info[(pickup_2017_info['count']>500000) & (pickup_2017_info['borough']=='Queens')][['borough','zone','count']]

Unnamed: 0,borough,zone,count
131,Queens,JFK Airport,2726868
137,Queens,LaGuardia Airport,3034479


### Fare variation

Now that, we have seen the variation in trips by destination and originating station let's also understand the average fare / trip in the areas and how it varys

In [None]:
sql = '''
SELECT 
dropoff_location_id as LocationID, avg(fare_amount) as avg_fare
FROM 
bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2018
WHERE dropoff_datetime > '2018-01-01' and dropoff_datetime < '2019-01-01'
AND fare_amount > 0 and fare_amount < 1000
GROUP BY dropoff_location_id;
'''
avg_fare_2018_df = bq_client.query(sql).to_dataframe()

In [None]:
avg_fare_2018_df.columns

Index(['LocationID', 'avg_fare'], dtype='object')

In [None]:
avg_fare_2018_info,avg_fare_2018 = create_dataframe(avg_fare_2018_df,var2 = 'avg_fare')

base_heatmap(avg_fare_2018,title_heatmap = 'Average Fare',chart_title = 'Fare distribution',feature='properties.avg_fare:Q') + boroughs + labels

Output hidden; open in https://colab.research.google.com to view.

In [None]:
chart_2 = box_plot_creator(avg_fare_2018_info,title='Distribution of Avg.Fare',feature = 'avg_fare:Q')

In [None]:
(chart_1.properties(height = 200)|chart_2.properties(height = 200))

Output hidden; open in https://colab.research.google.com to view.

In [None]:
avg_fare_2018_info[(avg_fare_2018_info['avg_fare']>45) & (avg_fare_2018_info['borough']=='Queens')][['borough','zone','avg_fare']]

Unnamed: 0,borough,zone,avg_fare
26,Queens,Breezy Point/Fort Tilden/Riis Beach,55.876336
113,Queens,Hammels/Arverne,52.694515
128,Queens,JFK Airport,48.194505
197,Queens,Rockaway Park,51.172654


In [None]:
avg_fare_2018_info[(avg_fare_2018_info['avg_fare']>20) & (avg_fare_2018_info['borough']=='Manhattan')][['borough','zone','avg_fare']]

Unnamed: 0,borough,zone,avg_fare
116,Manhattan,Highbridge Park,24.164416
123,Manhattan,Inwood,28.357769
124,Manhattan,Inwood Hill Park,27.70327
149,Manhattan,Marble Hill,31.171338
198,Manhattan,Roosevelt Island,21.043422
239,Manhattan,Washington Heights North,26.494606
240,Manhattan,Washington Heights South,22.065101


Manhattan has higher number of trips. However, the median fare is only 11 dollars. A trip to Staten Islands in general costs 60 dollars - 6 times that of a trip to Manhattan. The median trips to Queens and Brooklyn is 28 dollars and 30 dollars respectively

### Analysis between Fares and Location trips

Write a bug query to get the summarized fares and number of trips originating from a zone

In [None]:
sql = '''
SELECT 
pickup_location_id as LocationID, sum(fare_amount) as total_fare,count(*) as number_trips,avg(fare_amount) as avg_fare
FROM 
bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2018
WHERE dropoff_datetime > '2018-01-01' and dropoff_datetime < '2019-01-01'
AND fare_amount > 0 and fare_amount < 1000
GROUP BY pickup_location_id;
'''
new_fare_2018_df = bq_client.query(sql).to_dataframe()

In [None]:
new_fare_2018_df

new_fare_2018_df.rename(columns={'pickup_location_id': 'LocationID'}, inplace=True)
new_fare_2018_df['LocationID'] = new_fare_2018_df['LocationID'].astype('int64')
new_fare_2018_df['total_fare'] = new_fare_2018_df['total_fare'].astype('float64')
new_fare_2018_df['number_trips'] = new_fare_2018_df['number_trips'].astype('float64')
new_fare_2018_df['avg_fare'] = new_fare_2018_df['avg_fare'].astype('float64')
df_merge = taxi_zones.merge(new_fare_2018_df,on = 'LocationID')
df_merge_json = json.loads(df_merge.to_json())['features']

In [None]:
df_merge.head()

Unnamed: 0,OBJECTID,Shape_Leng,Shape_Area,zone,LocationID,borough,geometry,centroid_lon,centroid_lat,total_fare,number_trips,avg_fare
0,1,0.116357,0.000782,Newark Airport,1,EWR,"POLYGON ((-74.18445 40.69500, -74.18449 40.695...",-74.174,40.691831,656702.82,8217.0,79.920022
1,2,0.43347,0.004866,Jamaica Bay,2,Queens,"MULTIPOLYGON (((-73.82338 40.63899, -73.82277 ...",-73.831299,40.616745,2905.5,79.0,36.778481
2,3,0.084341,0.000314,Allerton/Pelham Gardens,3,Bronx,"POLYGON ((-73.84793 40.87134, -73.84725 40.870...",-73.847422,40.864474,34418.23,1318.0,26.113983
3,4,0.043567,0.000112,Alphabet City,4,Manhattan,"POLYGON ((-73.97177 40.72582, -73.97179 40.725...",-73.976968,40.723752,2802877.76,234352.0,11.960119
4,5,0.092146,0.000498,Arden Heights,5,Staten Island,"POLYGON ((-74.17422 40.56257, -74.17349 40.562...",-74.188484,40.552659,9814.77,131.0,74.921908


Prepare a scatter plot with Number of trips on X-axis and average fare on the Y-axis. Divide the area into four zones on the basis of median fare and median trips / zone as a baseline. The annotations will provide details around the each area . We will be utilizing the chart for further analysis using the weather dataset

In [None]:
#df_merge.head()

annotations = [[100000,10, 'High trips with \n less than median earnings'],
               [100000,70 , 'Ideal Zone -  \n Good appreciation for earnings and trips'],
               [10,70,'One time favourites \n - Less in Trips, but will yield high'],
               [10,10,'Not an ideal location'],
               [5124,22,'Ideal Zone']]
a_df = pd.DataFrame(annotations, columns=['Trips','Fare','note'])

main_chart = alt.Chart(df_merge).mark_circle(opacity = 0.5).encode(
  x=alt.X('number_trips:Q',scale=alt.Scale(type='log'),title = 'Trips'),
    y=alt.Y('avg_fare:Q',title = 'Average_Fare'),
    color='borough:N',
    tooltip=['borough:N', 'zone:N','avg_fare:Q']
)

vertline = alt.Chart().mark_rule().encode(
    x='Trips:Q'
)

horzline = alt.Chart().mark_rule().encode(
    y='Average_Fare:Q'
)

layer2 = alt.Chart(a_df).mark_text(color = 'Black',lineBreak = '\n',dx = 6).encode(x = 'Trips:Q',y = 'Fare',text = 'note')

chart3 = alt.layer(
    main_chart, vertline,horzline,layer2,
    data=df_merge
).transform_calculate(
  Trips="5124",Average_Fare = "22"
).properties(title = 'Zones - Avg. Fare vs. Trips')


In [None]:
alt.layer(
    main_chart, vertline,horzline,layer2,
    data=df_merge
).transform_calculate(
  Trips="5124",Average_Fare = "22"
).properties(title = 'Zones - Avg. Fare vs. Trips')

Output hidden; open in https://colab.research.google.com to view.

In [None]:
df_merge['Final_Listing'] = np.where((df_merge['avg_fare']>22)&(df_merge['number_trips']>5124),'0High trips - High Income',
                                     np.where((df_merge['avg_fare']>22)&(df_merge['number_trips']<=5124),'1Less Trips - High Fare',
                                              np.where((df_merge['avg_fare']<=22)&(df_merge['number_trips']<=5124),'3Less Trips - Less Fare','2High Trips - Less Fare')))

df_merge['Final_Listing_Val'] = np.where((df_merge['avg_fare']>22)&(df_merge['number_trips']>5124),4,
                                     np.where((df_merge['avg_fare']>22)&(df_merge['number_trips']<=5124),3,
                                              np.where((df_merge['avg_fare']<=22)&(df_merge['number_trips']<=5124),1,2)))

In [None]:
df_merge_fin = df_merge[['OBJECTID', 'Shape_Leng', 'Shape_Area', 'zone', 'LocationID', 'borough',
       'geometry', 'centroid_lon', 'centroid_lat', 'Final_Listing_Val']]

In [None]:
df_merge_json = json.loads(df_merge_fin.to_json())['features']

In [None]:
cross_summary = df_merge.groupby(['Final_Listing','borough'], ).count()['OBJECTID'].unstack()

In [None]:
pd.concat([cross_summary,df_merge.groupby('Final_Listing').count()['OBJECTID']],axis = 1)

Unnamed: 0_level_0,Bronx,Brooklyn,EWR,Manhattan,Queens,Staten Island,OBJECTID
Final_Listing,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0High trips - High Income,,6.0,1.0,1.0,14.0,,22
1Less Trips - High Fare,29.0,21.0,,1.0,38.0,18.0,107
2High Trips - Less Fare,4.0,28.0,,61.0,14.0,,107
3Less Trips - Less Fare,10.0,6.0,,3.0,3.0,2.0,24


In [None]:

# 0High trips - High Income	- Group 4
# 1Less Trips - High Fare	- Group 3
# 2High Trips - Less Fare	- Group 2
# 3Less Trips - Less Fare	- Group 1

base = alt.Chart(alt.Data(values=df_merge_json)).mark_geoshape(
        stroke='black',
        strokeWidth=1
    ).encode(
      color=alt.Color('properties.Final_Listing_Val:O',scale=alt.Scale(domain=[1,2,3,4], range=['#f37656', '#ffc809','#cccd2a','green']),
                      
)    )

base + boroughs

Output hidden; open in https://colab.research.google.com to view.

As you can see the areas in Queens actually earn a lot compared to the other areas (Green shade). However the entire area of Manhattan is where the actual cab activity happens

## Understanding the weather data

We will first ingest the table into Big Query platform before doing any operation on the dataset

In [None]:
#Ingest the weather dataset and perform few transformations

weather = pd.read_csv('2398761.csv')
weather_h = weather[weather['REPORT_TYPE'].isin(['FM-15', 'SY-MT'])]
weather_hdf = weather_h.filter(items=['HourlyDryBulbTemperature', 'HourlyPrecipitation', 'HourlyPresentWeatherType'])
weather_hdf.rename(columns={'HourlyDryBulbTemperature': 'avg_temp',
                          'HourlyPrecipitation': 'precip_depth',
                          'HourlyPresentWeatherType': 'precip_type'}, inplace=True)
weather_hdf.loc[:, 'datetime'] = weather_h['DATE']
weather_hdf['datetime'] = pd.to_datetime(weather_hdf['datetime'], format="%Y-%m-%dT%H:%M:%S")
weather_hdf['precip_depth'].replace(to_replace='T', value=0.0, inplace=True)
weather_hdf['precip_depth'].replace(to_replace='[a-zA-Z]', value='', regex=True, inplace=True)
weather_hdf['avg_temp'].replace(to_replace='[a-zA-Z]', value='', regex=True, inplace=True)
weather_hdf['precip_depth'] = weather_hdf['precip_depth'].astype('double')
weather_hdf['precip_type'] = weather_hdf['precip_type'].astype('str')
weather_hdf['precip_type'] = ['rain' if 'RA' in x else 'snow' if ('SN' in x or 'SG' in x or 'IC' in x or 'PL' in x) else np.nan for x in weather_hdf['precip_type']]
weather_hdf['avg_temp'] = weather_hdf['avg_temp'].astype('double')


weather_d = weather[weather['REPORT_TYPE'] == 'SOD  ']
weather_ddf = weather_d.filter(items=['DailyAverageDryBulbTemperature', 'DailyPrecipitation', 'DailySnowDepth', 'DailySnowfall'])
weather_ddf.rename(columns={'DailyAverageDryBulbTemperature': 'avg_temp',
                          'DailyPrecipitation': 'precip_depth',
                          'DailySnowDepth': 'snow_depth',
                          'DailySnowfall': 'snow_fall'}, inplace=True)
weather_ddf.loc[:, 'datetime'] = weather_d['DATE']
weather_ddf['datetime'] = pd.to_datetime(weather_ddf['datetime'], format="%Y-%m-%dT%H:%M:%S")
weather_ddf['precip_depth'].replace(to_replace='T', value=0.0, inplace=True)
weather_ddf['snow_depth'].replace(to_replace='T', value=0.0, inplace=True)
weather_ddf['snow_fall'].replace(to_replace='T', value=0.0, inplace=True)
weather_ddf['precip_depth'] = weather_ddf['precip_depth'].astype('double')
weather_ddf['snow_depth'] = weather_ddf['snow_depth'].astype('double')
weather_ddf['snow_fall'] = weather_ddf['snow_fall'].astype('double')


#Create a pipeline to move the dataset into big query environment - We will be creating two tables - One for daily analysis and other for the monthly analysis

def create_and_populate_weather_tables():
  daily_table_id = 'mads-milestone-1.weather.daily'
  hourly_table_id = 'mads-milestone-1.weather.hourly'

  try:
    bq_client.get_table(daily_table_id)
    print('Weather tables already exist.')
    return
  except NotFound:
    print('Weather tables not found, creating...')

  bq_client.create_dataset('mads-milestone-1.weather')

  daily_schema = [
      bigquery.SchemaField("datetime", "TIMESTAMP", mode="REQUIRED"),
      bigquery.SchemaField("avg_temp", "FLOAT"),
      bigquery.SchemaField("precip_depth", "FLOAT"),
      bigquery.SchemaField("snow_depth", "FLOAT"),
      bigquery.SchemaField("snow_fall", "FLOAT")
  ]

  daily_table = bigquery.Table(daily_table_id, schema=daily_schema)
  daily_table = bq_client.create_table(daily_table)

  hourly_schema = [
      bigquery.SchemaField("datetime", "TIMESTAMP", mode="REQUIRED"),
      bigquery.SchemaField("avg_temp", "FLOAT"),
      bigquery.SchemaField("precip_type", "STRING"),
      bigquery.SchemaField("precip_depth", "FLOAT")
  ]

  hourly_table = bigquery.Table(hourly_table_id, schema=hourly_schema)
  hourly_table = bq_client.create_table(hourly_table)

  job_config_d = bigquery.LoadJobConfig(
      schema=daily_schema, source_format=bigquery.SourceFormat.CSV
  )
  job_config_h = bigquery.LoadJobConfig(
      schema=hourly_schema, source_format=bigquery.SourceFormat.CSV
  )

  bg_daily_job = bq_client.load_table_from_dataframe(weather_ddf, daily_table, job_config=job_config_d)
  bg_daily_job.result()

  bg_hourly_job = bq_client.load_table_from_dataframe(weather_hdf, hourly_table, job_config=job_config_h)
  bg_hourly_job.result()

  interactivity=interactivity, compiler=compiler, result=result)


In [1]:
weather_ddf['mnth_yr'] = weather_ddf['datetime'].apply(lambda x: x.strftime('%B-%Y'))

NameError: ignored

In [None]:
box_plots = alt.Chart(weather_ddf).mark_boxplot().encode(
    x='mnth_yr:T',y=alt.Y('avg_temp:Q')).properties(title = alt.Text(text = "Temperature Distribution",fontSize = 22),width = 1000)

line_plots = alt.Chart(weather_ddf).mark_line().encode(
    x='mnth_yr:T',y=alt.Y('mean(avg_temp):Q')).properties(title = alt.Text(text = "Temperature Distribution",fontSize = 22),width = 1000)

box_plots+line_plots

### Create and populate weather data tables

In [None]:
create_and_populate_weather_tables()

Weather tables already exist.


## Prepare SQL queries

In [None]:
CACHE_SQL_RESULTS = True

def get_location_ids(loc, str_output=False):
  location_ids = {
      'JFK': taxi_zones[taxi_zones['zone'] == 'JFK Airport']['LocationID'],
      'LGA': taxi_zones[taxi_zones['zone'] == 'LaGuardia Airport']['LocationID'],
      'EWR': taxi_zones[taxi_zones['zone'] == 'Newark Airport']['LocationID'],
      'Bronx': taxi_zones[taxi_zones['borough'] == 'Bronx']['LocationID'],
      'Brooklyn': taxi_zones[taxi_zones['borough'] == 'Brooklyn']['LocationID'],
      'Queens': taxi_zones[taxi_zones['borough'] == 'Queens']['LocationID'],
      'Staten_Island': taxi_zones[taxi_zones['borough'] == 'Staten Island']['LocationID'],
      'Manhattan': taxi_zones[taxi_zones['borough'] == 'Manhattan']['LocationID']
  }
  if str_output:
    return ','.join([f"'{id}'" for id in location_ids[loc].astype(str).tolist()])
  return ','.join(location_ids[loc].astype(str).tolist())


def build_query_soc(year, source, dest):
  if year == '2017':
    start, end = '2017-01-01', '2018-01-01'
  elif year == '2018':
    start, end = '2018-01-01', '2019-01-01'
  else:
    start, end = '2019-01-01', '2020-01-01'

  base_sql = f'''select date_trunc_ymd(tpep_dropoff_datetime) as day, 
                  avg(fare_amount) as avg_fare,
                  avg(trip_distance) as avg_dist,
                  --stddev_samp(fare_amount) as std_fare,
                  --stddev_samp(trip_distance) as std_dist,
                  count(*) as count
                where tpep_dropoff_datetime > '{start}'
                  and tpep_dropoff_datetime < '{end}'
                  and fare_amount > 5 and fare_amount < 100
                  and trip_distance > 0 and trip_distance < 100
                  and pulocationid in ({get_location_ids(source)})
                  and dolocationid in ({get_location_ids(dest)})
                group by day 
                order by day'''
  return base_sql


def build_query_bq(year, source, dest):
  if year == '2017':
    start, end = '2017-01-01', '2018-01-01'
  elif year == '2018':
    start, end = '2018-01-01', '2019-01-01'
  else:
    start, end = '2019-01-01', '2020-01-01'

  base_sql = f'''SELECT
                  datetime_trunc(dropoff_datetime, day) as day,
                  AVG(fare_amount) as avg_fare,
                  AVG(trip_distance) as avg_dist,
                  --stddev(fare_amount) as std_fare,
                  --stddev(trip_distance) as std_dist,
                  COUNT(*) as count,
                  AVG(datetime_diff(dropoff_datetime, pickup_datetime, minute)) as avg_duration,
                  --stddev(datetime_diff(dropoff_datetime, pickup_datetime, minute)) as std_duration
                FROM
                  bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_{year}
                WHERE
                  dropoff_datetime > '{start}'
                  AND dropoff_datetime < '{end}'
                  AND pickup_location_id in ({get_location_ids(source, str_output=True)})
                  AND dropoff_location_id in ({get_location_ids(dest, str_output=True)})
                GROUP BY day
                ORDER BY day'''
  return base_sql


sql_dict_soc = {
    
}

sql_dict_bq = {
    'dropoff_count_avg_temp_by_hour': '''SELECT datetime_trunc(t.dropoff_datetime, HOUR) as dropoff_hour, count(*) as taxi_avail, avg(w.avg_temp) as avg_temp
                                    FROM bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2018 t
                                    LEFT JOIN mads-milestone-1.weather.hourly w
                                    ON CAST(TIMESTAMP_TRUNC(w.datetime, HOUR) as DATETIME) = DATETIME_TRUNC(t.dropoff_datetime, HOUR)
                                    WHERE t.dropoff_datetime > '2018-01-01' and t.dropoff_datetime < '2019-01-01'
                                    GROUP BY dropoff_hour;''',
            
  'dropoff_count_avg_temp_by_day': '''SELECT datetime_trunc(t.dropoff_datetime, DAY) as dropoff_day, count(*) as taxi_avail, avg(w.avg_temp) as avg_temp
                                    FROM bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2018 t
                                    LEFT JOIN mads-milestone-1.weather.daily w
                                    ON CAST(TIMESTAMP_TRUNC(w.datetime, DAY) as DATETIME) = DATETIME_TRUNC(t.dropoff_datetime, DAY)
                                    WHERE t.dropoff_datetime > '2018-01-01' and t.dropoff_datetime < '2019-01-01'
                                    GROUP BY dropoff_day;'''
}

def populate_soc_sql_dict():
  years=['2017', '2018', '2019']
  locs=['JFK', 'LGA', 'EWR', 'Bronx', 'Brooklyn', 'Manhattan', 'Staten Island', 'Queens']

  #sql_dict_soc.clear()

  for year in years:
    for source in locs:
      for dest in locs:
        key = '_'.join([f'fare_dist_dura_avail_soc_{year}', source.replace(' ', '_'), dest.replace(' ', '_')])
        value = build_query_soc(year, source.replace(' ', '_'), dest.replace(' ', '_'))
        sql_dict_soc[key] = value


def populate_bq_sql_dict():
  years=['2017', '2018']
  locs=['JFK', 'LGA', 'EWR', 'Bronx', 'Brooklyn', 'Manhattan', 'Staten Island', 'Queens']

  #sql_dict_bq.clear()

  for year in years:
    for source in locs:
      for dest in locs:
        key = '_'.join([f'fare_dist_dura_avail_bq_{year}', source.replace(' ', '_'), dest.replace(' ', '_')])
        value = build_query_bq(year, source.replace(' ', '_'), dest.replace(' ', '_'))
        sql_dict_bq[key] = value

populate_soc_sql_dict()
populate_bq_sql_dict()

def run_cached_bq(sql_name):
  if not CACHE_SQL_RESULTS:
    sql_query = sql_dict_bq[sql_name]
    #print('Caching is disabled, querying database...')
    return bq_client.query(sql_query).to_dataframe()
  try:
    #print('Reading dataframe from cache...')
    return pd.read_pickle(''.join(['./cache/', sql_name, '.gz']))
  except FileNotFoundError:
    #print('Dataframe not found in cache, querying database..')
    sql_query = sql_dict_bq[sql_name]
    df = bq_client.query(sql_query).to_dataframe()
    #print('Caching resulting dataframe...')
    df.to_pickle(''.join(['./cache/', sql_name, '.gz']))
    #print('Dataframe saved to cache')
    return df

def run_cached_soc(sql_name, year):
  soc_client_dict = {'2017': 'biws-g3hs', '2018': 't29m-gskq', '2019': '2upf-qytp'}
  if not CACHE_SQL_RESULTS:
    sql_query = sql_dict_soc[sql_name]
    #print('Caching is disabled, querying database...')
    results = soc_client.get(soc_client_dict[year], query=sql_query)
    return pd.DataFrame.from_records(results)
  try:
    #print('Reading dataframe from cache...')
    return pd.read_pickle(''.join(['./cache/', sql_name, '.gz']))
  except FileNotFoundError:
    #print('Dataframe not found in cache, querying database..')
    sql_query = sql_dict_soc[sql_name]
    results = soc_client.get(soc_client_dict[year], query=sql_query)
    #print(f'results: {results}')
    df = pd.DataFrame.from_records(results)
    #print('Caching resulting dataframe...')
    df.to_pickle(''.join(['./cache/', sql_name, '.gz']))
    #print('Dataframe saved to cache')
    return df


#### Availability Analysis

Let's pick up the Ideal Zone and One time favouries zone referred as '0High trips - High Income' & '1Less Trips - High Fare' respectively to see how the taxi availability varies on the basis of the temperature variations. We define the Freezing days as the days where the temperature is <32FH 

In [None]:

def segment_location_ids(loc, str_output=False):
  location_ids = {
      '0High trips - High Income': df_merge[df_merge['Final_Listing'] == '0High trips - High Income']['LocationID'],
      '1Less Trips - High Fare': df_merge[df_merge['Final_Listing'] == '1Less Trips - High Fare']['LocationID'],
      '2High Trips - Less Fare': df_merge[df_merge['Final_Listing'] == '2High Trips - Less Fare']['LocationID'],
      '3Less Trips - Less Fare': df_merge[df_merge['Final_Listing'] == '3Less Trips - Less Fare']['LocationID']
  }
  return '\',\''.join(location_ids[loc].astype(str).tolist())

segment_location_ids('1Less Trips - High Fare')

"2','3','5','8','9','11','15','16','19','20','21','22','23','27','29','30','31','32','38','44','46','51','53','55','56','56','58','59','60','63','64','67','71','72','73','77','78','81','84','86','98','99','101','102','108','109','115','117','118','120','121','122','123','124','126','131','135','136','139','147','149','150','154','155','156','160','165','171','172','174','175','176','177','180','182','183','184','185','187','191','192','200','201','203','204','205','206','208','210','213','214','218','221','222','227','235','240','241','242','245','248','251','252','253','254','258','259"

Query engine for '0High Trips - High Income group

In [None]:
sql1 = '''
SELECT datetime_trunc(t.dropoff_datetime, HOUR) as hour_timestamp, count(*) as taxi_avail, avg(w.avg_temp) as avg_temp,AVG(fare_amount) as avg_fare
                                    FROM bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2018 t
                                    LEFT JOIN mads-milestone-1.weather.hourly w
                                    ON CAST(TIMESTAMP_TRUNC(w.datetime, HOUR) as DATETIME) = DATETIME_TRUNC(t.dropoff_datetime, HOUR)
                                    WHERE t.dropoff_datetime > '2018-01-01' and t.dropoff_datetime < '2019-01-01' and
                                    dropoff_location_id in ('1','10','14','28','35','39','70','76','91','92','93','130','132','134','138','157','194','195','197','215','216','219')
                                    GROUP BY hour_timestamp;
'''

sql2 = '''
SELECT datetime_trunc(t.pickup_datetime, HOUR) as hour_timestamp, count(*) as taxi_avail, avg(w.avg_temp) as avg_temp,AVG(fare_amount) as avg_fare
                                    FROM bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2018 t
                                    LEFT JOIN mads-milestone-1.weather.hourly w
                                    ON CAST(TIMESTAMP_TRUNC(w.datetime, HOUR) as DATETIME) = DATETIME_TRUNC(t.pickup_datetime, HOUR)
                                    WHERE t.pickup_datetime > '2018-01-01' and t.pickup_datetime < '2019-01-01' and
                                    pickup_location_id in ('1','10','14','28','35','39','70','76','91','92','93','130','132','134','138','157','194','195','197','215','216','219')
                                    GROUP BY hour_timestamp;
'''

availability = bq_client.query(sql1).to_dataframe()
active = bq_client.query(sql2).to_dataframe()

Query engine for less trips - high earnings group

In [None]:
sql3 = '''
SELECT datetime_trunc(t.dropoff_datetime, HOUR) as hour_timestamp, count(*) as taxi_avail, avg(w.avg_temp) as avg_temp,AVG(fare_amount) as avg_fare
                                    FROM bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2018 t
                                    LEFT JOIN mads-milestone-1.weather.hourly w
                                    ON CAST(TIMESTAMP_TRUNC(w.datetime, HOUR) as DATETIME) = DATETIME_TRUNC(t.dropoff_datetime, HOUR)
                                    WHERE t.dropoff_datetime > '2018-01-01' and t.dropoff_datetime < '2019-01-01' and
                                    dropoff_location_id in ('2','3','5','8','9','11','15','16','19','20','21','22','23','27','29','30','31','32','38','44','46','51','53','55','56','56','58','59','60','63','64','67','71','72','73','77','78','81','84','86','98','99','101','102','108','109','115','117','118','120','121','122','123','124','126','131','135','136','139','147','149','150','154','155','156','160','165','171','172','174','175','176','177','180','182','183','184','185','187','191','192','200','201','203','204','205','206','208','210','213','214','218','221','222','227','235','240','241','242','245','248','251','252','253','254','258','259')
                                    GROUP BY hour_timestamp;
'''

sql4 = '''
SELECT datetime_trunc(t.pickup_datetime, HOUR) as hour_timestamp, count(*) as taxi_avail, avg(w.avg_temp) as avg_temp,AVG(fare_amount) as avg_fare
                                    FROM bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2018 t
                                    LEFT JOIN mads-milestone-1.weather.hourly w
                                    ON CAST(TIMESTAMP_TRUNC(w.datetime, HOUR) as DATETIME) = DATETIME_TRUNC(t.pickup_datetime, HOUR)
                                    WHERE t.pickup_datetime > '2018-01-01' and t.pickup_datetime < '2019-01-01' and
                                    pickup_location_id in ('2','3','5','8','9','11','15','16','19','20','21','22','23','27','29','30','31','32','38','44','46','51','53','55','56','56','58','59','60','63','64','67','71','72','73','77','78','81','84','86','98','99','101','102','108','109','115','117','118','120','121','122','123','124','126','131','135','136','139','147','149','150','154','155','156','160','165','171','172','174','175','176','177','180','182','183','184','185','187','191','192','200','201','203','204','205','206','208','210','213','214','218','221','222','227','235','240','241','242','245','248','251','252','253','254','258','259')
                                    GROUP BY hour_timestamp;
'''

availability_LP = bq_client.query(sql3).to_dataframe()
active_LP = bq_client.query(sql4).to_dataframe()

In [None]:
Universe = pd.merge(availability,active,left_on = 'hour_timestamp',right_on='hour_timestamp',how = 'outer')
Universe_LP = pd.merge(availability_LP,active_LP,left_on = 'hour_timestamp',right_on='hour_timestamp',how = 'outer')

In [None]:
Universe.head()

Unnamed: 0,hour_timestamp,taxi_avail_x,avg_temp_x,avg_fare_x,taxi_avail_y,avg_temp_y,avg_fare_y
0,2018-04-14 06:00:00,314,52.0,38.587770701,217,52.0,45.224976959
1,2018-12-31 11:00:00,229,45.0,35.97371179,428,45.0,34.973831776
2,2018-08-06 15:00:00,548,85.0,37.880565693,998,85.0,37.333927856
3,2018-08-06 20:00:00,213,81.0,36.687793427,927,81.0,37.118662352
4,2018-11-04 02:00:00,46,47.0,30.72826087,21,47.0,28.047619048


In [None]:
Universe.columns = ['hour_timestamp','available','avg_temperature','Avg Fare','active_rides','_','>']
Universe_LP.columns = ['hour_timestamp','available','avg_temperature','Avg Fare','active_rides','_','>']

In [None]:
Universe = Universe[['hour_timestamp','available','avg_temperature','active_rides','Avg Fare']]
Universe['available'] = Universe.apply(lambda r: r['available'] / 2 if r['hour_timestamp'].month == 3 else r['available'], axis=1)
Universe['active_rides'] = Universe.apply(lambda r: r['active_rides'] / 2 if r['hour_timestamp'].month == 3 else r['active_rides'], axis=1)
Universe['Available_Rides_PERC'] = Universe['active_rides']/(Universe['available']+Universe['active_rides'])
Universe['Hour'] =  Universe['hour_timestamp'].apply(lambda x: x.strftime('%H'))
Universe['Freezing'] = np.where(Universe['avg_temperature']>=32,'Freezing','Moderate')

Universe_LP = Universe_LP[['hour_timestamp','available','avg_temperature','active_rides','Avg Fare']]
Universe_LP['available'] = Universe_LP.apply(lambda r: r['available'] / 2 if r['hour_timestamp'].month == 3 else r['available'], axis=1)
Universe_LP['active_rides'] = Universe_LP.apply(lambda r: r['active_rides'] / 2 if r['hour_timestamp'].month == 3 else r['active_rides'], axis=1)
Universe_LP['Available_Rides_PERC'] = Universe_LP['active_rides']/(Universe_LP['available']+Universe_LP['active_rides'])
Universe_LP['Hour'] =  Universe_LP['hour_timestamp'].apply(lambda x: x.strftime('%H'))
Universe_LP['Freezing'] = np.where(Universe_LP['avg_temperature']>=32,'Freezing','Moderate')

In [None]:
ds1 = (Universe.groupby(['Freezing','Hour']).mean()['Available_Rides_PERC']).reset_index()
ds2 = (Universe_LP.groupby(['Freezing','Hour']).mean()['Available_Rides_PERC']).reset_index()

In [None]:
ds1.columns

Index(['Freezing', 'Hour', 'Available_Rides_PERC'], dtype='object')

In [None]:
high = alt.Chart(ds1).mark_line().encode(
    x=alt.X('Hour:N',title = 'Hour of the Day'),
    y='Available_Rides_PERC:Q',
    color='Freezing:O'
)

low = alt.Chart(ds2).mark_line().encode(
    x=alt.X('Hour:N',title = 'Hour of the Day'),
    y='Available_Rides_PERC:Q',
    color=alt.Color('Freezing:O',scale=alt.Scale(domain=['Freezing','Moderate'], range=['#f37656','green'])
))

(high.properties(title = 'High Yield+ High Trips \n Availability between Freezing and Normal temperature')|low.properties(title = 'High Yield + Less Trips \n Availability between Freezing and Normal temperature')).resolve_scale(
    y='shared'
)

The temperature is not a hindrance for activity in areas of “High Yield + High Trips” zones (Ideal zone). They generally have a higher proportion of availability compared to that of One time favourites (“High Yield + Less Trip”) segment. However, during the freezing temperatures, the availability of cabs in “One time favourite” zones is higher - Inferring the need for a cab based transport in these areas


#### Availability over years

Let us also analyze the variation of taxi availability by the days on the basis of the Hourly fluctuations in the avaialbility. We didn't divide the zone information on the basis of the segments we earlier prepared. There is no visible trend between the avaialability between The Freezing day (<32 FH) vs. On a normal day (>32FH)

In [None]:
alt.data_transformers.enable('data_server')

#dropoff_cnt_by_hour = run_cached_bq('Universe')
Universe['available'] = Universe.apply(lambda r: r['available'] / 2 if r['hour_timestamp'].month == 3 else r['available'], axis=1)
Universe['active_rides'] = Universe.apply(lambda r: r['active_rides'] / 2 if r['hour_timestamp'].month == 3 else r['active_rides'], axis=1)
Universe['Available_Rides_PERC'] = Universe['available']/(Universe['available']+Universe['active_rides'])


by_hour = alt.Chart(Universe).transform_calculate(
          freezing = 'datum.avg_temperature < 32'
      ).mark_point(opacity=0.5, size=14, filled=True).encode(
          y = alt.Y('Available_Rides_PERC:Q', title='Available taxi count'),
          x = alt.X('hour_timestamp:T'),
        color=alt.Color('freezing:N')
      ).properties(
        width=1000,
        height=300,
        title = 'Avaiability over days (Hourly basis) vs. Temperature'
      )
  
by_hour

Output hidden; open in https://colab.research.google.com to view.

A granular look on the availability on adaily basis. It just reveals that the availability in general dips during the half mark of the year - during the holiday period of July - September. Apart from that at a high level there is no significant variation / differences in the availability

In [None]:
dropoff_cnt_by_day = run_cached_bq('dropoff_count_avg_temp_by_day')
dropoff_cnt_by_day['taxi_avail'] = dropoff_cnt_by_day.apply(lambda r: r['taxi_avail'] / 2 if r['dropoff_day'].month == 3 else r['taxi_avail'], axis=1)

by_day = alt.Chart(dropoff_cnt_by_day).transform_calculate(
        freezing = 'datum.avg_temp < 32'
      ).mark_point(opacity=0.5, size=20, filled=True).encode(
        y = alt.Y('taxi_avail:Q', title='Available taxi count', scale=alt.Scale(zero=False)),
        x = alt.X('dropoff_day:T'),
      color=alt.Color('freezing:N')
  ).properties(
        width=1000,
        height=300,
        title = 'Avaiability over days (Daily basis) vs. Temperature'
  )

by_day

# Final Application

The following section will give you an application interface for an end user to evaluate and see how the fares change between temperature slots (Freezing / Moderate) by various zones in New York




In [None]:
def draw_chart(df, var):

  color_dict = {'avg_fare': '#57A44C', 'avg_dist': '#26d1b2', 'avg_duration': '#d1c526', 'count': '#a44c71'}
  axis_title_dict = {'avg_fare': 'Avg. Fare ($)', 'avg_dist': 'Avg. Distance (mi)', 'avg_duration': 'Avg. Duration (mins)', 'count': 'Trip Count'}


  base = alt.Chart(df).encode(
    alt.X('day:T', axis=alt.Axis(title=None))
  )
  line1 = base.mark_point(color=f'{color_dict[var]}', opacity=0.6, filled=True).encode(
      alt.Y(f'{var}:Q', axis=alt.Axis(title=f'{axis_title_dict[var]}', titleColor=f'{color_dict[var]}'))
  )
  line2 = base.mark_line(stroke='#5276A7', opacity=0.5, interpolate='monotone').encode(
      alt.Y('avg_temp:Q', axis=alt.Axis(title='Avg. Temperature (°F)', titleColor='#5276A7'))
  )
  chart = alt.layer(line1, line2).resolve_scale(
      y = 'independent'
  ).properties(
      width=1000,
      height=300
  )

  return chart

year = widgets.Dropdown(options=['2017', '2018', '2019'], description='Year: ')
source = widgets.Dropdown(options=['JFK Airport', 'LaGuardia Airport', 'Newark Airport', 'Bronx', 'Brooklyn', 'Manhattan', 'Staten Island', 'Queens'], description='Origin: ')
dest = widgets.Dropdown(options=['JFK Airport', 'LaGuardia Airport', 'Newark Airport', 'Bronx', 'Brooklyn', 'Manhattan', 'Staten Island', 'Queens'], description='Destination: ')
tab_0 = widgets.Output()
tab_1 = widgets.Output()
tab_2 = widgets.Output()
tab_3 = widgets.Output()
tab_contents = ['Fare', 'Trip distance', 'Trip duration', 'Trip count']
item_layout = widgets.Layout(margin='20px 0 0 0')
tab = widgets.Tab([tab_0, tab_1, tab_2, tab_3], layout=item_layout)
for i in range(len(tab_contents)):
    tab.set_title(i, tab_contents[i])

@widgets.interact_manual(
  year=year,  
  source=source,
  dest=dest
)
def fill_tab(year, source, dest):

  alt.data_transformers.enable('default')

  for child_tab in tab.children:
    child_tab.clear_output()

  source = 'JFK' if source == 'JFK Airport' else 'LGA' if source == 'LaGuardia Airport' else 'EWR' if source == 'Newark Airport' else source
  dest = 'JFK' if dest == 'JFK Airport' else 'LGA' if dest == 'LaGuardia Airport' else 'EWR' if dest == 'Newark Airport' else dest

  # if (year == '2019') & (tab.selected_index == 2):
  #   children = list(tab.children)
  #   children[2] = widgets.HTML('<h2 style="color:red">Trip duration information not available for 2019</h2>')
  #   tab.children = children
  
  sql_name_soc = '_'.join([f'fare_dist_dura_avail_soc_{year}', source.replace(' ', '_'), dest.replace(' ', '_')])
  sql_name_bq = '_'.join([f'fare_dist_dura_avail_bq_{year}', source.replace(' ', '_'), dest.replace(' ', '_')])

  soc_df = run_cached_soc(sql_name_soc, year)
  
  if year != '2019':
    bq_df = run_cached_bq(sql_name_bq)
    bq_df['day'] = pd.to_datetime(bq_df['day'])

  #print(soc_df.head())
  #print(bq_df.head())

  if soc_df.shape == (0,0):
    soc_df = soc_df.reindex(columns = ['day', 'avg_fare', 'avg_dist', 'count'])

  soc_df['day'] = pd.to_datetime(soc_df['day'])
  soc_df['avg_fare'] = soc_df['avg_fare'].astype('double')
  soc_df['avg_dist'] = soc_df['avg_dist'].astype('double')
  soc_df['count'] = soc_df['count'].astype('double')

  

  weather_ddf['day'] = pd.to_datetime(weather_ddf['datetime'].dt.date)

  combined_df = weather_ddf[weather_ddf['day'].dt.year == int(year)]\
                  .merge(soc_df[['day', 'avg_fare', 'avg_dist', 'count']], how='left', on='day')

  if year != '2019':
    combined_df = combined_df.merge(bq_df[['day', 'avg_duration']], how='left', on='day')

  if year == '2018':
    combined_df['count'] = combined_df.apply(lambda r: r['count'] / 2 if r['day'].month == 3 else r['count'], axis=1)

  with tab_0:
    fare_chart = draw_chart(combined_df, 'avg_fare') 
    display(fare_chart)

  with tab_1:
    dist_chart = draw_chart(combined_df, 'avg_dist') 
    display(dist_chart)

  with tab_2:
    tab_2_content = draw_chart(combined_df, 'avg_duration') if year != '2019' else \
                        widgets.HTML('<span style="color: red">Trip duration information not available for 2019</span>')
    display(tab_2_content)

  with tab_3:
    count_chart = draw_chart(combined_df, 'count') 
    display(count_chart)

  combined_df['Type'] = np.where(combined_df['avg_temp']<=32,'Freezing','Normal')

  #print(combined_df[['Type','avg_fare','avg_dist']].groupby('Type').median().style.format('${0:,.2f}'))
  print(combined_df[['Type','avg_fare','avg_dist']].groupby('Type').median())
  print(' ')

  return alt.Chart(combined_df).mark_boxplot().encode(x = 'avg_fare:Q',y = alt.Y('Type:N')).properties(width = 250,title = 'Fare variation')|alt.Chart(
      combined_df).mark_boxplot().encode(x = 'avg_dist:Q',y = alt.Y('Type:N')).properties(width = 250,title = 'Distance')|alt.Chart(
          combined_df).mark_boxplot().encode(x = 'avg_duration:Q',y = alt.Y('Type:N')).properties(width = 250,title = 'Duration')


display(tab)


interactive(children=(Dropdown(description='Year: ', options=('2017', '2018', '2019'), value='2017'), Dropdown…

Tab(children=(Output(), Output(), Output(), Output()), layout=Layout(margin='20px 0 0 0'), _titles={'0': 'Fare…