<h1>Charts with bokeh assignment</h1>
Download the nyc taxi data for 2016 January (see below) and prepare the following charts:

<ol>
    <li>A bokeh bar chart with day of the week (Monday, Tuesday, ...) on the x-axis and the average duration of rides on the y-axis. Make sure that the hover tool is activated and that it shows the average duration when the cursor hovers over it</li>
    <li>A bokeh interactive chart with a slider containing the hour of the day (0,1,...23) and the average total amount for each hour for each day of the week. I.e., the chart should contain days of the week on the x-axis and the mean total amount on the y-axis for a particular hour of the day. Moving the slider (e.g., from 10 to 11) should replace the chart for 1000 hrs by the chart for 1100 hrs). Don't forget the tooltip</li>
    <ul><li><a href="https://docs.bokeh.org/en/latest/docs/reference/models/widgets/sliders.html">sliders</a></li>
        <li><a href="https://docs.bokeh.org/en/latest/docs/reference/models/glyphs/vbar.html">vbar</a></li>
        <li>note that column names must be strings for converting a data frame into a column data source</li>
    </ul>
    <li>A piechart that shows how much of the total payment comes from each day of the week. The pie should have seven slices, one for each day, and the size of each slice depends on the fraction it contributes to the total. Again, don't forget the tooltip</li>
    
</ol>
<li>For the purposes of this exercise, remove any taxi rides that are less than 5 minute in duration</li>

<h2>NYC taxi data</h2>
<li>NYC taxi trip data is collected and made available (yellow, green, and black cabs)</li>
<li>We'll use data from January 2016</li>
<li><a href="https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-01.parquet">https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-01.parquet</a></li>
<li>The data is in <a href="https://parquet.apache.org/">parquet</a> format. Parquet is a data interchange format created by the <a href="https://www.apache.org/">Apache Foundation</a> for efficient data storage and retreival. Sort of like JSON but in binary</li>
<li>Use pandas <span style="color:blue">read_parquet</span> function to import the data</li>

<li>You may need to install pyarrow and fastparquet (using pip) - not sure!</li>

In [1]:
from bokeh.io import output_notebook, show
from bokeh.plotting import figure

output_notebook()

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

#Get the data
datasource = "Resources/yellow_tripdata_2022-01.parquet"
df = pd.read_parquet(datasource, engine='pyarrow')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2463931 entries, 0 to 2463930
Data columns (total 19 columns):
 #   Column                 Dtype         
---  ------                 -----         
 0   VendorID               int64         
 1   tpep_pickup_datetime   datetime64[us]
 2   tpep_dropoff_datetime  datetime64[us]
 3   passenger_count        float64       
 4   trip_distance          float64       
 5   RatecodeID             float64       
 6   store_and_fwd_flag     object        
 7   PULocationID           int64         
 8   DOLocationID           int64         
 9   payment_type           int64         
 10  fare_amount            float64       
 11  extra                  float64       
 12  mta_tax                float64       
 13  tip_amount             float64       
 14  tolls_amount           float64       
 15  improvement_surcharge  float64       
 16  total_amount           float64       
 17  congestion_surcharge   float64       
 18  airport_fee           

<span style="color:blue">Start with a small subset of the data</span>
<br>
<li>After you've completed the assignment with the subset, you can try using all the data</li>

In [3]:
df = df.sample(frac=0.2)
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 492786 entries, 1176106 to 2323811
Data columns (total 19 columns):
 #   Column                 Non-Null Count   Dtype         
---  ------                 --------------   -----         
 0   VendorID               492786 non-null  int64         
 1   tpep_pickup_datetime   492786 non-null  datetime64[us]
 2   tpep_dropoff_datetime  492786 non-null  datetime64[us]
 3   passenger_count        478587 non-null  float64       
 4   trip_distance          492786 non-null  float64       
 5   RatecodeID             478587 non-null  float64       
 6   store_and_fwd_flag     478587 non-null  object        
 7   PULocationID           492786 non-null  int64         
 8   DOLocationID           492786 non-null  int64         
 9   payment_type           492786 non-null  int64         
 10  fare_amount            492786 non-null  float64       
 11  extra                  492786 non-null  float64       
 12  mta_tax                492786 non-null  fl

<h3>Get the pickup hour (e.g., 11:20 corresponds to 11, 15:30pm corresponds to 15, etc.)</h3>

In [4]:
df['pickup_hour'] = df['tpep_pickup_datetime'].dt.hour
df.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee,pickup_hour
1176106,1,2022-01-16 20:29:23,2022-01-16 20:32:49,0.0,0.5,1.0,N,79,234,1,4.5,3.0,0.5,2.05,0.0,0.3,10.35,2.5,0.0,20
1797606,1,2022-01-24 13:57:05,2022-01-24 14:03:42,1.0,1.1,1.0,N,68,161,1,6.0,2.5,0.5,1.85,0.0,0.3,11.15,2.5,0.0,13
173504,2,2022-01-03 18:36:21,2022-01-03 18:48:11,1.0,2.42,1.0,N,163,113,1,10.5,1.0,0.5,3.7,0.0,0.3,18.5,2.5,0.0,18
2107116,1,2022-01-27 20:24:11,2022-01-27 20:44:18,1.0,3.8,1.0,N,43,113,1,16.0,3.0,0.5,3.96,0.0,0.3,23.76,2.5,0.0,20
197227,2,2022-01-04 08:48:10,2022-01-04 08:52:30,3.0,1.14,1.0,N,48,143,2,5.5,0.0,0.5,0.0,0.0,0.3,8.8,2.5,0.0,8


<h3>Get the day of week (0-Monday, 1-Tuesday, ...)</h3>

In [5]:
df['day_of_week'] = df['tpep_pickup_datetime'].dt.dayofweek
df.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,...,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee,pickup_hour,day_of_week
1176106,1,2022-01-16 20:29:23,2022-01-16 20:32:49,0.0,0.5,1.0,N,79,234,1,...,3.0,0.5,2.05,0.0,0.3,10.35,2.5,0.0,20,6
1797606,1,2022-01-24 13:57:05,2022-01-24 14:03:42,1.0,1.1,1.0,N,68,161,1,...,2.5,0.5,1.85,0.0,0.3,11.15,2.5,0.0,13,0
173504,2,2022-01-03 18:36:21,2022-01-03 18:48:11,1.0,2.42,1.0,N,163,113,1,...,1.0,0.5,3.7,0.0,0.3,18.5,2.5,0.0,18,0
2107116,1,2022-01-27 20:24:11,2022-01-27 20:44:18,1.0,3.8,1.0,N,43,113,1,...,3.0,0.5,3.96,0.0,0.3,23.76,2.5,0.0,20,3
197227,2,2022-01-04 08:48:10,2022-01-04 08:52:30,3.0,1.14,1.0,N,48,143,2,...,0.0,0.5,0.0,0.0,0.3,8.8,2.5,0.0,8,1


<h3>Get the taxi ride duration in minutes</h3>
<li>I've done this for you</li>

In [6]:
df['duration'] = (df['tpep_dropoff_datetime'] - df['tpep_pickup_datetime'])/np.timedelta64(1, 's')/60.0
df.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,...,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee,pickup_hour,day_of_week,duration
1176106,1,2022-01-16 20:29:23,2022-01-16 20:32:49,0.0,0.5,1.0,N,79,234,1,...,0.5,2.05,0.0,0.3,10.35,2.5,0.0,20,6,3.433333
1797606,1,2022-01-24 13:57:05,2022-01-24 14:03:42,1.0,1.1,1.0,N,68,161,1,...,0.5,1.85,0.0,0.3,11.15,2.5,0.0,13,0,6.616667
173504,2,2022-01-03 18:36:21,2022-01-03 18:48:11,1.0,2.42,1.0,N,163,113,1,...,0.5,3.7,0.0,0.3,18.5,2.5,0.0,18,0,11.833333
2107116,1,2022-01-27 20:24:11,2022-01-27 20:44:18,1.0,3.8,1.0,N,43,113,1,...,0.5,3.96,0.0,0.3,23.76,2.5,0.0,20,3,20.116667
197227,2,2022-01-04 08:48:10,2022-01-04 08:52:30,3.0,1.14,1.0,N,48,143,2,...,0.5,0.0,0.0,0.3,8.8,2.5,0.0,8,1,4.333333


<h3>Remove rides of 5 minutes or less and save in df</h3>

In [7]:
df = df[df['duration'] > 5.0]
df.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,...,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee,pickup_hour,day_of_week,duration
1797606,1,2022-01-24 13:57:05,2022-01-24 14:03:42,1.0,1.1,1.0,N,68,161,1,...,0.5,1.85,0.0,0.3,11.15,2.5,0.0,13,0,6.616667
173504,2,2022-01-03 18:36:21,2022-01-03 18:48:11,1.0,2.42,1.0,N,163,113,1,...,0.5,3.7,0.0,0.3,18.5,2.5,0.0,18,0,11.833333
2107116,1,2022-01-27 20:24:11,2022-01-27 20:44:18,1.0,3.8,1.0,N,43,113,1,...,0.5,3.96,0.0,0.3,23.76,2.5,0.0,20,3,20.116667
1719651,1,2022-01-23 11:58:39,2022-01-23 12:11:18,2.0,1.8,1.0,N,186,50,2,...,0.5,0.0,0.0,0.3,13.3,2.5,0.0,11,6,12.65
105696,2,2022-01-02 19:02:17,2022-01-02 19:11:40,1.0,1.39,1.0,N,87,144,2,...,0.5,0.0,0.0,0.3,11.3,2.5,0.0,19,6,9.383333


<h1>PROBLEM 1: Average duration by day of week bar chart</h1>

<h3>group the data by day of week</h3>

In [8]:
day_of_week_group = df.groupby('day_of_week')

<h3>Get the mean ride duration for each group</h3>
<li>And make a df out of it</li>
<li>day_of_week_mean has the day of week as the index</li>
<li>the dataframe will have seven rows with indexes 0,1,2,..7</li>
<li>add a new column with values Monday, Tuesday, Wedensday,...,Sunday</li>

In [9]:
day_of_week_mean = day_of_week_group['duration'].mean()
day_of_week_mean_df = day_of_week_mean.to_frame()
day_of_week_mean_df['day_of_week_name'] = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
day_of_week_mean_df

Unnamed: 0_level_0,duration,day_of_week_name
day_of_week,Unnamed: 1_level_1,Unnamed: 2_level_1
0,15.891727,Monday
1,15.77638,Tuesday
2,15.595777,Wednesday
3,16.499453,Thursday
4,16.810112,Friday
5,16.08827,Saturday
6,16.615818,Sunday


<h3>Make a column data source object from this dataframe</h3>

In [10]:
from bokeh.models import ColumnDataSource
cdata = ColumnDataSource(day_of_week_mean_df)

<h3>Draw the vertical bar chart</h3>
<li>You must include tooltips that show the duration when hovering over a bar</li>


In [11]:
text_labels = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]

tooltips = [("duration (minutes)", "@duration")]


p = figure(x_range=text_labels, y_range=(0, 20), width=900, height=400, title="Average Trip Duration by Day of Week", tooltips=tooltips)
p.vbar(x='day_of_week_name', top='duration', width=0.5, source=cdata)

p.xgrid.grid_line_color = None
p.y_range.start = 0    
show(p)

<h1>PROBLEM 2: Interactive chart with slider</h1>
<li>In this second problem, construct an interactive chart that shows the distribution of total fare amount by day of week while varying the pickup_hour</li>
<li>Each chart will have day of the week on the x-axis and the average total fare as the height of the bars for a single pickup_hour</li>
<li>Construct a slider that slides from 0 to 23 with the graph for all 24 pickup_hours</li>

<h3>Group the data by day of week and, within day of week by pickup_hour</h3>

In [12]:
hour_group = df.groupby(['day_of_week', 'pickup_hour'])

<h3>Get the average total amount for each group and unstack so that rows are weekdays (0, 1,...,7) and cols are hours (0,1,...23)</h3>
<li>Then add an additional column (24) as a copy of column 0. Col 24 will be the display column</li>
<li>Finally, convert all column names into str (since pickup_hour is an int and column data source objects need str column names)</li>
<li>amount_df should like like (col names should be strings):</li>
<li>Note that your numbers may be different if you're using a random subset of the data</li>

<pre>
	0	1	2	3	4	5	6	7	8	9	...	16	17	18	19	20	21	22	23	24	dayname
day_of_week																					
0	28.519591	27.871129	21.032270	22.854089	27.553843	27.676799	22.630954	19.790608	18.589532	18.314011	...	19.823463	19.087813	19.056134	19.880450	20.452326	22.545119	23.010316	25.220471	28.519591	Monday
1	26.523835	24.473547	22.464758	25.709178	24.027132	23.652944	21.546370	18.771057	17.414492	17.255911	...	19.631683	19.094055	18.343164	19.008278	19.145718	19.704968	20.285164	21.180154	26.523835	Tuesday
2	22.662570	23.111039	23.067922	19.263433	25.915858	25.043071	19.286858	17.697268	17.354702	16.875423	...	20.199947	18.939048	18.146021	18.688651	18.839771	18.879133	19.636418	19.631235	22.662570	Wednesday
3	20.806747	20.891364	20.104057	21.230155	21.545217	23.838166	19.245900	17.484051	17.593239	17.560638	...	19.601307	19.309099	18.675074	19.065926	18.602721	18.435254	18.848939	18.878703	20.806747	Thursday
4	19.091578	18.271015	19.781767	19.620808	23.030823	25.265687	21.332188	19.119613	18.374634	18.916849	...	19.979930	19.077043	18.743151	18.467401	17.985403	17.955496	17.998007	18.500657	19.091578	Friday
5	18.792271	18.033738	18.594487	19.076232	20.591734	23.261181	27.161993	21.153212	19.545850	17.098222	...	18.881910	18.955416	18.305366	17.865027	18.519547	19.021029	19.285096	18.937453	18.792271	Saturday
6	18.807702	18.348061	18.054653	19.275509	20.891784	28.260720	27.280063	23.220415	21.592732	18.765725	...	20.551586	20.630724	19.809704	20.566784	21.532079	22.483101	24.575294	27.071233	18.807702	Sunday
</pre>

In [13]:
# Get the average total amount for each group and 
# unstack so that rows are weekdays (0, 1,...,7) and cols are hours (0,1,...23)
amount_df = hour_group['duration'].mean().unstack()
# Then add an additional column (24) as a copy of column 0. 
amount_df[24] = amount_df[0]
# Col 24 will be the display column
amount_df['display'] = amount_df[24]
# convert all column names into strings
amount_df.columns = amount_df.columns.astype(str)
amount_df

pickup_hour,0,1,2,3,4,5,6,7,8,9,...,16,17,18,19,20,21,22,23,24,display
day_of_week,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,18.049238,27.3598,15.976736,18.383931,17.487004,17.288159,17.915314,17.858229,17.069418,16.614094,...,16.348081,15.390303,14.068908,13.915666,13.851813,14.485094,16.636067,15.49516,18.049238,18.049238
1,16.124437,18.202364,15.520521,16.023667,17.984394,20.337452,15.685622,18.232362,15.265637,15.012828,...,15.930923,16.214236,14.435918,13.535042,14.989885,14.55488,15.689651,14.154836,16.124437,16.124437
2,21.444849,18.467886,13.756227,15.886749,16.891538,17.439305,14.960484,15.986224,16.684339,14.565289,...,16.2431,15.939672,14.223405,14.618678,13.18671,14.156029,13.71947,14.548195,21.444849,21.444849
3,19.403326,33.078972,20.028233,14.801239,17.602363,17.081398,16.981927,17.638199,15.677472,15.674687,...,18.438052,16.134919,14.675151,14.84305,15.653557,13.479004,14.839508,14.262186,19.403326,19.403326
4,15.760471,17.157421,13.840933,14.435125,15.570018,16.232178,22.48271,17.173988,17.80528,17.734696,...,17.882223,16.659422,15.424635,15.441626,15.38052,14.333005,14.923418,14.46911,15.760471,15.760471
5,16.294874,16.226456,17.434032,18.399291,16.973391,24.972667,17.352157,14.153814,13.370474,14.576004,...,16.672825,15.198327,16.386877,15.299883,15.394676,16.254473,16.987469,15.838998,16.294874,16.294874
6,15.676627,18.488662,17.166321,18.757952,13.739083,17.045711,14.653479,18.469787,18.00646,16.292698,...,18.349429,16.93134,16.700512,17.75061,16.352684,17.253157,17.681631,18.113941,15.676627,15.676627


<h3>Draw the interactive chart by filling in the code below</h3>
<li>Mostly done. You need to fill in the missing parts identified by ??)</li>

In [14]:
from bokeh.models import Slider, CustomJS, CustomJSTickFormatter
from bokeh.layouts import row

source = ColumnDataSource(amount_df)

#Average Total Fare. Note the formatting so that the values
# show up currency formatted
tooltips = [
    ("Average Total Fare", "$@display{0,0.00}"),
]

p = figure(height=400, 
           width=600,
           x_axis_label = "Day of Week",
           y_axis_label = "Average Total Fare",
           title="Chart",
           tooltips=tooltips,
           min_border_left = 50,
           min_border_right = 50)

p.vbar(x='day_of_week', top='display', source=source, width=0.9,
      fill_color='red', line_color='black',fill_alpha = 0.75,
      hover_fill_alpha = 1.0, hover_fill_color = 'navy')

p.xgrid.grid_line_color = None
p.y_range.start = 0    

# Set tick positions 
p.xaxis.ticker = [0,1,2,3,4,5,6]

# Set tick labels
p.xaxis.major_label_overrides = {
    0: 'Monday',
    1: 'Tuesday',
    2: 'Wednesday',
    3: 'Thursday',
    4: 'Friday',
    5: 'Saturday',
    6: 'Sunday'
}

# Set axis labels
p.xaxis.formatter = CustomJSTickFormatter(code="""
    var labels = %s;
    return labels[tick];
""" % text_labels)

slider = Slider(start=0, end=24, value=0, step=1, title="Pickup Hour")


jscallback = CustomJS(args={'source':source,'slider':slider},code="""
        console.log(' changed selected option', slider.value);

        var data = source.data;
        var col = String(slider.value);
        console.log(' changed selected option', col);
        data['display'] = data[col];

        source.change.emit();
""")


slider.js_on_change('value', jscallback)

layout = row(p,slider)
show(layout)

<h1>PROBLEM 3: Piechart</h1>
<li>Use the total_amount column</li>
<li>Use the grouped by day of week data</li>
<li>Sum the total amount for each group and then compute the fractional amount for each day</li>
<li>Using the class notebook piechart as a guide, construct the piechart for distribution of total amount collected by day of week</li>

In [15]:
from bokeh.palettes import Turbo256
from bokeh.transform import cumsum

total_amount_df = day_of_week_group['total_amount'].sum().to_frame()

total_amount_df['day_of_week_name'] = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

total_amount_df['pct'] = ((total_amount_df['total_amount'] / total_amount_df['total_amount'].sum()) * 100.0).round(2)

total_amount_df['angle'] = total_amount_df['pct'] / (total_amount_df['pct'].sum()) * 2 * np.pi

import random 
colors = [Turbo256[random.randint(0,255)] for i in range(len(total_amount_df))]
total_amount_df['color'] = colors

tooltips = [
    ("Day of Week", "@day_of_week_name"),
    ("Total Amount", "$@total_amount{0,0.00}"),
    ("Percent", "@pct{0.00}%")
]

cdata = ColumnDataSource(total_amount_df)

p = figure(height=350, 
           title="Total Amount by Day of Week", 
           tools="hover",
           tooltips=tooltips,
           x_range=(-0.5, 1.0))

p.wedge(x=0, y=1, radius=0.4,
        start_angle=cumsum('angle', include_zero=True), end_angle=cumsum('angle'),
        line_color="white", fill_color='color', legend_field='day_of_week_name', source=cdata)

p.axis.axis_label=None
p.axis.visible=False
p.grid.grid_line_color = None

show(p)