<a href="https://colab.research.google.com/github/Joyakis/DATA_VIZ/blob/main/BOKEH_TAXIS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PROJECT BY JOY AKINYI
## (NYC TAXI DATA SET)

### INTRODUCTION

### Dataset Overview

> This project is about visualisation using bokeh and is about analysing a taxi dataset in the month of January 2022.I am aiming to draw insights from it

   ### Preliminary Wrangling

In [3]:
#Importing necessary libraries
import bokeh
from bokeh.models import ColumnDataSource
import pandas as pd
import numpy as np
from bokeh.plotting import figure, output_notebook, show
from bokeh.models import NumeralTickFormatter
output_notebook()
from bokeh.palettes import Spectral7
from bokeh.palettes import viridis
from bokeh.transform import linear_cmap
from bokeh.palettes import Spectral5

In [5]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [7]:

df=pd.read_parquet("/content/drive/MyDrive/yellow_tripdata_2022-01.parquet")

In [8]:
# Print the shape of the DataFrame
print('Shape:')
print(df.shape)

# Print the info of the DataFrame
print('Info:')
print(df.info())

Shape:
(2463931, 19)
Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2463931 entries, 0 to 2463930
Data columns (total 19 columns):
 #   Column                 Dtype         
---  ------                 -----         
 0   VendorID               int64         
 1   tpep_pickup_datetime   datetime64[ns]
 2   tpep_dropoff_datetime  datetime64[ns]
 3   passenger_count        float64       
 4   trip_distance          float64       
 5   RatecodeID             float64       
 6   store_and_fwd_flag     object        
 7   PULocationID           int64         
 8   DOLocationID           int64         
 9   payment_type           int64         
 10  fare_amount            float64       
 11  extra                  float64       
 12  mta_tax                float64       
 13  tip_amount             float64       
 14  tolls_amount           float64       
 15  improvement_surcharge  float64       
 16  total_amount           float64       
 17  congestion_surcharge   float64       


### What is the structure of the dataset?


The dataframe has 2463931 rows and 18 columns.The majority of the data is in the datatype float

In [9]:
df.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee
0,1,2022-01-01 00:35:40,2022-01-01 00:53:29,2.0,3.8,1.0,N,142,236,1,14.5,3.0,0.5,3.65,0.0,0.3,21.95,2.5,0.0
1,1,2022-01-01 00:33:43,2022-01-01 00:42:07,1.0,2.1,1.0,N,236,42,1,8.0,0.5,0.5,4.0,0.0,0.3,13.3,0.0,0.0
2,2,2022-01-01 00:53:21,2022-01-01 01:02:19,1.0,0.97,1.0,N,166,166,1,7.5,0.5,0.5,1.76,0.0,0.3,10.56,0.0,0.0
3,2,2022-01-01 00:25:21,2022-01-01 00:35:23,1.0,1.09,1.0,N,114,68,2,8.0,0.5,0.5,0.0,0.0,0.3,11.8,2.5,0.0
4,2,2022-01-01 00:36:48,2022-01-01 01:14:20,1.0,4.3,1.0,N,68,163,1,23.5,0.5,0.5,3.0,0.0,0.3,30.3,2.5,0.0


### What are your main features of interest?

> The tpep_pick_up_datetime and the drop off time are definately on top of my list as i can extract the time in hours and also i can extract te days of the week hence i will be able to see day to day trends.I would also love to know what time majority of passengers take Taxis.The PULocationID and DOLocationID is of interest too since i will be able to see popular pick ups and drop offs

### Data Cleaning

Here we will check for the tidiness of data and aim to clean it

In [10]:
df.isnull().sum()

VendorID                     0
tpep_pickup_datetime         0
tpep_dropoff_datetime        0
passenger_count          71503
trip_distance                0
RatecodeID               71503
store_and_fwd_flag       71503
PULocationID                 0
DOLocationID                 0
payment_type                 0
fare_amount                  0
extra                        0
mta_tax                      0
tip_amount                   0
tolls_amount                 0
improvement_surcharge        0
total_amount                 0
congestion_surcharge     71503
airport_fee              71503
dtype: int64

Passenger_count,RatecodeID,store_and_fwd_flag,congestation_surcharge and airport_fee have quite a number of missing values.We shall decide on whether to impute them or drop the missing values

In [11]:
df.duplicated().sum()

0

This dataset has no duplicated values

In [12]:
#dropping null values
df.dropna(inplace=True)

# UNIVARIATE EXPLORATION

#### 1. Distribution of passenger counts

In [13]:
#Using numpy histogram function to create the histogram data, and then convert it to a pandas DataFrame
hist, edges = np.histogram(df['passenger_count'], bins=10)
df_hist = pd.DataFrame({'count': hist, 'left': edges[:-1], 'right': edges[1:]})

# Define the figure and add axis labels
p = figure(title="Passenger Count Distribution", x_axis_label="Passenger Count", y_axis_label="Frequency")

# Add the histogram
p.quad(source=df_hist, bottom=0, top='count', left='left', right='right', alpha=0.5)
# Format the y-axis tick labels
p.yaxis.formatter = NumeralTickFormatter(format="0.00a")
p.y_range.start = 0

# Show the plot
show(p)



> From the above plot,we can conclude that most taxi rides carry between one to two passengers.This can be due to factors like the price and also it can help taxi companies that are interested in understanding travel patterns or optimizing their services

2. #### AVERAGE NUMBER OF PICKUPS PER DAY

In [14]:
# convert pickup_datetime to pandas datetime format
df['tpep_pickup_datetime'] = pd.to_datetime(df['tpep_pickup_datetime'])

# extract day of the week from pickup_datetime and convert to string format
df['day_of_week'] = df['tpep_pickup_datetime'].dt.strftime('%A')

# create histogram of number of pickups per day
counts = df['day_of_week'].value_counts()
days = counts.index.tolist()

p = figure(x_range=days, height=350, title="Number of Pickups per Day",
           toolbar_location=None, tools="")

p.vbar(x=days, top=counts.values, width=0.9, color=Spectral7)

p.xgrid.grid_line_color = None
p.y_range.start = 0
p.xaxis.major_label_orientation = 1.2
p.xaxis.axis_label = "Day of the Week"
p.yaxis.axis_label = "Number of Pickups"
# Format the y-axis tick labels
p.yaxis.formatter = NumeralTickFormatter(format="0.00a")

show(p)


From the above we can conclude that there is more traffic on mondays as the number of pickups are the highest on this day.

In [15]:
counts = df['day_of_week'].value_counts()
counts

Monday       363818
Saturday     352417
Friday       350891
Thursday     346565
Sunday       331893
Wednesday    330436
Tuesday      316408
Name: day_of_week, dtype: int64

In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2392428 entries, 0 to 2392427
Data columns (total 20 columns):
 #   Column                 Dtype         
---  ------                 -----         
 0   VendorID               int64         
 1   tpep_pickup_datetime   datetime64[ns]
 2   tpep_dropoff_datetime  datetime64[ns]
 3   passenger_count        float64       
 4   trip_distance          float64       
 5   RatecodeID             float64       
 6   store_and_fwd_flag     object        
 7   PULocationID           int64         
 8   DOLocationID           int64         
 9   payment_type           int64         
 10  fare_amount            float64       
 11  extra                  float64       
 12  mta_tax                float64       
 13  tip_amount             float64       
 14  tolls_amount           float64       
 15  improvement_surcharge  float64       
 16  total_amount           float64       
 17  congestion_surcharge   float64       
 18  airport_fee           

3. ### AVERAGE NUMBER OF DROP OFFS PER DAY

In [17]:
output_notebook()


# extract day of the week from pickup_datetime and convert to string format
df['day_of_week_drop'] = df['tpep_dropoff_datetime'].dt.strftime('%A')

# create histogram of number of pickups per day
counts = df['day_of_week_drop'].value_counts()
days = counts.index.tolist()

p = figure(x_range=days, height=350, title="Number of Drop offs per Day",
           toolbar_location=None, tools="")

p.vbar(x=days, top=counts.values, width=0.9, color=Spectral7)

p.xgrid.grid_line_color = None
p.y_range.start = 0
p.xaxis.major_label_orientation = 1.2
p.xaxis.axis_label = "Day of the Week"
p.yaxis.axis_label = "Number of Dropoffs"
# Format the y-axis tick labels
p.yaxis.formatter = NumeralTickFormatter(format="0.00a")

show(p)

Seems Taxis are really busy on Mondays!Again,we have the most number of dropoffs on Mondays

In [18]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2392428 entries, 0 to 2392427
Data columns (total 21 columns):
 #   Column                 Dtype         
---  ------                 -----         
 0   VendorID               int64         
 1   tpep_pickup_datetime   datetime64[ns]
 2   tpep_dropoff_datetime  datetime64[ns]
 3   passenger_count        float64       
 4   trip_distance          float64       
 5   RatecodeID             float64       
 6   store_and_fwd_flag     object        
 7   PULocationID           int64         
 8   DOLocationID           int64         
 9   payment_type           int64         
 10  fare_amount            float64       
 11  extra                  float64       
 12  mta_tax                float64       
 13  tip_amount             float64       
 14  tolls_amount           float64       
 15  improvement_surcharge  float64       
 16  total_amount           float64       
 17  congestion_surcharge   float64       
 18  airport_fee           

In [19]:
df['fare_amount'].describe()

count    2.392428e+06
mean     1.280723e+01
std      2.595991e+02
min     -4.800000e+02
25%      6.500000e+00
50%      9.000000e+00
75%      1.350000e+01
max      4.010923e+05
Name: fare_amount, dtype: float64

In [20]:
r=df['RatecodeID'].unique()
len(r)

7

4. #### FREQUENCY OF RATECODEID

In [21]:


ratecodeid=df['RatecodeID'].unique()
ratecodeid=ratecodeid.astype('str')

r_count=df['RatecodeID'].value_counts().values
source=ColumnDataSource(data=dict(ratecodeid=ratecodeid,r_count=r_count,color=Spectral7))
p=figure(x_range=(0,100000),y_range=ratecodeid,width=600,height=450,title="Number of trips for each ride",toolbar_location=None,
         tools="")
p.hbar(y='ratecodeid',right='r_count',height=0.7,color='color',source=source)
# Format the y-axis tick labels
p.yaxis.formatter = NumeralTickFormatter(format="0.00a")
# Format the x-axis tick labels
p.xaxis.formatter = NumeralTickFormatter(format="0.00a")

p.xaxis.axis_label="Number of trips"
p.yaxis.axis_label="Rate Code"

show(p)

> The ratecode with the most number of trips is 1 as seen in the plot above.This may be  caused by other factors as we shall continue the analysis and try to figure out

5. #### DISTRIBUTION OF PAYMENT TYPES

In [22]:
#Creating a dictionary of payment names
payment_names = {1: 'Credit card', 2: 'Cash', 3: 'No charge', 4: 'Dispute', 5: 'Unknown'}
#Creating a new column that maps the payment names
df['payment_name'] = df['payment_type'].map(payment_names)

# Then, you can create a bar chart of the payment names like this:
from bokeh.models import ColumnDataSource
from bokeh.plotting import figure, show

source = ColumnDataSource(df.groupby('payment_name').size().reset_index(name='count'))

p = figure(x_range=source.data['payment_name'],height=350, title="Payment Type Counts")
p.vbar(x='payment_name', top='count', width=0.9, source=source)

p.xgrid.grid_line_color = None
p.y_range.start = 0
# Format the y-axis tick labels
p.yaxis.formatter = NumeralTickFormatter(format="0.00a")
p.xaxis.axis_label="Payment type"
p.yaxis.axis_label="Count"


show(p)


> Most people prefer paying their trips using credit.This is represented as 2 in the payment_type column

In [23]:
df['trip_distance'].min()

0.0

In [24]:
df.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,...,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee,day_of_week,day_of_week_drop,payment_name
0,1,2022-01-01 00:35:40,2022-01-01 00:53:29,2.0,3.8,1.0,N,142,236,1,...,0.5,3.65,0.0,0.3,21.95,2.5,0.0,Saturday,Saturday,Credit card
1,1,2022-01-01 00:33:43,2022-01-01 00:42:07,1.0,2.1,1.0,N,236,42,1,...,0.5,4.0,0.0,0.3,13.3,0.0,0.0,Saturday,Saturday,Credit card
2,2,2022-01-01 00:53:21,2022-01-01 01:02:19,1.0,0.97,1.0,N,166,166,1,...,0.5,1.76,0.0,0.3,10.56,0.0,0.0,Saturday,Saturday,Credit card
3,2,2022-01-01 00:25:21,2022-01-01 00:35:23,1.0,1.09,1.0,N,114,68,2,...,0.5,0.0,0.0,0.3,11.8,2.5,0.0,Saturday,Saturday,Cash
4,2,2022-01-01 00:36:48,2022-01-01 01:14:20,1.0,4.3,1.0,N,68,163,1,...,0.5,3.0,0.0,0.3,30.3,2.5,0.0,Saturday,Saturday,Credit card


In [25]:
df['VendorID'].value_counts().index

Int64Index([2, 1], dtype='int64')

6. #### VENDOR ID DISTRIBUTION

In [26]:
# Count the number of users for each vendorID
vendor_counts = df['VendorID'].value_counts()

# Create a ColumnDataSource
source = ColumnDataSource(data=dict(
    x=[str(vendor) for vendor in vendor_counts.index.tolist()],
    top=vendor_counts.values,
    fill_color=viridis(len(vendor_counts)),
))

# Create a new plot
p = figure(x_range=source.data['x'], height=350, title='Number of Trips by VendorID')

# Add the bar chart glyph using the source
p.vbar(x='x', top='top', width=0.9, fill_color='fill_color', source=source)

# Set the x-axis label and orientation
p.xaxis.axis_label = 'VendorID'
p.xaxis.major_label_orientation = 1.2

# Set the y-axis label and format the ticks with commas
p.yaxis.axis_label = 'Number of Trips'
p.yaxis.formatter = NumeralTickFormatter(format='0,0')

# Show the plot
show(p)


Vendor ID 2 had more trips as compared to vendor ID 1 as seen in the analysis above,over 1.5 million trips

In [27]:
df.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,...,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee,day_of_week,day_of_week_drop,payment_name
0,1,2022-01-01 00:35:40,2022-01-01 00:53:29,2.0,3.8,1.0,N,142,236,1,...,0.5,3.65,0.0,0.3,21.95,2.5,0.0,Saturday,Saturday,Credit card
1,1,2022-01-01 00:33:43,2022-01-01 00:42:07,1.0,2.1,1.0,N,236,42,1,...,0.5,4.0,0.0,0.3,13.3,0.0,0.0,Saturday,Saturday,Credit card
2,2,2022-01-01 00:53:21,2022-01-01 01:02:19,1.0,0.97,1.0,N,166,166,1,...,0.5,1.76,0.0,0.3,10.56,0.0,0.0,Saturday,Saturday,Credit card
3,2,2022-01-01 00:25:21,2022-01-01 00:35:23,1.0,1.09,1.0,N,114,68,2,...,0.5,0.0,0.0,0.3,11.8,2.5,0.0,Saturday,Saturday,Cash
4,2,2022-01-01 00:36:48,2022-01-01 01:14:20,1.0,4.3,1.0,N,68,163,1,...,0.5,3.0,0.0,0.3,30.3,2.5,0.0,Saturday,Saturday,Credit card


In [28]:
df['extra'].head()

0    3.0
1    0.5
2    0.5
3    0.5
4    0.5
Name: extra, dtype: float64

8. #### TIME DISTRIBUTION

In [29]:
from bokeh.plotting import figure, show
from bokeh.io import output_notebook
from bokeh.models import ColumnDataSource
import numpy as np

# extract the 'trip_distance' column from the dataframe as a numpy array of floats
distance_array = np.array(df['trip_distance'], dtype=np.float64)

# create a list of x-coordinates
x = list(range(len(distance_array)))

# create a ColumnDataSource object
source = ColumnDataSource(data=dict(x=x, distance_array=distance_array))

# create a Bokeh figure
p = figure(title="Trip Distance Line Plot", x_axis_label="Index", y_axis_label="Trip Distance")

# add a line glyph to the figure
p.line('x', 'distance_array', source=source, line_width=2)
p.xaxis.formatter = NumeralTickFormatter(format='0,0')

# show the figure
output_notebook()
show(p)


> We can tell that the taxis that took the most distance lie below the index 500000 of the distance column

9. #### Pickup Location ID Frequency Distribution 

In [None]:
# extract the 'pickup_location_id' column from the dataframe as a numpy array of integers
pickup_loc_array = np.array(df['PULocationID'], dtype=np.int64)

# create a histogram of pickup location IDs
hist, edges = np.histogram(pickup_loc_array, bins=50)

# create a list of x-coordinates for the bars
x = [(edges[i] + edges[i+1])/2 for i in range(len(edges)-1)]

# create a ColumnDataSource object
source = ColumnDataSource(data=dict(x=x, top=hist))

# create a Bokeh figure
p = figure(title="Pickup Location Histogram", x_axis_label="Pickup Location ID", y_axis_label="Frequency")

# add a vertical bar glyph to the figure
p.vbar(x='x', top='top', width=0.9*(edges[1]-edges[0]), source=source, 
       fill_color=linear_cmap('top', 'Viridis256', 0, max(hist)))
p.yaxis.formatter = NumeralTickFormatter(format='0,0')








# show the figure
output_notebook()
show(p)


> We have the highest frequency at around location ID 250 meaning more people are picked at that specific point

10. #### Drop off Location ID Frequency Distribution

In [None]:
from bokeh.plotting import figure, show
from bokeh.io import output_notebook
from bokeh.transform import linear_cmap
from bokeh.models import ColumnDataSource
import numpy as np

# extract the 'pickup_location_id' column from the dataframe as a numpy array of integers
pickup_loc_array = np.array(df['DOLocationID'], dtype=np.int64)

# create a histogram of pickup location IDs
hist, edges = np.histogram(pickup_loc_array, bins=50)

# create a list of x-coordinates for the bars
x = [(edges[i] + edges[i+1])/2 for i in range(len(edges)-1)]

# create a ColumnDataSource object
source = ColumnDataSource(data=dict(x=x, top=hist))

# create a Bokeh figure
p = figure(title="Drop off Location Histogram", x_axis_label="Drop Location ID", y_axis_label="Frequency")

# add a vertical bar glyph to the figure
p.vbar(x='x', top='top', width=0.9*(edges[1]-edges[0]), source=source, 
       fill_color=linear_cmap('top', 'Magma256', 0, max(hist)))
p.yaxis.formatter = NumeralTickFormatter(format='0,0')

# show the figure
output_notebook()
show(p)

> It looks like also at around location ID 250 majority of people are being dropped off.This can be concluded by the height of the bar

## BIVARIATE EXPLORATION

By using a heatmap, we can quickly identify areas where there are high concentrations of rides, which can be useful for identifying popular pickup and dropoff locations.
higher frequency would be represented by brighter colors, and lower frequency would be represented by darker colors.

11.  #### Pickup and Dropoff Location Heatmap - 2D Histogram

In [None]:
# extract the pickup and dropoff location ID columns
pickup_loc_array = np.array(df['PULocationID'], dtype=np.int32)
dropoff_loc_array = np.array(df['DOLocationID'], dtype=np.int32)

# create a 2D histogram using numpy
hist, xedges, yedges = np.histogram2d(pickup_loc_array, dropoff_loc_array, bins=50)

# create a Bokeh figure
p = figure(title="Pickup and Dropoff Locations", x_axis_label="Pickup Location ID", y_axis_label="Dropoff Location ID")

# add a rect glyph to the figure to display the heatmap
p.image(image=[hist], x=xedges[0], y=yedges[0], dw=xedges[-1]-xedges[0], dh=yedges[-1]-yedges[0], 
        palette="Viridis256")
# add a color bar legend to the figure
color_mapper = p.select(dict(type=GlyphRenderer))[0].glyph.color_mapper
color_bar = ColorBar(color_mapper=color_mapper, label_standoff=12, location=(0,0), title='Number of Rides')
p.add_layout(color_bar, 'right')

# show the figure
output_notebook()
show(p)


> The above figure confirms our analysis that most people are picked and dropped at approximately location ID 250

### Features that strengthened each other in terms of looking at your feature(s) of interest

This will definately be the payment_type column because from it came the payment_name column that i created to help in giving more insights

### Conclusions

    1.Most people have favorite pickup and drop off locations hence there are some ID's that are ever busy
    2.Vendor ID 1 is way more active than ID 2,they have quite a large gap in terms of the number of trips 