#  <center>Bicycle Share Use in the greater San Francisco Bay area</center>
##  <center>Explanatory Data Visualization - by Donia Djerbi</center>





## Investigation Overview


In this investigation, I wanted to look at the cicycle hires informations to help understand the users's choice for a specific station. The main focus was on the trip duration and distance, the start hour, the day of the week and the nature of the user


## Dataset Overview

The data consists of trips features of 176114 bicycle hires. The attributes included the trip duration, the station name, the start hour, the day of the week as well as additional variables such as the user type and the travelled distance.

In [1]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
sb.set_theme()
import plotly.express as px
import statistics

# Import library and module to calculate the distance
import geopy.distance
from geopy.distance import geodesic

%matplotlib inline
# suppress warnings from final output
import warnings
warnings.simplefilter("ignore")

In [2]:
# load in the dataset into a pandas dataframe, print statistics
df = pd.read_csv('201902-fordgobike-tripdata.csv')

In [3]:
# Convert the member birth year to int and fill the nan values with the mean birth year
df['member_birth_year'] = pd.to_numeric(df['member_birth_year'], errors='coerce').fillna(1984).astype(int)

In [4]:
# Convert start_time and end_time to datetime
df['start_time'] = pd.to_datetime(df['start_time'])

In [5]:
# Find correspanding start hour and day of the week
df['start_hour'] = df.start_time.dt.hour
df['start_dayofweek'] = df.start_time.dt.day_name()
df['Year'] = df.start_time.dt.year

In [6]:
df['Date'] = df.start_time.dt.date
df['Date'] = pd.to_datetime(df['Date'])

In [7]:
# Define the order of days
cat_type = pd.api.types.CategoricalDtype(categories=['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'], ordered=True)
df['start_dayofweek']  = df['start_dayofweek'].astype(cat_type)

In [8]:
# Find the age of the users
df['Age'] = 2019 - df['member_birth_year']

In [9]:
# Convert member_gender and user_type to category data type
df['member_gender'] = df['member_gender'].astype('category')
df['user_type'] = df['user_type'].astype('category')
df['bike_share_for_all_trip'] = df['bike_share_for_all_trip'].astype('category')

In [10]:
# Get the ride duration per minute
df['duration_min'] = df['duration_sec']/ 60

# Get the ride duration per hour 
df['duration_hour'] = df['duration_min']/60

In [11]:
# Convert bike_id to str type
df['bike_id'] = df['bike_id'].astype(str)

In [None]:
def calc_distance(x):
    start = (x.start_station_latitude, x.start_station_longitude)
    end = (x.end_station_latitude, x.end_station_longitude)
    return geopy.distance.geodesic(start, end ).km
df['distance'] = df.apply(calc_distance, axis = 1)

In [None]:
# Data wrangling,removing data trips with inconsistent or missing data.
df.drop(df[(df['Age'] >= 80) ].index, inplace=True)
df = df[df.distance != 0]
df['speed'] = df['distance']/df['duration_hour']
df.drop(df[(df['speed'] <= 1.931285) ].index, inplace=True)
df.drop([112038], inplace = True)

In [None]:
# Reorder columns and keep only the necessary ones
df = df[['duration_min', 'distance','Date', 'start_dayofweek','start_hour','start_station_name','user_type','member_gender','bike_share_for_all_trip','Age','bike_id']]

## Distribution of trip duration in minutes

The average duration of bike rides is 10 minutes, however the median duration is 8.4 minutes. The histogram presents a normal distribution with a standard deviation of 7.1 minutes.

In [None]:
m = statistics.mean(df['duration_min'] )
std = statistics.stdev(df['duration_min'] )
median = statistics.median(df['duration_min'] )
print(f"Mean =  {m}")
print(f"Standard deviation =  {std}")
print(f"Median =  {median}")

In [None]:
# Plot the trip duration on a log scale
log_binsize = 0.1
bins = 10 ** np.arange(np.log10(df['duration_min'].min()), np.log10(df['duration_min'].max())+log_binsize, log_binsize)

plt.figure(figsize=[5, 5])
plt.hist(data = df, x = 'duration_min', bins = bins, edgecolor='k', color ='#15616d' )
plt.axvline(m, color ='r', linestyle='dashed')
plt.axvline(median, color ='g', linestyle='dashed')
plt.axvline(m+std, color ='y', linestyle='dashed')
plt.axvline(m-std, color ='y', linestyle='dashed')
plt.xscale('log')
plt.title('Distribution of bike rides duration in minutes')
plt.xticks([1, 10, 100] , ['1', '10', '100'])
plt.xlabel('Trips duration')
plt.show()

## Top busiest stations
> The frequent start station is **Market St at 10th St** with 3800 bicycle hires over the month followed by **San Francisco Caltrain Station 2** with 3500 bicycle hires.


In [None]:
top_stations = pd.DataFrame(df.groupby(['start_station_name']).count()
                            ['bike_id'].sort_values(ascending = False)
                            .head(10)).reset_index()

In [None]:
Stations = px.histogram(top_stations, x="bike_id",y="start_station_name",
                     color_discrete_sequence = ['#005f69'],title="Total number of rides per station",text_auto='.2s', 
            labels={
                     "start_station_name": "Start station",
                     "bike_id": "Number of Trips",
                 },)
Stations.update_layout(yaxis={'categoryorder':'total ascending',},
                   xaxis={'visible': False, 'showticklabels': False},
                  yaxis_title=None)
                  
                  
                   
Stations.update_traces(textfont_size=12, textangle=0, 
                   textposition="outside",
                    cliponaxis=False
                  )
Stations.show()

## Busiest day of the week 
>Weekdays are the days with the highest number of bicycle hires. Thursday is the busiest day , with a bicycle hires range from 8973 to 9498.
In general, one can notice a moderate uptick in weekends compared to weekdays. 


In [None]:
Date = pd.DataFrame(df.groupby(['Date', 'start_dayofweek']).count()['bike_id']).reset_index()

In [None]:
Day = px.scatter(Date, x="Date", y="start_dayofweek", color = 'start_dayofweek',
                 color_discrete_sequence=["#00308F", "#72A0C1", '#89CFF0', "#5F9EA0", "#00CED1",'#008E97', "#005f69"],

           labels={"bike_id": "Number of rides"},
           size="bike_id",size_max=30,
)
#Day.update_layout(yaxis={'categoryorder':'total descending',},
                   #xaxis={'visible': True, 'showticklabels': True},
                  #yaxis_title=None)

Day.update_layout(
    title='Total number of bicycle hires per day in February 2019',
    yaxis={'visible': False, 'showticklabels': True},
    yaxis_title=None,
    xaxis_title=None,
    paper_bgcolor='rgb(243, 243, 243)',
    plot_bgcolor='rgb(243, 243, 243)',
)
Day.show()

## Top stations by trip duration and distance
All the top 5 stations have the same trip duration average of about 10 min. However, regarding the average distance Berry St at 4th St has the max average distance followed by Market St at 10th St.



In [None]:
Top_stations = df['start_station_name'].value_counts()[:5].index
fig, ax = plt.subplots(nrows = 2, figsize = [12,10])
plt.subplot(2,1,1)
sb.boxplot(data = df, x='duration_min', y='start_station_name', order=Top_stations, color='#84C318')
ax[0].set_title('Trip duration for the busiest stations')
ax[0].set_yticklabels([],minor = True)
ax[0].set(ylabel = None)
ax[0].set(xlabel = None)


plt.subplot(2,1,2)
sb.boxplot(data = df, x='distance', y='start_station_name', order=Top_stations, color='#84C318')
ax[1].set_title('Trip distance for the busiest stations')
ax[1].set_yticklabels([],minor = True)
ax[1].set(ylabel = None)
ax[1].set(xlabel = None);



### Distribution of start hours Vs. days
>During weekdays the start hours of rides tend to be in the morning around 8 am or in the evening around 5 pm, wich means the rush hours. However, during weekends, the bicycle hires tend to be all the day with an importante hires the afternoon.

In [None]:
ax = sb.violinplot(data=df, x='start_dayofweek', y='start_hour', color = '#5F9EA0');
plt.title('Distribution of the start hours of trips per day of the week')
plt.ylabel('Start hour')
ax.set(xlabel = None);

## Distribution of rush hours per day for busiest stations
The figure is devided into days of the week so, we can see for each day what station is the busiest and at what hour. At mondays for instance, San Fransisco Caltrain Station 2 is the station with the highest number of bicycle hires espacially between 6 and 9 am. Therefore, there should be enough bicycles.

In [None]:
rush_hour_perday_perstation = df.groupby(['start_station_name', 'start_dayofweek','start_hour'])['bike_id'].count()
rush_hour_perday_perstation = rush_hour_perday_perstation.reset_index(name = 'count')

In [None]:
times = df.start_hour.unique()
g = sb.FacetGrid(rush_hour_perday_perstation, col="start_dayofweek",
                 hue="start_hour",hue_order = times,
                   aspect=2, col_wrap=2, palette = 'tab20')
g.map(sb.barplot, 'count', 'start_station_name', order=Top_stations ).add_legend()
g.figure.subplots_adjust(top=0.9);
g.fig.suptitle(" Distribution of rush hours per day for busiest stations",fontsize=24, fontdict={"weight": "bold"})
g.set(ylabel = None)
g.set(xlabel = None);

In [None]:
!jupyter nbconvert Explanatory_Data_Visualization.ipynb --to slides --post serve --no-input --no-prompt