# Ford GoBike Data System: Data Exploration
<a id='intro'></a>
## Introduction
This document explores a dataset that includes information about individual rides made in a bike-sharing system covering the greater San Francisco Bay area. it covers the first quarter of 2020. [Download here](https://www.lyft.com/bikes/bay-wheels/system-data), you'll find Bay Wheels's trip data for public use.
<br/> 


### What is the structure of your dataset?
The dataset ,[Ford GoBike Share](https://www.lyft.com/bikes/bay-wheels/system-data), for the first quarter of 2020 has 617,823 trip records in the dataset with 13 features. (start_date, end_date, start_time, end_time, duration_sec, duration_min, start_hour, end_hour, distance_km, month, dayofweek, bike_id, user_type, rental_access_method).<br/>
Most variables are numeric, some are sting objects (e.g. month, dayofweek, string of dates and times), the variables user_type, rental_access_method are catgorical date with the following valus. <br/>
User Type: (Subscriber or Customer – “Subscriber” = Member or “Customer” = Casual)
Rental access method: app or clipper – clipper is the all-in-one transit card for the Bay Area)<br/>



### What is/are the main feature(s) of interest in your dataset?
* When are most trips taken in terms of time of day, day of the week, or month of the year?
* How long does the average trip take?
* Does the above depend on if a user is a subscriber or customer?



### What features in the dataset do you think will help support your investigation into your feature(s) of interest?

I expect that time, month, and day of week will have the strongest effect on on the amount of trips taken: less trips on the days of the week and higher at the weekends. The duration of time will help us understand the average time a trip takes. I also think that the other customer and reental method will be used to test if it has any effect on the above.



## Univariate Exploration

In [1]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
import calendar
import datetime

%matplotlib inline
# suppress warnings from final output
import warnings
warnings.simplefilter("ignore")

In [2]:
df=pd.read_csv('2020-1stQ-baywheels-tripdata.csv')

In [3]:
df.describe()

Unnamed: 0,duration_sec,duration_min,start_hour,end_hour,distance_km,bike_id
count,617823.0,617823.0,617823.0,617823.0,617823.0,617823.0
mean,787.964103,12.642194,13.722759,13.886833,1.92538,501905.16811
std,1356.964028,22.617983,4.771084,4.798867,1.39321,243637.842441
min,60.0,1.0,0.0,0.0,0.0,100077.0
25%,379.0,6.0,10.0,10.0,0.92114,312833.0
50%,604.0,10.0,14.0,15.0,1.616548,476504.0
75%,940.0,15.0,17.0,18.0,2.631885,672267.0
max,811077.0,13517.0,23.0,23.0,60.61937,999960.0


### Bike Trip Duration 
We notice from the trip duration distribution that in most of the bike trips (40,000), the bike is rented for about 5-10 minutes. Moreover, the average renting time according to the statistic summary is 12 minutes.

In [4]:
df.duration_sec.describe(),df.duration_min.describe()

(count    617823.000000
 mean        787.964103
 std        1356.964028
 min          60.000000
 25%         379.000000
 50%         604.000000
 75%         940.000000
 max      811077.000000
 Name: duration_sec, dtype: float64,
 count    617823.000000
 mean         12.642194
 std          22.617983
 min           1.000000
 25%           6.000000
 50%          10.000000
 75%          15.000000
 max       13517.000000
 Name: duration_min, dtype: float64)

In [None]:
binsize = 1
bins = np.arange(0, df.duration_min.max()+binsize, binsize)
plt.hist(data = df, x = 'duration_min', bins=bins);
#plt.ylim(0,100)
plt.xlim(0,10)
plt.title("Trip Duration in Minutes")
plt.xlabel('Duration (Minutes)')
plt.ylabel('Number of Bike Trips');

### Bike rides in each month
It appears that February has the most brike rides which was expected with slightly warmer weather and the Valentine's Day (couple going out on trips). However, the trips has drastically decreased by March which is also expected since the break of covid–19.

In [None]:
plt.hist(data = df, x = 'month');
plt.xlim(0,2)
plt.title("Trips Taken by Months - 1st Quarter of 2020")
plt.xlabel('Nonths')
plt.ylabel('Number of Bike Trips');

### Bike trips by day of week
The bar chart shows that from Wednesday to Friday has the most ride trips number. Interestingly, the weekend has the least number! This suggest that people rent bikes for working days (i.e. taking rides to work this justify the 10 mins trip.) 

In [None]:
weekday = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
sb.countplot(data=df, x='dayofweek',color=sb.color_palette()[0] ,order=weekday);
plt.ylabel('Number of Bike Trips')
plt.xlabel('Days of the week')
plt.title('Trips Taken by Day')
plt.show()

### Bike trips by hours
The peak time or the top two hours is 5:00 pm followed by 8:00 am. Moreover, from the distribiution the top two hours and the adjesant hours has large number of bike rides! This also proves that the users are renting bikes during working hours (i.e. renting bikes for commuting from/to work).

In [None]:
binsize = 0.25
bins = np.arange(0, df.start_hour.max()+binsize, binsize)
plt.hist(data = df, x = 'start_hour', bins=bins)
plt.title("Trips Taken by Hours - 1st Quarter of 2020")
plt.xlim(0,23)
plt.xticks([0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23])
plt.xlabel('Hours')
plt.ylabel('Number of Bike Trips');

### Bike trip distances (km)
The distribution of the trip distance shows that most of the rides takes 1km. the second most is less than 1km. according to the statistic the average trip distance is 1.93km with 25% it takes 0.92km and 57% of the rides it takes 2.63km.

In [None]:
binsize = 1
bins = np.arange(0, df.distance_km.max()+binsize, binsize)
plt.hist(data = df, x = 'distance_km', bins=bins)
plt.title("Distance Distribution of Bike Trips")
plt.xlim(0,10)
plt.xlabel('Trip Distance (km)')
plt.ylabel('Number of Bike Trips');

### The rental access methods used most of the bike trips
92.33% of the bike rental methods are by app and only 7.67% are rented by clippers (transit card).

In [None]:
common_method=df.rental_access_method.value_counts()/df.shape[0]*100
common_method

In [None]:
labels = ['App 92.33%','Clipper 7.67%']
plt.pie(common_method,labels=common_method.index, autopct='%.2f');
plt.axis('equal')
plt.title('Common Rental Method for Bike Trips')
plt.legend(labels, bbox_to_anchor=(1.3,1.0), loc="upper right")
plt.show()

### User types
86.11% of the users are subscribers while 13.88% of them are customer!

In [None]:
common_users=df.user_type.value_counts()/df.shape[0]*100
common_users

In [None]:
labels = [' Subscriber 86.11%','Customer 13.88%']
plt.pie(common_users,labels=common_users.index, autopct='%.2f');
plt.axis('equal')
plt.title('Common Users Types for Bike Trips')
plt.legend(labels, bbox_to_anchor=(1.3,1.0), loc="upper right")
plt.show()

### Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

Nothing unexpected. At first it was unusual how the `Bike trip duration in minutes` had 5-10 mins (i.e. small values) but by the time we investigated other features such `user_types`,`Trips by month`,`Hours`, and `Distance in km` it was reasonable. <br/>


## Bivariate Exploration

### Relationship between user types and time duration
From both plots it appears that the customers has higher trip duration than subscribers. Customers has longer trip duration (over 30 mins) meanwhile, subscribers take less than 30.

In [None]:
cond=df[df.duration_min<60]

In [None]:
# base_color = sb.color_palette()[0]
# s=sb.violinplot(data = cond, y = 'duration_min', x =  'user_type' , color=base_color)

plt.figure(figsize = [15, 10])
base_color = sb.color_palette()[0]

# left plot: violin plot
plt.subplot(1, 2, 1)
ax1 = sb.violinplot(data = cond, x = 'user_type', y = 'duration_min', color = base_color)

# right plot: box plot
plt.subplot(1, 2, 2)
sb.boxplot(data = cond, x = 'user_type', y = 'duration_min', color = base_color)
plt.ylim(ax1.get_ylim()) # set y-axis limits to be same as left plot
ax1.set(xlabel='User types',ylabel='Duration (min)', title='Relationship between user types and time duration m.')


plt.show()

### Relationship between user types and months 1st quarter of 2020

Both subscribers and customer hit the peak in february then decreased drastically in march.

In [None]:
ax = sb.countplot(data = df, x = 'month', hue = 'user_type', order=['January','February','March'])
ax.legend(loc = 1, ncol = 1, framealpha = 0.5, title = 'user_type')
ax.set(xlabel='Months',ylabel='Number of trips', title='Relationship between user types and months \n1st quarter of 2020')

plt.show()

### Relationship between user types and days of week
We notice here that subscribers are more likely to take rides in general since they represent 86%, hence, it affects the distribution of the trips by days. specifically, subscribers usage in the week days Wednesday and Friday (for work) and decreased by fraction in the weekends. Meanwhile, the customers are the less consuming/renting rides and there is a slight increasing at the weekends.

In [None]:
weekday = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']

In [None]:
ax = sb.countplot(data = df, x = 'dayofweek', hue = 'user_type', order=weekday)
ax.legend(loc = 1, ncol = 1, framealpha = 0.5, title = 'user_type')
ax.set(xlabel='Days of week',ylabel='Number of trips', title='Relationship between user types and days of week')
plt.show()

### Relationship between user types and time of day
Most of the subscribers trips is taken at 8:00pm and 17:00pm which is the peak time (people go to and come from work at these hours). Meanwhile, the customers distribution entails that the trip count increases as the day progresses until 17:00pm which is the peak hour too.

In [None]:
plt.figure(figsize = [10, 5])
ax = sb.countplot(data = df, x = 'start_hour', hue = 'user_type')
ax.legend(loc = 1, ncol = 3, framealpha = 0.5, title = 'user_type')
ax.set(xlabel='Hours of the day',ylabel='Number of trips', title='Relationship between user types and time of day')
plt.show()

### Relationship between user types and rental methods
Both subscribers and customers rented bikes by app, however, both have a relatively small number of trips rented using clippers.

In [None]:
plt.figure(figsize = [10, 5])
s=sb.countplot(data = df, x = 'rental_access_method', hue = 'user_type')
s.set(xlabel='Rental access method',ylabel='Count', title='Relationship between user types and rental methods')
plt.show()

### Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?
We observed the time duration according to the user type. Also, the trip counts of user categories by month, days of the week, hours, and the relation ship between the user type and `rental access method`. Nothing out of expectation.
#### Developed insights
- From both plots it appears that the customers has higher trip duration than subscribers. Customers took longer trip duration (over 30 mins) meanwhile, subscribers take less than 30.
- Both subscribers and customer hit the peak in February then decreased drastically in march.
- Subscribers usage in the week days Wednesday and Friday (for work) and decreased by fraction in the weekends. Meanwhile, the customers are the less consuming/renting rides and there is a slight increasing at the weekends.
- Most of subscribers trips is taken at 8:00pm and 17:00pm which is the peak time (people go to and come from work at these hours). Meanwhile, the customers distribution entails that the trip count increases as the day progresses until 17:00pm which is the peak hour too. 
- Both subscribers and customers rented bikes by app, however, both have a relatively small number of trips rented using clippers.


### Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?
All the relationships betweent the features we tested were strong. However, a new discovery that customer trips take longer, and both user types prefered using app as rental access method.

## Multivariate Exploration
The main thing I want to explore in this part of the analysis is how the three vatiables date-time (Months, Days, Hours), user type, duration (min) play into the relationship with trips count.
Now we can answer the following 
* When are most trips taken in terms of time of day, day of the week, or month of the year?
<br/>
* How long does the average trip take?
<br/>
* Does the above depend on if a user is a subscriber or customer?



### Trips duration by user type and rental method
Both types of group used app as rental method for the trip more over than clipper (transit card). it does not seem that the rental method is effective on the other two. In trips type of users with duration affect the result, i.e. when its a customer he/she tends to have more trip duration.

In [None]:
g = sb.FacetGrid(data = cond, col = 'user_type', height=5)
g.map(sb.boxplot, 'rental_access_method', 'duration_min')
plt.subplots_adjust(top=0.9)
g.fig.suptitle('Trips duration by user type and rental method')
plt.show()


### Trips taken in terms of time of day, day of the week, or month of the year 

**The following plotted visuals answer our main question/topic**<br/>

### Trips taken in terms of day of the week and user type
We notice here the trip duration for subscribers has consistency throughout the week days with averaage duration 5-10 minutes. However, the trip duration for customers is more than with the subscribers for all days especially during weekends. As we saw earlier in bivariate exploration and the relationship between trip duration and user types. The customers tend to take long trips on weekends compared to weekdays.

In [None]:
g = sb.FacetGrid(data = cond, col = 'user_type', height=6)
g.map(sb.boxplot, 'dayofweek', 'duration_min',order=weekday)
g.set(xlabel='Day of week', ylabel='Duration (min)')
plt.subplots_adjust(top=0.9)
g.fig.suptitle('Trips taken in terms of day of the week and user type')
plt.show()

### Trips taken in terms of time of day and user type
The subscibers trip has two peak hours 8:00 am and 17:00 pm, meanwhile, the customers trips from 14:00 - 16:00 pm. 

In [None]:
g = sb.FacetGrid(data = cond, col = 'user_type', height=8)
g.map(sb.boxplot, 'start_hour', 'duration_min',order=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23])
g.set(xlabel='Time of day', ylabel='Duration (min)')
g.fig.suptitle='Trips taken in terms of time of day and user type \n (1st Quarter of 2020)'
plt.subplots_adjust(top=0.7)
plt.show()

In [None]:
cat_means = cond.groupby(['user_type', 'start_hour']).mean()['duration_min']
cat_means = cat_means.reset_index(name = 'duration_min')
cat_means = cat_means.pivot(index = 'start_hour', columns = 'user_type',
                            values = 'duration_min')
s=sb.heatmap(cat_means, annot = True, fmt = '.3f',
           cbar_kws = {'label' : 'mean(duration_min)'})
s.set(xlabel='Time of day', ylabel='Hours', title='Trips taken in terms of time of the day and user type \n (1st Quarter of 2020)')
plt.show()

### Trips taken by months and user type
The customers still have the higher trip duration, however, there is consistency in time duration throughout the months...

In [None]:
g = sb.FacetGrid(data = cond, col = 'user_type', height=8)
g.map(sb.boxplot, 'month', 'duration_min', order=['January','February','March'])
g.set(xlabel='Months of 1st Q', ylabel='Duration (min)')
plt.subplots_adjust(top=0.9)
g.fig.suptitle='Trips taken by months \n and user type'
plt.show()

### Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?
I extended my investigation of the three vatiables date-time (Months, Days, Hours), user type, trip duration (min) in this section by looking at the impact of it on the trips..
To answer the following 
* When are most trips taken in terms of time of day, day of the week, or month of the year?
* How long does the average trip take?
* Does the above depend on if a user is a subscriber or customer?
The findings was just as we expected. 



### Were there any interesting or surprising interactions between features?
Looking back on the point plots, the time-date and user-type features has an effect on the trips. However, there were no relation between the rental method and the user type. Overall, the results match our previous observations and expectations.  



### Key Insights 
- The the bike is rented for about 5-10 minutes. Moreover, the average renting time according to the statistic summary is 12 minutes.
- 92.33% of the bike rental methods are by app and only 7.67% are rented by clippers (transit card).
- 86.11% of the users are subscribers while 13.88% of them are customer!
- February is the top month for bike trips.
- Most of subscribers trips has two peak hours at 8:00pm and 17:00pm which is work-time (I'm assuming people go to and come from work at these hours). Meanwhile, the customers distribution entails that the trip count increases as the day progresses from 14:00 until 17:00pm which is the peak time too.
- The subscriber type has consistency throughout the week days with averaage duration 5-10 minutes while, the customers type has higher duration (over 20 min trip duration) during the weekends. 
- Both subscribers and customers rented bikes by app, however, both have a relatively small number of trips rented using clippers.



## References
[1] https://gist.github.com/rochacbruno/2883505 <br/>
[2] https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reset_index.html <br/>
[3] https://news.google.com/covid19/map?hl=en-US&mid=%2Fg%2F11g6njkk2y&gl=US&ceid=US%3Aen <br/>
[4] Udacity - https://udacity.com/<br/>
[5] YouTube Tutorials - https://www.youtube.com/<br/>
[6] stackoverflow - https://stackoverflow.com <br/>
[7] Title positioning error - suptitle - https://stackoverflow.com/questions/52096050/seaborn-title-position

# Thank you!
### Shroug Salem

In [None]:
! jupyter nbconvert FordGoBikeSystemData_slide_deck.ipynb --to slides --post serve --template output_toggle

In [None]:
from subprocess import call
call(['python', '-m', 'nbconvert', 'FordGoBikeSystemData_slide_deck.ipynb'])