# Case Study: How Does a Bike-Share Navigate Speedy Success?
Author: Nguyen Anh Tuan

Last Update: 29 April 2022

## Introduction

This case study is a part of capstone project for [Google data analytics certificate](https://www.coursera.org/professional-certificates/google-data-analytics). It's about a fictional company - Cyclistic.
In this analysis, we will looking at the company's data of its customer from April 2020 to December 2021.
The data has been made available by Motivate International Inc. under this [license](https://ride.divvybikes.com/data-license-agreement).

The analysis will be follow the steps of the data analysis process: ask, prepare, process, analyze, share, and act.



### 1.Ask:
Context:
    The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Marketing team wants to understand how casual riders and annual members use Cyclistic bikes differently. 
From these insights, we will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.

Stakeholders:

. Director of marketing

. Cyclistic executive team

Objective:

The goal for this analysis is to find out different behaviour between two types of customers: annual members and casual riders based on few parameters that can be calculated/ obtained from existing data.

Deliverables:

1.Insights on how annual members and casual riders use Cyclistic bikes differently

2.Provide effective visuals and relevant data to support insights

3.Use insights to give recommendations to convert casual riders to member riders

    

### 2. Prepare
Data Source:

The data consists of 21 .csv files downloaded from Motivate International’s Divvy.
The datasets provided are open-sourced and clean of information that might trace back to individual users, such as credit card information or the rider’s personal identification.

The combined size of all the 21 datasets is close to 1 GB. Size of that quite big that make spreadsheet maybe is not the best choice to clean and manipulate data. So Python is tool I using to clean and analysis

Load important library:



In [None]:
import pandas as pd # to help manipulation and visualization data
import matplotlib as mpl # to help the set up of  default color
import matplotlib.pyplot as plt # to help visualization
import matplotlib.ticker as ticker # to help the set up of the axis number
import numpy as np # to help matematical function data

Read raw data in DataFrame

In [None]:
trip_202004 = pd.read_csv('../input/bikeshare/202004-divvy-tripdata.csv')
trip_202005 = pd.read_csv('../input/bikeshare/202005-divvy-tripdata.csv')
trip_202006 = pd.read_csv('../input/bikeshare/202006-divvy-tripdata.csv')
trip_202007 = pd.read_csv('../input/bikeshare/202007-divvy-tripdata.csv')
trip_202008 = pd.read_csv('../input/bikeshare/202008-divvy-tripdata.csv')
trip_202009 = pd.read_csv('../input/bikeshare/202009-divvy-tripdata.csv')
trip_202010 = pd.read_csv('../input/bikeshare/202010-divvy-tripdata.csv')
trip_202011 = pd.read_csv('../input/bikeshare/202011-divvy-tripdata.csv')
trip_202012 = pd.read_csv('../input/bikeshare/202012-divvy-tripdata.csv')
trip_202101 = pd.read_csv('../input/bikeshare/202101-divvy-tripdata.csv')
trip_202102 = pd.read_csv('../input/bikeshare/202102-divvy-tripdata.csv')
trip_202103 = pd.read_csv('../input/bikeshare/202103-divvy-tripdata.csv')
trip_202104 = pd.read_csv('../input/bikeshare/202104-divvy-tripdata.csv')
trip_202105 = pd.read_csv('../input/bikeshare/202105-divvy-tripdata.csv')
trip_202106 = pd.read_csv('../input/bikeshare/202106-divvy-tripdata.csv')
trip_202107 = pd.read_csv('../input/bikeshare/202107-divvy-tripdata.csv')
trip_202108 = pd.read_csv('../input/bikeshare/202108-divvy-tripdata.csv')
trip_202109 = pd.read_csv('../input/bikeshare/202109-divvy-tripdata.csv')
trip_202110 = pd.read_csv('../input/bikeshare/202110-divvy-tripdata.csv')
trip_202111 = pd.read_csv('../input/bikeshare/202111-divvy-tripdata.csv')
trip_202112 = pd.read_csv('../input/bikeshare/202112-divvy-tripdata.csv')

Combine all the datasets into one single dataframe:

In [None]:
all_trip = pd.concat([trip_202004,trip_202005,trip_202006,trip_202007,trip_202008,trip_202009,trip_202010,
                     trip_202011,trip_202012,trip_202101,trip_202102,
                     trip_202103,
                     trip_202104,
                     trip_202105,
                     trip_202106,
                     trip_202107,
                     trip_202108,
                     trip_202109,
                     trip_202110,
                     trip_202111,
                     trip_202112],ignore_index= True, sort = False)

In [None]:
all_trip.info()

### 3.Processing

Cleaning some null values

In [None]:
all_trip = all_trip.dropna()

Transform column 'started_at' and 'ended_at' to format datetime.

In [None]:
all_trip['started_at'] = pd.to_datetime(all_trip['started_at'])
all_trip['ended_at'] = pd.to_datetime(all_trip['ended_at'])

Remove columns not required or beyond the scope of project

In [None]:
all_trip = all_trip.drop(columns = ['start_lat','start_lng','end_lat','end_lng'])

Rename some columns:

In [None]:
all_trip = all_trip.rename(columns={"rideable_type":"ride_type","started_at":"start_time","ended_at":"end_time","member_casual":"customer_type"})

Data is looking good now!

In [None]:
all_trip.head()

Compute new columns 'day_of_week','day_number','month' from 'start_time'for further analysis 

In [None]:
all_trip['day_of_week'] = all_trip['start_time'].dt.day_name()
all_trip['day_number'] = all_trip['start_time'].dt.day_of_week

In [None]:
all_trip['month'] = all_trip['start_time'].dt.strftime('%Y-%m')

Compute trip duration in minute for each trip : end_time - start_time

In [None]:
all_trip['duration'] = all_trip['end_time']- all_trip['start_time']
all_trip['duration'] = all_trip['duration'].dt.total_seconds()/60

Let's take a look at the Final Data:

In [None]:
all_trip.describe()

Remove trip_duration has negatif value.

In [None]:
all_trip = all_trip.drop(all_trip[ all_trip['duration'] < 0 ].index)


Remove test data in the column 'start_station_name'

In [None]:
all_trip = all_trip.drop(all_trip.loc[all_trip['start_station_name'].str.contains("Test", case =False)].index)

Ready to go!

In [None]:
all_trip.head()

### 4. Analyze & 5.Share the data

In [None]:
plt.style.use('seaborn')
mpl.rcParams['axes.prop_cycle'] = mpl.cycler(color=["#F8766D", "#00BFC4"]) 

The dataframe is now ready for descriptive analysis.
Let's find some insights on how the casual riders and members use Cyclistic rideshare differently.

First, let's try to get some simple statistics on customer_type:

In [None]:
all_trip.groupby('customer_type').describe()

Let's see the average of all trip duration

In [None]:
all_trip['duration'].mean()

We can see that: 

The mean trip duration of member is lower than the average duration of all customer. While casual rider mean trip duration is higher than the average. Casual rider tends to take the bike out for longer duration compared to members.

Now,we will take a look at the different use between 2 types of customer in a wek

**Total number of trips by customer type and day of the week**

In [None]:
resume_1 = all_trip.groupby(['customer_type','day_number'], as_index= False)['duration'].agg(['count','mean'])

In [None]:
day_number = [day for day,df in all_trip.groupby('day_number')]
day_name = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday','Sunday']

In [None]:
x = np.arange(len(day_number))

fig, ax = plt.subplots(figsize=(15,5))

type1 = ax.bar(x + 0.2, resume_1.loc['casual','count'],width=0.4,label='casual')
type2 = ax.bar(x - 0.2, resume_1.loc['member','count'],width=0.4,label='member')

#add some text for labels,tilte and x-axis and y-axis
ax.set_ylabel('Nb of trips')
ax.set_xlabel('Day of week')
ax.set_xticks(x,day_name,rotation = 'vertical', size=10)
ax.set_title('Number of trip by day of week and customer type')
ax.legend()

plt.grid(axis = 'y')
plt.show()

From the graph above:
- Casual customers are more busy on Weekend 
- Members are most busy on later half of the week extending into the weekend. 

Interesting pattern to note:
- The consistent trip numbers among members with less spread over entire week
- Casual riders don't seem to use to use the bikeshare services much during weekdays



**Total number of trips by customer type and by month**

In [None]:
resume_2 =all_trip.groupby(['customer_type','month'], as_index= False)['duration'].agg(['count','mean'])

In [None]:
month = [day for day,df in all_trip.groupby('month')]
x = np.arange(len(month))

fig, ax = plt.subplots(figsize=(15,5))

type1 = ax.bar(x + 0.2, resume_2.loc['casual','count'],width=0.4,label='casual')
type2 = ax.bar(x - 0.2, resume_2.loc['member','count'],width=0.4,label='member')

#add some text for labels,tilte and x-axis and y-axis
ax.set_ylabel('Nb of trips')
ax.set_xlabel('Months')
ax.set_xticks(x,month,rotation = 'vertical', size=8)
ax.set_title('Number of trip by month and customer type')
ax.legend()

plt.grid(axis = 'y')
plt.show()

- The data shows that, during summer (From June to September) are the most busy time of the year for both member and casual riders. The weather may cause this effect ( Temperature, Sun, ..)
- The number of user bikeshare in 2021 is higher than 2020. After the lockdown of COVID 2020, People seem to enjoy outside and try to avoid using the public transport
- The number of trips made by members is always higher than of casual riders, except for summer  2021 (July and August 2021). This might be caused by many reason.


**Average trip duratation by customer type for each day of week**

In [None]:
x = np.arange(len(day_number))

fig, ax = plt.subplots(figsize=(15,5))

type1 = ax.bar(x + 0.2, resume_1.loc['casual','mean'],width=0.4,label='casual')
type2 = ax.bar(x - 0.2, resume_1.loc['member','mean'],width=0.4,label='member')

#add some text for labels,tilte and x-axis and y-axis
ax.set_ylabel('Average trip duration', size = 12)
ax.set_xlabel('Day of week',size = 12)
ax.set_xticks(x,day_name,rotation = 'horizontal', size=10)
ax.set_title('Average duration by day of week and customer type')
ax.legend()

plt.grid(axis = 'y')
plt.show()

- The average trip duration of a casual rider is more than twice that of a member.
- The weekend, riders tend to use bikeshare longer than weekday

**Average trip duratation by customer type for each month**

In [None]:
month = [day for day,df in all_trip.groupby('month')]
x = np.arange(len(month))

fig, ax = plt.subplots(figsize=(20,5))

type1 = ax.bar(x + 0.2, resume_2.loc['casual','mean'],width=0.4,label='casual')
type2 = ax.bar(x - 0.2, resume_2.loc['member','mean'],width=0.4,label='member')

#add some text for labels,tilte and x-axis and y-axis
ax.set_ylabel('Average duration of trips',size = 12)
ax.set_xlabel('Months',size = 12)
ax.set_xticks(x,month,rotation = 'horizontal', size=10)
ax.set_title('Average duration of trip by month and customer type')
ax.legend()

plt.grid(axis = 'y')
plt.show()

- The average ride of member is between 10-20 minutes throughout the year.
- Causual rider tend to keep the bike longer, its average trip duration is more than 25 minutes.
- Unusual long trip duration by casual rider in April 2020.

**Visualizaton of bike demand over 24 hr period (a day)**

In [None]:
all_trip.head(5)

In [None]:
all_trip['hour'] = all_trip['start_time'].dt.hour

In [None]:
resume_3 = all_trip.groupby(['customer_type','hour'])['ride_id'].count()

In [None]:
hour = [hour for hour,df in all_trip.groupby('hour')]
x = np.arange(len(hour))

fig, ax = plt.subplots(figsize=(15,5))
ax.set_facecolor('white')

type1 = ax.plot(x, resume_3.loc['casual'],label= 'casual')
type2 = ax.plot(x, resume_3.loc['member'],label= 'member')

#add some text for labels,tilte and x-axis and y-axis
ax.set_ylabel('Number of bike demande',size = 14)
ax.set_xlabel('Time of the day',size = 12)
ax.set_xticks(x,hour,rotation = 'horizontal', size=10)
ax.set_title('Demande over 24 hours of a day')
ax.legend()

plt.grid()
plt.show()

- For the members : We can see 2 peaks demand hours: 7-9 AM and 4-7 PM. Hypothetically , member riders use bike to go to office.
- For the casual riders: the same peak of demand at 7pm.


### 6.Act

#### Key Takeaways

- Casual riders contribute more trip than member rider( About 60% of total trip). 
- Base on trip duration, casual riders tends to ride longer ( duration, not about distance) than member rider
- Casual customers use bikeshare services more during weekends, while members use them consistently over the entire week.
- During summer ( From June to September), the number of trips is higher than the other time of the year for both Casual and Member riders.



#### Recommendations

- Offer summer promotion for new subcription or create new summer travel pass . It might encourage casual riders to take up membership and it also keep the member riders for long time.
- Introduce a new weekend-only annual subscription with a slightly lower price comparing to the annual subscription, this way people will be encouraged to take either one of the two which will result in more annual users