# Cyclistic BikeShare Trip Data Case Study"
### Date: "20/08/2021"


## Introduction


You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. 

The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. 

From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.

## Bussiness Task 
       The main business task is to design marketing strategies to convert casual riders to annual members.



1. How annual members and casual riders differ?
2. Why casual riders should buy annual memberships?
3. How digital media influence casual riders to become members?


### The key stakeholders would be:

- Primary stakeholders :
                       1. Director of Marketing (Lily Moreno).
                       2. The Cyclistic executive team.
- Secondary stakeholders : 
                         The marketing analytics team.

## Prepare

### Data Source :

Cyclistic provides the last 12 months of historical trip data, found [here](http://divvy-tripdata.s3.amazonaws.com/index.html)(Note: for the purpose of this case study, the data sets have a different name because Cyclistic is a fictional company). The data has been made available by Motivate International Inc. The data consists of 12 CSV files each detailing one month of trip data from August 2020 - July 2021.

The datasets are located in the company's cloud server, and so we must download it to prepare, process and analyze. Each dataset consists of 13 columns, including: ride id, ride type, start time/end time, start station name, start station id, end station name, end station id, starting latitude/longitude, ending latitude/longitude, and member or casual rider.

As the data is public data, we will assume it is reliable. We can verify the data's integrity by ensuring that it is clean, up-to-date, and stored in one centralized location


### Tools

 For this project, I have decided to use R  to merge, clean, and analyze the datasets. It can handle wrangling the large dataset, analysis functions, and options to visualize, I feel like this is the best choice. This is because the datasets are too large for processing in spreadsheets, so a powerful programming language like **R**.

## Process:

I begin with installation the needed packages and libraries to make mergeing and cleaning the data easier.

In [1]:
#Loading up needed libraries
library(tidyverse)
library(janitor)
library(lubridate)
library(ggplot2)
library(scales)
library(magrittr)

For working with data in **R** I set up my working directory and upload the csv files which further I can access in **R** .

In [2]:
#Upload the previous 12 months of data

bike001<-read.csv("../input/cyclistic-bike-share/202004-divvy-tripdata.csv")
bike002<-read.csv("../input/cyclistic-bike-share/202005-divvy-tripdata.csv")
bike003<-read.csv("../input/cyclistic-bike-share/202006-divvy-tripdata.csv")
bike004<-read.csv("../input/cyclistic-bike-share/202007-divvy-tripdata.csv")
bike005<-read.csv("../input/cyclistic-bike-share/202008-divvy-tripdata.csv")
bike006<-read.csv("../input/cyclistic-bike-share/202009-divvy-tripdata.csv")
bike007<-read.csv("../input/cyclistic-bike-share/202010-divvy-tripdata.csv")
bike008<-read.csv("../input/cyclistic-bike-share/202011-divvy-tripdata.csv")
bike009<-read.csv("../input/cyclistic-bike-share/202012-divvy-tripdata.csv")
bike010<-read.csv("../input/cyclistic-bike-share/202101-divvy-tripdata.csv")
bike011<-read.csv("../input/cyclistic-bike-share/202102-divvy-tripdata.csv")
bike012<-read.csv("../input/cyclistic-bike-share/202103-divvy-tripdata.csv")
bike013<-read.csv("../input/cyclistic-bike-share/202104-divvy-tripdata.csv")



After installing all the necessary packages and loading all the CSV files into R, I needed to get acquainted with the available data. 

My process was:

     Check each column name and their properties, ensure there were no inconsistencies in data type or spelling (as this is important for merging all the tables into one!).



In [3]:
#Check the structure and columns of files to note inconsistencies and errors
colnames(bike001)
colnames(bike002)
colnames(bike003)
colnames(bike004)
colnames(bike005)
colnames(bike006)
colnames(bike007)
colnames(bike008)
colnames(bike009)
colnames(bike010)
colnames(bike011)
colnames(bike012)
colnames(bike013)

str(bike001)
str(bike002)
str(bike003)
str(bike004)
str(bike005)
str(bike006)
str(bike007)
str(bike008)
str(bike009)
str(bike010)
str(bike011)
str(bike012)
str(bike013)

After looking over the structure and column names and working with the data, I noted some things that needed to be addressed.

* There were rides that had a duration of zero. These included bike inspections. They were removed.
* The Start and End Times were char type which need to be changed to date time .

I also removed rows and columns with that were empty. 
I combined the twelve files into one data frame.

In [4]:
#Combine the dataframes into one
bike_rides<-rbind(bike001,bike002,bike003,bike004,bike005,bike006,bike007,bike008,bike009,bike010,bike011,bike012,bike013)

#Clean/remove empty rows and columns
bike_rides<-janitor::remove_empty(bike_rides, which = c("cols"))
bike_rides<-janitor::remove_empty(bike_rides, which = c("rows"))

As I noted earlier, I changed the datatype for the start and end times from char to date time. I also add columns for date, month, day, and day of the week to make it easier to track rides.   

In [5]:
#Change data type from char into date
bike_rides$started_at <- lubridate::ymd_hms(bike_rides$started_at)
bike_rides$ended_at <- lubridate::ymd_hms(bike_rides$ended_at)

#Create hour field to calculate duration and busiest hours used
bike_rides$start_hour<-lubridate::hour(bike_rides$started_at)
bike_rides$end_hour<- lubridate::hour(bike_rides$ended_at)

#Add date,month, day, and year of bike rides
bike_rides$date <- as.Date(bike_rides$started_at)
bike_rides$month <- format(as.Date(bike_rides$date), "%m")
bike_rides$day <- format(as.Date(bike_rides$date), "%d")
bike_rides$year <- format(as.Date(bike_rides$date), "%Y")
bike_rides$day_of_week <- format(as.Date(bike_rides$date), "%A")

#Calculate ride duration by hour and minutes
bike_rides$ride_duration_hrs <- difftime(bike_rides$ended_at,bike_rides$started_at, units=c("hours"))
bike_rides$ride_duration_mins <- difftime(bike_rides$ended_at,bike_rides$started_at, units=c("mins"))



I then calculate ride time (ride duration)  . 
 I cleaned up the data by removing bike inspections where the duration is 0 already. 

In [6]:
#Review dataframe
head(bike_rides)
str(bike_rides)

#Calculate ride duration 
bike_rides$ride_duration <- difftime(bike_rides$ended_at,bike_rides$started_at)

#Make sure ride_duration is numeric
is.factor(bike_rides$ride_duration)
bike_rides$ride_duration <- as.numeric(as.character(bike_rides$ride_duration))
is.numeric(bike_rides$ride_duration)

#Remove bike inspections & create new data frame
bike_ridesv2<-bike_rides[!(bike_rides$start_station_name =="HQ QR" | bike_rides$ride_duration<=0),]

## Analyze

I began my analysis by calculating some key metrics about the bike rides - median, average, max,and mininum. 

Note: When getting the maximum ride duration, the data returned 3,356,649 minutes. Seeing that this translates to 55,000+ hours, this outlier was removed (and documented here).

I also looked into which days of the weeks rides were taken to see if there was a difference between annual members and casual users. 

I performed descriptive analyses on the cleaned table to give a broad idea of ride length.

In [7]:
#Bike Rides Analysis - looking at average, median, max, and min by minutes
mean(bike_ridesv2$ride_duration_mins)             
median(bike_ridesv2$ride_duration_mins)          
max(bike_ridesv2$ride_duration_mins)
min(bike_ridesv2$ride_duration_mins)

#Compare Usage By User Type
aggregate(bike_ridesv2$ride_duration_mins~bike_ridesv2$member_casual, FUN=mean)
aggregate(bike_ridesv2$ride_duration_mins~bike_ridesv2$member_casual, FUN=median)
aggregate(bike_ridesv2$ride_duration_mins~bike_ridesv2$member_casual, FUN=max)
aggregate(bike_ridesv2$ride_duration_mins~bike_ridesv2$member_casual, FUN=min)

#Breaking down and comparing average ride duration between casual and members by day of the week
aggregate(bike_ridesv2$ride_duration~bike_ridesv2$member_casual + bike_ridesv2$day_of_week,FUN=mean)

#Order this by day of week
bike_ridesv2$day_of_week<-ordered(bike_ridesv2$day_of_week, levels=c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))

#Re-run the rides to compare
aggregate(bike_ridesv2$ride_duration~bike_ridesv2$member_casual + bike_ridesv2$day_of_week,FUN=mean)

head(bike_ridesv2)
colnames(bike_ridesv2)

#More analysis - ohh goody! Breaking it down by type and weekday
bike_ridesv2 %>%
  mutate(weekday=wday(started_at, label =TRUE)) %>%                                            
  group_by(member_casual,weekday) %>%                                                         
  summarise(number_of_rides=n(),average_duration_mins = mean(ride_duration)) %>%                     
  arrange(member_casual,weekday)            

# Key Takeaways

To summarize all the simple analyses done above:

* Casual riders have, on average, longer ride lengths compared to annual members. This is most likely due to the fact that members use bikes for specific routine commutes (going to work, school, etc.), whereas casual riders may intend to use them for longer, leisurely occasions.
* Saturday and Sunday have the longest average ride length for both casual riders and members.
* Saturday is the day that has the most casual riders.

## Share

With some analysis done, I now start with some visualizations to easily see any key differences between annual members and casual users. 

In [8]:
#Let's visualize the number of rides by rider type
bike_ridesv2 %>%
  mutate(weekday = wday(started_at, label =TRUE)) %>%
  group_by(member_casual,weekday) %>% 
  summarise(number_of_rides=n(),average_duration_mins = mean(ride_duration_mins)) %>%  
  arrange(member_casual, weekday) %>%
  ggplot(aes(x=weekday,y=number_of_rides, fill= member_casual)) + geom_col(position="dodge") + scale_y_continuous(labels = comma)

#Create visualization for average duration by minutes
bike_ridesv2 %>%
  mutate(weekday=wday(started_at, label =TRUE)) %>% 
  group_by(member_casual,weekday)%>% 
  summarise(number_of_rides=n(), average_duration_mins = mean(ride_duration_mins)) %>% 
  arrange(member_casual, weekday)  %>%      
  ggplot(aes(x=weekday,y=average_duration_mins, fill= member_casual)) + geom_col(position="dodge") + scale_y_continuous(labels = comma)

#Break data down to see which type of bike the different users ride
bike_ridesv2 %>%
  group_by(rideable_type, member_casual) %>%
  summarise(number_of_trips = n()) %>%  
  ggplot(aes(x= rideable_type, y=number_of_trips, fill= member_casual))+
  geom_bar(stat='identity') +
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE)) +
  labs(title ="Which Bikes Members and Casual Users Ride") 

## Key Findings

Looking at the data, here are some key takeaways gathered:

* Saturday is the most popular day for casual riders (Graph 1).
* Members rent bikes throughout the week on a more consistent basis, with higher number of rides on weekdays than casual riders (Graph 1).
* On any day of the week, casual riders' ride length are about 2 times higher than annual members (Graph 2).
* The classic bike is the most popular option compared to docked and electric bikes (Graph 3).
* The number of rides follow a seasonal pattern for both casual riders and members. In the winter months, number of rides drop significantly, and peak in the summer months. Casual riders have the peak number of rides in July (Graph 4).



## Act

# Recommendations

1. Release the ad campaigns during the peak summer months (June and July) to reach the maximum number of casual riders.
2. Offer a discounted weekend-only membership (Friday-Sunday) to attract casual riders and eventually entice them towards a full membership.
3. Offer a campaign for touring the city, with suggested routes to see each of Chicago's highlights over the course of a year. This incentivises casual riders to pay for a membership to save costs in the long term for sightseeing/touring.