# Google Data Analytics Capstone Project

## About the company

In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that
are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and
returned to any other station in the system anytime.
Until now, Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments.
One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes,
and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers
who purchase annual memberships are Cyclistic members.
Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders. Although the
pricing flexibility helps Cyclistic attract more customers, Moreno believes that maximizing the number of annual members will
be key to future growth. Rather than creating a marketing campaign that targets all-new customers, Moreno believes there is a
very good chance to convert casual riders into members. She notes that casual riders are already aware of the Cyclistic
program and have chosen Cyclistic for their mobility needs.
Moreno has set a clear goal: Design marketing strategies aimed at converting casual riders into annual members. In order to
do that, however, the marketing analyst team needs to better understand how annual members and casual riders differ, why
casual riders would buy a membership, and how digital media could affect their marketing tactics. Moreno and her team are
interested in analyzing the Cyclistic historical bike trip data to identify trends.


For the analysis I will follow 6 steps of Data Analysis which are **Ask,Prepare,Process,Analyze,Share** and **Act**.

## Phase1: Ask

### Business Task

Cyclistic, a bike share company wants to understand how their annual members(who purchase annual memberships) and casual riders(who purchase single-ride or full-day passes) use bike-share offering differently.Based on the insights of the analysis company may launch some marketing strategies aimed at converting casual riders into annual members.

### Key Stakeholders

* Lily Moreno,The director of marketing and my manager
* Other members of Cyclistic marketing analytics team
* Cyclistic executive team

## Phase2: Prepare

Data is good first party public data and is located at <https://divvy-tripdata.s3.amazonaws.com/index.html>.The data has been made available by Motivate International Inc. under this <https://ride.divvybikes.com/data-license-agreement>.
Data is reliable,original,comprehensive,current and cited. Data is organized by months,some are by quarters,some are by years.

for the analysis I will take consideration of data of **June2021-July2022** period.I will download the data and store it appropriately.



In [1]:
# load all the relevant libraries for my analysis

library(tidyverse)  #helps wrangle data
library(lubridate)  #helps wrangle date attributes
library(ggplot2)  #helps visualize data
library(janitor) #helps cleaning dirty data
library(dplyr)


In [2]:
# Assign all the source files which are stored monthly to their respective variables

jun_2021 <- read.csv("../input/cyclistic-bike-share/202106-divvy-tripdata.csv")
jul_2021 <- read.csv("../input/cyclistic-bike-share/202107-divvy-tripdata.csv")
aug_2021 <- read.csv("../input/cyclistic-bike-share/202108-divvy-tripdata.csv")
sep_2021 <- read.csv("../input/cyclistic-bike-share/202109-divvy-tripdata.csv")
oct_2021 <- read.csv("../input/cyclistic-bike-share/202110-divvy-tripdata.csv")
nov_2021 <- read.csv("../input/cyclistic-bike-share/202111-divvy-tripdata.csv")
dec_2021 <- read.csv("../input/cyclistic-bike-share/202112-divvy-tripdata.csv")
jan_2022 <- read.csv("../input/cyclistic-bike-share/202201-divvy-tripdata.csv")
feb_2022 <- read.csv("../input/cyclistic-bike-share/202202-divvy-tripdata.csv")
mar_2022 <- read.csv("../input/cyclistic-bike-share/202203-divvy-tripdata.csv")
apr_2022 <- read.csv("../input/cyclistic-bike-share/202204-divvy-tripdata.csv")
may_2022 <- read.csv("../input/cyclistic-bike-share/202205-divvy-tripdata.csv")


In [3]:
# Combine all the source files and assign them to one variable

bike_rides <- rbind(jun_2021,jul_2021,aug_2021,sep_2021,oct_2021,nov_2021,dec_2021,jan_2022,feb_2022,mar_2022,apr_2022,may_2022)


## Phase3: Process

I am using **R** for the analysis.In this phase I will check possible errors like NA in the dataset,remove any empty row or colums,do the necessary manipulation and transform the data so that I can work with it effectively and draw meaningful insights.



In [4]:
# lets see the data
str(bike_rides)
summary(bike_rides)
head(bike_rides)
# remove empty rows and empty columns

bike_rides <- janitor::remove_empty(bike_rides,which = c("cols"))
bike_rides <- janitor::remove_empty(bike_rides,which = c("rows"))

In [5]:
#lets convert started_date and ended_date to ymd_hms format and separate the dates into month, day, year and day of the week and make new columns

bike_rides$started_at <- lubridate::ymd_hms(bike_rides$started_at)
bike_rides$ended_at <- lubridate::ymd_hms(bike_rides$ended_at)
bike_rides$month <- format(as.Date(bike_rides$started_at), "%m")
bike_rides$day <- format(as.Date(bike_rides$started_at), "%d")
bike_rides$year <- format(as.Date(bike_rides$started_at), "%Y")
bike_rides$day_of_week <- format(as.Date(bike_rides$started_at), "%A")

In [6]:
#getting the ride length in seconds:

bike_rides$ride_length <- difftime(bike_rides$ended_at,bike_rides$started_at)

# Inspect the structure of the columns

str(bike_rides)

In [7]:
# Remove "bad" data
# The dataframe includes a few hundred entries when bikes were taken out of docks and checked for quality by Divvy or ride_length was negative
# We will create a new version of the dataframe since data is being removed

bike_rides_v2 <- bike_rides[!(bike_rides$start_station_name == "HQ QR" | bike_rides$ride_length<0),]

str(bike_rides_v2)

In [8]:
# creating a new dataframe excluding all NA and ride_length greater than 0

bike_rides_new <- bike_rides_v2 %>% filter(ride_length > 0) %>% drop_na()

Now the dataset is clean, so we will move to the next phase.

## Phase4: Analyze

Now lets analyze the cleaned dataset to get meaningful insights.

In [9]:
# Descriptive analysis on ride_length (all figures in seconds) 
# calculate mean,median,max and min of ride_length

mean(bike_rides_new$ride_length) #straight average (total ride length / rides)
median(bike_rides_new$ride_length) #midpoint number in the ascending array of ride lengths
max(bike_rides_new$ride_length) #longest ride
min(bike_rides_new$ride_length) #shortest ride

In [10]:
# Compare members and casual users

aggregate(bike_rides_new$ride_length ~ bike_rides_new$member_casual, FUN = mean)
aggregate(bike_rides_new$ride_length ~ bike_rides_new$member_casual, FUN = median)
aggregate(bike_rides_new$ride_length ~ bike_rides_new$member_casual, FUN = max)
aggregate(bike_rides_new$ride_length ~ bike_rides_new$member_casual, FUN = min)

**Analysis**:
Mean,Median and Max value of ride_length for casual_members are more than annual members.

In [11]:
# See the average ride time by each day for members vs casual users
aggregate(bike_rides_new$ride_length ~ bike_rides_new$member_casual + bike_rides_new$day_of_week, FUN = mean)

# Notice that the days of the week are out of order. Let's fix that.
bike_rides_new$day_of_week <- ordered(bike_rides_new$day_of_week, levels=c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))


**Analysis**: In all days of the week casual members have more mean ride_length than annual members.

In [12]:
# Now, let's run the average ride time by each day for members vs casual users

aggregate(bike_rides_new$ride_length ~ bike_rides_new$member_casual + bike_rides_new$day_of_week, FUN = mean)


In [13]:
# analyze ridership data by type and weekday

bike_rides_new %>% 
  mutate(weekday = wday(started_at, label = TRUE)) %>%      # creates weekday field using wday()
  group_by(member_casual, weekday) %>%                      # groups by usertype and weekday
  summarize(number_of_rides = n()                           # calculates the number of rides and average duration 
            ,average_duration = mean(ride_length)) %>%      # calculates the average duration
  arrange(member_casual, weekday)                           # sorts

In [14]:
# Let's visualize the number of rides by rider type
bike_rides_new %>% 
  mutate(weekday = wday(started_at, label = TRUE)) %>% 
  group_by(member_casual, weekday) %>% 
  summarise(number_of_rides = n()
            ,average_duration = mean(ride_length)) %>% 
  arrange(member_casual, weekday)  %>% 
  ggplot(aes(x = weekday, y = number_of_rides, fill = member_casual)) +
  geom_col(position = "dodge")+
  labs(title = "Number of Rides Vs Weekday by user type")

**Analysis**: On weekends(Saturday and Sunday) casual members have more number of rides. On the other hand on weekdays(Monday-Friday) annual members have more number of rides.

In [15]:
# Let's create a visualization for average duration
bike_rides_new %>% 
  mutate(weekday = wday(started_at, label = TRUE)) %>% 
  group_by(member_casual, weekday) %>% 
  summarise(number_of_rides = n()
            ,average_duration = mean(ride_length)) %>% 
  arrange(member_casual, weekday)  %>% 
  ggplot(aes(x = weekday, y = average_duration, fill = member_casual)) +
  geom_col(position = "dodge")+
  labs(title = "Average Duration Vs Weekday by user type")


**Analysis**: Casual members have more average duration of rides than annual members in all days of the week.

In [16]:

#lets check the bike type usage by user type:

bike_rides_new %>%
    filter(rideable_type=="classic_bike" | rideable_type=="electric_bike") %>%
    group_by(member_casual,rideable_type) %>%
    summarise(totals=n(), .groups="drop")  %>%

ggplot()+
    geom_col(aes(x=member_casual,y=totals,fill=rideable_type), position = "dodge") + 
    labs(title = "Bike type usage by both users",x="User type",y=NULL, fill="Bike type")

**Analysis**: Classics bike used more than electric bikes by both the members.

In [17]:
#And their usage by both users on each day of the week:

bike_rides_new %>%
    filter(rideable_type=="classic_bike" | rideable_type=="electric_bike") %>%
    mutate(weekday = wday(started_at, label = TRUE)) %>% 
    group_by(member_casual,rideable_type,weekday) %>%
    summarise(totals=n(), .groups="drop") %>%

ggplot(aes(x=weekday,y=totals, fill=rideable_type)) +
  geom_col(, position = "dodge") + 
  facet_wrap(~member_casual) +
  labs(title = "Bike type usage by both users on different days of the week",x="User type",y=NULL)
 

**Analysis**: Classic bikes widely used by annual members.

## Phase5 : Share

Sharing my analysis:


*  The casual members have more average duration of rides in all days of the week and have more number of rides on weekends which implies casual members use bike sharing as leisure activity or they use it for tourism purpose.

* The Annual members use bike-share more on weekdays widely using classic bikes which implies they use bike sharing as commute or pragmatic use.

I would share my analysis with the stakeholders. I would suggest that in order to convert the casual to the annual users it would be interesting to focus on some promotional offers on weekends for annual members.

### Thanks for reading and I hope you liked it!!!