# Case Study: How Does a Bike-Share Navigate Speedy Success?


**Note:** This is a Google data analytics certificate case study.  
**Kaggle link:**
https://www.kaggle.com/code/evgenevgen/bike-share-company-data-analysis-gdac-cs1

## Scenario
According to the scenario, I'm a junior data analyst in the marketing team at Cyclistic, a bike-share company from Chicago.  
Cyclistics has a flexible pricing plans: single-ride passes, full-day passes and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members. Also, we know that casual riders are already aware of the Cyclistic program and have chosen Cyclistic for their mobility needs.  
The finance team have concluded that annual members are much more profitable than casual riders. The management believes the company can increase profits by maximizing the number of annual memberships.  
Therefore, I have to understand how casual riders and annual members use Cyclistic bikes differently and how we can convert casual riders into annual members by a new marketing strategy. To do this, I'll analyze the Cyclistic historical bike trip data.

# PART 1 - ASK
### **1. What is the main task?**  
* To identify how casual customers and members use bikes differently.
* To figure out why would casual riders buy Cyclistic annual memberships.
* To suggest recommendations on how to convert casual customers to members.  
  
### **2. Who are the key stakeholders?**
* Lily Moreno: The director of marketing and my manager.
* Cyclistic executive team: The notoriously detail-oriented executive team will decide whether to approve the recommended marketing program.
* Cyclistic marketing analytics team.

# PART 2, 3 - PREPARE, PROCESS  
### **1. Where is the data?**  
* I will use the previous 12 months (**from october 2020 to september 2021**) of Cyclistic trip data from [here](https://divvy-tripdata.s3.amazonaws.com/index.html).
  
  (Note: The datasets have a different name because Cyclistic is a fictional company.)  

### **2. How is the data organized?** 
* 12 .csv files with 13 columns each.  

### **3. Is the data ROCCC?**  
* **R**eliable - yes (without bias).
* **O**riginal - yes. The data has been made available by Motivate International Inc. under this [license](https://www.divvybikes.com/data-license-agreement).
* **C**omprehensive - not exactly (no info about customers, financial info; some empty and NA values, duplicates). 
* **C**urrent - yes, updated monthly.
* **C**ited - yes.  

### **4. What tools to choose and why?**
* I'll choose R because it's possible to clean, transform, analyze and visualize large datasets right in RStudio.

In [None]:
# Download all the needed packages
library(tidyverse) 
library(lubridate) # for work with time data
library(skimr) # for describing statistics of data
library(scales) # for adjusting display of values on plot axis

In [None]:
# Uploading the trips data into R
trips_202010 <- read.csv("../input/cyclistic-tripdata-202010-202109/202010-divvy-tripdata.csv")
trips_202011 <- read.csv("../input/cyclistic-tripdata-202010-202109/202011-divvy-tripdata.csv")
trips_202012 <- read.csv("../input/cyclistic-tripdata-202010-202109/202012-divvy-tripdata.csv")
trips_202101 <- read.csv("../input/cyclistic-tripdata-202010-202109/202101-divvy-tripdata.csv")
trips_202102 <- read.csv("../input/cyclistic-tripdata-202010-202109/202102-divvy-tripdata.csv")
trips_202103 <- read.csv("../input/cyclistic-tripdata-202010-202109/202103-divvy-tripdata.csv")
trips_202104 <- read.csv("../input/cyclistic-tripdata-202010-202109/202104-divvy-tripdata.csv")
trips_202105 <- read.csv("../input/cyclistic-tripdata-202010-202109/202105-divvy-tripdata.csv")
trips_202106 <- read.csv("../input/cyclistic-tripdata-202010-202109/202106-divvy-tripdata.csv")
trips_202107 <- read.csv("../input/cyclistic-tripdata-202010-202109/202107-divvy-tripdata.csv")
trips_202108 <- read.csv("../input/cyclistic-tripdata-202010-202109/202108-divvy-tripdata.csv")
trips_202109 <- read.csv("../input/cyclistic-tripdata-202010-202109/202109-divvy-tripdata.csv")


### **5. What problems does the data have?**  
* Let's inspect our data and fix possible errors.

In [None]:
# Quick inspection of the data
print("October, 2020")
glimpse(trips_202010)
print("November, 2020")
glimpse(trips_202011)
print("December, 2020")
glimpse(trips_202012)
print("January, 2021")
glimpse(trips_202101)
print("February, 2021")
glimpse(trips_202102)
print("March, 2021")
glimpse(trips_202103)
print("April, 2021")
glimpse(trips_202104)
print("May, 2021")
glimpse(trips_202105)
print("June, 2021")
glimpse(trips_202106)
print("July, 2021")
glimpse(trips_202107)
print("August, 2021")
glimpse(trips_202108)
print("September, 2021")
glimpse(trips_202109)

**Note:**  
We need to unite our monthly data into one data frame, but *start_station_id* and *end_station_id* columns in October and November of 2020 are **integer**, though in other tables they are **characters**.  Let' fix this for correct joining.

In [None]:
# Coverting int to chr
print("October, 2020")
trips_202010 %>%
  as_tibble() %>% 
  mutate(start_station_id = as.character(start_station_id), 
         end_station_id = as.character(end_station_id)) %>%
  select(start_station_id, end_station_id) %>%
  head(2)

print("November, 2020")
trips_202011 %>%
  as_tibble() %>% 
  mutate(start_station_id = as.character(start_station_id), 
         end_station_id = as.character(end_station_id)) %>%
  select(start_station_id, end_station_id) %>%
  head(2)

In [None]:
# Unite all the df's into one
trips_total_raw <- rbind(trips_202010, trips_202011, trips_202012, trips_202101, trips_202102, 
                     trips_202103, trips_202104, trips_202105, trips_202106, trips_202107, 
                     trips_202108, trips_202109)

In [None]:
# Explore the structure of the combined table
str(trips_total_raw)
summary(trips_total_raw)

**Note:**  
There are some NA's in *end_lat* and *end_lng*. Also, the *started_at* and *ended_at* values are character, though they should be in date_time format. We'll fix this.  
But first, let's check all the columns for empty or NA values and for duplicates.

#### **Duplicates**

In [None]:
# Duplicates
print("Duplicates in ride_id:")
sum(duplicated(trips_total_raw$ride_id))
# Recheck
length(unique(trips_total_raw$ride_id)) == nrow(trips_total_raw)

In [None]:
# Deleting duplicates
non_duplicated_trips <- trips_total_raw[!duplicated(trips_total_raw$ride_id), ]

In [None]:
# Check for duplicates in a non-duplicated data frame
print("Duplicates in ride_id:")
sum(duplicated(non_duplicated_trips$ride_id))
length(unique(non_duplicated_trips$ride_id)) == nrow(non_duplicated_trips)
glimpse(non_duplicated_trips)

#### **NA values**

In [None]:
# Checking the NA values in the variables
sum(is.na(non_duplicated_trips))
colSums(is.na(non_duplicated_trips))

In [None]:
# Let's remove all NA values
trips_cleaned <- drop_na(non_duplicated_trips)

In [None]:
# Checking the NA values in the variables
sum(is.na(trips_cleaned))
colSums(is.na(trips_cleaned))

In [None]:
# What if we lost too much data?
data_loss_percent <- (nrow(non_duplicated_trips)-nrow(trips_cleaned)) / nrow(non_duplicated_trips) * 100
data_loss_percent # No. The lost data is less than 2%, so we could delete it

#### **Get date-time values**
  * We'll have to work with time data. So let's convert our time columns (started_at and ended_at) from character to date-time.

In [None]:
# Use the lubridate package
trips_cleaned$started_at <- ymd_hms(trips_cleaned$started_at)
trips_cleaned$ended_at <- ymd_hms(trips_cleaned$ended_at)

In [None]:
# Check the new time columns class
class(trips_cleaned$started_at) 
class(trips_cleaned$ended_at)
# It's date-time class now

### **6. Adding some valuable columns into our data frame**

In [None]:
# Counting a ride length in a column
trips_cleaned$ride_length <- (trips_cleaned$ended_at) - (trips_cleaned$started_at)

# Let's calculate day of week, month, year that each ride started - to provide additional opportunities to aggregate the data
# Change language of weekday output
# Sys.setlocale("LC_TIME", "English")

trips_cleaned$day_of_week <- wday(trips_cleaned$started_at, label = TRUE, abbr = FALSE) # day of week
trips_cleaned$month <- month(trips_cleaned$started_at, label = TRUE, abbr = FALSE) # month
trips_cleaned$year <- year(trips_cleaned$started_at) # year

head(trips_cleaned)

In [None]:
summary(trips_cleaned)

### **7. Organizing the data**

In [None]:
# First, let's convert our data frame into a tibble to simplify it's visualization
trips <- as_tibble(trips_cleaned)
# trips # oops, it doesn't work in Kaggle
head(trips, 5)

In [None]:
# Let's arrange our data to find some outliers
trips %>%
  arrange(ride_length) %>%
  head(3)

# Count negative values in ride_length column
sum(trips$ride_length < 0)

In [None]:
# There are a lot of negative values. These are errors. Also, I suppose to delete too short rides - let's say less than 5 secs, inclusive.
trips_filtered <- trips %>%
                    filter(ride_length > 5)

# Count wrong values in ride_length column
sum(trips_filtered$ride_length < 5)

head(trips_filtered, 5)

#### **Simplifiying my typing work by renaming the data frame**

In [None]:
t_f <- trips_filtered

#### **Continue organizing the table**  
I've noticed some suspicious station names contained "TESTING". I assume, these are test rides by Cyclistic's specialists. We won't include this in our analysis.

In [None]:
# Finding stations with "TEST" in their names
trips_tested <- t_f %>% 
  filter(grepl("TEST", start_station_name) | grepl("TEST", end_station_name))

glimpse(trips_tested) # we have 288 test trips. Let's remove them from our table

In [None]:
# Remove test trips from our table
t_f_v2 <- t_f %>% 
  filter(!(grepl("TEST", start_station_name) | grepl("TEST", end_station_name)))

# Check for test stations
trips_tested_v2 <- t_f_v2 %>% 
  filter(grepl("TEST", start_station_name) | grepl("TEST", end_station_name))

glimpse(trips_tested_v2) # 0

#### **Removing errors**  
Also, I suppose to remove trips that are longer than one day (24x60x60), because they're probably an error and not representative.

In [None]:
# Deleting too long trips
t_f_v2 <- t_f_v2[!(t_f_v2$ride_length > (24*60*60)), ]

head(sort(t_f_v2$ride_length, decreasing = TRUE), n=50) # check. (A day equals to 86400 seconds)

# PART 4, 5 - ANALYZE, SHARE  
Now we have a summary file with clean data. My goal is to identify any surprises, trends or relationships in the data and get some valuable insights that will help the stakeholders' to make decisions. 

#### **Descriptive analysis of ride_length**

In [None]:
print("Average ride duration:")
mean(t_f_v2$ride_length) # straight average (total ride length / rides) - 1220.715 (s)
print("Median of ride duration:")
median(t_f_v2$ride_length) # midpoint number in the ascending array of ride lengths - 759 (s)
print("Shortest ride duration:")
min(t_f_v2$ride_length)  #shortest ride - 6 (s)
print("Longest ride duration:")
max(t_f_v2$ride_length) #longest ride - 86394 (s)

### **Identifying how casual customers and members use bikes differently**

In [None]:
# Setting my plot theme
plot_theme = theme(
    plot.title = element_text(size=20, face = 'bold'), 
    plot.subtitle = element_text(size=10, color = 'gray', face = 'bold'), 
    plot.caption = element_text(size=12, color = 'darkgray', face = 'bold'),
    axis.text.x = element_text(size=15),
    axis.text.y = element_text(size=15),
    axis.title.x = element_text(size=18), 
    axis.title.y = element_text(size=18),
    strip.text.x = element_text(size=10), 
    strip.text.y = element_text(size=10),
    legend.title = element_text(size=18), 
    legend.text = element_text(size=16)
)

options(repr.plot.width = 12, repr.plot.height = 10)

* **Number of trips by type of customer**

In [None]:
t_f_v2 %>% 
  group_by(member_casual) %>% 
  summarise(number_of_trips = n()) %>% 
  ggplot(aes(x=member_casual, y=number_of_trips, fill=member_casual)) + 
  geom_col(position = "dodge") + 
  labs(title = "Total Trips: Members vs. Casual Riders",
       x = "Type of Rider", y = "Number", fill = "Rider Type", 
       caption = "source: Motivate International Inc.") + 
  scale_y_continuous(label=comma) + 
  geom_text(aes(label=comma(number_of_trips)), position = position_stack(vjust = 0.95), size = 5) + 
  plot_theme

**Analyze:**   
Annual members do more rides (in **1.2** times more) in total.

* **Average trips duration by type of client**

In [None]:
t_f_v2 %>% 
  group_by(member_casual) %>% 
  summarise(average_ride_length = mean(ride_length)) %>% 
  ggplot(aes(x=member_casual, y=average_ride_length, fill=member_casual)) + 
  geom_col(position = "dodge") + 
  labs(title = "Average Ride Duration: Members vs. Casual Riders",
       x = "Type of Rider", y = "Average Ride Length (s)", fill = "Rider Type", 
       caption = "source: Motivate International Inc.") +
  scale_y_continuous(label=comma) + 
  geom_text(aes(label=round(average_ride_length,2)), position = position_stack(vjust = 0.95), size = 5) +
  plot_theme

**Analyze:**  
Casual users ride duration is in **2** more times (1676 / 833) than members' duration in average, though members do more rides in total.

* **Total ride duration by type of riders**

In [None]:
t_f_v2 %>% 
  group_by(member_casual) %>% 
  summarise(sum_ride_duration = sum(ride_length)/60/60) %>% 
  ggplot(aes(x = member_casual, y = sum_ride_duration, fill=member_casual)) + 
  geom_col(position = "dodge") +
  scale_y_continuous(labels = comma) + 
  labs(x = "Rider Type", y = "Total Ride Duration (hours)",
       title = "Total Ride Duration: Members vs. Casual Riders", 
       caption = "source: Motivate International Inc.") + 
  geom_text(aes(label=round(sum_ride_duration,2)), position = position_stack(vjust = 0.95), size = 5) +
  theme(
    plot.title = element_text(size=18, face = 'bold'), 
    plot.caption = element_text(size=12, color = 'darkgray', face = 'bold'),
    axis.text.x = element_text(size=15),
    axis.text.y = element_text(size=15),
    axis.title.x = element_text(size=18), 
    axis.title.y = element_text(size=18),
    legend.title = element_text(size=18),
    legend.text = element_text(size=16),
  )

**Analyze:**  
We can see that casual riders have a way bigger (in **1.71** times) total ride duration than members.  
It's interesting, considering that annual members are much more profitable (according to the financial team) than casual riders.

* **How many different customers do we have?**

In [None]:
t_f_v2 %>%
  group_by(member_casual) %>%
  summarize(total_by_type = n()) %>%
  mutate(overall_total = sum(total_by_type)) %>% 
  group_by(member_casual) %>%
  summarize(percent_total = total_by_type/overall_total) %>% 
  ggplot(aes(fill=member_casual, y=percent_total, x="")) + 
  geom_bar(position="fill", stat="identity") +
  geom_text(aes(label = percent(percent_total)), 
            position = position_stack(vjust = 0.5), size = 10) + 
  labs(x = "", y = "Percent",
       title = "Different Customers Distribution", 
       caption = "source: Motivate International Inc.") + 
  theme(
    plot.title = element_text(size=18, face = 'bold'), 
    plot.caption = element_text(size=12, color = 'darkgray', face = 'bold'),
    axis.text.x = element_text(size=15),
    axis.text.y = element_text(size=15),
    axis.title.x = element_text(size=18), 
    axis.title.y = element_text(size=18),
    legend.title = element_text(size=18),
    legend.text = element_text(size=16),
  ) + 
  scale_y_continuous(labels = percent)

**Analyze:**  
There are more members than casual customers.  
We can see that **54%** of customers have an annual subscription and **46%** - use single-ride or full-day passes.  
Though there's almost 50% of casual customers in total, they don't generate enough revenue.

### **Analyze by day of week**

In [None]:
# First, let's order our weekdays in a normal order
t_f_v2$day_of_week <- ordered(t_f_v2$day_of_week, levels=c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))

* **Average ride time and number of rides by client type during a week**

In [None]:
t_f_v2 %>% 
  group_by(member_casual, day_of_week) %>% 
  summarise(num_of_rides = n(),
            average_ride_time = format(round(mean(ride_length), 4), nsmall = 4), .groups = 'drop') %>% 
  arrange(day_of_week)

* **Average Number of Rides During a Week**

In [None]:
# Visualize the number of rides by day and type of rider
t_f_v2 %>% 
  group_by(member_casual, day_of_week) %>% 
  summarise(num_of_rides = n(), .groups = 'drop') %>% 
  arrange(day_of_week)  %>% 
  ggplot(aes(x = day_of_week, y = num_of_rides, fill = member_casual)) + 
  geom_col(position = "dodge") + 
  scale_y_continuous(labels = comma) + # changing y_axis numbers format
  labs(title = "Average Number of Rides During a Week: Members vs. Casual Riders",
       x = "Day of Week", y = "Number", fill = "Rider Type", 
       caption = "source: Motivate International Inc.") +
  theme(
    plot.title = element_text(size=16, face = 'bold'), 
    plot.caption = element_text(size=12, color = 'darkgray', face = 'bold'),
    axis.text.x = element_text(size=15),
    axis.text.y = element_text(size=15),
    axis.title.x = element_text(size=18), 
    axis.title.y = element_text(size=18),
    legend.title = element_text(size=18), 
    legend.text = element_text(size=16)
) + 
  theme(axis.text.x = element_text(angle=45))

**Analyze:**  
We can see that casual riders use bikes mostly on weekends - what supports a theory that they use the Cyclistic service for leisure or observation trips (or exercising).  
And members use bikes almost equally throughout a week, but with peaks on workdays.

* **Visualize average ride duration by day and type of rider**

In [None]:
t_f_v2 %>% 
  group_by(member_casual, day_of_week) %>% 
  summarise(avg_ride_time = (mean(ride_length)/60), .groups = 'drop') %>% 
  arrange(day_of_week)  %>% 
  ggplot(aes(x = day_of_week, y = avg_ride_time, fill = member_casual)) + 
  geom_col(position = "dodge") + 
  labs(title = "Average Ride Duration During a Week: Members vs. Casual Riders", 
       x = "Day of Week", y = "Mins", fill = "Rider Type", 
       caption = "source: Motivate International Inc.") + 
  theme(
    plot.title = element_text(size=16, face = 'bold'), 
    plot.caption = element_text(size=12, color = 'darkgray', face = 'bold'),
    axis.text.x = element_text(size=15),
    axis.text.y = element_text(size=15),
    axis.title.x = element_text(size=18), 
    axis.title.y = element_text(size=18),
    legend.title = element_text(size=18),
    legend.text = element_text(size=16)
    ) +
  theme(axis.text.x = element_text(angle=45))

**Analyze:**  
We can see that average ride duration of casual riders is much higher than members'. This reinforces hypothesis that casual riders use bikes mostly for **leisure trips** (or exercising / tourism) and members use bikes mostly for **practical purposes** (e.g. get to work).

### **Analyze by month. Seasonal trends**

* **Number of rides by month**

In [None]:
t_f_v2 %>% 
  group_by(member_casual, month) %>% 
  summarise(num_of_rides = n(), .groups = 'drop') %>% 
  arrange(month) %>% 
  ggplot(aes(x = month, y = num_of_rides, fill = member_casual)) + 
  geom_col(position = "dodge") + 
  labs(title = "Number of Rides by Months: Members vs. Casual Riders",
       x = "Month", y = "Number", fill = "Rider Type", 
       caption = "source: Motivate International Inc.") +
  scale_y_continuous(labels = comma) + 
  theme(axis.text.x = element_text(angle=45)) + 
  theme(
    plot.title = element_text(size=16, face = 'bold'), 
    plot.caption = element_text(size=12, color = 'darkgray', face = 'bold'),
    axis.text.x = element_text(size=15),
    axis.text.y = element_text(size=15),
    axis.title.x = element_text(size=18), 
    axis.title.y = element_text(size=18),
    legend.title = element_text(size=18),
    legend.text = element_text(size=16)
  )

**Analyze:**  
We see the biggest values in **summer** (the peak is in July). Thus, it's the best time to start our marketing campaign.

* **Average ride duration by month and type of rider**

In [None]:
t_f_v2 %>% 
  group_by(member_casual, month) %>% 
  summarise(avg_ride_time = (mean(ride_length)/60), .groups = 'drop') %>% 
  arrange(month) %>% 
  ggplot(aes(x = month, y = avg_ride_time, fill = member_casual)) + 
  geom_col(position = "dodge") + 
  labs(title = "Average Ride Duration by Months: Members vs. Casual Riders",
       x = "Month", y = "Average Ride Duration (mins)", fill = "Rider Type", 
       caption = "source: Motivate International Inc.") +
  theme(axis.text.x = element_text(angle=45)) + 
  theme(
    plot.title = element_text(size=16, face = 'bold'), 
    plot.caption = element_text(size=12, color = 'darkgray', face = 'bold'),
    axis.text.x = element_text(size=15),
    axis.text.y = element_text(size=15),
    axis.title.x = element_text(size=18), 
    axis.title.y = element_text(size=18),
    legend.title = element_text(size=18),
    legend.text = element_text(size=16)
  )

**Analyze:**  
Members' ride duration stays almost equal throughout a year with a slight decline in December and January. Interesting, that the peak month by ride duration for members is **February**.  
  
  The peaks for casual customers is in **springtime** (longer than 30 mins). Summer weather is too hot for long trips. Later duration growth in autumn confirms it.

* **Number of rides by day of week during a year**

In [None]:
options(repr.plot.width = 20, repr.plot.height = 15)

t_f_v2 %>% 
  group_by(month, day_of_week, member_casual) %>% 
  summarise(num_of_rides = n(), .groups = 'drop') %>%
  drop_na() %>% 
  ggplot(aes(x = day_of_week, y = num_of_rides, fill = member_casual)) +
  geom_col(position = "dodge") +
  scale_y_continuous(labels = comma) +
  facet_grid(member_casual~month) +
  labs(x = "Day of Week", y = "Number of Rides", fill = "Member/Casual",
       title = "Number of Rides by Day During a Year: Members vs. Casual Riders", fill = 'Member/Casual', 
       caption = "source: Motivate International Inc.") +
  theme(axis.text.x = element_text(angle = 90)) + 
  theme(
    plot.title = element_text(size=18, face = 'bold'), 
    plot.caption = element_text(size=12, color = 'darkgray', face = 'bold'),
    axis.text.x = element_text(size=15),
    axis.text.y = element_text(size=15),
    axis.title.x = element_text(size=18), 
    axis.title.y = element_text(size=18),
    legend.title = element_text(size=18),
    legend.text = element_text(size=16),
    strip.text.x = element_text(size=15),
    strip.text.y = element_text(size=15)
  )

**Analyze:**  
Though total number of rides by annual members is bigger than casual riders, random users need more bikes on **weekends** from **may to september**.

### **Analyze by hour**

* **Average number of rides by hour**

In [None]:
# Create hour column in our df
t_f_v2$start_hour <- hour(t_f_v2$started_at)
t_f_v2$end_hour <- hour(t_f_v2$ended_at)

In [None]:
options(repr.plot.width = 12, repr.plot.height = 10)

t_f_v2 %>% 
  group_by(member_casual, start_hour) %>% 
  summarise(num_of_rides = n(), .groups = 'drop') %>% 
  arrange(start_hour) %>% 
  ggplot(aes(x=start_hour, y=num_of_rides, fill=member_casual)) + 
  geom_col(position = "dodge") +
  labs(title = "Number of Rides by Hour: Members vs. Casual Riders",
       x = "Day Time (h)", y = "Number of Rides", fill = "Rider Type", 
       caption = "source: Motivate International Inc.") + 
  scale_y_continuous(labels = comma) + 
  scale_x_continuous(breaks = c(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23)) +
  theme(
    plot.title = element_text(size=18, face = 'bold'), 
    plot.caption = element_text(size=12, color = 'darkgray', face = 'bold'),
    axis.text.x = element_text(size=15),
    axis.text.y = element_text(size=15),
    axis.title.x = element_text(size=18), 
    axis.title.y = element_text(size=18),
    legend.title = element_text(size=18),
    legend.text = element_text(size=16),
  )

**Analyze:**  
Bike usage rises near 06:00-07:00 for members (but not for casual riders). Both of them have peaks between **15:00-18:00**.  
  
  Casual riders peaks (more than **150,000**) start after 12:00 and least until 19:00. At evening and night time casual riders use bikes more than members.

* **Average ride duration by hour**

In [None]:
t_f_v2 %>% 
  group_by(member_casual, start_hour) %>% 
  summarise(avg_ride_time = (mean(ride_length)/60), .groups = 'drop') %>% 
  arrange(start_hour) %>% 
  ggplot(aes(x=start_hour, y=avg_ride_time, fill=member_casual)) + 
  geom_col(position = "dodge") +
  labs(x = "Day Time (h)", y = "Average Ride Duration (mins)", fill = "Rider Type", 
       title = "Average Ride Duration by Hour: Members vs. Casual Riders", 
       caption = "source: Motivate International Inc.") + 
  scale_x_continuous(breaks = c(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23)) +
  theme(
    plot.title = element_text(size=18, face = 'bold'), 
    plot.caption = element_text(size=12, color = 'darkgray', face = 'bold'),
    axis.text.x = element_text(size=15),
    axis.text.y = element_text(size=15),
    axis.title.x = element_text(size=18), 
    axis.title.y = element_text(size=18),
    legend.title = element_text(size=18),
    legend.text = element_text(size=16),
  )

**Analyze:**  
The average casual users ride duration is higher than members' throughout a day.  
Members average ride duration stays nearly the same (less than 15 minutes) during all day.  
Average casuals ride duration peaks (more than **30 minutes**) is nearly **11:00-15:00**.  
So, best time for ads is between 11:00 and 18:00.

### **Most popular stations**

* **Top most popular stations**

In [None]:
t_f_v2 %>% 
  group_by(start_station_name, member_casual) %>% 
  summarise(num_of_usage = n(), .groups = 'drop') %>%
  filter(start_station_name != "") %>%
  arrange(-num_of_usage) %>% 
  head(n=10)

**Analyze:**  
Top 5 most popular stations are: 1) "Streeter Dr & Grand Ave"; 2) "Millennium Park"; 3) "Michigan Ave & Oak St"; 4) "Clark St & Elm St"; 5) "Lake Shore Dr & Monroe St".  
We can use them for geo targeting our ad campaign.

* **Visualize the most popular stations for different customers**

In [None]:
# Casual users
t_f_v2 %>% 
  group_by(start_station_name, member_casual) %>% 
  summarise(num_of_usage = n(), .groups = 'drop') %>%
  filter(start_station_name != "") %>%
  filter(member_casual == "casual") %>% 
  arrange(-num_of_usage) %>% 
  head(n=10) %>% 
  ggplot() + 
  geom_col(aes(x = reorder(start_station_name, num_of_usage), y = num_of_usage), fill = "purple") + 
  labs(title = "Top 10 Used Stations by Casual Customers", y = "Number of Rides", x = "", 
       caption = "source: Motivate International Inc.") + 
  coord_flip() + 
  scale_y_continuous(labels = comma) + 
  theme(
    plot.title = element_text(size=18, face = 'bold'), 
    plot.caption = element_text(size=12, color = 'darkgray', face = 'bold'),
    axis.text.x = element_text(size=15),
    axis.text.y = element_text(size=15),
    axis.title.x = element_text(size=18), 
    axis.title.y = element_text(size=18),
    legend.title = element_text(size=18),
    legend.text = element_text(size=16),
  )

In [None]:
# Members
t_f_v2 %>% 
  group_by(start_station_name, member_casual) %>% 
  summarise(num_of_usage = n(), .groups = 'drop') %>%
  filter(start_station_name != "") %>%
  filter(member_casual == "member") %>% 
  arrange(-num_of_usage) %>% 
  head(n=10) %>% 
  ggplot() + 
  geom_col(aes(x = reorder(start_station_name, num_of_usage), y = num_of_usage), fill = "darkgreen") + 
  labs(title = "Top 10 Used Stations by Members", y = "Number of Rides", x = "", 
       caption = "source: Motivate International Inc.") + 
  coord_flip() + 
  scale_y_continuous(labels = comma) + 
  theme(
    plot.title = element_text(size=18, face = 'bold'), 
    plot.caption = element_text(size=12, color = 'darkgray', face = 'bold'),
    axis.text.x = element_text(size=15),
    axis.text.y = element_text(size=15),
    axis.title.x = element_text(size=18), 
    axis.title.y = element_text(size=18),
    legend.title = element_text(size=18),
    legend.text = element_text(size=16),
  )

### **Different types of bike usage**

* **Check the customers preferences of bikes type**

In [None]:
t_f_v2 %>% 
  group_by(rideable_type, member_casual) %>% 
  summarise(num_of_usages = n(), .groups = 'drop') %>% 
  ggplot(aes(x = member_casual, y = num_of_usages, fill = rideable_type)) + 
  geom_col(position = "dodge") + 
  labs(x = "Rider Type", y = "Number of Usages",
       title = "Bike Type Usage: Members vs. Casual Riders", 
       caption = "source: Motivate International Inc.") + 
  scale_y_continuous(labels = comma) + 
  theme(
    plot.title = element_text(size=18, face = 'bold'), 
    plot.caption = element_text(size=12, color = 'darkgray', face = 'bold'),
    axis.text.x = element_text(size=15),
    axis.text.y = element_text(size=15),
    axis.title.x = element_text(size=18), 
    axis.title.y = element_text(size=18),
    legend.title = element_text(size=18),
    legend.text = element_text(size=16),
  ) + 
  geom_text(aes(x = member_casual, y = num_of_usages, label = comma(num_of_usages), group = rideable_type), 
            position = position_dodge(width = 1), vjust = 1, size = 5)

**Analyze:**  
Both type of users prefer **classic** bikes.

* **Bike type usage throughout a week**

In [None]:
options(repr.plot.width = 20, repr.plot.height = 15)

t_f_v2 %>% 
  group_by(rideable_type, member_casual, day_of_week) %>% 
  summarise(num_of_usages = n(), .groups = 'drop') %>% 
  ggplot(aes(x = day_of_week, y = num_of_usages, fill = rideable_type)) + 
  geom_col(position = "dodge") + 
  facet_wrap(~member_casual) +
  labs(x = "Day of Week", y = "Number of Usages",
       title = "Bike Type Usage during a Week: Members vs. Casual Riders", 
       caption = "source: Motivate International Inc.") + 
  scale_y_continuous(labels = comma) + 
  theme(axis.text.x = element_text(angle=45)) + 
  theme(
    plot.title = element_text(size=18, face = 'bold'), 
    plot.caption = element_text(size=12, color = 'darkgray', face = 'bold'),
    axis.text.x = element_text(size=15),
    axis.text.y = element_text(size=15),
    axis.title.x = element_text(size=18), 
    axis.title.y = element_text(size=18),
    legend.title = element_text(size=18),
    legend.text = element_text(size=16),
    strip.text.x = element_text(size=15),
    strip.text.y = element_text(size=15)
  )

**Analyze:**  
We can see that casual riders tend to use more electric bikes on **weekends** that supports our hypothesis that they use Cyclistics for leisure longer trips and members use it for more practical purposes.  
And again we see that all bike types usage of members fall on weekends.

In [None]:
options(repr.plot.width = 12, repr.plot.height = 10)

t_f_v2 %>% 
  group_by(rideable_type, member_casual) %>% 
  summarise(avg_time_spent = mean(ride_length)/60, .groups = 'drop') %>% 
  ggplot(aes(x=rideable_type, y=avg_time_spent, fill = rideable_type)) + 
  geom_col(position = "dodge") + 
  facet_wrap(~member_casual) +
  labs(x = "Bike Type", y = "Ride Duration (mins)",
       title = "Ride Duration by type of bike: Members vs. Casual Riders", 
       caption = "source: Motivate International Inc.") + 
  scale_y_continuous(labels = comma) + 
  theme(
    plot.title = element_text(size=18, face = 'bold'), 
    plot.caption = element_text(size=12, color = 'darkgray', face = 'bold'),
    axis.text.x = element_text(size=15),
    axis.text.y = element_text(size=15),
    axis.title.x = element_text(size=18), 
    axis.title.y = element_text(size=18),
    legend.title = element_text(size=18),
    legend.text = element_text(size=16),
    strip.text.x = element_text(size=15),
    strip.text.y = element_text(size=15)
  ) + 
  geom_text(aes(label = round(avg_time_spent,0)), vjust = 2, size = 5)

**Analyze:**  
Though the most popular type of bike is classic, "champions" by ride duration for casual customers are docked_bikes (**45** minutes).  
Interesting, that members almost don't use any type of bike longer than others (near **14** mins).  
Also, we can notice that, in general, casual users spend more

# PART 6 - ACT  
Here are my recommendations on how to convert casual customers to members:  
#### **1. A weekend offer.**  
Most "popular days" are saturday and sunday. We can implement a new type of less expensive membership: for those who will use bikes only (or mostly) on weekends. Or give some bonuses for weekend members like extra minutes or discount.  
#### **2. A seasonal offer.**  
Extra minutes, discount or free water (we've seen a decrease of ride length during a heat) for summer or spring members.  
#### **3. Special hours offer.**  
The peak time of biking is from 3 PM to 6 PM. We can offer some bonuses, like a free hour (or 30 minutes), at this time for members.  
Because of casual users tend to ride longer trips, we might suggest some bonuses for trips longer than 30 minutes.  
#### **4. Geotargeting advertisement**  
We can use digital media for geotargeting advertising near the most popular stations. Stations with over 20,000 total rides: "Streeter Dr & Grand Ave", "Millenium Park", "Michigan Ave & Oak St", "Lake Shore Dr & Monroe St", "Shedd Aquarium" and "Theater on the Lake".  
Also, suggest some extra bonuses (like a coupon in a local store) for new members.  
#### **5. Bike types offers**  
Since electric bikes are less popular for casual customers, we can upgrade them with some extra features to emphasize it's merit. E.g. sightseeing guide (via the app, QR-code or built-in headphones) or travel routes navigator, etc.  
Docked bikes are "champions" by ride duration. So we can offer some bonuses (lower price, discount, free minutes, a free water bottle, etc.) for members who ride long trips.  
Classic bikes are already the most popular, but we can consider to implement all listed above options for classic bikes member users.

### THANK YOU!