-
Notifications
You must be signed in to change notification settings - Fork 1
/
Cyclistic_geom_polygon-point.rmd
519 lines (432 loc) · 21.8 KB
/
Cyclistic_geom_polygon-point.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
---
title: "Coursera Capstone Project - Cyclistic"
author: "Kevin Chin"
date: "2023-10-03"
output: html_document
---
***
# Introduction
This case study will use all the skills I developed in the Google Data Analytics Professional Certificate Course to complete the business tasks as a data analyst working for the fictional bike-sharing company **Cyclistic**. I will use the data analysis process of **Ask**, **Prepare**, **Process**, **Analyze**, **Share** and **Act** to answer key business questions so that the company can make data-driven decisions.
***
# Scenario
The Director of Marketing, Lily Moreno, believes the Cyclistic's future success depends on maximizing the number of annual memberships. Therefore, the team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, the team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve the recommendations, so they must be backed up with compelling data insights and professional data visualizations.
***
# About the Company
In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime.
Until now, Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments. One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members.
Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders. Moreno believes that maximizing the number of annual members will be key to future growth and she believes there is a very good chance to convert casual riders into members.
Moreno has set a clear goal: Design marketing strategies aimed at converting casual riders into annual members. In order to do that, however, the marketing analyst team needs to better understand the Cyclistic historical bike trip data to identify trends.
***
# 1. Ask
### Business Task:
Analyze Cyclistic's August 2022 through July 2023 trip data to better understand how annual members and casual riders differ, then use insights to assist with marketing strategies aimed at converting casual riders into annual members.
### Stakeholders
**Lily Moreno**: Director of Marketing. Moreno is responsible for the development of campaigns and initiatives to promote the bike-share program. These may include email, social media, and other channels.
**Cyclistic marketing analytics team**: A team of data analysts who are responsible for collecting, analyzing, and reporting data that helps guide Cyclistic marketing strategy.
**Cyclistic executive team**: The notoriously detail-oriented executive team will decide whether to approve the
recommended marketing program.
***
# 2. Prepare
[Trip data](https://divvy-tripdata.s3.amazonaws.com/index.html) stored by Cyclistic goes back many years, but we will use data available for the last 12 months (August 2022 to July 2023). The data is broken up monthly and stored as 12 .zip files:
* [202208-divvy-tripdata.zip](https://divvy-tripdata.s3.amazonaws.com/202208-divvy-tripdata.zip)
* [202209-divvy-tripdata.zip](https://divvy-tripdata.s3.amazonaws.com/202209-divvy-tripdata.zip)
* [202210-divvy-tripdata.zip](https://divvy-tripdata.s3.amazonaws.com/202210-divvy-tripdata.zip)
* [202211-divvy-tripdata.zip](https://divvy-tripdata.s3.amazonaws.com/202211-divvy-tripdata.zip)
* [202212-divvy-tripdata.zip](https://divvy-tripdata.s3.amazonaws.com/202212-divvy-tripdata.zip)
* [202301-divvy-tripdata.zip](https://divvy-tripdata.s3.amazonaws.com/202301-divvy-tripdata.zip)
* [202302-divvy-tripdata.zip](https://divvy-tripdata.s3.amazonaws.com/202302-divvy-tripdata.zip)
* [202303-divvy-tripdata.zip](https://divvy-tripdata.s3.amazonaws.com/202303-divvy-tripdata.zip)
* [202304-divvy-tripdata.zip](https://divvy-tripdata.s3.amazonaws.com/202304-divvy-tripdata.zip)
* [202305-divvy-tripdata.zip](https://divvy-tripdata.s3.amazonaws.com/202305-divvy-tripdata.zip)
* [202306-divvy-tripdata.zip](https://divvy-tripdata.s3.amazonaws.com/202306-divvy-tripdata.zip)
* [202307-divvy-tripdata.zip](https://divvy-tripdata.s3.amazonaws.com/202307-divvy-tripdata.zip)
The data has been made available by Motivate International Inc. under this
[license](https://www.divvybikes.com/data-license-agreement).
The 12 .zip files were downloaded, stored, and extracted into the /Cyclistic/raw_trip_data/ directory.
***
# 3. Process
### Load data analysis packages
```{r}
library(tidyverse)
library(janitor)
library(scales)
library(mapdata)
library(leaflet)
#library(hms)
```
### Set working directory
```{r}
setwd("D:/R/Cyclistic")
```
### Import trip data from August 2022 to July 2023 into their respective dataframes
```{r}
data_202208 <- read_csv("raw_trip_data/202208-divvy-tripdata.csv", show_col_types = FALSE)
data_202209 <- read_csv("raw_trip_data/202209-divvy-publictripdata.csv", show_col_types = FALSE)
data_202210 <- read_csv("raw_trip_data/202210-divvy-tripdata.csv", show_col_types = FALSE)
data_202211 <- read_csv("raw_trip_data/202211-divvy-tripdata.csv", show_col_types = FALSE)
data_202212 <- read_csv("raw_trip_data/202212-divvy-tripdata.csv", show_col_types = FALSE)
data_202301 <- read_csv("raw_trip_data/202301-divvy-tripdata.csv", show_col_types = FALSE)
data_202302 <- read_csv("raw_trip_data/202302-divvy-tripdata.csv", show_col_types = FALSE)
data_202303 <- read_csv("raw_trip_data/202303-divvy-tripdata.csv", show_col_types = FALSE)
data_202304 <- read_csv("raw_trip_data/202304-divvy-tripdata.csv", show_col_types = FALSE)
data_202305 <- read_csv("raw_trip_data/202305-divvy-tripdata.csv", show_col_types = FALSE)
data_202306 <- read_csv("raw_trip_data/202306-divvy-tripdata.csv", show_col_types = FALSE)
data_202307 <- read_csv("raw_trip_data/202307-divvy-tripdata.csv", show_col_types = FALSE)
```
### Compare column names and data types to confirm they are the same in all 12 dataframes
```{r}
compare_df_cols(data_202208, data_202209, data_202210, data_202211, data_202212, data_202301, data_202302, data_202303, data_202304, data_202305, data_202306, data_202307)
```
### Column names and data types match so the 12 dataframes will be merged into 1
```{r}
#data_trip <- rbind(data_202208, data_202209, data_202210, data_202211, data_202212, data_202301, data_202302, data_202303, data_202304, data_202305, data_202306, data_202307)
data_trip <- data_202208
```
### Remove the 12 monthly dataframes to free up memory
```{r}
rm(data_202208, data_202209, data_202210, data_202211, data_202212, data_202301, data_202302, data_202303, data_202304, data_202305, data_202306, data_202307)
```
### Check merged data
```{r}
head(data_trip)
str(data_trip)
```
### Remove start_station_name, start_station_id, end_station_name, end_station_id because values are not consistently filled in. We can use start/end latitude and start/end longitude to determine locations
```{r}
data_trip <- data_trip %>%
select(-c(start_station_name, start_station_id, end_station_name, end_station_id))
```
### Number of missing values in dataframe
```{r}
sum(is.na(data_trip))
```
### Remove the rows of data that have missing values
```{r}
data_trip <- na.omit(data_trip)
```
### Check for duplicate rows
```{r}
#remove comment tag later
#duplicate_rows <- data_trip[duplicated(data_trip), ]
#str(duplicate_rows)
#rm(duplicate_rows)
```
### Create a new column named ride_length (in minutes) by taking the difference between ended_at and started_at
```{r}
data_trip$ride_length <- as.numeric(as.character(round(difftime(data_trip$ended_at, data_trip$started_at, units="mins"), 2)))
#data_trip$ride_length <- as_hms(difftime(data_trip$ended_at, data_trip$started_at))
str(data_trip)
```
### Check for zero or negative ride_length values
```{r}
negative_rows <- data_trip[data_trip$ride_length<=0, ]
str(negative_rows)
rm(negative_rows)
```
### Remove rows with zero or negative value ride_length
```{r}
data_trip <- data_trip[!(data_trip$ride_length<=0), ]
str(data_trip)
```
### Summary to display minimum, median, mean, maximum, 1st quartile, and 3rd quartile for ride_length in minutes
```{r}
summary(time_length(data_trip$ride_length), unit = "mins" )
```
### Sort ride_length by descending order to find incongruent values
```{r}
newdata <- data_trip[order(-data_trip$ride_length), ]
head(newdata$ride_length, n = 20)
```
It is possible the rider did not return the bike because they needed it for that length of time so we will leave those values in place.
```{r}
remove(newdata)
```
### Create new columns named date (YYY-MM-DD), year (YYYY), month(MM), day(DD), day_of_week (Sunday, Monday, etc.) and hour_started (by hour) using started_at; create coords_start and coords_end
```{r}
data_trip$date <- as.Date(data_trip$started_at)
data_trip$year <- format(as.Date(data_trip$date), "%Y")
data_trip$month <- format(as.Date(data_trip$date), "%m")
data_trip$day <- format(as.Date(data_trip$date), "%d")
data_trip$day_of_week <- format(as.Date(data_trip$date), "%A")
data_trip$hour_started <- hour(data_trip$started_at)
data_trip$coords_start <- paste(data_trip$start_lat, data_trip$start_lng, sep=", ")
data_trip$coords_end <- paste(data_trip$end_lat, data_trip$end_lng, sep=", ")
str(data_trip)
```
***
# 4. Analyze
* Top Start and End Stations per rider type
### Total trips by rider type
```{r}
data_trip %>%
group_by(member_casual) %>%
summarize(total_count = n())
```
### Total trips per bike type
```{r}
data_trip %>%
group_by(rideable_type) %>%
summarize(total_count = n())
```
### Total trips per bike type for each type of rider
```{r}
data_trip %>%
group_by(member_casual, rideable_type) %>%
summarize(total_count = n())
```
### Total trips by year and month
```{r}
data_trip %>%
group_by(month, year) %>%
summarize(total_count = n()) %>%
arrange(year)
```
### Order day of the week to start on Sunday and end on Saturday
```{r}
data_trip$day_of_week <- factor(data_trip$day_of_week,
levels = c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))
```
### Total trips by day of the week
```{r}
data_trip %>%
group_by(day_of_week) %>%
summarize(total_count = n())
```
### Total trips by day of the week for each rider type
```{r}
data_trip$day_of_week <- factor(data_trip$day_of_week,
levels = c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))
data_trip %>%
group_by(member_casual, day_of_week) %>%
summarize(total_count = n())
```
### Total trips by hour for each rider type
```{r}
data_trip %>%
group_by(member_casual, hour_started) %>%
summarize(total_count = n()) %>%
as_tibble() %>%
print(n=48)
```
### Average ride length for each rider type (in minutes)
```{r}
data_trip %>%
group_by(member_casual) %>%
summarize(avg_ride_length = mean(ride_length))
```
### Average ride length for each bike type (in minutes)
```{r}
data_trip %>%
group_by(rideable_type) %>%
summarize(avg_ride_length = mean(ride_length))
```
### Average ride length for each rider type by bike type (in minutes)
```{r}
data_trip %>%
group_by(member_casual, rideable_type) %>%
summarize(avg_ride_length = mean(ride_length))
```
### Average ride length for each rider type by day of the week (in minutes)
```{r}
data_trip %>%
group_by(member_casual, day_of_week) %>%
summarize(avg_ride_length = mean(ride_length))
```
### Average ride length for each rider type by year and month (in minutes)
```{r}
data_trip %>%
group_by(member_casual, year, month) %>%
summarize(avg_ride_length = mean(ride_length)) %>%
as_tibble() %>%
print(n=24)
```
### Find mode for ride length for day of the week (rounded in minutes)
```{r}
#define function to calculate mode
find_mode <- function(x) {
u <- unique(x)
tab <- tabulate(match(x, u))
u[tab == max(tab)]
}
#find mode for ride length for day of the week
data_trip %>%
group_by(member_casual, day_of_week) %>%
summarize(mode_ride_length = find_mode(round(ride_length)))
```
### Find the coordinates for the top 10 most used start point stations
```{r}
data_trip_start <- data_trip %>%
group_by(coords_start, start_lat, start_lng) %>%
summarize(total_count = n()) %>%
arrange(desc(total_count)) %>%
as_tibble() %>%
print(n=10)
data_trip_start_top <- head(data_trip_start, 25)
state_map <- map_data("state")
data_trip_start %>% ggplot() +
geom_polygon(data = state_map, aes(x = long, y = lat, group = group), fill = "white", color = "grey") +
geom_point(data = data_trip_start_top, aes(x = start_lng, y = start_lat, fill = total_count), size = 3, shape = 21) +
scale_fill_gradient(low = "white", high = "red") +
# coord_cartesian(xlim=c(-87.9, -87.5), ylim = c(41.6, 42.1)) +
coord_cartesian(xlim=c(-87.675, -87.6), ylim = c(41.85, 41.97)) +
labs(x = "Longitude", y = "Latitude", title = "Most Used Start Stations")
pal = colorNumeric("RdYlBu", domain = data_trip_start_top$total_count)
leaflet(data = data_trip_start_top) %>%
addProviderTiles(providers$CartoDB.Positron) %>%
addCircles(lng=~start_lng, lat=~start_lat, opacity = 0.9, color = ~pal(data_trip_start_top$total_count),
popup=paste(data_trip_start_top$coords_start,"<br>",data_trip_start_top$total_count)) %>%
# addPolygons(data = lnd, fill = FALSE) %>%
addLegend(pal = pal, values = ~total_count) %>%
setView(lng = -87.63, 41.91, zoom = 12) %>%
addMiniMap()
```
### Find the coordinates for the top 10 most used end point stations
```{r}
data_trip_end <- data_trip %>%
group_by(coords_end) %>%
summarize(total_count = n()) %>%
arrange(desc(total_count)) %>%
as_tibble() %>%
print(n=10)
```
***
# 5. Share
### Create table with percentages for member and casual riders
```{r}
rider_percentage <- data_trip %>%
group_by(member_casual) %>%
summarize(count = n()) %>%
mutate(percentage = count / sum(count)) %>%
ungroup()
```
### Pie chart of percentage for member and casual rides
```{r}
ggplot(rider_percentage, aes(x="", y=percentage, fill=member_casual)) +
geom_bar(stat="identity", width=1, color="white") +
coord_polar("y", start=0) +
geom_text(aes(label = paste0(comma(count), " (",scales::percent(percentage),")")), position=position_stack(vjust=0.5), size = 9/.pt) +
labs(title="Percentage of Member and Casual Riders") +
theme_void()
```
### Bar chart for total trips per bike type for each type of rider
```{r}
data_trip %>%
group_by(member_casual, rideable_type) %>%
summarize(total_count = n()) %>%
ggplot(aes(x=rideable_type, y=total_count, fill=member_casual)) +
geom_bar(position = position_dodge(preserve = "single"), stat="identity", color="black", width=0.9) +
scale_y_continuous(labels = scales::comma) +
geom_text(aes(label=comma(total_count), group=member_casual), position=position_dodge(width = 0.9), vjust = -0.25) +
labs(title="Total Trips per Bike Type for Each Type of Rider", x="Type of Bikes", y="Number of Rides") +
theme_classic()
```
### Bar chart for total trips per month for each rider type
```{r}
data_trip %>%
group_by(member_casual, month) %>%
summarize(total_count = n()) %>%
ggplot(aes(x=month, y=total_count, fill=member_casual)) +
geom_bar(position = position_dodge(preserve = "single"), stat="identity", color="black", width=0.9) +
scale_y_continuous(labels = scales::comma) +
geom_text(aes(label=comma(total_count), group=member_casual), position=position_dodge(width = 0.9), vjust = -0.25, size = 6/.pt) +
labs(title="Total Trips per Month for Each Type of Rider", x="Month", y="Number of Rides") +
theme_classic()
```
### Bar chart for total trips by day of the week for each rider type
```{r}
data_trip %>%
group_by(member_casual, day_of_week) %>%
summarize(total_count = n()) %>%
ggplot(aes(x=day_of_week, y=total_count, fill=member_casual)) +
geom_bar(position = position_dodge(preserve = "single"), stat="identity", color="black", width=0.9) +
scale_y_continuous(labels = scales::comma) +
geom_text(aes(label=comma(total_count), group=member_casual), position=position_dodge(width = 0.9), vjust = -0.25, size = 6/.pt) +
labs(title="Total Trips by Day of the Week for Each Type of Rider", x="Day of the Week", y="Number of Rides") +
theme_classic()
```
### Bar chart for total trips by hour for each rider type
```{r}
data_trip %>%
group_by(member_casual, hour_started) %>%
summarize(total_count = n()) %>%
ggplot(aes(x=hour_started, y=total_count, fill=member_casual)) +
geom_bar(position = position_dodge(preserve = "single"), stat="identity", color="black", width=0.7) +
scale_y_continuous(labels = scales::comma) +
labs(title="Total Trips per hour for Each Type of Rider", x="Hour", y="Number of Rides") +
theme_classic()
```
### Bar chart for average ride length for each rider type (in minutes)
```{r}
data_trip %>%
group_by(member_casual) %>%
summarize(avg_ride_length = mean(ride_length)) %>%
ggplot(aes(x=member_casual, y=avg_ride_length, fill=member_casual)) +
geom_bar(position = position_dodge(preserve = "single"), stat="identity", color="black", width=0.9) +
scale_y_continuous(labels = scales::comma) +
geom_text(aes(label=comma(avg_ride_length), group=member_casual), position=position_dodge(width = 0.9), vjust = -0.25, size = 9/.pt) +
labs(title="Average Ride Length for Each Type of Rider", x="Member Types", y="Average Ride Length in Minutes") +
theme_classic()
```
### Bar chart for average ride length for each rider type by bike type (in minutes)
```{r}
data_trip %>%
group_by(member_casual, rideable_type) %>%
summarize(avg_ride_length = mean(ride_length)) %>%
ggplot(aes(x=rideable_type, y=avg_ride_length, fill=member_casual)) +
geom_bar(position = position_dodge(preserve = "single"), stat="identity", color="black", width=0.9) +
scale_y_continuous(labels = scales::comma) +
geom_text(aes(label=comma(avg_ride_length), group=member_casual), position=position_dodge(width = 0.9), vjust = -0.25, size = 9/.pt) +
labs(title="Average Ride Length for Each Type of Rider by Bike Type", x="Type of Bikes", y="Average Ride Length in Minutes") +
theme_classic()
```
### Bar chart for average ride length for each rider type by day of the week (in minutes)
```{r}
data_trip %>%
group_by(member_casual, day_of_week) %>%
summarize(avg_ride_length = mean(ride_length)) %>%
ggplot(aes(x=day_of_week, y=avg_ride_length, fill=member_casual)) +
geom_bar(position = position_dodge(preserve = "single"), stat="identity", color="black", width=0.9) +
scale_y_continuous(labels = scales::comma) +
geom_text(aes(label=comma(avg_ride_length), group=member_casual), position=position_dodge(width = 0.9), vjust = -0.25, size = 7/.pt) +
labs(title="Average Ride Length for Each Type of Rider by Day of the Week", x="Day of the Week", y="Average Ride Length in Minutes") +
theme_classic()
```
### Bar chart for average ride length for each rider type by year and month (in minutes)
```{r}
data_trip %>%
group_by(member_casual, year, month) %>%
summarize(avg_ride_length = mean(ride_length)) %>%
ggplot(aes(x=month, y=avg_ride_length, fill=member_casual)) +
geom_bar(position = position_dodge(preserve = "single"), stat="identity", color="black", width=0.9) +
scale_y_continuous(labels = scales::comma) +
geom_text(aes(label=comma(avg_ride_length), group=member_casual), position=position_dodge(width = 0.9), vjust = -0.25, size = 7/.pt) +
labs(title="Average Ride Length for Each Type of Rider by Month", x="Month", y="Average Ride Length in Minutes") +
theme_classic()
```
***
# 6. Act
### Conclusion
Ridership was 38% and 62% for casual and member riders, respectively.
Total trips:
* Member riders had more trips during weekdays than weekends
* Casual riders had more trips during weekends than weekdays
* Member and casual riders had more electric bike trips than classic or docked
+ Casual riders on electric bikes nearly doubled classic bikes
+ Member riders took 0 docked bike trips
* Member and casual rider trips were highest during months with warmer weather
Average ride length:
* Casual riders had longer ride length
* Docked bike had longest ride length, followed by classic then electric bikes
+ Member riders ride length were similar
+ Casual riders ride length nearly doubled the next bike type from electric to classic to docked
* Member and casual riders had longer ride length during weekends
+ Casual riders had larger change throughout the week
+ Member riders had consistent ride lengths
* Member and casual riders had longer ride length during warmer months
+ Casual riders had larger change while member riders length were more consistent
### Recommendations
Based on analysis and the question of how members and casual riders use Cyclistic bikes differently, the following recommendations will help the marketing team convert casual riders into members:
* June to September have the highest bike usage. Offering summer discounts and incentives for annual membership so trips can be more cost effective.
* A campaign to target weekday casual riders to use bikes to commute to work
* Electric bike usage almost doubled classic bike usage by casual riders. Show the convenience of using electric bikes over other modes of transportation
The coordinates for start and end stations were not used this time per guidelines. The data can be analyzed in the future to develop strategic marketing to target hotspots.