![Chicago-Divvy Icon](https://cdn.vox-cdn.com/thumbor/gxpEBcy4ZXpibFtudaVGR7RE2ok=/0x0:3872x2581/1200x800/filters:focal(1627x982:2245x1600)/cdn.vox-cdn.com/uploads/chorus_image/image/60438451/shutterstock_525161920.0.jpg)

<h2 align='center'>How to Improve Chicago-Divvy's Future Success by Maximizing the Number of Annual Memberships</h2>  

## by: Nurudeen Abdulsalaam
## updated: September 10, 2022  

<a id = "table-of-contents"></a>
# Table of Contents
- [Introduction](#intro)
     - [Background](#background)
     - [Divvy Bikes](#divvy)
     - [About the data](#data)
     - [Methodology](#method)
- [Phase One: Ask](#ask)
- [Phase Two: Prepare](#prepare)
- [Phase Three: Process](#process)
- [Phase Four: Analysis](#analyse)
- [Phase Five: Visualization](#viz)
- [Phase Five: Share](#share)
- [Phase Six: Recommendation](#act)
- [Limitations](#limit)

<a id = "intro"></a>
<a id = "background"></a>
# Background

As a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, my team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, my team will design a new marketing strategy to convert casual riders into annual members.
Read [more](https://d3c33hcgiwev3.cloudfront.net/aacF81H_TsWnBfNR_x7FIg_36299b28fa0c4a5aba836111daad12f1_DAC8-Case-Study-1.pdf?Expires=1662681600&Signature=BpMkr1QA6pHmRZQrvA4hLcaOg7~GEDdV3zNZHpJxQ2G4u0bqCswVvXMaCGEEAxvBrkxwq70R7E2wFXKh-a2cjc4eTBGBuLBsSkOWqIEFahp965JhL1GN3qkVCVot2UabhxOv64Ijb2G7sWqWs2129wG1U9JBY5CReLVtPm8FQ3E_&Key-Pair-Id=APKAJLTNE6QMUY6HBC5A)


<a id = "divvy"></a>
# Divvy Bikes

Divvy is the bicycle sharing system in the Chicago metropolitan area, currently serving the cities of Chicago and Evanston. The system is owned by the Chicago Department of Transportation and has been operated by Lyft since 2019. As of Sept 2021, Divvy operated 16,500 bicycles and over 800 stations, covering 190 square miles. [Explore!](https://divvybikes.com/)

<a id = "data"></a>
# About the Data

Divvy makes its Historical trip data available for public use. The datasets were downloaded from [link](https://divvy-tripdata.s3.amazonaws.com/index.html), under this [license](https://ride.divvybikes.com/data-license-agreement). Each trip is anonymized and includes, trip start day and time, trip end day and time, Trip start station, a Trip end station, Rider type. For this project, I will be analyzing 12-month Cyclist trip data between *August, 2020 and July, 2022*. Each month's data in a separate CSV file was loaded and were later concatinated.


<a id = "method"></a>
# Methodology

In order to answer the key business questions, I will follow the steps of the data analysis process: ask, prepare, process, analyze, share, and act taught in the Google Data Analytics Professional Certificate. 

You wish you understood this steps better? Read [more](https://www.geeksforgeeks.org/six-steps-of-data-analysis-process/)

<a id = "ask"></a>

# **Stage One:** <font color = "blue">Ask</font>
- How do annual members and casual riders use Cyclistic bikes differently?
- Why would casual riders buy Cyclistic annual memberships?
- How can Cyclistic use digital media to influence casual riders to become members?

## Business task
- Identify historical trends for casual and annual bike riders

- Determine the factors that influence casual riders into buying annual memberships

- Use insights from historical trends and factors associated with casual riders buying annual memberships to improve the casual rider to annual membership conversion rate via digital media.


<a id = "prepare"></a>
# **Stage Two:** <font color = "blue">Prepare</font>

## Data Source
Motivate International Inc. (“Motivate”) operates the City of Chicago’s (“City”) Divvy bicycle sharing service. Motivate and the City are committed to supporting bicycling as an alternative transportation option. As part of that commitment, the City permits Motivate to make certain Divvy system data owned by the City (“Data”) available to the public, subject to the terms and conditions of this License Agreement.
<https://ride.divvybikes.com/data-license-agreement>

## Data Credibility
The data is collected directly by the company itself who is issuing the service and the bikes, so the data insures that it's reliable, and from the data provided it seems that it is ROCCC.


## Data Privacy, Security & Accessibility
The data does not contain personal information about any of the users who participated in the survey. The provider company has clear and detailed licence and terms to use for the data, so it's the right thing to acknowledge them and give them full credit with references for the data used.

<a id = "process"></a>
# **Stage Three:** <font color = "blue">Process</font>

## Summary of the Wrangling Process
The original dataset has **5,901,463** observations and **13** features. It was then checked for consistency in column names and data types and no discrepancy was detected. After that, the data row containing missing values were dropped and outliers were removed leaving behind **3,625,932** observations. Some columns were added making a total of **21** features out of which only **11** features were then selected for analysis.

### Import Libraries

In [None]:
options(warn = -1) # load library without output
library("tidyverse",) # for data import and wranging
library("readr") # for saving out as csv 
library("ggplot2") # for visualization
library("lubridate") # for date function
library("geosphere") # to calculate distance usin lat and lon
library("scales") # to format y_label in ggplot

I'll load dataset for each month, take a quick look at it, and `understand how the data is organized`.

In [None]:
# I will not be using this method Since i want to examine each file before merging them
files <-  list.files("../input/divvytrips-aug-2021july-2022", full.names=TRUE)
trips_df <- files[! files %in% c('../input/divvytrips-aug-2021july-2022/ride_length.csv')] %>%
                map_df(~read_csv(., show_col_types = FALSE))
dim(trips_df)

In [None]:
## Dataset from August 2020 to December 2021
#path <- "../input/divvytrips-aug-2021july-2022/Divvytrips_202108_202207/Divvytrips_202108_202207"
divvy_2021_08 <- read_csv("../input/divvytrips-aug-2021july-2022/202108-divvy-tripdata.csv")
divvy_2021_09 <- read_csv("../input/divvytrips-aug-2021july-2022/202109-divvy-tripdata.csv")
divvy_2021_10 <- read_csv("../input/divvytrips-aug-2021july-2022/202110-divvy-tripdata.csv")
divvy_2021_11 <- read_csv("../input/divvytrips-aug-2021july-2022/202111-divvy-tripdata.csv")
divvy_2021_12 <- read_csv("../input/divvytrips-aug-2021july-2022/202112-divvy-tripdata.csv")


## Dataset from January 2021 to July 2022
divvy_2022_01 <- read_csv("../input/divvytrips-aug-2021july-2022/202201-divvy-tripdata.csv")
divvy_2022_02 <- read_csv("../input/divvytrips-aug-2021july-2022/202202-divvy-tripdata.csv")
divvy_2022_03 <- read_csv("../input/divvytrips-aug-2021july-2022/202203-divvy-tripdata.csv")
divvy_2022_04 <- read_csv("../input/divvytrips-aug-2021july-2022/202204-divvy-tripdata.csv")
divvy_2022_05 <- read_csv("../input/divvytrips-aug-2021july-2022/202205-divvy-tripdata.csv")
divvy_2022_06 <- read_csv("../input/divvytrips-aug-2021july-2022/202206-divvy-tripdata.csv")
divvy_2022_07 <- read_csv("../input/divvytrips-aug-2021july-2022/202207-divvy-tripdata.csv")


I'll slice through the data set structure using `str` and `glimpse` functions

In [None]:
str(divvy_2021_08)
str(divvy_2021_09)
str(divvy_2021_10)
str(divvy_2021_11)
str(divvy_2021_12)

str(divvy_2022_01)
str(divvy_2022_02)
str(divvy_2022_03)
str(divvy_2022_04)
str(divvy_2022_05)
str(divvy_2022_06)
str(divvy_2022_07)


In [None]:
glimpse(divvy_2021_08)
glimpse(divvy_2021_09)
glimpse(divvy_2021_10)
glimpse(divvy_2021_11)
glimpse(divvy_2021_12)

glimpse(divvy_2022_01)
glimpse(divvy_2022_02)
glimpse(divvy_2022_03)
glimpse(divvy_2022_04)
glimpse(divvy_2022_05)
glimpse(divvy_2022_06)
glimpse(divvy_2022_07)


Reading through the output from the cells above seems to be `difficult, messy and error prone`. I will write lines of code to do that for me. It will check for consistency in column names, taking the first dataset as the standard. If the columns name tallies with the standard, it will print `TRUE` and otherwise `FALSE`.


In [None]:
## Check for consistency in column naming taking the first dataset as the standard
# using `colnames` to print the column names and the compare logically using equal(==) 

colnames(divvy_2021_08) == colnames(divvy_2021_09)
colnames(divvy_2021_08) == colnames(divvy_2021_10)
colnames(divvy_2021_08) == colnames(divvy_2021_11)
colnames(divvy_2021_08) == colnames(divvy_2021_12)

colnames(divvy_2021_08) == colnames(divvy_2022_01)
colnames(divvy_2021_08) == colnames(divvy_2022_02)
colnames(divvy_2021_08) == colnames(divvy_2022_03)
colnames(divvy_2021_08) == colnames(divvy_2022_04)
colnames(divvy_2021_08) == colnames(divvy_2022_05)
colnames(divvy_2021_08) == colnames(divvy_2022_06)
colnames(divvy_2021_08) == colnames(divvy_2022_07)

It can be seen here that the `column names are consistent` across the tables. I'll then check for consistency in `data types`.

In [None]:
## Check for consistency data type for each taking the first dataset as the standard.
# This function loops through and check the data type each column. This is then stored in the
# variable defined before the for loop as a vector

dtype2108 <- c()
for(col in colnames(divvy_2021_08)){
    type <- class(divvy_2021_08[[col]])
    dtype2108 <- c(dtype2108,type)
}

dtype2109 <- c()
for(col in colnames(divvy_2021_09)){
    type <- class(divvy_2021_09[[col]])
    dtype2109 <- c(dtype2109,type)
}

dtype2110 <- c()
for(col in colnames(divvy_2021_10)){
    type <- class(divvy_2021_10[[col]])
    dtype2110 <- c(dtype2110,type)
}

dtype2111 <- c()
for(col in colnames(divvy_2021_11)){
    type <- class(divvy_2021_11[[col]])
    dtype2111 <- c(dtype2111,type)
}

dtype2112 <- c()
for(col in colnames(divvy_2021_12)){
    type <- class(divvy_2021_12[[col]])
    dtype2112 <- c(dtype2112,type)
}

dtype2201 <- c()
for(col in colnames(divvy_2022_01)){
    type <- class(divvy_2022_01[[col]])
    dtype2201 <- c(dtype2201,type)
}

dtype2202 <- c()
for(col in colnames(divvy_2022_02)){
    type <- class(divvy_2022_02[[col]])
    dtype2202 <- c(dtype2202,type)
}

dtype2203 <- c()
for(col in colnames(divvy_2022_03)){
    type <- class(divvy_2022_03[[col]])
    dtype2203 <- c(dtype2203,type)
}

dtype2204 <- c()
for(col in colnames(divvy_2022_04)){
    type <- class(divvy_2022_04[[col]])
    dtype2204 <- c(dtype2204,type)
}

dtype2205 <- c()
for(col in colnames(divvy_2022_05)){
    type <- class(divvy_2022_05[[col]])
    dtype2205 <- c(dtype2205,type)
}

dtype2206 <- c()
for(col in colnames(divvy_2022_06)){
    type <- class(divvy_2022_06[[col]])
    dtype2206 <- c(dtype2206,type)
}

dtype2207 <- c()
for(col in colnames(divvy_2022_07)){
    type <- class(divvy_2022_07[[col]])
    dtype2207 <- c(dtype2207,type)
}

# Outputs above were stored as vectors. Let's covert it to dataframe

type_df = data.frame(dtype21_08 = dtype2108, dtype21_09 = dtype2109, dtype21_10 = dtype2110,
                dtype21_11 = dtype2111, dtype21_12 = dtype2112, dtype22_01 = dtype2201,
                dtype22_02 = dtype2202, dtype22_03 = dtype2203, dtype22_04 = dtype2204,
                dtype22_05 = dtype2205, dtype22_06 = dtype2206, dtype22_07 = dtype2207)
# Confirm changes
type_df

`Note:` This table has 15 rows, this is because `started_at` and `ended_at` columns data types have to values and were printed separately. To confirm, run `sapply(divvy_2021_08, class)`

In [None]:
sapply(divvy_2021_08, class)

With the table above, It can been seen that data types for the columns accross the tables are also uniform. Therefore, I can combined the tables now for proper cleaning.
This is named `all_trips`

In [None]:
all_trips <- bind_rows(divvy_2021_08, divvy_2021_09, divvy_2021_10, divvy_2021_11, divvy_2021_12, divvy_2022_01,
                       divvy_2022_02, divvy_2022_03, divvy_2022_04, divvy_2022_05, divvy_2022_06, divvy_2022_07)

# Create a copy before data transformation
master_df <- data.frame(all_trips)

# Confirm changes
head(all_trips)

`Note:` The method above was used to copy instead of `master_df <- alltrips` so that the two dataframes point to different memory locations such that manipulation on one does not affect the other. Try the second method and check with `tracemem(master_df) == tracemem(all_trips)` if they both point to the same location or not.

Now, it is time to understand my data

In [None]:
dim <- dim(all_trips)
sprintf("all_trips has %.0f observations and %0.f features", dim[1], dim[2])

In [None]:
print("Glance at the data types and few rows with glimpse")
glimpse(all_trips)

In [None]:
print("Observe its statistical summary")
summary(all_trips)

I can see that my data has a lot of missing values. I'll quickly check for the following
- duplicates in ride_id
- unique values in 
    - rideable_type
    - member_casual

In [None]:
# calculate the total number of duplicates
duplicate <- sum(duplicated(all_trips$ride_id))
sprintf("Tota duplicates in ride_id equals: %0.f", duplicate)

In [None]:
uni_list <- c('rideable_type', 'member_casual')
print('Unique values in rideable_type and member_casual respectively:')
for(col in uni_list){
    print(unique(all_trips[[col]]))
}

I'll now check for and remove missing values in my data

In [None]:
# Call the original copy of the data frame so that original `all_trips` table is restore
# whenever this code is called upon

all_trips <- data.frame(master_df)
# Check for missing values
total_trips1 = dim(all_trips)[1]
total_null = sum(is.na(all_trips))
sprintf('Total trips before: %.0f', total_trips1)
sprintf('Total null before: %.0f', total_null)

# Look at the distribution of the missing values
sapply(all_trips, function(x) sum(is.na(x)))

# remove missing values
all_trips <- all_trips %>% drop_na()
       
# Confirm changes
total_trips2 = dim(all_trips)[1]
total_null = sum(is.na(all_trips))
sprintf('Total trips after: %.0f', total_trips2)
sprintf('Total null after: %.0f', total_null)
       
# Calculate the percentage of the total missing values(trips)
null_prop = (total_trips1 - total_trips2)/ total_trips1 *100
sprintf('Percentage of missing trips: %.0f percent', null_prop)

# Create a copy before any data transformation -as done above
clean_df <- data.frame(all_trips)

The total missing values before cleaning appear to be huge `3,572,542 (66%)` than the final percentage calculated `22%` that were removed. This is because the function counted the individual missing values initial while the `drop_na` drop rows with missing values and it is possible for a row to have more than on missing values.

The percentage dropped `22%` leaving behind `4,629,230` rows out of `5,901,643` rows, although many, is still ok since the dataset itself is so huge. If were to carry out this analysis at a confidence level of `99%` and margin of error of `1%`, a sample size of about `16,595` could have been ok

I'll now mathematically create other columns that will be required for analysis (data engineering). Here I'm adding more columns to make the dataset more analysis-ready. These are:
- Total duration of each ride from latitude and longitude columns 
- Total distance (ride_length) of each ride from ended_at and started_at columns 
- Date, time, week_day, day, month, and year: From date column

In [None]:
# Calculate distance using 
# `sapply(1:nrow(df), function(x) (distm(c(lat1, lon1), c(lat2, lon2), fun = distHaversine))
#ride_length <- sapply(1:nrow(all_trips), function(x) distm(c(all_trips$start_lat[x],all_trips$start_lng[x]),
#                                          c(all_trips$end_lat[x], all_trips$end_lng[x]), fun = distHaversine));

#Convert ride_length which was a list to data frame                
#ride_length <- data.frame(ride_length)
#head(ride_length)
                      
#Save ride_length into the output working diractory
#write_csv(ride_length, '/kaggle/working/ride_length.csv')

writeLines("Running me uses a lot of computational power. That is why I have been commeneted. My output has been
saved as `ride_length.csv` and you're free to load it in the next cell. 
    If you wish to run me, uncomment me and run. I'll do the calculations for you in the next 30mins.")

In [None]:
# reason explained above
all_trips <- data.frame(clean_df)

# Load ride_length
ride_lengths <- read_csv('../input/divvytrips-aug-2021july-2022/ride_length.csv',
                         show_col_types = FALSE)
all_trips$ride_length <- ride_lengths$ride_length

# Date, time, week_day, day, month, year
all_trips$date <- as.Date(all_trips$started_at) #The default format is yyyy-mm-dd
all_trips$time <- format(as.POSIXct(all_trips$started_at), "%H")
all_trips$week_day <- format(as.Date(all_trips$date), "%A")
all_trips$day <- format(as.Date(all_trips$date), "%d")
all_trips$month <- format(as.Date(all_trips$date), "%m")
all_trips$year <- format(as.Date(all_trips$date), "%Y")

# Duration : Total duration of each ride in minutes
all_trips$duration <- difftime(all_trips$ended_at,all_trips$started_at, units = 'secs')

# Confirm changes
head(all_trips)

To provide answers to my problem statement for the analysis, I will be selecting the following columns: rideable_type, start_station_name, end_station_name, member_casual, ride_length, duration, time, week_day, day, month, and year and label it as `all_trips_v2`

In [None]:
all_trips_v2 <- all_trips %>%
                select(rideable_type, start_station_name, end_station_name, member_casual,
                       ride_length, duration, time, week_day, day, month, year)
head(all_trips_v2)

In [None]:
print("Glance at the data types and few rows with glimpse")
glimpse(all_trips_v2)

In [None]:
print("Observe its statistical summary")
summary(all_trips_v2)

Key insight from the last two cells that need to be addressed
- ride_length equal `0`. This should'nt be. May be the person ordered for but did not take a ride. The is also possibility of outliers
- duration columns also has negative values and the maximum values is too far from the mean. This is probably collection error whereby users input values in a wrong field
- days of the week are not ordered `data type`
- duration in difftime

Steps to take
- Divide the features into numerical and categorical values
- Convert the week_day into ordered data
- Convert duration, ride_length, time and month are numerical
- Visualize numerical/continous variables for outliers using boxplot
- Remove outrageous values and outliers

In [None]:
# Divide the features into numerical and categorical values
num_col <- c('ride_length', 'duration')
cat_col <- c('rideable_type', 'start_station_name', 'end_station_name', 'member_casual',
             'time', 'week_day', 'day', 'month', 'year')

# Convert the week_day into ordered data
all_trips_v2$week_day <- ordered(all_trips_v2$week_day, 
                                 levels=c('Sunday', 'Monday', 'Tuesday', 'Wednesday',
                                          'Thursday', 'Friday', 'Saturday'))

# Convert duration, ride_length, time and month are numerical
all_trips_v2$duration <- as.numeric(all_trips_v2$duration)
all_trips_v2$ride_length <- as.numeric(all_trips_v2$ride_length)
all_trips_v2$time <- as.numeric(all_trips_v2$time)
all_trips_v2$month <- as.numeric(all_trips_v2$month)

# Visualize outliers using boxplot
print("Visualize outliers using boxplot")
boxplot(all_trips_v2[, c(num_col)])

writeLines("Higher percentiles have more outliers than the lower percentiles.
Retain only data points between 1 and 90 percentiles")
# calculate the percentiles
duration_qlow <- quantile(all_trips_v2$duration, probs = 0.01)
duration_qhigh <- quantile(all_trips_v2$duration, probs = 0.90)

length_qlow <- quantile(all_trips_v2$ride_length, probs = 0.01)
length_qhigh <- quantile(all_trips_v2$ride_length, probs = 0.90)

all_trips_clean <- all_trips_v2 %>%
            filter((duration > duration_qlow & duration < duration_qhigh),
                   (ride_length > length_qlow & ride_length < length_qhigh))

# Visualize again to confirm changes
boxplot(all_trips_clean[, c(num_col)])
print("Outliers dropped")

# Check for missing values again
nulls <- sum(is.na(all_trips_clean))
dim <- dim(all_trips_clean)
sprintf("Total missing values after cleaning equals: %0.f", nulls)
sprintf("Total Observations after cleaning equals: %0.f", dim[1])

<a id = "analyse"></a>

# **Phase Four:** <font color = "blue">Analysis</font>
## Questions:
   - Which bike is the most preferred bike?
   - Are short rides most common or long rides?
   - What is the correlation between ride length and duration of ride?

In [None]:
#find the summary for duration and ride_length
print('summary statistic for duration')
summary(all_trips_clean$duration)
print('summary statistic for ride_length')
summary(all_trips_clean$ride_length)

In [None]:
print('Most preferred bike')
table(all_trips_clean$rideable_type)

In [None]:
print("Number of total rides by membership")
membership <- table(all_trips_clean$member_casual)
membership

In [None]:
print("Most preferred bike by member_casual")
all_trips_clean %>% 
        group_by(member_casual) %>%
            count(rideable_type, sort = TRUE)

## I'll continue with the rest of the analysis together with visualization

In [None]:
# Set plot size
options(repr.plot.width = 10, repr.plot.height = 8)

# Distribution of Ride Duration
p <- ggplot(all_trips_clean, aes(duration)) + 
        geom_histogram(binwidth = 30, color = 'blue') +
            labs(title = "Distribution of Ride Duration",
            tag = "Fig. 1",
            x = "Ride Length (meters)", y = "Count") +
            theme(text = element_text(size = 15, face = 'bold'))
print(p)
ggsave('distribution_of_ride_duration.png', p, width = 10, height = 8)


# Distribution of Ride Length
p <- ggplot(all_trips_clean, aes(ride_length)) +
            geom_histogram(binwidth = 50, color = 'blue') +
            labs(title = "Distribution of Ride Length",
            tag = "Fig. 2",
            x = "Ride Length (meters)", y = "Count") +
            theme(text = element_text(size = 15, face = 'bold'))
print(p)
ggsave('distribution_of_ride_length.png', p, width = 10, height = 8)

# Variation in Bike Preference
options(dplyr.summarise.inform = FALSE) # override `summarise()` has grouped output` message
p <- all_trips_clean %>% 
        group_by(member_casual, rideable_type) %>%
            summarize(count = n()) %>%
        ggplot(aes(x = rideable_type, y = count, fill = member_casual))+
        geom_col(position = "dodge") +
        labs(title = "Variation in Bike Preference",
        subtitle = "Member vs. Casual",
        tag = "Fig. 3",
        y = "Count", x = "Bike type") +
        scale_y_continuous(label = comma) +
        theme(text = element_text(size = 15, face = 'bold'), legend.title = element_blank())
print(p)
ggsave('variation_in_bike_preference.png', p, width = 10, height = 8)

## Observations
- For both duration and ride_length, the median is close to mean, that is, positively skewed. This implies that:
    - Most of the rides are short in distance and time
    - Both variable are positively correlated
- Clasic bike is most preferred bike by member and casual riders
- Docked bike is only used by casual riders

## Interpretation
Focusing on maximizing the number of annual memberships will most likely improve Chicago-Divvy's future success and they should also look into investing on clasic bike than the other two.

<a id = "viz"></a>

# **Phase Five:** <font color = "blue">Visualization</font>

## Question: Group by member_casual and week days,
- What is the average ride length and duration
- How many rides are recoded by each member everyday

In [None]:
trips_by_day <- all_trips_clean %>% 
                group_by(week_day, member_casual) %>% 
                dplyr::summarize(number_of_rides = n(),
                                 avg_duration = mean(duration),
                                 avg_distance = mean(ride_length)) %>% 
                arrange(week_day, member_casual)
trips_by_day

In [None]:
# Set plot size
options(repr.plot.width = 10, repr.plot.height = 8)

# Average Ride Length
p <- trips_by_day %>%
    ggplot(aes(x = week_day, y = avg_distance, fill = member_casual))+
    geom_col(position = "dodge") +
    labs(title = "Average Ride Length",
         subtitle = "Member vs. Casual",
         tag = "Fig. 4",
         y = "Distance (meters)", x = "Week Days") +
    scale_x_discrete(labels = c('Sun', 'Mon', 'Tue', 'Wed', 'Thur', 'Fri', 'Sat')) +
    scale_y_continuous(label = comma) +
    theme(text = element_text(size = 15, face = 'bold'), legend.title = element_blank())
print(p)
ggsave('average_ride_length.png', p, width = 10, height = 8)

# Average Duration of Rides
p <- trips_by_day %>%
    ggplot(aes(x = week_day, y = avg_duration, fill = member_casual))+
    geom_col(position = "dodge") +
    labs(title = "Average Duration of Rides",
         subtitle = "Member vs. Casual",
         tag = "Fig. 5",
         y = "Duration (secs)", x = "Week Days") +
    scale_x_discrete(labels = c('Sun', 'Mon', 'Tue', 'Wed', 'Thur', 'Fri', 'Sat')) +
    scale_y_continuous(label = comma) +
    theme(text = element_text(size = 15, face = 'bold'), legend.title = element_blank())
print(p)
ggsave('average_duration_of_rides.png', p, width = 10, height = 8)

# Distribution of Rides in Week Days
p <- trips_by_day %>%
    ggplot(aes(x = week_day, y = number_of_rides, fill = member_casual))+
    geom_col(position = "dodge") +
    labs(title = "Distribution of Rides in Week Days",
         subtitle = "Member vs. Casual",
         tag = "Fig. 6",
         y = "Number of Rides", x = "Week Days") +
    scale_x_discrete(labels = c('Sun', 'Mon', 'Tue', 'Wed', 'Thur', 'Fri', 'Sat')) +
    scale_y_continuous(label = comma) +
    theme(text = element_text(size = 15, face = 'bold'), legend.title = element_blank())
print(p)
ggsave('distribution_of_rides_wdays.png', p, width = 10, height = 8)

## Observations:
- In total, members have more rides than casual riders.
- For the same ride length, it takes causal riders longer duration than members.
- Casual riders rides more on weekends while members do more on weekdays.

## Interpretations
- It is possilbe that casual members are not good at riding divvy's bikes such that it takes them longer time to cover exact distance corvered by a member and this, perhaps, may prevent them from ride during the work days as they may get late to work, or
- Casual riders enjoy rides at low speed (for leisure) and prefer to take ride on weekends for outing.

## Question: Group by member_casual and month,
- What is the average ride length and duration
- How many rides are recoded by each member month

In [None]:
trips_by_month <- all_trips_clean %>%
                group_by(month, member_casual) %>% 
                dplyr::summarize(number_of_rides = n(),
                                 avg_duration = mean(duration),
                                 avg_distance = mean(ride_length)) %>% 
                arrange(month, member_casual)
trips_by_month

In [None]:
# Set plot size
options(repr.plot.width = 10, repr.plot.height = 8)

# Monthly Ride Pattern based on number of rides
p <- trips_by_month %>%
    ggplot(aes(x = month, y = number_of_rides, color = member_casual))+
    geom_line(size = 0.8) +
    labs(title = "Total Rides' Monthly Ride Trend",
         subtitle = "Member vs. Casual",
         caption = "Aug 2021 - Jul 2022",
         tag = "Fig. 7",
         y = "Number of Rides", x = "Month") +
    scale_x_continuous(breaks = seq(from = 01, to = 12, by = 1)) +
    scale_y_continuous(label = comma) +
    theme(text = element_text(size = 15, face = 'bold'), legend.title = element_blank())
print(p)
ggsave('monthly_ride_trend.png', p, width = 10, height = 8)

## Observation
Memebers annual total rides is more than casual riders. There is upward trend from the beginning of the year until around June and August when members and casual riders, respectively, reached the ceiling ride. They both decreases till December

## Interpretation:
Chicago exprience hot weather conndition in the middle of the years and is usually intense between May and September. At this same period, precipation reduces.[[1]](https://en.wikipedia.org/wiki/Climate_of_Chicago#Data). Since Divvy's bikes are not covered, it is safe to conclude that their product will serve customer more during summer season (June - September).


## Questions: 
- What are the popular ride hours during the week days?

In [None]:
trips_time <- all_trips_clean %>%
                    group_by(time, member_casual) %>%
                        dplyr::summarize(number_of_rides = n())

head(trips_time)

In [None]:
# Set plot size
options(repr.plot.width = 15, repr.plot.height = 10)

p <- trips_time %>%
    ggplot(aes(x = time, y = number_of_rides, fill = member_casual))+
    geom_col(position = "dodge") +
    labs(title = "Hour' of the Day Riding Pattern",
         subtitle = "Member vs. Casual",
         caption = "Aug 2021 - Jul 2022",
         tag = "Fig. 8",
         y = "Total Ride", x = "Time (hrs)") +
    scale_x_continuous(breaks = seq(from = 0, to = 23, by = 1),
                      label = c(sprintf('%0.2d',0:23))) +
    scale_y_continuous(label = comma) +
    theme(text = element_text(size = 15, face = 'bold'), legend.title = element_blank())

print(p)
ggsave('ride_pattern_day_hrs.png', p, width = 15, height = 10)

## Observations:
- There is increase in ride for both groups during the day and decreases at night.
- There are noticeable increase in ride between 6am and 9am in the morning for members only and between 4pm and 7pm in the evening for both members and casual riders

## Interpretation 
- This trend support the previous speculation made with *Fig. 5* that members, most likely, use the service for work commute.

In [None]:
unlink("Rplot001.png")
unlink("Rplot002.png")
unlink("Rplot003.png")

<a id = "share"></a>

# **Phase Five:** <font color = "blue">Share</font>

## 1. How do annual members and casual riders use Cyclistic bikes differently?

- For the same ride length, it takes causal riders longer duration than members.

- Casual riders rides more on weekends while members do more on weekdays

- There are noticeable increase in ride between 6am and 9am in the morning for members only and between 4pm and 7pm in the evening for both members and casual riders

- Thus, based on above findings, it is safe to conclude that casual riders use the sevice mostly for leisure purpose, while members use them for routine commute.

## 2. Why would casual riders buy Cyclistic annual memberships?

- Considering the lifestyle of casual riders, buying more of fancy bikes would please more to their desire. It is obvious from their selection of docked bike over classic bike.

- Reducing annual membership compared to other option would also be attractive solution for them as most of them are probably youth.

<a id = "act"></a>

# **Phase Six:** <font color = "blue">Recommendation or Act</font>

## 3. How can Cyclistic use digital media to influence casual riders to become members?

- Targeting social media, club houses and youth programs would be stragetic enough because casual riders are most likely to be found here

- The marketing team can offer free membership trial for a month. Casual riders may want to continue the membership if they find it convenient and cost-effetive than alternatives

- Allowing family or group registration should be allowed and encouraged especially for friends and families. It is assumed that casual riders are social and they would love to enjoy the discount that is attached to this as well.

- Awareness at pupolar start/ end stations, season greeting and referral should be employed for constant and periodic reaching out.



<a id = "limit"></a>
# Limitations

## Below include are the list additonal data that could expand scope this analysis if available:

- Age and gender profile: This data could be used to study the category of riders who can be targeted for marketing.

- Occupation of member riders: This data could be used predict the income of riders.

- Speed of ride by members and casual: This data could be used to undstand why it takes casua riders to reach their destination.

- Pricing details for members and casual riders - This could be used to optimize cost structure for casual riders or provide discounts for members without affecting the profit margin.

<a id = 'last'></a>
[Back to the top!](#table-of-contents)