# Google Capstone Project - Cyclistic BikeShare Analysis

# **Introduction**

This project is the final project in my Google Data Analytics Professional Certificate course on Coursera. In this case study, I will be analyzing a public dataset for a fictional company called Cyclistic, provided by the course. Here, I will be using R programming language for this analysis because of its potential benefits to reproducibility, transparency, easy statistical analysis tools and data visualizations.

This project will be based on the data analysis process;

* Ask
* Prepare
* Process
* Analysis
* Share
* Act

This project will follow the following road map steps

* Code where applicable
* Key tasks to be undertaken
* Deliverable

# **Scenario**

I am a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships,in essence, converting casual users to member. Therefore, my team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, my team will design a new marketing strategy to convert casual riders into annual members.But first, Cyclistic executives must approve my recommendations, so they must be backed up with compelling data insights and professional data visualizations

# **Characters and Teams**

**Cyclistic**: A bike-share program that features more than 5,800 bicycles and 600 docking stations. Cyclistic sets itself apart by also offering reclining bikes, hand tricycles, and cargo bikes, making bike-share more inclusive to people with disabilities and riders who can’t use a standard two-wheeled bike. Most riders opt for traditional bikes; about 8% of riders use the assistive options. Cyclist users are more likely to ride for leisure, but about 30% use them to commute to work each day.

**Lily Moreno**: The director of marketing and your manager. Moreno is responsible for the development of campaigns and initiatives to promote the bike-share program. These may include email, social media, and other channels.

**Cyclistic marketing analytics team**: A team of data analysts who are responsible for collecting, analyzing, and reporting data that helps guide Cyclistic marketing strategy. You joined this team six months ago and have been busy learning about Cyclistic mission and business goals as well as how you, as a junior data analyst, can help Cyclistic achieve them.

**Cyclistic executive team**: The notoriously detail-oriented executive team will decide whether to approve the recommended marketing program.

## **ASK**
Three questions will guide the future marketing program:

* How do annual members and casual riders use Cyclistic bikes differently?
* Why would casual riders buy Cyclistic annual memberships?
* How can Cyclistic use digital media to influence casual riders to become members?


Lily Moreno (director of marketing and my manager) has assigned me the first question to answer:
* How do annual members and casual riders use Cyclistic bikes differently?

#### **Guiding questions**
* What is the problem you are trying to solve?
The main objective is to determine a way to build a profile for annual members and the best marketing strategies to turn casual bike riders into annual members.

*  How can your insights drive business decisions?
The insights will help the marketing team to increase annual members.

#### Key tasks

* Identify the business task
The main business objective is to design marketing strategies aimed at converting casual riders into annual members by understanding how they differ.
* Consider key stakeholders
The key stakeholders are the Director of Marketing (Lily Moreno), Marketing Analytics team, and Executive team.


#### Deliverable
 1. A clear statement of the business task
* To find the differences between the casual riders and annual members

## **PREPARE**
I will use Cyclistic’s historical trip data to analyze and identify trends.The data has been made available by Motivate International Inc. under this [license](https://www.divvybikes.com/data-license-agreement). Datasets are available [here](https://divvy-tripdata.s3.amazonaws.com/index.html) for download.

#### **Guiding questions**
* Where is your data located?
The data is located in a kaggle dataset.

* How is the data organized?
The data is separated by month, each on it's own csv.

* Are there issues with bias or credibility in this data? Does your data ROCCC?
Bias isn't a problem, the population of the dataset is it's own clients as bike riders. And have full credibility for the same reason. And finally, it's ROCCC because it's reliable, original, comprehensive, current and cited.

* How are you addressing licensing, privacy, security, and accessibility?
The company has their own licence over the dataset. Besides that, the dataset doesn't have any personal information about the riders.

* How did you verify the data’s integrity?
All the files have consistent columns and each column has the correct type of data.

* How does it help you answer your question?
It may have some key insights about the riders and their riding style

* Are there any problems with the data?
It would be good to have some updated information about the bike stations. Also more information about the riders could be useful.

#### Key tasks

1. Download data and store it appropriately;
Data has been downloaded and copies have been stored securely on my computer.
2. Identify how it’s organized;
The data is in CSV (comma-separated values) format, and there are a total of 13 columns in the dataset.
3. Sort and filter the data;
For this analysis, I will be using data from Dec 2020 to Nov 2021
4. Determine the credibility of the data;
For the purposes of this case study, the datasets are appropriate and will enable me to answer the business questions. The data has been made available by Motivate International Inc under this [license](https://www.divvybikes.com/data-license-agreement). This is public data that I can use to explore how different customer types are using Cyclistic bikes. All ride ids are unique.

#### Deliverable

1. A description of all data sources used
* The main data source is 12 months (Between Dec 2020 and Nov 2021) of riding data provided by the Cicylistic company.

## **PROCESS**
This step will prepare the data for analysis. All the xls files will be merged into one, to improve workflow

# **Guiding questions**
* What tools are you choosing and why?
I'm using R for this project, for two main reasons: Because of the large dataset and to gather experience with the language.

* Have you ensured your data’s integrity?
Yes, the data is consistent throughout the columns.

* What steps have you taken to ensure that your data is clean?
First the duplicated values where removed, then the columns where formatted to their appropriate format.

* How can you verify that your data is clean and ready to analyze?
It can be verified by this notebook.

* Have you documented your cleaning process so you can review and share those results?
Yes, it's all documented in this R notebook.


#### **Key tasks**
* Check the data for errors.
* Choose your tools.
* Transform the data for effective work
* Document the cleaning process.

#### **Deliverable**
Documentation of any cleaning or manipulation of data

**Code**

**Dependencies**
The main dependencie for the project will be tidyverse.



In [1]:
# Set working directory
setwd("/kaggle/input/coursera-google-capstone")

Install and load necessary packages

In [2]:
# Install necessary packages
install.packages("tidyverse")
install.packages("scales")
install.packages("date")
install.packages("dplyr")
install.packages("pillar")
install.packages("ggplot2")
install.packages("pkgconfig")
install.packages("isoband")
install.packages("janitor")
install.packages('stringi')
install.packages('skimr')


# Load necessary packages
library(date)
library(ggplot2)
library(dplyr)
library(tidyverse)
library(lubridate)
library(tidyr)
library(skimr)
library(janitor)
library(scales)
library(readr)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

“installation of package ‘dplyr’ had non-zero exit status”
Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

“installation of package ‘pillar’ had non-zero exit status”
Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

“installation of package ‘pkgconfig’ had non-zero exit status”
Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is un

In [3]:
library(ggplot2)

**Concatenating**
* All the csv files will be concatenated into one dataframe.


In [4]:
# Read in the data files
trips_2020_12 <- read.csv("/kaggle/input/coursera-google-capstone/td1.csv")
trips_2021_01 <- read.csv("/kaggle/input/coursera-google-capstone/td2.csv")
trips_2021_02 <- read.csv("/kaggle/input/coursera-google-capstone/td3.csv")
trips_2021_03 <- read.csv("/kaggle/input/coursera-google-capstone/td4.csv")
trips_2021_04 <- read.csv("/kaggle/input/coursera-google-capstone/td5.csv")
trips_2021_05 <- read.csv("/kaggle/input/coursera-google-capstone/td6.csv")
trips_2021_06 <- read.csv("/kaggle/input/coursera-google-capstone/td7.csv")
trips_2021_07 <- read.csv("/kaggle/input/coursera-google-capstone/td8.csv")
trips_2021_08 <- read.csv("/kaggle/input/coursera-google-capstone/td9.csv")
trips_2021_09 <- read.csv("/kaggle/input/coursera-google-capstone/td10.csv")
trips_2021_10 <- read.csv("/kaggle/input/coursera-google-capstone/td11.csv")
trips_2021_11 <- read.csv("/kaggle/input/coursera-google-capstone/td12.csv")

Checking the structure of a data frame is important to understand the organization of the data and to ensure that the data is correctly loaded into R. The structure of a data frame includes information about the number of observations, the number of variables, the variable names, and the data types of the variables.

In [5]:
# Check the structure of one of the data frames
str(trips_2021_01)

'data.frame':	96834 obs. of  13 variables:
 $ ride_id           : chr  "E19E6F1B8D4C42ED" "DC88F20C2C55F27F" "EC45C94683FE3F27" "4FA453A75AE377DB" ...
 $ rideable_type     : chr  "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
 $ started_at        : chr  "2021-01-23 16:14:19" "2021-01-27 18:43:08" "2021-01-21 22:35:54" "2021-01-07 13:31:13" ...
 $ ended_at          : chr  "2021-01-23 16:24:44" "2021-01-27 18:47:12" "2021-01-21 22:37:14" "2021-01-07 13:42:55" ...
 $ start_station_name: chr  "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" ...
 $ start_station_id  : chr  "17660" "17660" "17660" "17660" ...
 $ end_station_name  : chr  "" "" "" "" ...
 $ end_station_id    : chr  "" "" "" "" ...
 $ start_lat         : num  41.9 41.9 41.9 41.9 41.9 ...
 $ start_lng         : num  -87.7 -87.7 -87.7 -87.7 -87.7 ...
 $ end_lat           : num  41.9 41.9 41.9 41.9 41.9 ...
 $ end_lng           : num  -87.7 -87

In [6]:
# Compare column names of all data frames
compare_df_cols(trips_2021_01, trips_2021_02, trips_2021_03, trips_2021_04, trips_2021_05, trips_2021_06, trips_2021_07, trips_2021_08, trips_2021_09, trips_2021_10, trips_2021_11, trips_2020_12)


column_name,trips_2021_01,trips_2021_02,trips_2021_03,trips_2021_04,trips_2021_05,trips_2021_06,trips_2021_07,trips_2021_08,trips_2021_09,trips_2021_10,trips_2021_11,trips_2020_12
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
end_lat,numeric,numeric,numeric,numeric,numeric,numeric,numeric,numeric,numeric,numeric,numeric,numeric
end_lng,numeric,numeric,numeric,numeric,numeric,numeric,numeric,numeric,numeric,numeric,numeric,numeric
end_station_id,character,character,character,character,character,character,character,character,character,character,character,character
end_station_name,character,character,character,character,character,character,character,character,character,character,character,character
ended_at,character,character,character,character,character,character,character,character,character,character,character,character
member_casual,character,character,character,character,character,character,character,character,character,character,character,character
ride_id,character,character,character,character,character,character,character,character,character,character,character,character
rideable_type,character,character,character,character,character,character,character,character,character,character,character,character
start_lat,numeric,numeric,numeric,numeric,numeric,numeric,numeric,numeric,numeric,numeric,numeric,numeric
start_lng,numeric,numeric,numeric,numeric,numeric,numeric,numeric,numeric,numeric,numeric,numeric,numeric


Comparing columns in a data frame is to check for inconsistencies, errors, or missing values. Comparing columns can help identify if there are any differences or similarities between the data in the columns, such as if they have the same data types or if they contain the same values. This can be particularly useful when merging or joining multiple data frames, as it helps ensure that the data is aligned correctly and any discrepancies are resolved before analysis.

**Combining all the dataset into one dataframe**

In [7]:
# Combine all data frames into one
Trips2021 <- rbind(trips_2021_01, trips_2021_02, trips_2021_03, trips_2021_04, trips_2021_05, trips_2021_06, trips_2021_07, trips_2021_08, trips_2021_09, trips_2021_10, trips_2021_11, trips_2020_12)


In [8]:
# Check the structure of the combined data frame
str(Trips2021)

'data.frame':	5479096 obs. of  13 variables:
 $ ride_id           : chr  "E19E6F1B8D4C42ED" "DC88F20C2C55F27F" "EC45C94683FE3F27" "4FA453A75AE377DB" ...
 $ rideable_type     : chr  "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
 $ started_at        : chr  "2021-01-23 16:14:19" "2021-01-27 18:43:08" "2021-01-21 22:35:54" "2021-01-07 13:31:13" ...
 $ ended_at          : chr  "2021-01-23 16:24:44" "2021-01-27 18:47:12" "2021-01-21 22:37:14" "2021-01-07 13:42:55" ...
 $ start_station_name: chr  "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" ...
 $ start_station_id  : chr  "17660" "17660" "17660" "17660" ...
 $ end_station_name  : chr  "" "" "" "" ...
 $ end_station_id    : chr  "" "" "" "" ...
 $ start_lat         : num  41.9 41.9 41.9 41.9 41.9 ...
 $ start_lng         : num  -87.7 -87.7 -87.7 -87.7 -87.7 ...
 $ end_lat           : num  41.9 41.9 41.9 41.9 41.9 ...
 $ end_lng           : num  -87.7 -

**Data Cleaning**
* removing duplicates

In [9]:
# Identify and remove duplicates
duplicates <- duplicated(Trips2021$ride_id)
num_duplicates <- sum(duplicates)
print(paste("Number of duplicates removed: ", num_duplicates))

[1] "Number of duplicates removed:  0"


In [10]:
#Replace blanks or empty spaces with NA
is.na(Trips2021) <- Trips2021 == ""


#Removing all NA values from dataframe
Trips2021 <- na.omit(Trips2021)

#Check for missing values
colSums(is.na(Trips2021))

In [11]:
# Convert date and time columns to POSIXct format
Trips2021$started_at <- as.POSIXct(as.character(Trips2021$started_at), format = "%Y-%m-%d %H:%M", tz = "Africa/Lagos")
Trips2021$ended_at <- as.POSIXct(as.character(Trips2021$ended_at), format = "%Y-%m-%d %H:%M", tz = "Africa/Lagos")

**Parsing datetime columns**
Started_at and Ended_at columns are in character data type and we need to convert to POSIXct so as to be able to calculate ride lengths.

**Data Manipulation**
* New columns will help improve calculation time in the future


*Adding ride_length column to the dataframe. Calculating the length of each ride by subtracting the column “started_at” from the column “ended_at”. Leaving answer in minutes and filtering to have our minutes greater than 0*

In [12]:
# Create/add a new column to the dataframe and calculate ride length in minutes
Trips2021<-mutate(Trips2021,ride_length=round(difftime(ended_at,started_at,units = "mins"),0))

# Remove rows with negative ride_length
Trips2021 <- Trips2021 %>%
  filter(ride_length > 0)
    

# View the summary of ride_length variable
summary(Trips2021$ride_length)

  Length    Class     Mode 
 4491263 difftime  numeric 

Adding "day_of_week" column to the dataframe using the start time variable

In [13]:
# Add day_of_week column using the start time variable
Trips2021<- Trips2021 %>%
  mutate(day_of_week=wday(Trips2021$started_at,label=TRUE, abbr=TRUE))

In [14]:
# View dataframe structure
glimpse(Trips2021)

Rows: 4,491,263
Columns: 15
$ ride_id            [3m[90m<chr>[39m[23m "B9F73448DFBE0D45", "457C7F4B5D3DA135", "57C750326F…
$ rideable_type      [3m[90m<chr>[39m[23m "classic_bike", "electric_bike", "electric_bike", "…
$ started_at         [3m[90m<dttm>[39m[23m 2021-01-24 19:15:00, 2021-01-23 12:57:00, 2021-01-…
$ ended_at           [3m[90m<dttm>[39m[23m 2021-01-24 19:22:00, 2021-01-23 13:02:00, 2021-01-…
$ start_station_name [3m[90m<chr>[39m[23m "California Ave & Cortez St", "California Ave & Cor…
$ start_station_id   [3m[90m<chr>[39m[23m "17660", "17660", "17660", "17660", "17660", "17660…
$ end_station_name   [3m[90m<chr>[39m[23m "Wood St & Augusta Blvd", "California Ave & North A…
$ end_station_id     [3m[90m<chr>[39m[23m "657", "13258", "657", "657", "657", "KA1504000135"…
$ start_lat          [3m[90m<dbl>[39m[23m 41.90036, 41.90041, 41.90037, 41.90038, 41.90036, 4…
$ start_lng          [3m[90m<dbl>[39m[23m -87.69670, -87.69673, -87.69669, -8

In [15]:
# Add date, month and year columns using the start time variable
Trips2021$date <- as.Date(Trips2021$started_at)
Trips2021$month <- format(as.Date(Trips2021$date), "%B")
Trips2021$day <- format(as.Date(Trips2021$date), "%d")
Trips2021$year <- format(as.Date(Trips2021$date), "%Y")

New columns have been added and will like to see how the dataframe is now

In [16]:
# View summary of the dataframe
summary(Trips2021)

# View the first 50 rows of the dataframe
head(Trips2021,50)

# View the last 20 rows of the dataframe
tail(Trips2021,20)

# View the structure of the dataframe
str(Trips2021)

   ride_id          rideable_type        started_at                 
 Length:4491263     Length:4491263     Min.   :2020-12-01 00:07:00  
 Class :character   Class :character   1st Qu.:2021-05-28 09:28:00  
 Mode  :character   Mode  :character   Median :2021-07-22 13:45:00  
                                       Mean   :2021-07-14 02:56:53  
                                       3rd Qu.:2021-09-11 11:34:00  
                                       Max.   :2021-11-30 23:59:00  
                                                                    
    ended_at                   start_station_name start_station_id  
 Min.   :2020-12-01 00:10:00   Length:4491263     Length:4491263    
 1st Qu.:2021-05-28 09:46:00   Class :character   Class :character  
 Median :2021-07-22 14:11:00   Mode  :character   Mode  :character  
 Mean   :2021-07-14 03:19:00                                        
 3rd Qu.:2021-09-11 11:58:00                                        
 Max.   :2021-12-01 00:15:00      

Unnamed: 0_level_0,ride_id,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_casual,ride_length,day_of_week,date,month,day,year
Unnamed: 0_level_1,<chr>,<chr>,<dttm>,<dttm>,<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<drtn>,<ord>,<date>,<chr>,<chr>,<chr>
1,B9F73448DFBE0D45,classic_bike,2021-01-24 19:15:00,2021-01-24 19:22:00,California Ave & Cortez St,17660,Wood St & Augusta Blvd,657,41.90036,-87.6967,41.89918,-87.6722,member,7 mins,Sun,2021-01-24,January,24,2021
2,457C7F4B5D3DA135,electric_bike,2021-01-23 12:57:00,2021-01-23 13:02:00,California Ave & Cortez St,17660,California Ave & North Ave,13258,41.90041,-87.69673,41.91044,-87.69689,member,5 mins,Sat,2021-01-23,January,23,2021
3,57C750326F9FDABE,electric_bike,2021-01-09 15:28:00,2021-01-09 15:37:00,California Ave & Cortez St,17660,Wood St & Augusta Blvd,657,41.90037,-87.69669,41.89918,-87.67218,casual,9 mins,Sat,2021-01-09,January,9,2021
4,4D518C65E338D070,electric_bike,2021-01-09 15:28:00,2021-01-09 15:37:00,California Ave & Cortez St,17660,Wood St & Augusta Blvd,657,41.90038,-87.69672,41.89915,-87.67218,casual,9 mins,Sat,2021-01-09,January,9,2021
5,9D08A3AFF410474D,classic_bike,2021-01-24 15:56:00,2021-01-24 16:07:00,California Ave & Cortez St,17660,Wood St & Augusta Blvd,657,41.90036,-87.6967,41.89918,-87.6722,casual,11 mins,Sun,2021-01-24,January,24,2021
6,49FCE1F8598F12C6,electric_bike,2021-01-22 15:15:00,2021-01-22 15:36:00,California Ave & Cortez St,17660,Wells St & Elm St,KA1504000135,41.90037,-87.69679,41.90327,-87.63446,member,21 mins,Fri,2021-01-22,January,22,2021
7,0FEED5C2C8749A1C,classic_bike,2021-01-05 10:33:00,2021-01-05 10:39:00,California Ave & Cortez St,17660,Sacramento Blvd & Franklin Blvd,KA1504000113,41.90036,-87.6967,41.89047,-87.70261,member,6 mins,Tue,2021-01-05,January,5,2021
8,E276FD43BDED6420,classic_bike,2021-01-30 11:59:00,2021-01-30 12:03:00,California Ave & Cortez St,17660,Western Ave & Walton St,KA1504000103,41.90036,-87.6967,41.89842,-87.6866,member,4 mins,Sat,2021-01-30,January,30,2021
9,88BFCF66C2D585EC,electric_bike,2021-01-27 07:27:00,2021-01-27 07:45:00,California Ave & Cortez St,17660,Damen Ave & Clybourn Ave,13271,41.90031,-87.69679,41.93184,-87.67781,member,18 mins,Wed,2021-01-27,January,27,2021
10,8BD6F6510F5C8BD2,electric_bike,2021-01-15 08:54:00,2021-01-15 09:11:00,California Ave & Cortez St,17660,Damen Ave & Clybourn Ave,13271,41.90036,-87.69663,41.93192,-87.67786,member,17 mins,Fri,2021-01-15,January,15,2021


Unnamed: 0_level_0,ride_id,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_casual,ride_length,day_of_week,date,month,day,year
Unnamed: 0_level_1,<chr>,<chr>,<dttm>,<dttm>,<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<drtn>,<ord>,<date>,<chr>,<chr>,<chr>
4491244,6C75BF6F496BF59C,classic_bike,2020-12-29 22:17:00,2020-12-29 22:21:00,Rhodes Ave & 32nd St,13215,Indiana Ave & 31st St,TA1308000036,41.83621,-87.61353,41.83884,-87.62186,member,4 mins,Tue,2020-12-29,December,29,2020
4491245,4ACAFE42A588AC1A,classic_bike,2020-12-21 06:25:00,2020-12-21 06:29:00,Rhodes Ave & 32nd St,13215,Indiana Ave & 31st St,TA1308000036,41.83621,-87.61353,41.83884,-87.62186,member,4 mins,Mon,2020-12-21,December,21,2020
4491246,6BF9B8388E6626D3,electric_bike,2020-12-14 06:17:00,2020-12-14 06:20:00,Rhodes Ave & 32nd St,13215,Indiana Ave & 31st St,TA1308000036,41.83675,-87.61347,41.83868,-87.62184,member,3 mins,Mon,2020-12-14,December,14,2020
4491247,FD0C864E8360310A,classic_bike,2020-12-15 14:23:00,2020-12-15 14:27:00,Rhodes Ave & 32nd St,13215,Indiana Ave & 31st St,TA1308000036,41.83621,-87.61353,41.83884,-87.62186,member,4 mins,Tue,2020-12-15,December,15,2020
4491248,439614F00C8BB50F,classic_bike,2020-12-22 14:14:00,2020-12-22 14:18:00,Rhodes Ave & 32nd St,13215,Indiana Ave & 31st St,TA1308000036,41.83621,-87.61353,41.83884,-87.62186,member,4 mins,Tue,2020-12-22,December,22,2020
4491249,1CB7A948228128F5,electric_bike,2020-12-07 06:30:00,2020-12-07 06:33:00,Rhodes Ave & 32nd St,13215,Indiana Ave & 31st St,TA1308000036,41.83673,-87.61348,41.83869,-87.62185,member,3 mins,Mon,2020-12-07,December,7,2020
4491250,FB351326E3592815,electric_bike,2020-12-01 14:21:00,2020-12-01 14:24:00,Rhodes Ave & 32nd St,13215,Indiana Ave & 31st St,TA1308000036,41.83669,-87.6134,41.83875,-87.62183,member,3 mins,Tue,2020-12-01,December,1,2020
4491251,598B1E13C9AD9854,classic_bike,2020-12-08 14:08:00,2020-12-08 14:12:00,Rhodes Ave & 32nd St,13215,Indiana Ave & 31st St,TA1308000036,41.83621,-87.61353,41.83884,-87.62186,member,4 mins,Tue,2020-12-08,December,8,2020
4491252,99741BBCF3E82A6B,classic_bike,2020-12-27 12:14:00,2020-12-27 12:18:00,Rhodes Ave & 32nd St,13215,Indiana Ave & 31st St,TA1308000036,41.83621,-87.61353,41.83884,-87.62186,casual,4 mins,Sun,2020-12-27,December,27,2020
4491253,2AE24B8F461EE351,electric_bike,2020-12-16 17:42:00,2020-12-16 18:04:00,Bissell St & Armitage Ave,13059,California Ave & Altgeld St,15646,41.91851,-87.65218,41.92665,-87.6977,casual,22 mins,Wed,2020-12-16,December,16,2020


'data.frame':	4491263 obs. of  19 variables:
 $ ride_id           : chr  "B9F73448DFBE0D45" "457C7F4B5D3DA135" "57C750326F9FDABE" "4D518C65E338D070" ...
 $ rideable_type     : chr  "classic_bike" "electric_bike" "electric_bike" "electric_bike" ...
 $ started_at        : POSIXct, format: "2021-01-24 19:15:00" "2021-01-23 12:57:00" ...
 $ ended_at          : POSIXct, format: "2021-01-24 19:22:00" "2021-01-23 13:02:00" ...
 $ start_station_name: chr  "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" ...
 $ start_station_id  : chr  "17660" "17660" "17660" "17660" ...
 $ end_station_name  : chr  "Wood St & Augusta Blvd" "California Ave & North Ave" "Wood St & Augusta Blvd" "Wood St & Augusta Blvd" ...
 $ end_station_id    : chr  "657" "13258" "657" "657" ...
 $ start_lat         : num  41.9 41.9 41.9 41.9 41.9 ...
 $ start_lng         : num  -87.7 -87.7 -87.7 -87.7 -87.7 ...
 $ end_lat           : num  41.9 41.9 41.9 41.9 41.

A more suitable name should be given to member_casual

In [17]:
# Rename the column member_casual to users
colnames(Trips2021)[colnames(Trips2021)=="member_casual"] <- "users"

## **ANALYZE**
* descriptive

In [18]:
# Calculate the mean ride length and print the result
mean_ride_length = round(mean(Trips2021$ride_length),0)
print(mean_ride_length)

# Calculate the maximum ride length and print the result
max_ride_length = max(Trips2021$ride_length)
print(max_ride_length)

# Calculate the minimum ride length and print the result
min_ride_length = min(Trips2021$ride_length)
print(min_ride_length)

# Calculate the median ride length and print the result
median_ride_length = median(Trips2021$ride_length)
print(median_ride_length)

# Define the Mode function for numeric variables
Mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}

# Calculate the mode of the day_of_week column
mode_day_of_week <- Mode(Trips2021$day_of_week)

# Print the mode
print(mode_day_of_week)

Time difference of 22 mins
Time difference of 55944 mins
Time difference of 1 mins
Time difference of 12 mins
[1] Sat
Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat


**comparing members and casual users**

In [19]:
# Compare user type with visuals
# Set the output directory and file name
output_dir <- "/kaggle/working/"
output_file <- "plot0.png"

# Open the PNG device with the specified file and directory
png(file = "/kaggle/working/plot0.png")
# Calculate the mean ride length for each user type
means <- aggregate(ride_length ~ users, data = Trips2021, FUN = mean)

# Extract the values of the ride_length column as a numeric vector By using as.numeric() to convert the ride_length column to a numeric vector, you are ensuring that the data is in a format that is compatible with these types of calculations, plots, and models.
mean_lengths <- as.numeric(means$ride_length)

# Remove missing and infinite values from mean_lengths and means$users
mean_lengths <- mean_lengths[is.finite(mean_lengths)]
means$users <- means$users[is.finite(mean_lengths)]

# Create a bar plot of the mean ride length for each user type
mean_ride_length<-barplot(height = mean_lengths, names.arg = means$users, xlab = "User Type", ylab = "Mean Ride Length")
print(mean_ride_length)
# Calculate the median ride length for each user type
medians <- aggregate(ride_length ~ users, data = Trips2021, FUN = median)

# Extract the values of the ride_length column as a numeric vector
median_lengths <- as.numeric(medians$ride_length)

# Create a bar plot of the median ride length for each user type
median_ride_length<-barplot(height = median_lengths, names.arg = medians$users, xlab = "User Type", ylab = "Median Ride Length")

# Calculate the min ride length for each user type
mins <- aggregate(ride_length ~ users, data = Trips2021, FUN = min)

# Extract the values of the ride_length column as a numeric vector
min_lengths <- as.numeric(mins$ride_length)

# Create a bar plot of the min ride length for each user type
min_ride_length<-barplot(height = min_lengths, names.arg = mins$users, xlab = "User Type", ylab = "Min Ride Length")

# Calculate the max ride length for each user type
maxs <- aggregate(ride_length ~ users, data = Trips2021, FUN = max)

# Extract the values of the ride_length column as a numeric vector
max_lengths <- as.numeric(maxs$ride_length)

# Create a bar plot of the max ride length for each user type
max_ride_length<-barplot(height = max_lengths, names.arg = maxs$users, xlab = "User Type", ylab = "Max Ride Length")


     [,1]
[1,]  0.7
[2,]  1.9


![Minimum Ride Length.png](attachment:cfecf831-dd2f-4e33-9d68-f89135f773d8.png)![Maximum Ride Length.png](attachment:2f0c210a-6f74-4d80-b62d-33bd050a9492.png)![Mean Ride Length.png](attachment:3d33ea93-f74f-40ba-870f-f09b838b651c.png)![Median Ride Length.png](attachment:204e8e21-bf33-4fad-9025-4893b52fc439.png)

# **Supporting visualizations and key findings**

* Aggregating user type by total number of rides 

In [20]:
# Set the output directory and file name
output_dir <- "/kaggle/working/"
output_file <- "plot1.png"

# Open the PNG device with the specified file and directory
png(file = "/kaggle/working/plot1.png")

#Visualize user type by the number of ride taken (ride count)
Trips2021 %>% 
  group_by(users) %>% 
  summarise(ride_count = length(ride_id)) %>% 
  arrange(users) %>% 
  ggplot(aes(x = users,y = ride_count,fill = users)) +
  geom_col(position = "dodge")+
  labs(title = "Total rides taken (ride_count) of Members and Casual riders")+
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE))

# Save the plot
dev.off()


![Total Rides Taken of Members & Casual Riders.png](attachment:2071a13b-f3b5-436a-8af2-318d705f7551.png)

From the above graph, we can observe that there are more member rides(2472658) compared to casual rides(2016189) based on the ride count

* Aggregating user type by total number of rides and month

In [21]:
# Set the output directory and file name
output_dir <- "/kaggle/working/"
output_file <- "plot2.png"

# Open the PNG device with the specified file and directory
png(file = "/kaggle/working/plot2.png")

# This code chunk creates a bar chart showing the number of rides by user type and month.
Trips2021 %>%
  group_by(month, users) %>%
  summarise(count = n()) %>%
  pivot_wider(names_from = users, values_from = count) %>%
  ggplot(aes(x = month, y = member, fill = "Member")) +
  theme(axis.text.x = element_text(angle = 45)) +
  geom_bar(stat = "identity") +
  geom_col(aes(y = casual, fill = "Casual"), stat = "identity") +
  scale_fill_manual(values = c("Member" = "blue", "Casual" = "orange")) +
  labs(title = "Trips by Users' Type and Month From Dec2020 - Nov2021",
       x = "Month",
       y = "Number of Trips",
       fill = "User Type")

# Save the plot
dev.off()

[1m[22m`summarise()` has grouped output by 'month'. You can override using the
`.groups` argument.
“[1m[22mIgnoring unknown parameters: `stat`”


![image.png](attachment:a2216acf-11da-4a15-843a-141baeb42bfb.png)

* Aggregating user type by total number of rides vs day of the week

In [22]:
# Set the output directory and file name
output_dir <- "/kaggle/working/"
output_file <- "plot3.png"

# Open the PNG device with the specified file and directory
png(file = "/kaggle/working/plot3.png")

# This code chunk creates a bar chart showing the total trips by user type vs. day of the week.
Trips2021 %>%
  group_by(users, day_of_week) %>%
  summarise(number_of_rides = n()) %>%
  arrange(users, day_of_week)  %>%
  ggplot(aes(x = day_of_week, y = number_of_rides, fill = users)) +
  labs(title = "Total trips by User type Vs. Day of the week") +
  geom_col(width = 0.5, position = position_dodge(width = 0.5)) +
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE))

# Save the plot
dev.off()

[1m[22m`summarise()` has grouped output by 'users'. You can override using the
`.groups` argument.


![Total trips by User type Vs. Day of the week.png](attachment:98b92b28-d037-4aad-a4a4-1c43d79225cf.png)

* Aggregating user type by total number of rides by hour of the day


In [23]:
# Set the output directory and file name
output_dir <- "/kaggle/working/"
output_file <- "plot4.png"

# Open the PNG device with the specified file and directory
png(file = "/kaggle/working/plot4.png")

 #This code chunk creates a scatter plot of ride length vs. hour of the day, with each point representing a single ride.
# Extract the hour of the day from the started_at column
Trips2021$hour <- format(Trips2021$started_at, format = "%H")
# Create a scatter plot of ride length vs. hour of the day
ggplot(Trips2021, aes(x = hour, y = ride_length)) +
  geom_point() +
  labs(title = "Ride Length vs. Hour of the Day",
       x = "Hour of the Day",
       y = "Ride Length")

# Save the plot
dev.off()


[1m[22mDon't know how to automatically pick scale for object of type [34m<difftime>[39m.
Defaulting to continuous.


![Ride Length vs Hour of the Day.png](attachment:ecedff26-91a0-45c0-a1fc-42a277e9989d.png)

* Aggregating user type by ride lengths by hour of the day


In [24]:
# Set the output directory and file name
output_dir <- "/kaggle/working/"
output_file <- "plot5.png"

# Open the PNG device with the specified file and directory
png(file = "/kaggle/working/plot5.png")

 #This code chunk creates a scatter plot of ride length vs. hour of the day against each user type, with each point representing a single ride.
# Create a scatter plot of ride length vs. hour of the day by each user type
ggplot(Trips2021, aes(x = hour, y = ride_length, color = users)) +
  geom_point() +
  labs(title = "Ride Length vs. Hour of the Day by User Type",
       x = "Hour of the Day",
       y = "Ride Length",
       color = "User Type")

# Save the plot
dev.off()

[1m[22mDon't know how to automatically pick scale for object of type [34m<difftime>[39m.
Defaulting to continuous.


![Ride Length vs Hour of the Day by user type.jpeg](attachment:9e3b527c-f3ca-4df2-8f94-2e29d7c54a95.jpeg)

* Aggregating ride length by rideable type and user type

In [25]:
 # Set the output directory and file name
output_dir <- "/kaggle/working/"
output_file <- "plot6.png"

# Open the PNG device with the specified file and directory
png(file = "/kaggle/working/plot6.png")

# Create a bar chart of ride length by rideable type and user type
 ggplot(Trips2021, aes(x = rideable_type, y = ride_length, fill = users)) +
  geom_bar(stat = "identity", position = "dodge") +
   labs(title = "Ride Length by Rideable Type and User Type",
        x = "Rideable Type",
        y = "Ride Length",
        fill = "User Type") +
   theme_minimal()

# Save the plot
dev.off()


[1m[22mDon't know how to automatically pick scale for object of type [34m<difftime>[39m.
Defaulting to continuous.


![rideableType by user.png](attachment:fe12a05f-0f0f-4069-9b9c-9bf1c387ee58.png)

It is obvious that electric bike is the least choice of the rideable type for both users (casual and member). The docked bike seem to be the best option and dominated by the casual users. Members have their preferences at the same rate only within the classic and the docked bike.

* Average ride length by hour of the day 

In [26]:
library(RColorBrewer)
# Set the output directory and file name
output_dir <- "/kaggle/working/"
output_file <- "plot7.png"

# Open the PNG device with the specified file and directory
png(file = "/kaggle/working/plot7.png")

# Calculate the average ride length by hour of the day
ride_length_hour <- Trips2021 %>%
  mutate(hour = format(started_at, format = "%H")) %>%
  group_by(hour) %>%
  summarise(avg_ride_length = mean(as.numeric(ride_length)))

# Create a heatmap
ggplot(ride_length_hour, aes(x = hour, y = avg_ride_length, fill = avg_ride_length)) +
  geom_tile() +
  scale_fill_gradient(low = "blue", high = "red") +
  labs(title = "Average Ride Length by Hour of the Day Heatmap",
       x = "Hour of the Day",
       y = "Average Ride Length") +
  theme_minimal()

# Save the plot
dev.off()

![Average Ride Length by Hour of the Day Heatmap.png](attachment:c08520d4-fa4f-4e36-b0a1-fff777b8185e.png)

* Aggregating user type by average ride length and day of the week

In [27]:
# Set the output directory and file name
output_dir <- "/kaggle/working/"
output_file <- "plot8.png"

# Open the PNG device with the specified file and directory
png(file = "/kaggle/working/plot8.png")

# This code chunk creates a bar chart showing the average ride length by user type vs. day of the week.
Trips2021 %>%
  group_by(users, day_of_week) %>%
  summarise(average_ride_length = mean(as.numeric(ride_length), na.rm = TRUE)) %>%
  ggplot(aes(x = day_of_week, y = average_ride_length/60, fill = users)) + # dividing by 60 to convert seconds to minutes
  geom_col(width = 0.5, position = position_dodge(width = 0.5)) +
  labs(title = "Average Ride length by User type Vs. Day of the week")

# Save the plot
dev.off()

[1m[22m`summarise()` has grouped output by 'users'. You can override using the
`.groups` argument.


![Average Ride Length by User Type vs Day of the Week.png](attachment:a70976fc-47f2-4a77-8153-d25d3a704a8e.png)

* Aggregating average ride length for each user type by month.

In [28]:
# Set the output directory and file name
output_dir <- "/kaggle/working/"
output_file <- "plot9.png"

# Open the PNG device with the specified file and directory
png(file = "/kaggle/working/plot9.png")

#Visualize the average ride distance and month of the ride by users
Trips2021 %>% 
  group_by(users,month) %>% 
  summarise(average_ride_length = mean(as.numeric(ride_length))) %>% 
  arrange(month) %>% 
  ggplot(aes(x = month, y = average_ride_length, fill = users)) +
  labs(title = "Average ride distance of members and casual riders by month")+
  theme(axis.text.x = element_text(angle = 45)) +
  geom_col(position = position_dodge(width=0.5)) +
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE))

# Save the plot
dev.off()

[1m[22m`summarise()` has grouped output by 'users'. You can override using the
`.groups` argument.


![Average Ride Distance of Members & Casual Riders by Month.png](attachment:3725fbab-ffc9-4e43-9663-88863fb299ae.png)

From the graph above casual riders tend to travel for more distance than the membership riders.



* Aggregating Ride Length and Hour of the Day by User Type

In [29]:
# Set the output directory and file name
output_dir <- "/kaggle/working/"
output_file <- "plot10.png"

# Define the desired width and height for the output plot
width <- 17  # Width in inches
height <- 12  # Height in inches

# Open the PNG device with the specified file and directory
png(file = "/kaggle/working/plot10.png")

# Convert hour to character
Trips2021$hour <- as.character(Trips2021$hour)

# Create the plot with organized x-axis labels
ggplot(Trips2021, aes(x = hour, y = ride_length, color = users)) +
  geom_point() +
  labs(title = "Ride Length and Hour of the Day by User Type",
       x = "Hour of the Day",
       y = "Ride Length",
       color = "User Type") +
  facet_wrap(~ day, ncol = 3) +
  theme_minimal() +
  scale_x_discrete(breaks = seq(0, 23, by = 2), labels = c("12AM", "2AM", "4AM", "6AM", "8AM", "10AM", "12PM", "2PM", "4PM", "6PM", "8PM", "10PM"))

# Save the plot
dev.off()

[1m[22mDon't know how to automatically pick scale for object of type [34m<difftime>[39m.
Defaulting to continuous.


* Aggregating Ride Length by Day of the Week, Hour of the Day, and User Type

In [30]:
# Set the output directory and file name
output_dir <- "/kaggle/working/"
output_file <- "plot11.png"

# Open the PNG device with the specified file and directory
png(file = "/kaggle/working/plot11.png")

# Convert the hour column to a factor with ordered levels
Trips2021$hour <- factor(Trips2021$hour, levels = sprintf("%02d", 1:24), ordered = TRUE)

# Define the order of weekdays
weekday_order <- c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday")

# Create the plot
ggplot(Trips2021, aes(x = hour, y = ride_length, color = users)) +
  geom_point() +
  labs(title = "Ride Length by Day of the Week, Hour of the Day, and User Type",
       x = "Hour of the Day",
       y = "Ride Length",
       color = "User Type") +
  facet_grid(day ~ .) +
  scale_x_discrete(labels = function(x) weekday_order[as.integer(x)]) +
  theme_minimal()
                   
# Save the plot
dev.off()

[1m[22mDon't know how to automatically pick scale for object of type [34m<difftime>[39m.
Defaulting to continuous.


# **SHARE**
The share phase is usually done by building a presentation. 
Let's go through the main finds and try to arrive at a conclusion.

What we know about the dataset:

**Members** have the biggest proportion of the dataset, **~23%** bigger than **Casuals**.
There's more data points at the third quarter of the year.
The month with the biggest count of data points was July.
We have more members' rides than casual rides in all months except June, July and August.
The difference of proporcion of member x casual is larger in the last semester of 2021.
Time of day also influences the volume of rides in the days of the week.
There's a bigger volume of bikers in the afternoon. Mostly around 1pm - 6pm
Most active time for riders is between 12 noon to 9pm.
The biggest volume of data is on the the weekend (sat & sun).

The remaining question is: **Why are there more members than casual?** One plausible answer is that members have a bigger need for the bikes than casuals, as can be seen on how there are more members than casuals in the months.

Besides that, we have more bike rides on the weekends. Maybe because on those days the bikes were utilized for more recreational ways. This is even more plausible when knowing that There's a bigger volume of bikers in the afternoon.

Now for how members differs from casuals:

* Members may have the biggest volume of data, besides on saturday and sunday. Averagely on weekdays, casuals take place as having the most data points.
* Weekends have the biggest volume of casuals.
* We have more members during the morning, mainly between 5am and 10am. And more casuals between 11pm and 4am.
* There's a big increase of data points in the midweek between 6am to 8am for members. Then it fell a bit. Another big increase is from 7pm to 9pm.
* During the weekend we have a bigger flow of casuals between 10am to 6pm.
* Members have a bigger preference for classic bikes.
* Casuals have more riding time than members.
* Riding time for members keeps unchanged during the midweek, increasing during weekends.
* Casuals follow a more curve distribution, peaking on sundays and valleying on thursday/friday.
What we can take from this information is that members have a more fixed use for bikes besides casuals. Their uses is for more routine activities, like:

* Go to work.
* Use it as an exercise.
This can be proven we state that we have more members in between 5am to 10am and at 5pm to 6pm. Also, members may have set routes when using the bikes, as proven by riding time for members keeps unchanged during the midweek, increasing during weekends. The bikes is also heavily used for recreation on the weekends, when riding time increases and casuals take place.

Members also have a bigger preference for classic bikes, so they can exercise when going to work.

#### **Concluding that;**

1. Members use the bikes for fixed activities, one of those is going to work.
2. Bikes are used for recreation on the weekends.
3. Rides are influenced by purposes as well as time of day.

#### **Guiding questions**
Were you able to answer the question of how annual members and casual riders use Cyclistic bikes differently?
Yes. The data points to several differences between casuals and members.

What story does your data tell?
The main story the data tells is that members have set schedules, as seen on chart above, Those timestamps point out that members use the bikes for routine activities, like going to work. Also point out that they have less riding time, because they have a set route to take.

How do your findings relate to your original question?
The findings build a profile for members, relating to the key differences between casuals and member riders", also knowing why they use the bikes, helps to find "How digital media could influence them".

Who is your audience? What is the best way to communicate with them?
The main target audience is my cyclistic marketing analytics team and Lily Moreno. The best way to communicate is through a slide presentation of the findings.

Can data visualization help you share your findings?
Yes, the main core of the finds is through data visualization.

Is your presentation accessible to your audience?
Yes, the plots were made using vibrant colors, and corresponding labels.


Key tasks
[x] Determine the best way to share your findings.
[x] Create effective data visualizations.
[x] Present your findings.
[x] Ensure your work is accessible.

Deliverable
[x] Supporting visualizations and key findings

## **ACT**
The act phase would be done by the marketing team of the company. The main takeaway will be the top three recommendations for the marketing.


#### **Guiding questions**
What is your final conclusion based on your analysis?
Members and casual have different habits when using the bikes. The conclusion is further stated on the share phase.

How could your team and business apply your insights?
The insights could be implemented when preparing a marketing campaign for turning casual into members. The marketing can have a focus on workers as a green way to get to work.

What next steps would you or your stakeholders take based on your findings?
Further analysis could be done to improve the findings, besides that, the marketing team can take the main information to build a marketing campaign.


#### Deliverable
* Your top three recommendations based on your analysis.
1. Build a marketing campaign focusing on show how bikes help people to get to work, as the bikes are also used for recreations on the weekends, ads campaigns could also be made showing people using the bikes for exercise during the weeks. The ads could focus on how practical and consistent the bikes can be. Whilst maintaining the planet green and avoid traffic. The ads could be show on professional social networks.

2. Host fun biking competitions with prizes at intervals for casual riders during their high number of rides, this will also attract them to get a membership.

3. Increase benefits for riding during certain months. Encourage casual riders to ride more in the entire year through advertisement, hand flyers, by giving them various coupons so as to convince them into being a member

#### **Conclusion**
The Google Analytics Professional Certificate taught me a lot and the R language is really useful for analysing data (although its my first time using it). This took me more time than I expected, but it was fun.

Thank you for going through my work, I'll be open for new ideas and correction. Happy Reading!!
