# **Divvy bike-share - Google Data Analytics(Case Study)**
# Author: *Behnam Ebrahimi*
# Start date: *7/2/2021*
![Divvy](https://d21xlh2maitm24.cloudfront.net/chi/Divvy-Bike_new_0119_v3.png?mtime=20190820123644)

# How Does a Bike-Share Navigate Speedy Success?

## Cyclistic

In 2016, Cyclistic launched a successful bike-share offering Since then, the program has grown to a fleet of 5,824 bicycles that are geo-tracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system any time.  
Until now, Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments. One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members.

Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders. Although the pricing flexibility helps Cyclistic attract more customers, Moreno believes that maximizing the number of annual members will be key to future growth. Rather than creating a marketing campaign that targets all-new customers, Moreno believes there is a very good chance to convert casual riders into members. She notes that casual riders are already aware of the Cyclistic program and have chosen Cyclistic for their mobility needs.

Moreno has set a clear goal: Design marketing strategies aimed at converting casual riders into annual members. In order to do that, however, the marketing analyst team needs to better understand how annual members and casual riders differ, why casual riders would buy a membership, and how digital media could affect their marketing tactics. Moreno and her team are interested in analysing the Cyclistic historical bike trip data to identify trends.

## Scenario

A data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, marketing team wants to understand how casual riders and annual members use Cyclistic bikes differently.  

From these insights, the marketing team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve the recommendations, so they must be backed up with compelling data insights and professional data visualizations.  

## Characters and teams
* Cyclistic:

A bike-share program that features more than 5,800 bicycles and 600 docking stations. Cyclistic sets itself apart by also offering reclining bikes, hand tricycles, and cargo bikes, making bike-share more inclusive to people with disabilities and riders who can’t use a standard two-wheeled bike. The majority of riders opt for traditional bikes; about 8% of riders use the assistive options. Cyclistic users are more likely to ride for leisure, but about 30% use them to commute to work each day.

* Lily Moreno:

The director of marketing and your manager. Moreno is responsible for the development of campaigns and initiatives to promote the bike-share program. These may include email, social media, and other channels.

* Marketing Analytics Team:

A team of data analysts who are responsible for collecting, analyzing, and reporting data that helps guide Cyclistic marketing strategy. You joined this team six months ago and have been busy learning about Cyclistic’s mission and business goals — as well as how you, as a junior data analyst, can help Cyclistic achieve them.

* Executive Tam:

The notoriously detail-oriented executive team will decide whether to approve the recommended marketing program.



## Data analysis process:

### 1 - Ask

##### A breif of steps

* Ask effective questions
* Define the scope of the analysis
* Define what success looks like

##### Three questions will guide the future marketing program: 

* 1. How do annual members and casual riders use Cyclistic bikes differently?
* 2. Why would casual riders buy Cyclistic annual memberships?
* 3. How can Cyclistic use digital media to influence casual riders to become members?

#### Identify the business task.
The key business task in this case is to discover how casual riders and Cyclistic members use their rental bikes differently. Both the Director of Marketing as well as finance analysts have concluded that annual members are more profitable.

Therefore, the results of this analysis will be used to design a new marketing strategy to convert casual riders to annual members.

#### Consider key stakeholders.

Key stakeholders include: 
   * Cyclistic executive team
   * Director of Marketing
   * Marketing Analytics team.

### 2 - Prepare

##### A breif of steps
* Verify data’s integrity
* Check data credibility and reliability
* Check data types
* Merge datasets


Data has been downloaded from Motivate International Inc <https://divvy-tripdata.s3.amazonaws.com/index.html>. 
Local copies on Kaggle were carefully archived.

#### Identify the organisation of data.

The data is in comma-delimited (.CSV) format with 15 columns.

* ride ID
* ride type
* start & end time 
* ride length
* day of the week
* starting point (code, name, and latitude/longitude)
* ending point (code, name, and latitude/longitude)
* member types

#### Determination of data credibility

Because this is a case study that uses public data, we will presume that the data are reliable.

### 3 - Process

##### A breif of steps
* Clean, Remove and Transform data
* Document cleaning processes and results

* First of all, the file name should be based on conventional type
  + **Divvy_Trip_Data_202x_xx_Vxx**
  
* Now we need to load the CSV file into R for processing 
  + **Installing packages and Loading them:**
    - tidyverse (helps wrangle data)
    - dplyr (helps wrangle data)
    - lubridate (helps wrangle date attributes _ date time)
    - ggplot2 (helps visualize data)
    - hmisc (helpe for descriptive statistics) 
  
  
![img](https://cdn.analyticsvidhya.com/wp-content/uploads/2019/05/ggplot_hive.jpg)  

### Process Frame work
![https://rviews.rstudio.com/post/2017-06-09-What-is-the-tidyverse_files/tidyverse1.png](https://rviews.rstudio.com/post/2017-06-09-What-is-the-tidyverse_files/tidyverse1.png)

![<https://d21xlh2maitm24.cloudfront.net/chi/Divvy_Explore_test_190820_164722.jpg?mtime=20190820164722>](<https://d21xlh2maitm24.cloudfront.net/chi/Divvy_Explore_test_190820_164722.jpg?mtime=20190820164722>)
# Importing csv file

In [1]:
# This R environment comes with many helpful analytics packages installed
# It is defined by the kaggle/rstats Docker image: https://github.com/kaggle/docker-rstats
# For example, here's a helpful package to load

library(tidyverse) # metapackage of all tidyverse packages
library(dplyr)
library(Hmisc)
library(ggplot2)
library(lubridate)

# Import fastDummies
library('fastDummies')


# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

list.files(path = "../input//divvy-bike-share-google-data-analyticscase-study")

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.4     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.2     [32m✔[39m [34mdplyr  [39m 1.0.7
[32m✔[39m [34mtidyr  [39m 1.1.3     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.4.0     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

Loading required package: lattice

Loading required package: survival

Loading required package: Formula


Attaching package: ‘Hmisc’


The following objects are masked from ‘package:dplyr’:

    src, summarize


The following objects are masked from ‘package:base’:

    format.pval, units



Attaching package: ‘lubridate’


The following objects are mas

In [2]:
df_an <- read.csv(file = '../input//divvy-bike-share-google-data-analyticscase-study/Divvy_Trip_Data_2020_June_2021_May_V04.csv')


## CLEAN UP AND ADD DATA TO PREPARE FOR ANALYSIS
### Inspect the table that has been loaded

### List of column names



In [3]:
colnames(df_an)

### How many rows are in data frame?
 

In [4]:
nrow(df_an)

### Dimensions of the data frame?


In [5]:
dim(df_an)

 ### See the first and end 6 rows of data frame.

In [6]:
head(df_an, n = 6)
tail(df_an, n = 6)

Unnamed: 0_level_0,ride_id,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_casual,ride_length,WEEKDAY
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<int>
1,8CD5DE2C2B6C4CFC,docked_bike,13-06-20 23:24,13-06-20 23:36,Wilton Ave & Belmont Ave,117,Damen Ave & Clybourn Ave,163,41.94018,-87.65304,41.93193,-87.67786,casual,00:12:00,7
2,9A191EB2C751D85D,docked_bike,26-06-20 07:26,26-06-20 07:31,Federal St & Polk St,41,Daley Center Plaza,81,41.87208,-87.62954,41.88424,-87.62963,member,00:05:00,6
3,F37D14B0B5659BCF,docked_bike,23-06-20 17:12,23-06-20 17:21,Daley Center Plaza,81,State St & Harrison St,5,41.88424,-87.62963,41.87405,-87.62772,member,00:09:00,3
4,C41237B506E85FA1,docked_bike,20-06-20 01:09,20-06-20 01:28,Broadway & Cornelia Ave,303,Broadway & Berwyn Ave,294,41.94553,-87.64644,41.97835,-87.65975,casual,00:19:00,7
5,4B51B3B0BDA7787C,docked_bike,25-06-20 16:59,25-06-20 17:08,Sheffield Ave & Webster Ave,327,Wilton Ave & Belmont Ave,117,41.92154,-87.65382,41.94018,-87.65304,casual,00:09:00,5
6,D50DF288196B53BE,docked_bike,17-06-20 18:07,17-06-20 18:18,Sheffield Ave & Webster Ave,327,Wilton Ave & Belmont Ave,117,41.92154,-87.65382,41.94018,-87.65304,casual,00:11:00,4


Unnamed: 0_level_0,ride_id,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_casual,ride_length,WEEKDAY
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<int>
4073556,D0B8E59E2B3C406D,electric_bike,02-05-21 17:48,02-05-21 17:52,Blackstone Ave & Hyde Park Blvd,13398,,,41.80259,-87.59031,41.8,-87.6,member,00:04:02,1
4073557,EF56D7D1D612AC11,electric_bike,20-05-21 16:32,20-05-21 16:35,Blackstone Ave & Hyde Park Blvd,13398,,,41.80258,-87.59023,41.8,-87.6,member,00:03:25,5
4073558,745191CB9F21DE3C,classic_bike,29-05-21 16:40,29-05-21 17:22,Sheridan Rd & Montrose Ave,TA1307000107,Michigan Ave & Oak St,13042,41.96167,-87.65464,41.90096,-87.62378,casual,00:42:00,7
4073559,428575BAA5356BFF,electric_bike,31-05-21 14:24,31-05-21 14:31,Sheridan Rd & Montrose Ave,TA1307000107,,,41.96152,-87.65465,41.95,-87.65,member,00:06:44,2
4073560,FC8A4A7AB7249662,electric_bike,25-05-21 16:01,25-05-21 16:07,Sheridan Rd & Montrose Ave,TA1307000107,,,41.96165,-87.65472,41.98,-87.66,member,00:06:04,3
4073561,E873B8AA3EE84678,docked_bike,12-05-21 12:22,12-05-21 12:30,Sheridan Rd & Montrose Ave,TA1307000107,Clark St & Grace St,TA1307000127,41.96167,-87.65464,41.95078,-87.65917,casual,00:08:13,4


In [7]:
df_Divvy_C_V01 <- df_an %>% 
  select(-c(ride_id,start_station_id,end_station_id))

In [8]:
head(df_Divvy_C_V01)
write_csv(df_Divvy_C_V01,'df_Divvy_C_V01.csv')

Unnamed: 0_level_0,rideable_type,started_at,ended_at,start_station_name,end_station_name,start_lat,start_lng,end_lat,end_lng,member_casual,ride_length,WEEKDAY
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<int>
1,docked_bike,13-06-20 23:24,13-06-20 23:36,Wilton Ave & Belmont Ave,Damen Ave & Clybourn Ave,41.94018,-87.65304,41.93193,-87.67786,casual,00:12:00,7
2,docked_bike,26-06-20 07:26,26-06-20 07:31,Federal St & Polk St,Daley Center Plaza,41.87208,-87.62954,41.88424,-87.62963,member,00:05:00,6
3,docked_bike,23-06-20 17:12,23-06-20 17:21,Daley Center Plaza,State St & Harrison St,41.88424,-87.62963,41.87405,-87.62772,member,00:09:00,3
4,docked_bike,20-06-20 01:09,20-06-20 01:28,Broadway & Cornelia Ave,Broadway & Berwyn Ave,41.94553,-87.64644,41.97835,-87.65975,casual,00:19:00,7
5,docked_bike,25-06-20 16:59,25-06-20 17:08,Sheffield Ave & Webster Ave,Wilton Ave & Belmont Ave,41.92154,-87.65382,41.94018,-87.65304,casual,00:09:00,5
6,docked_bike,17-06-20 18:07,17-06-20 18:18,Sheffield Ave & Webster Ave,Wilton Ave & Belmont Ave,41.92154,-87.65382,41.94018,-87.65304,casual,00:11:00,4


In [9]:
unique(df_Divvy_C_V01$rideable_type)
unique(df_Divvy_C_V01$member_casual)

## Generating Dummy variable

In [10]:
df_Divvy_C_V01$rideable_type_Dummy <- ifelse(df_Divvy_C_V01$rideable_type == 'docked_bike',0,ifelse(df_Divvy_C_V01$rideable_type == 'electric_bike',1,2))
df_Divvy_C_V01$member_type_Dummy <- ifelse(df_Divvy_C_V01$member_casual == 'member',0,1)

* **rideable_type**
    - docked_bike -> 0
    - electric_bike -> 1
    - classic_bike -> 2
* **member_casual**
    - member -> 0
    - casual -> 1

In [11]:
df_Divvy_C_V02 <- df_Divvy_C_V01 %>% 
  select(-c(rideable_type,member_casual))

In [12]:
str(df_Divvy_C_V02)

'data.frame':	4073561 obs. of  12 variables:
 $ started_at         : chr  "13-06-20 23:24" "26-06-20 07:26" "23-06-20 17:12" "20-06-20 01:09" ...
 $ ended_at           : chr  "13-06-20 23:36" "26-06-20 07:31" "23-06-20 17:21" "20-06-20 01:28" ...
 $ start_station_name : chr  "Wilton Ave & Belmont Ave" "Federal St & Polk St" "Daley Center Plaza" "Broadway & Cornelia Ave" ...
 $ end_station_name   : chr  "Damen Ave & Clybourn Ave" "Daley Center Plaza" "State St & Harrison St" "Broadway & Berwyn Ave" ...
 $ start_lat          : num  41.9 41.9 41.9 41.9 41.9 ...
 $ start_lng          : num  -87.7 -87.6 -87.6 -87.6 -87.7 ...
 $ end_lat            : num  41.9 41.9 41.9 42 41.9 ...
 $ end_lng            : num  -87.7 -87.6 -87.6 -87.7 -87.7 ...
 $ ride_length        : chr  "00:12:00" "00:05:00" "00:09:00" "00:19:00" ...
 $ WEEKDAY            : int  7 6 3 7 5 4 5 6 3 1 ...
 $ rideable_type_Dummy: num  0 0 0 0 0 0 0 0 0 0 ...
 $ member_type_Dummy  : num  1 0 0 1 1 1 0 1 0 0 ...


In [13]:
write_csv(df_Divvy_C_V02,'df_Divvy_C_V02.csv')

In [14]:
df_Divvy_C_V02[!complete.cases(df_Divvy_C_V02),]

Unnamed: 0_level_0,started_at,ended_at,start_station_name,end_station_name,start_lat,start_lng,end_lat,end_lng,ride_length,WEEKDAY,rideable_type_Dummy,member_type_Dummy
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<int>,<dbl>,<dbl>
801,04-06-20 07:24,04-06-20 07:58,Broadway & Cornelia Ave,,41.94553,-87.64644,,,00:34:00,5,0,0
1267,28-06-20 13:58,28-06-20 18:23,Michigan Ave & Lake St,,41.88602,-87.62412,,,04:25:00,1,0,1
2355,21-06-20 19:08,21-06-20 20:28,Michigan Ave & Lake St,,41.88602,-87.62412,,,01:20:00,1,0,0
2493,04-06-20 08:46,04-06-20 10:09,Indiana Ave & Roosevelt Rd,,41.86789,-87.62304,,,01:23:00,5,0,1
2958,17-06-20 09:51,17-06-20 10:18,Clarendon Ave & Junior Ter,,41.96100,-87.64960,,,00:27:00,4,0,0
2988,16-06-20 14:14,16-06-20 14:41,Clarendon Ave & Junior Ter,,41.96100,-87.64960,,,00:27:00,3,0,0
3022,28-06-20 13:39,28-06-20 16:08,Lakeview Ave & Fullerton Pkwy,,41.92586,-87.63897,,,02:29:00,1,0,1
3594,23-06-20 16:48,23-06-20 17:17,LaSalle St & Washington St,,41.88266,-87.63253,,,00:29:00,3,0,0
3599,26-06-20 00:27,26-06-20 00:53,Clark St & Schiller St,,41.90799,-87.63150,,,00:26:00,6,0,0
3856,10-06-20 10:30,10-06-20 10:46,Wells St & Hubbard St,,41.88991,-87.63427,,,00:16:00,4,0,1


In [15]:
df_Divvy_nocoor_C_V02 <- df_Divvy_C_V02 %>% 
  select(-c(start_lat,start_lng,end_lat,end_lng))

In [16]:
df_Divvy_nocoor_C_V02[!complete.cases(df_Divvy_nocoor_C_V02),]

started_at,ended_at,start_station_name,end_station_name,ride_length,WEEKDAY,rideable_type_Dummy,member_type_Dummy
<chr>,<chr>,<chr>,<chr>,<chr>,<int>,<dbl>,<dbl>


In [17]:
unique(df_Divvy_C_V02[c("end_station_name")])

Unnamed: 0_level_0,end_station_name
Unnamed: 0_level_1,<chr>
1,Damen Ave & Clybourn Ave
2,Daley Center Plaza
3,State St & Harrison St
4,Broadway & Berwyn Ave
5,Wilton Ave & Belmont Ave
8,Broadway & Cornelia Ave
9,Franklin St & Lake St
10,Wells St & Huron St
11,Michigan Ave & 14th St
12,Campbell Ave & North Ave


In [18]:
colnames(df_Divvy_C_V02)

In [19]:
class(df_Divvy_C_V02["started_at"])
class(df_Divvy_C_V02[["started_at"]])
class(df_Divvy_C_V02[c("started_at")])
class(df_Divvy_C_V02[[c("started_at")]])
class(df_Divvy_C_V02$started_at)

In [20]:
count(unique(df_Divvy_C_V02["start_station_name"]))
#no applicable method for 'count' applied to an object of class "character"
#count(unique(df_Divvy_C_V02[["start_station_name"]]))
count(unique(df_Divvy_C_V02[c("start_station_name")]))
#no applicable method for 'count' applied to an object of class "character"
#count(unique(df_Divvy_C_V02[[c("start_station_name")]]))
#no applicable method for 'count' applied to an object of class "character"
#count(unique(df_Divvy_C_V02$start_station_name))

n
<int>
716


n
<int>
716


In [21]:
count(unique(df_Divvy_C_V02[c("start_lat","start_lng")]))
unique(df_Divvy_C_V02[c("start_lat","start_lng")])

n
<int>
275937


Unnamed: 0_level_0,start_lat,start_lng
Unnamed: 0_level_1,<dbl>,<dbl>
1,41.94018,-87.65304
2,41.87208,-87.62954
3,41.88424,-87.62963
4,41.94553,-87.64644
5,41.92154,-87.65382
8,41.93627,-87.65266
9,41.85761,-87.61941
10,41.89158,-87.64838
11,41.86915,-87.67105
12,41.91994,-87.64883


In [22]:
specify_decimal <- function(x, k) trimws(format(round(x, k), nsmall=k))

In [23]:
df_Divvy_C_V03 <- df_Divvy_C_V02
df_Divvy_C_V03[["start_lat"]] <- specify_decimal(df_Divvy_C_V02[["start_lat"]],3)
df_Divvy_C_V03[["start_lng"]] <- specify_decimal(df_Divvy_C_V02[["start_lng"]],3)
df_Divvy_C_V03[["end_lat"]] <- specify_decimal(df_Divvy_C_V02[["end_lat"]],3)
df_Divvy_C_V03[["end_lng"]] <- specify_decimal(df_Divvy_C_V02[["end_lng"]],3)

In [24]:
count(unique(df_Divvy_C_V03[c("start_lat","start_lng")]))
unique(df_Divvy_C_V03[c("start_lat","start_lng")])

n
<int>
2190


Unnamed: 0_level_0,start_lat,start_lng
Unnamed: 0_level_1,<chr>,<chr>
1,41.940,-87.653
2,41.872,-87.630
3,41.884,-87.630
4,41.946,-87.646
5,41.922,-87.654
8,41.936,-87.653
9,41.858,-87.619
10,41.892,-87.648
11,41.869,-87.671
12,41.920,-87.649


In [25]:
count(unique(df_Divvy_C_V03[c("end_lat","end_lat")]))
unique(df_Divvy_C_V03[c("end_lng","end_lng")])

n
<int>
378


Unnamed: 0_level_0,end_lng,end_lng.1
Unnamed: 0_level_1,<chr>,<chr>
1,-87.678,-87.678
2,-87.630,-87.630
3,-87.628,-87.628
4,-87.660,-87.660
5,-87.653,-87.653
8,-87.646,-87.646
9,-87.635,-87.635
10,-87.634,-87.634
11,-87.624,-87.624
12,-87.690,-87.690


### Drop all na

In [26]:
df_an_NA <- drop_na(df_an)

df_an_NA$date <- as.Date(df_an_NA$started_at) 
df_an_NA$month <- format(as.Date(df_an_NA$date), "%m")
df_an_NA$day <- format(as.Date(df_an_NA$date), "%d")
df_an_NA$year <- format(as.Date(df_an_NA$date), "%Y")
df_an_NA$day_of_week <- format(as.Date(df_an_NA$date), "%A")
df_an_NA$ride_distance <- distGeo(matrix(c(df_an_NA$start_lng, df_an_NA$start_lat), ncol = 2), matrix(c(df_an_NA$end_lng, df_an_NA$end_lat), ncol = 2))
df_an_NA$ride_distance <- df_an_NA$ride_distance/1000
#At last the speed in Km/h
df_an_NA$ride_speed = c(df_an_NA$ride_distance)/as.numeric(c(df_an_NA$ride_length), units="hours")

# The dataframe includes a few hundred entries when bikes were taken out of docks and checked for quality by Divvy or ride_length was negative:

df_an_NA <- df_an_NA[!(df_an_NA$start_station_name == "HQ QR" | df_an_NA$ride_length<0),]


ERROR: Error in distGeo(matrix(c(df_an_NA$start_lng, df_an_NA$start_lat), ncol = 2), : could not find function "distGeo"


In [None]:
str(df_an_NA)
glimpse(df_an_NA)
summary(df_an_NA)

In [None]:
str(df_an)

In [None]:
glimpse(df_an)

In [None]:
summary(df_an)

### Creating a new data frame for the coordination

In [None]:
df_coordinate <- df_an %>% 
  select(-c(ride_id,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,member_casual,ride_length,WEEKDAY))

In [None]:
glimpse(df_coordinate)

# 4 - Analyse




##### A breif of steps

* Identify patterns
* Draw conclusions
* Make predictions

In [None]:
colnames(df_an)
df_an_m1 <- df_an %>% 
  select(-c(ride_id,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng))
colnames(df_an_m1)

In [None]:
df_an_m1 %>% lapply(min)
df_an_m1 %>% lapply(max)
df_an_m1 %>% sapply(mode)


In [None]:
 glimpse(df_an_m1)

# 5 - Share

##### A breif of steps
* Create effective visuals
* Create a story for data
* Share insights to stakeholders


Now after performing analysis and gained some insights into the data, create visualizations is needed to share the findings. Moreno has reminded that they should be sophisticated and polished in order to effectively communicate to the executive team. 

Use the following Case Study Roadmap as a guide:

Guiding questions:

* Were we able to answer the question of **how annual members and casual riders use Cyclistic bikes differently?**
* What story does the data tell?
* How do the findings relate to the original question? 
* Who is the audience? 
* What is the best way to communicate with them?
* Can data visualization helps share findings?
* Is the presentation accessible to the audience?


## Visualisiaiton

In [None]:
df_an_m1 %>%
ggplot(mapping = aes(x = ride_length , y = WEEKDAY)) + geom_count()
ggsave("Count.png")

## Provide a uniqe coordination


In [None]:
df_coordinate_unique <- unique(df_coordinate[c('start_lat','start_lng','end_lat','end_lng')])

In [None]:
df_coordinate_unique2 <- unique(df_Divvy_C_V03[c('start_lat','start_lng','end_lat','end_lng')])

In [None]:
write_csv(df_coordinate_unique2,"coordination2.csv")

In [None]:
glimpse(df_coordinate_unique2)

### Scatter PLot


In [None]:

df_coordinate_unique2 %>%
ggplot() + geom_point(mapping = aes(x = start_lat,y = start_lng))


In [None]:
library(ggmap)
library(RColorBrewer)
library(patchwork)
library(here)

 @Article{,
    author = {David Kahle and Hadley Wickham},
    title = {ggmap: Spatial Visualization with ggplot2},
    journal = {The R Journal},
    year = {2013},
    volume = {5},
    number = {1},
    pages = {144--161},
    url = {https://journal.r-project.org/archive/2013-1/kahle-wickham.pdf},
   }

In [None]:
here()

In [None]:
citation("ggmap")

In [None]:
png("", units="px", width=4000, height=3000, res=600)
options(digits = 3)
set.seed(1234)
theme_set(theme_minimal())
chi_bb <- c(
  left = -88.656921,
  bottom = 41.408746,
  right = -86.801605,
  top = 42.206142
)

dots <- 55
get_stamenmap(
  bbox = chi_bb,
  zoom = 13
) %>%
  ggmap() +
  geom_point(data = df_coordinate_unique[1:dots,], mapping = aes(x = df_coordinate_unique[1:dots,]$start_lng,
                                 y = df_coordinate_unique[1:dots,]$start_lat,
                                 size = .25,
                                 color = "red",
                                 alpha = .27))+
  geom_point(data = df_coordinate_unique[1:dots,], mapping = aes(x = df_coordinate_unique[1:dots,]$end_lng,
                                 y = df_coordinate_unique[1:dots,]$end_lat,
                                 size = .25,
                                 color = "blue",
                                 alpha = .21))+
  geom_density2d(data = df_coordinate_unique[1:dots,], mapping = aes(x = df_coordinate_unique[1:dots,]$start_lng,
                                 y = df_coordinate_unique[1:dots,]$start_lat,
                                 ))


### Map Visualisiation

#### Inspiration by [Julen Aranguren](https://www.kaggle.com/julenaranguren/cyclistic-bike-share-a-case-study)

##### Lets check now the coordinates data of the rides, to see if is there any interesting pattern:

##### First we create a table only for the most popular routes (>250 times)


In [None]:
library("tidyverse")
library("ggplot2")
library("lubridate")
library("geosphere")
library("gridExtra") 
library("ggmap")

In [None]:
df_coordination <- read_csv(file = "../input/df-coordination/coordination.csv")

In [None]:
summary(df_coordination)

In [None]:
df_coordination$

In [None]:
#First we create a table only for the most popular routes (>250 times)

coordinates_table <- df_an %>% 
filter(start_lng != end_lng & start_lat != end_lat) %>%
group_by(start_lng, start_lat, end_lng, end_lat, member_casual, rideable_type) %>%
summarise(total = n(),.groups="drop") %>%
filter(total > 100)

In [None]:
#Then we create two sub tables for each user type
casual <- coordinates_table %>% filter(member_casual == "casual")
member <- coordinates_table %>% filter(member_casual == "member")

In [None]:
# Lets store bounding box coordinates for ggmap:
chi_bb <- c(
  left = -87.700424,
  bottom = 41.790769,
  right = -87.554855,
  top = 41.990119
)
# Here we store the stamen map of Chicago
chicago_stamen <- get_stamenmap(
  bbox = chi_bb,
  zoom = 15,
  maptype = "toner"
)

#Then we plot the data on the map
ggmap(chicago_stamen,darken = c(0.8, "white")) +
   geom_curve(casual, mapping = aes(x = start_lng, y = start_lat, xend = end_lng, yend = end_lat, alpha= total, color=rideable_type), size = 0.5, curvature = .2,arrow = arrow(length=unit(0.2,"cm"), ends="first", type = "closed")) +
    coord_cartesian() +
    labs(title = "Most popular routes by casual users",x=NULL,y=NULL, color="User type", caption = "Data by Motivate International Inc") +
    theme(legend.position="none")
ggsave("Casual_popular_route.png", width = 20, height = 20, units = "cm")

ggmap(chicago_stamen,darken = c(0.8, "white")) +
    geom_curve(member, mapping = aes(x = start_lng, y = start_lat, xend = end_lng, yend = end_lat, alpha= total, color=rideable_type), size = 0.5, curvature = .2,arrow = arrow(length=unit(0.2,"cm"), ends="first", type = "closed")) +  
    coord_cartesian() +
    labs(title = "Most popular routes by annual members",x=NULL,y=NULL, caption = "Data by Motivate International Inc") +
    theme(legend.position="none")
ggsave("Member_popular_route.png", width = 20, height = 20, units = "cm")


## 6 - Act

##### A breif of steps
* Give recommendations based on insights*
* Solve problems
* Create something new



Now after finishing visualizations, we need to act on the findings. Prepare the deliverables Morena asked to create, including:
* The three top recommendations based on analysis. 
Use the following Case Study Roadmap as a guide:
#### Guiding questions 
* What is the final conclusion based on analysis?
* How could marketing team and business apply the insights? 
* What next steps would be taken based on the findings?
* Is there additional data could be used to expand on the findings?