# Google Data Analytics Capstone Project
## Cyclistic Dataset

**Case Study: How Does a Bike-Share Navigate Speedy Success?**

**Introduction**

Welcome to the Cyclistic bike-share analysis case study! In this case study, you will perform many real-world tasks of a junior data analyst. You will work for a fictional company, Cyclistic, and meet different characters and team members. In order to answer the key business questions, you will follow the steps of the data analysis process: ask, prepare, process, analyze, share, and act. Along the way, the Case Study Roadmap tables — including guiding questions and key tasks — will help you stay on the right path.

**Scenario**

You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.
Characters and teams



* Cyclistic: A bike-share program that features more than 5,800 bicycles and 600 docking stations. Cyclistic sets itself apart by also offering reclining bikes, hand tricycles, and cargo bikes, making bike-share more inclusive to people with disabilities and riders who can’t use a standard two-wheeled bike. The majority of riders opt for traditional bikes; about 8% of riders use the assistive options. Cyclistic users are more likely to ride for leisure, but about 30% use them to commute to work each day.

* Lily Moreno: The director of marketing and your manager. Moreno is responsible for the development of campaigns and initiatives to promote the bike-share program. These may include email, social media, and other channels.

* Cyclistic marketing analytics team: A team of data analysts who are responsible for collecting, analyzing, and reporting data that helps guide Cyclistic marketing strategy. You joined this team six months ago and have been busy learning about Cyclistic’s mission and business goals — as well as how you, as a junior data analyst, can help Cyclistic achieve them.

* Cyclistic executive team: The notoriously detail-oriented executive team will decide whether to approve the recommended marketing program.

**About the company**

In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime.
Until now, Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments. One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members.

Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders. Although the pricing flexibility helps Cyclistic attract more customers, Moreno believes that maximizing the number of annual members will be key to future growth. Rather than creating a marketing campaign that targets all-new customers, Moreno believes there is a very good chance to convert casual riders into members. She notes that casual riders are already aware of the Cyclistic program and have chosen Cyclistic for their mobility needs.

Moreno has set a clear goal: Design marketing strategies aimed at converting casual riders into annual members. In order to do that, however, the marketing analyst team needs to better understand how annual members and casual riders differ, why casual riders would buy a membership, and how digital media could affect their marketing tactics. Moreno and her team are interested in analyzing the Cyclistic historical bike trip data to identify trends.

This project will be completed using the 6 stages of Data Analysis namely:

* **Ask**: In this stage the key business questions will be identified along with the stakeholders
* **Prepare**: Collect the data, identify how it’s organized, determine the credibility of the data.
* **Process**: Select the tool for data cleaning, check for errors and document the cleaning process.
* **Analyze**: Organize and format the data, aggregate the data so that it’s useful, perform calculations and identify trends and relationships.
* **Share**: Use design thinking principles and data-driven storytelling approach, present the findings with effective visualization. Ensure the analysis has answered the business task.
* **Act**: Share the final conclusion and the recommendations.

### STAGE 1

**ASK**

Task:
Identify the types of membership and analyse them to figure out their modes of use and prepare a marketing strategy to convert casual rider into annual members.

Stakeholders:
Lily Moreno: Director of marketing and manager
Cyclistic Executive Team: A team who will decide whether to approve the recommended marketing program or not
Cyclistic marketing analytics Team: A team of data analysts responsible for collecting, analyzing, and reporting data

### STAGE 2

**Prepare**

For this project, I will use the public data of Cyclistic’s historical trip data to analyze and identify trends. The data has been made available by [Motivate International Inc.](https://divvy-tripdata.s3.amazonaws.com/index.html) under the [license](https://ride.divvybikes.com/data-license-agreement).

I downloaded the dataset zip file from Jan 2023 to Dec 2023 , the same dataset has been made available in the input section of this Notebook.

**Data overview**

* **ride_id**: It is a distinct identifier assigned to each individual ride.
* **rideable_type**: This column indicates the type of bikes used for each ride. 
* **started_at**: This column denotes the timestamp when a particular ride began.
* **ended_at**: This column represents the timestamp when a specific ride concluded.
* **start_station_name**: This column contains the name of the station where the bike ride originated.
* **start_station_id**: This column represents the unique identifier for the station where the bike ride originated.
* **end_station_name**: This column contains the name of the station where the bike ride concluded.
* **end_station_id**: This column represents the unique identifier for the station where the bike ride concluded. 
* **start_lat**: This column denotes the latitude coordinate of the starting point of the bike ride.
* **start_lng**: This column denotes the longitude coordinate of the starting point of the bike ride.
* **end_lat**: This column denotes the latitude coordinate of the ending point of the bike ride.
* **end_lng**: This column denotes the longitude coordinate of the ending point of the bike ride.
* **member_casual**: This column indicates whether the rider is a member or a casual user.

I have used Microsoft Excel to get a glimpse of the data . the data set use each file for each month.

### STAGE 3 & STAGE 4

**Process** **& Analyze**

I will be using kaggle notebook to present my analysis and documenting EDA , Vizualization and Insight. I will be using python as a data processing language throughout this notebook, starting with installing and importing necessary libraries in this notebook.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

pd.options.mode.chained_assignment = None  # Suppress warnings
import seaborn as sns  # visualization Library
import matplotlib.pyplot as plt  # visualization Library
import plotly.graph_objs as go   # visualization Library
import seaborn.objects as so
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
# import the DataFrames
df1 = pd.read_csv('../input/cyclistic-dataset/202301-divvy-tripdata.csv')
df2 = pd.read_csv('../input/cyclistic-dataset/202302-divvy-tripdata.csv')
df3 = pd.read_csv('../input/cyclistic-dataset/202303-divvy-tripdata.csv')
df4 = pd.read_csv('../input/cyclistic-dataset/202304-divvy-tripdata.csv')
df5 = pd.read_csv('../input/cyclistic-dataset/202305-divvy-tripdata.csv')
df6 = pd.read_csv('../input/cyclistic-dataset/202306-divvy-tripdata.csv')
df7 = pd.read_csv('../input/cyclistic-dataset/202307-divvy-tripdata.csv')
df8 = pd.read_csv('../input/cyclistic-dataset/202308-divvy-tripdata.csv')
df9 = pd.read_csv('../input/cyclistic-dataset/202309-divvy-tripdata.csv')
df10 = pd.read_csv('../input/cyclistic-dataset/202310-divvy-tripdata.csv')
df11 = pd.read_csv('../input/cyclistic-dataset/202311-divvy-tripdata.csv')
df12 = pd.read_csv('../input/cyclistic-dataset/202312-divvy-tripdata.csv')

In [None]:
# Create a list of DataFrames
dfs = [df1, df2, df3, df4, df5, df6, df7, df8, df9 , df10, df11, df12]

# Concatenate the DataFrames along the rows axis
df = pd.concat(dfs, ignore_index=True)


In [None]:
# an overview over our data
df.info()

In [None]:
#drop the NAN
df_1 = df.dropna()

In [None]:
df_1.info()

In [None]:
#convert the data types to datetime
df_1['started_at'] = pd.to_datetime(df_1['started_at'],format = '%d-%m-%Y %H:%M')
df_1['ended_at'] = pd.to_datetime(df_1['ended_at'],format = '%d-%m-%Y %H:%M')

In [None]:
#Add new column showing ride length
df_1['ride_length'] = (df_1['ended_at'] - df_1['started_at']).dt.total_seconds()/60
#sort the data by ride length
df_1.sort_values(by='ride_length', ascending=False).head(2)
# take only the data which has ride length of 0 or above
dff = df_1[(df_1['ride_length'] > 0)]
dff.head(3)

In [None]:
# Create columns with days, months and hours
dff['day_of_week'] = dff['started_at'].dt.day_name()
dff['months'] = dff['started_at'].dt.strftime('%B')
dff['hours'] = dff['started_at'].dt.strftime('%I:%M %p')

In [None]:
# Analyse the members and casual riders distribution
mem_status = dff['member_casual'].value_counts()
mem_status
grouped_data = dff.groupby(['start_station_name'])['ride_length'].sum().reset_index()
fig, ax = plt.subplots(figsize=(9, 2), layout='constrained')
fig.suptitle('Fig 1.1',fontsize='small' )
ax.set_title('Distribution of Riders', loc='left', fontstyle='oblique', fontsize='medium' )
# ax.set_facecolor('#a4a2a8')
mem_status.plot(kind='barh',color=['#b04238','#df8879'])
plt.xlabel('Number of Users' ,loc='center', fontstyle='oblique', fontsize='medium' )
plt.ylabel('Membership Status' , loc='center', fontstyle='oblique', fontsize='medium')

In [None]:
# Top 10 Start Stations
start_station = dff['start_station_name'].value_counts().head(10)

# Create subplots with 1 row and 2 columns
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(14, 6), constrained_layout=True, sharex=False, sharey=False, squeeze=True, subplot_kw=None, gridspec_kw=None)
fig.suptitle('Top 10 Stations with Highest Frequency',fontsize='small' )

# Plot the top 10 start stations
sns.countplot(y='start_station_name', hue='member_casual', data=dff, order=start_station.index, ax=axes[0], palette={'member': '#b04238', 'casual': '#df8879'})
axes[0].set_title('Start Stations',fontsize='medium')
axes[0].set_xlabel('Frequency',fontsize='medium')
axes[0].set_ylabel('Station Name',fontsize='small')

# Top 10 End Stations
end_station = dff['end_station_name'].value_counts().head(10)

# # Plot the top 10 end stations
sns.countplot(y='end_station_name', hue='member_casual', data=dff, order=end_station.index, ax=axes[1], palette={'member': '#b04238', 'casual': '#df8879'})
axes[1].set_title('End Stations',fontsize='medium' )
axes[1].set_xlabel('Frequency',fontsize='medium')
axes[1].set_ylabel('Station Name',fontsize='small')

plt.show()


In [None]:
# calculate summary statistics
summary_statistics = dff.groupby('member_casual').agg(
    avg_time=pd.NamedAgg(column='ride_length', aggfunc='mean'),
    median_time=pd.NamedAgg(column='ride_length', aggfunc='median'),
    max_time=pd.NamedAgg(column='ride_length', aggfunc='max'),
    min_time=pd.NamedAgg(column='ride_length', aggfunc='min')
).reset_index()

print(summary_statistics)

In [None]:
weekdays = [ 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

dff['day_of_week'] = pd.Categorical(dff.day_of_week,categories=weekdays)
dff = dff.sort_values('day_of_week')

# # Set the order of months
months_order = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']

dff['months'] = pd.Categorical(dff.months,categories=months_order)
dff = dff.sort_values('months')# # Set the order of months
#months distribution
# fig,ax = plt.subplots(figsize=(7,3))

sns.histplot(dff, x="months", hue="member_casual", multiple="dodge", shrink=.8, palette=['#df8879', '#b04238'], edgecolor='none')
plt.title('distribution of Months')
plt.xlabel('Months')
plt.ylabel('total number of rides')
plt.tight_layout()

fig,ax = plt.subplots(figsize=(7,3))
# sns.barplot(dff, x = dff['days'].sum() ,  ax=ax, y=dff['member_casual'].sum())
sns.histplot(dff, x="day_of_week", hue="member_casual", multiple="dodge", shrink=.8, palette=['#df8879', '#b04238'], edgecolor='none')
plt.title('distribution of days')
plt.xlabel('Days of week')
plt.ylabel('total number of rides')
plt.tight_layout()


In [None]:
fig,ax = plt.subplots(figsize=(5,3))

col=[]
for day in dff['day_of_week']:
    if day == 'Saturday' or day== 'Sunday':
        col.append('weekend')
    else:
        col.append('weekday')
        
dff['day_category'] = col


# dff.head()
dff['day_category'].value_counts()
# dff['day_category'].unique()
sns.countplot(dff, x="day_category", hue="member_casual", saturation=1, palette=['#df8879', '#b04238'])

In [None]:
fig,ax = plt.subplots(figsize=(5,3))
sns.countplot(dff, x="day_category", hue="member_casual", saturation=1, palette=['#df8879', '#b04238'])
fig, ax = plt.subplots(figsize=(5, 4), layout='constrained')
total= dff.groupby('member_casual')['ride_length'].sum()
total.plot(kind='bar',color=['#b04238','#df8879'])

plt.ticklabel_format(style='plain', axis='y')
plt.xticks(rotation=0)

In [None]:
dff['hour'] = dff['started_at'].dt.hour

sns.histplot(x='hour', data=dff, bins=24, kde=False )  \

ax.set_xticks(range(24))
ax.set_xticklabels([f'{hour}' for hour in range(24)], rotation=45)

plt.xlabel('Hour of the Day')
plt.ylabel('total no of rides')
plt.title('total rides per hour')
plt.show()

In [None]:
overall_time_spent = dff['ride_length'].sum()


average_time_spent = dff.groupby('member_casual')['ride_length'].sum()


percentage_of_overall= (average_time_spent / overall_time_spent) * 100
percentage_of_overall
average_time_spent = dff.groupby('day_of_week')['ride_length'].sum()
average_time_spent
median_time_spent = dff.groupby('member_casual')['ride_length'].median()
median_time_spent
avg=dff.groupby('member_casual')['ride_length'].mean()
avg_=dff['ride_length'].mean()
median_=dff['ride_length'].max()
table = dff.pivot_table(index='member_casual', columns='day_of_week', values= 'ride_length', aggfunc='mean')
table

### STAGE 5

**Share**

* As per the data members share a high percentage of riders than casual. 

* Average ride durations vary across different days of the week for both casual and member riders. Casual riders have longer average ride durations than members. 

* Top 10 stations preferred by riders are somewhat same for casuals and members .


# STAGE 6

**ACT**

Three Recommendation as per the analysis would be:

* Implement targeted marketing campaigns around Top Stations to further increase engagement. Implement station-specific enhancements during the evenings, such as additional amenities, events, or promotions.

* Create targeted marketing messages that address the longer ride durations experienced by casual riders and showcase how an annual membership can offer cost-effective and convenient solutions for their longer rides.

* Highlight exclusive benefits such as member-only events, priority access, or special discounts to encourage casual riders to convert to annual memberships & loyalty programs or rewards for members who use the service frequently during weekdays.
