# Introduction

As a "junior data analyst" working in Cyclistics marking analytics team, I am tasked to understand how casual riders and annual members use Cyclistic bikes differently. I will approach this task through the Ask, Prepare, Process, Analyze, Share, and Act phases.

# Ask

The ask phase requires us to answer related questions that will aid the business task, which in this case is to determine the differences in behaviours between casual members and annual members so that design marketing strategies can be implemented to aid in the conversion of casual members into annual members. This is because, the marketing director and the marketing analysis team have came to the conclusion that an annual membership is much more profitable to the company in comparison to a casual membership.

The company has provided the data they collected for the past 12 months to help with this business task. I will be using the language Python to prepare, process, and analyse this data. I will then use Power BI to share the results to the stakeholders.

# Prepare
In the prepare phase, we have to first determine if the data is usable by checking its credibility. I did this by going through the "ROCCC" check (Reliable, Original, Comprehensive, Current, and Cited). After verifying that the 12 months of data provided by Motivate International Inc is credible, I then move on to setting up my studio to process the data. As mentioned before, I will be using Python for this phase.

I observed that the data is stored in a ".csv" format (comma seperated values) and that each data set has 13 variables.

# Process
In this phase, I clean the data by removing all null values, renaming or recoding variables to get more useful data, and format variables to have symmetrical and readable data types.

# Business Task:
Understanding the nature of bike usage between annual members & casual members and to introduce strategy to increase the number of annual memberships.

# Data Preparation
### Import Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

### Import DataSet

In [2]:
jan = pd.read_csv('Cyclistic_data/202101-divvy-tripdata.csv')
feb = pd.read_csv('Cyclistic_data/202102-divvy-tripdata.csv')
mar = pd.read_csv('Cyclistic_data/202103-divvy-tripdata.csv')
apr = pd.read_csv('Cyclistic_data/202104-divvy-tripdata.csv')
may = pd.read_csv('Cyclistic_data/202105-divvy-tripdata.csv')
jun = pd.read_csv('Cyclistic_data/202106-divvy-tripdata.csv')
jul = pd.read_csv('Cyclistic_data/202107-divvy-tripdata.csv')
aug = pd.read_csv('Cyclistic_data/202108-divvy-tripdata.csv')
sep = pd.read_csv('Cyclistic_data/202109-divvy-tripdata.csv')
oct = pd.read_csv('Cyclistic_data/202110-divvy-tripdata.csv')
nov = pd.read_csv('Cyclistic_data/202111-divvy-tripdata.csv')
dec = pd.read_csv('Cyclistic_data/202112-divvy-tripdata.csv')

In [3]:
jan.head(2)

Unnamed: 0,ride_id,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_casual
0,E19E6F1B8D4C42ED,electric_bike,2021-01-23 16:14:19,2021-01-23 16:24:44,California Ave & Cortez St,17660,,,41.900341,-87.696743,41.89,-87.72,member
1,DC88F20C2C55F27F,electric_bike,2021-01-27 18:43:08,2021-01-27 18:47:12,California Ave & Cortez St,17660,,,41.900333,-87.696707,41.9,-87.69,member


In [4]:
dec.head(2)

Unnamed: 0,ride_id,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_casual
0,46F8167220E4431F,electric_bike,2021-12-07 15:06:07,2021-12-07 15:13:42,Laflin St & Cullerton St,13307,Morgan St & Polk St,TA1307000130,41.854833,-87.66366,41.871969,-87.650965,member
1,73A77762838B32FD,electric_bike,2021-12-11 03:43:29,2021-12-11 04:10:23,LaSalle Dr & Huron St,KP1705001026,Clarendon Ave & Leland Ave,TA1307000119,41.894405,-87.632331,41.967968,-87.650001,casual


In [5]:
jan.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 96834 entries, 0 to 96833
Data columns (total 13 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   ride_id             96834 non-null  object 
 1   rideable_type       96834 non-null  object 
 2   started_at          96834 non-null  object 
 3   ended_at            96834 non-null  object 
 4   start_station_name  88209 non-null  object 
 5   start_station_id    88209 non-null  object 
 6   end_station_name    86557 non-null  object 
 7   end_station_id      86557 non-null  object 
 8   start_lat           96834 non-null  float64
 9   start_lng           96834 non-null  float64
 10  end_lat             96731 non-null  float64
 11  end_lng             96731 non-null  float64
 12  member_casual       96834 non-null  object 
dtypes: float64(4), object(9)
memory usage: 9.6+ MB


In [6]:
df = pd.concat([jan, feb, mar, apr, may, jun, jul, aug, sep, oct, nov, dec])

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5595063 entries, 0 to 247539
Data columns (total 13 columns):
 #   Column              Dtype  
---  ------              -----  
 0   ride_id             object 
 1   rideable_type       object 
 2   started_at          object 
 3   ended_at            object 
 4   start_station_name  object 
 5   start_station_id    object 
 6   end_station_name    object 
 7   end_station_id      object 
 8   start_lat           float64
 9   start_lng           float64
 10  end_lat             float64
 11  end_lng             float64
 12  member_casual       object 
dtypes: float64(4), object(9)
memory usage: 597.6+ MB


### The Data type of 'start_station_id' and 'end_station_id' is not 'object'. we need to convert it to object

In [8]:
df['start_station_id'] = df['start_station_id'].apply(str)
df['end_station_id'] = df['start_station_id'].apply(str)

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5595063 entries, 0 to 247539
Data columns (total 13 columns):
 #   Column              Dtype  
---  ------              -----  
 0   ride_id             object 
 1   rideable_type       object 
 2   started_at          object 
 3   ended_at            object 
 4   start_station_name  object 
 5   start_station_id    object 
 6   end_station_name    object 
 7   end_station_id      object 
 8   start_lat           float64
 9   start_lng           float64
 10  end_lat             float64
 11  end_lng             float64
 12  member_casual       object 
dtypes: float64(4), object(9)
memory usage: 597.6+ MB


In [10]:
bike_share = pd.concat([df])

In [11]:
bike_share.head()

Unnamed: 0,ride_id,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_casual
0,E19E6F1B8D4C42ED,electric_bike,2021-01-23 16:14:19,2021-01-23 16:24:44,California Ave & Cortez St,17660,,17660,41.900341,-87.696743,41.89,-87.72,member
1,DC88F20C2C55F27F,electric_bike,2021-01-27 18:43:08,2021-01-27 18:47:12,California Ave & Cortez St,17660,,17660,41.900333,-87.696707,41.9,-87.69,member
2,EC45C94683FE3F27,electric_bike,2021-01-21 22:35:54,2021-01-21 22:37:14,California Ave & Cortez St,17660,,17660,41.900313,-87.696643,41.9,-87.7,member
3,4FA453A75AE377DB,electric_bike,2021-01-07 13:31:13,2021-01-07 13:42:55,California Ave & Cortez St,17660,,17660,41.900399,-87.696662,41.92,-87.69,member
4,BE5E8EB4E7263A0B,electric_bike,2021-01-23 02:24:02,2021-01-23 02:24:45,California Ave & Cortez St,17660,,17660,41.900326,-87.696697,41.9,-87.7,casual


In [12]:
bike_share.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5595063 entries, 0 to 247539
Data columns (total 13 columns):
 #   Column              Dtype  
---  ------              -----  
 0   ride_id             object 
 1   rideable_type       object 
 2   started_at          object 
 3   ended_at            object 
 4   start_station_name  object 
 5   start_station_id    object 
 6   end_station_name    object 
 7   end_station_id      object 
 8   start_lat           float64
 9   start_lng           float64
 10  end_lat             float64
 11  end_lng             float64
 12  member_casual       object 
dtypes: float64(4), object(9)
memory usage: 597.6+ MB


# Data Processing
### We will look for the null values and duplicates in the dataset to remove them.

In [13]:
bike_share.isnull().sum()

ride_id                    0
rideable_type              0
started_at                 0
ended_at                   0
start_station_name    690809
start_station_id           0
end_station_name      739170
end_station_id             0
start_lat                  0
start_lng                  0
end_lat                 4771
end_lng                 4771
member_casual              0
dtype: int64

In [14]:
bike_share.shape

(5595063, 13)

In [15]:
(bike_share.isna().sum()/5595063) * 100

ride_id                0.000000
rideable_type          0.000000
started_at             0.000000
ended_at               0.000000
start_station_name    12.346760
start_station_id       0.000000
end_station_name      13.211111
end_station_id         0.000000
start_lat              0.000000
start_lng              0.000000
end_lat                0.085272
end_lng                0.085272
member_casual          0.000000
dtype: float64

### The missing values are less than 15%. We can remove them and continue with our analysis.

In [None]:
bike_share.dropna(axis=0, inplace=True)

In [None]:
bike_share.isnull().sum()

In [None]:
bike_share[bike_share.duplicated()]

### Let's convert columns 'started_at' and 'ended_at' to 'datetime' Datatype.

In [None]:
bike_share['started_at'] = pd.to_datetime(bike_share['started_at'], dayfirst=True)

In [None]:
bike_share['ended_at'] = pd.to_datetime(bike_share['ended_at'], dayfirst=True)

In [None]:
bike_share.info()

### Now, we will start creating new columns for 'Hour', 'Day' and 'Month'.

In [None]:
bike_share['Hour'] = bike_share.started_at.apply(lambda x: x.hour)
bike_share['Day'] = bike_share.started_at.apply(lambda x: x.day_name())
bike_share['Month'] = bike_share.started_at.apply(lambda x: x.month)

In [None]:
bike_share.tail()

### Let's calculate 'Total_Ride_Time' in minutes.

In [None]:
import datetime as datetime
from datetime import timedelta

In [None]:
# Total_Ride_Time in minutes
bike_share['Total_Ride_Time'] = (bike_share['ended_at'] - bike_share['started_at'])

In [None]:
bike_share['Total_Ride_Time'] = (bike_share['Total_Ride_Time'])/timedelta(minutes=1)

In [None]:
bike_share['Total_Ride_Time'] = bike_share['Total_Ride_Time'].round(decimals = 1)

In [None]:
bike_share.head()

### Let's calculate the ride distance in Kms from given coordinates.

In [None]:
bike_share['Lat'] = (bike_share['end_lat'] - bike_share['start_lat'])
bike_share['Lng'] = (bike_share['end_lng'] - bike_share['start_lng'])

In [None]:
import math

In [None]:
bike_share['Distance'] = np.sqrt((bike_share['Lat']** 2) + (bike_share['Lng'] ** 2))

In [None]:
bike_share['Distance'] = bike_share['Distance'] * 111

In [None]:
bike_share['Distance'].head()

In [None]:
bike_share.head()

In [None]:
month = {1:'January', 2:'February', 3:'March', 4:'April', 5:'May', 6:'June', 7:'July', 8:'August', 9:'September', 10:'October', 11:'November', 12:'December'}

In [None]:
bike_share['Month_Name'] = bike_share['Month'].map(month)

# Share 
# Data Analysis and Visualization:
In the share phase, data analysis and visualization are used to provide insightful and actionable suggestions or communicate the story within the data to stakeholders.n this case study, I created detailed visualizations that will help the stakeholders better understand the suggestions I am trying to present.Each visualization is accompanied by a concise conclusion, enabling stakeholders to grasp the key findings. By combining thorough analysis with compelling visuals, the data's story is effectively shared to inform decision-making.

In [None]:
bike_share.head(3)

In [None]:
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import plotly.express as px
import scipy.stats as stats
from IPython.display import display, HTML

In [None]:
# Set the figure size and style
plt.figure(figsize=(8, 6))
sns.set_style("whitegrid")

# Create the bar plot
sns.barplot(x='member_casual', y='Distance', data=bike_share, palette='viridis')

# Set the title and axis labels
plt.title("Distance by Member Type", fontsize=16)
plt.xlabel("Member Type", fontsize=12)
plt.ylabel("Distance", fontsize=12)

# Customize the tick labels
plt.xticks(fontsize=10)
plt.yticks(fontsize=10)

# Remove the spines
sns.despine()

# Show the plot
plt.show()

In [None]:

# Set the figure size and style
plt.figure(figsize=(8, 6))
sns.set_style("whitegrid")

# Create the bar plot
sns.barplot(x='member_casual', y='Total_Ride_Time', data=bike_share, palette='viridis')

# Set the title and axis labels
plt.title("Distance by Member Type", fontsize=16)
plt.xlabel("Member Type", fontsize=12)
plt.ylabel("Total_Ride_Time", fontsize=12)

# Customize the tick labels
plt.xticks(fontsize=10)
plt.yticks(fontsize=10)

# Remove the spines
sns.despine()

# Show the plot
plt.show()

* In the first plot, we observe that the casual riders have travelled longer distance than the member riders. However the second plot for 'Total Ride Time' shows casual bikers have more ride time than the member bikers.

* We can conclude from the above observations that member riders have short journeys compared to casual ones. Their travel frequency is higher but travel time is lower.

**conclusion:**

* **Longer Ride Times for Casual Riders**: The bar plot reveals that casual riders tend to have longer total ride times compared to members. This suggests that casual riders may use the bike-sharing service for leisure activities or longer recreational rides, whereas members may primarily use it for shorter commuting purposes. The difference in ride times highlights the distinct usage patterns and preferences between these two groups.

* **Potential Marketing Opportunities**: The plot indicates an opportunity for targeted marketing efforts. The longer ride times of casual riders could be leveraged to promote special offers or tailored services for this group. By focusing on attracting and engaging casual riders, bike-sharing companies can potentially increase overall ride time and revenue. Understanding the distinct needs and preferences of each member type can help optimize marketing strategies and enhance the overall user experience.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 6))
sns.set(style='whitegrid', font_scale=1.2)

ax = sns.countplot(x='Hour', hue='member_casual', data=bike_share, palette='Set1')

ax.set_title('Hourly Usage by Member Type', fontsize=16)
ax.set_xlabel('Hour', fontsize=14)
ax.set_ylabel('Count', fontsize=14)
ax.legend(title='Member Type', title_fontsize=12, fontsize=12)
ax.tick_params(axis='both', labelsize=12)

plt.tight_layout()
plt.show()

**conclusion:**
* **Peak Hour Disparity**: There is a noticeable difference in peak usage hours between members and casual riders. Members tend to have a more concentrated peak during commuting hours, particularly in the morning around 8-9 AM and in the evening around 5-6 PM. On the other hand, casual riders show a more evenly distributed pattern throughout the day, with relatively high usage during mid-morning and afternoon hours.
* **Member Preference for Commuting**: The plot suggests that members predominantly use the bike-sharing service for their daily commute, as indicated by the concentrated peaks during typical commuting hours. This insight indicates that the service is primarily catering to the transportation needs of regular commuters who rely on the bikes as a convenient and efficient mode of transport for their work or study commutes.

In [None]:
plt.figure(figsize=(10, 6))
sns.set_style("whitegrid")

# Define the order of the days of the week
order = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]

# Create the countplot with the specified order and color palette
ax = sns.countplot(x='Day', hue='member_casual', data=bike_share, palette='Paired', order=order)

# Set the title and axis labels
plt.title("Daily Usage by Member Type", fontsize=16)
plt.xlabel("Day", fontsize=12)
plt.ylabel("Count", fontsize=12)

# Customize the tick labels
plt.xticks(fontsize=10)
plt.yticks(fontsize=10)

# Customize the legend
ax.legend(title='Member Type', title_fontsize=12, fontsize=10)

# Remove the spines
sns.despine()

# Show the plot
plt.tight_layout()
plt.show()

* Casual riders are enthuasiatic on weekends as they have the highest bike usage on Saturday and Sunday.
* Member riders have consistent use of bikes on weekdays.

**conclusion:**
* **Weekend Usage**: The countplot reveals that both members and casual riders show higher usage on weekends (Saturday and Sunday) compared to weekdays. This indicates that weekends are popular for recreational or leisurely bike rides, attracting riders from both member types.

* **Weekday Usage Patterns**: On weekdays, members tend to have higher usage compared to casual riders. This suggests that members rely on the bike-sharing service for their daily commuting needs, while casual riders may have other transportation options or use the service less frequently during weekdays.

* **Member Dominance**: The countplot also shows that members consistently outnumber casual riders on all days of the week. This highlights the significant role of members in driving the overall usage of the bike-sharing service. Understanding the preferences and usage patterns of members can help tailor services and promotions to cater to their needs and retain their loyalty.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 6))
sns.set_style("whitegrid")

# Define the order of the months
order = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']

# Create the countplot with the specified order and color palette
ax = sns.countplot(x='Month_Name', hue='member_casual', data=bike_share, palette='ocean', order=order)

# Set the title and axis labels
plt.title("Monthly Usage by Member Type", fontsize=16)
plt.xlabel("Month", fontsize=12)
plt.ylabel("Count", fontsize=12)

# Customize the tick labels
plt.xticks(fontsize=10)
plt.yticks(fontsize=10)

# Customize the legend
ax.legend(title='Member Type', title_fontsize=12, fontsize=10)

# Remove the spines
sns.despine()

# Show the plot
plt.tight_layout()
plt.show()

The summer months show the highest usage of bikes. This can be a starting point for preparing business strategy.

**conclusion:**

**Potential Marketing Opportunities**: The countplot can help identify periods of high and low demand throughout the year. This information can be utilized to plan targeted marketing campaigns and promotions to attract more casual riders during the off-peak months. By incentivizing casual riders with discounts, special offers, or seasonal promotions, bike-sharing companies can potentially increase overall usage and revenue during the less busy periods.

## INSIGHTS:
The insights extracted from the data analysis and visualizations offer valuable knowledge and understanding of the underlying patterns, trends, and relationships within the data, empowering stakeholders to make informed decisions and take appropriate actions.

1. The bike usage trend **highlightes the purpose** for which the bikes are used.

1. **Member riders** have annual memberships because their f**requency of bike usage is higher.** They use it for **daily commute of shorter distance.**

1. **Casual riders** more often use bikes **for leisure or personal activities. Their usage is higher on weekends.**

1. **Summer months are more popular** and business can focus on this period to maximise its profit.

1. **Special 'Summer Membership' can be introduced** specifically for casual riders who're hesitant to go for annual membership.

1. **Coupons, Discounts schemes can be introduced** for casual riders to **increase their bike usage on weekdays or small distance journeys.**

1. It is important for business **to develop the idea of using bikes regulary than just for leisure activities in casual riders**.

# ACT:

1. **Introduce a "Summer Membership"**: To attract more casual riders during the peak summer months, Cyclistics can offer a special "Summer Membership" tailored for their needs. This membership can provide discounted rates, flexible usage options, and additional benefits to encourage casual riders to use the bikes regularly for leisure or personal activities.

1. **Implement Promotional Campaigns**: To increase bike usage by casual riders on weekdays and for shorter distance journeys, Cyclistics can launch targeted promotional campaigns. Coupons, discounts, or incentives can be offered specifically for casual riders during these periods, motivating them to utilize the bikes for daily commuting or other purposes.

1. **Enhance Member Engagement**: To retain existing annual members and encourage casual riders to consider annual memberships, Cyclistics should focus on enhancing member engagement. This can be achieved by introducing loyalty programs, personalized offers, and exclusive benefits for annual members, highlighting the convenience, cost-effectiveness, and value of having a long-term membership.

1. **Shift Perception of Casual Riding**: Cyclistics should work towards changing the perception of casual riding from solely leisure activities to regular use for daily commuting or practical purposes. This can be done through targeted marketing campaigns that emphasize the benefits of using bikes regularly, such as improved fitness, cost savings, reduced carbon footprint, and convenience.

By implementing these actions, Cyclistics can optimize its services, attract more casual riders, increase member retention, and ultimately drive business growth by aligning offerings with the identified usage patterns and preferences of different rider segments.