# **Project Name**    -



##### **Project Type**    - EDA
##### **Contribution**    - Individual
##### **Team Member 1 - Mansi Jangir**


# **Project Summary -**

This project focuses on the exploratory analysis of the Ford GoBike System dataset, which contains information about bike-share trips made in the San Francisco Bay Area. The main objective of the analysis is to uncover insights about customer behavior, trip patterns, and how different factors like user type, gender, and time of day affect the bike usage.

The dataset includes variables such as trip duration, start and end times, start and end stations, user type (subscriber or customer), gender, and birth year. The project begins with an initial assessment of the dataset’s structure, followed by data cleaning steps to ensure accuracy. During cleaning, missing values are handled, data types are corrected (e.g., dates are converted to datetime objects), and new features such as trip duration in minutes, day of the week, and hour of the day are created to aid deeper analysis.

The exploration reveals several key findings. First, subscribers dominate the user base, accounting for the majority of the trips compared to customers (casual users). In terms of gender, male riders make more trips than female riders and riders of unknown gender. Age distribution analysis shows that most riders are between 25 to 40 years old, indicating that the service is popular among working-age adults.

Time-based analysis highlights clear usage patterns: bike usage peaks during the morning (around 8 AM) and evening (around 5 PM), which aligns with typical commuting hours. This trend is more pronounced among subscribers, reinforcing the idea that many users rely on Ford GoBike for their daily commute. On the other hand, casual customers tend to use the bikes more during weekends and non-peak hours, likely for recreational purposes.

The duration of trips also varies among different user groups. Customers tend to have longer trip durations compared to subscribers, which again suggests recreational usage versus quick commuting. Weekday versus weekend patterns show that weekdays have higher trip counts, especially among subscribers, while weekends show a slight increase in customer rides.

Furthermore, station analysis identifies the most popular start and end stations. These hotspots are mainly located in densely populated or business-heavy areas, confirming that station location significantly influences trip frequency.

Throughout the project, visualizations such as histograms, bar charts, and violin plots are employed effectively to represent findings clearly. These visuals help identify and confirm the observed trends, providing a more intuitive understanding of the data.

In conclusion, the analysis successfully uncovers valuable insights about Ford GoBike users. It shows that the service is primarily used by young to middle-aged working professionals for weekday commuting, while casual users engage more on weekends for leisure. Understanding these patterns can help stakeholders like urban planners and the bike-share company optimize station placement, adjust bike availability during peak times, and tailor marketing strategies for different user groups.

The project demonstrates a strong grasp of data wrangling, feature engineering, and data visualization techniques to tell a coherent story from raw data. It emphasizes the importance of thorough exploratory data analysis (EDA) in extracting actionable insights from real-world datasets.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**Write Problem Statement Here.**

#### **Define Your Business Objective?**

The primary business objective of this project is to understand and optimize user engagement and operational efficiency for the Ford GoBike bike-sharing service. By analyzing user demographics, trip patterns, and usage behaviors, the goal is to:

* Identify the key user segments (e.g., subscribers vs. customers, gender, age groups) and understand their usage habits.

* Discover peak usage times and popular stations to better allocate bikes and resources.

* Highlight differences in usage behavior between weekday commuting and weekend recreational activities.

* Support marketing strategies by tailoring promotions and services to different user types.

* Enhance customer satisfaction and retention by improving service availability during high-demand periods.

Ultimately, the insights from this analysis aim to help Ford GoBike increase ridership, improve service reliability, and make informed decisions about station expansions, pricing strategies, and targeted marketing campaigns.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

### Dataset Loading

In [None]:
# Load Dataset
df = pd.read_csv("fordgobike.csv")
df.columns = df.columns.str.strip()


### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df.duplicated()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
missing_counts=df.isnull().sum()
missing_counts

In [None]:
# Visualizing the missing values
plt.figure(figsize=(10, 6))
missing_counts.plot(kind='bar', color='coral')
plt.title("Missing Values Count per Column")
plt.ylabel("Number of Missing Values")
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

### What did you know about your dataset?

Most of your columns (like duration_sec, start_time, end_time, etc.) do not have missing values — that's why you don't see any visible bars for them.

The columns member_birth_year and member_gender have a significant amount of missing data — around 7800 missing values each.

Other columns like user_type, bike_id, start_station_id, start_station_name, etc., seem to have complete data without missing entries.

The title "Missing Values Count per Column" matches well with the visualization.

In short:
* Only 2 columns have major missing values (birth year and gender).
* Other columns are clean.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe().T

### Variables Description

* duration_sec -	            Duration of the trip in seconds (ranges from 61 sec to 85,546 sec).
* start_station_id	-        ID of the station where the trip started.
* start_station_latitude -	Latitude of the starting station.
* start_station_longitude -nLongitude of the starting station.
* end_station_id -	        ID of the station where the trip ended.
* end_station_latitude -     Latitude of the ending station.
* end_station_longitude -   	Longitude of the ending station.
* bike_id - 	                Unique ID for each bike used in the trips.
* member_birth_year	 -      Birth year of the member (user), indicating the user's age (ranges from 1975 to 2000).

**Notice:**
* member_birth_year has fewer counts (86,963 vs. 94,802), which matches the missing values you found earlier.
* Other fields (duration_sec, start_station_id, etc.) have no missing values.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
unique_values=df.nunique().sort_values(ascending=False)
unique_values

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

# Step 1: Convert time columns to datetime format
df['start_time'] = pd.to_datetime(df['start_time'])
df['end_time'] = pd.to_datetime(df['end_time'])


In [None]:
# Step 2: Create derived columns
df['trip_duration_minutes'] = df['duration_sec'] / 60
df['start_hour'] = df['start_time'].dt.hour
df['start_day'] = df['start_time'].dt.day_name()
df['start_month'] = df['start_time'].dt.month_name()


In [None]:
# Step 3: Convert columns to categorical
categorical_columns = ['user_type', 'member_gender', 'bike_share_for_all_trip', 'start_day', 'start_month']
for col in categorical_columns:
    df[col] = df[col].astype('category')


In [None]:
# Step 4: Drop rows with missing values in critical user info fields
df_cleaned = df.dropna(subset=['member_birth_year', 'member_gender'])


In [None]:
# Step 5: Reset index
df_cleaned.reset_index(drop=True, inplace=True)

# Optional: Save the cleaned dataset
df_cleaned.to_csv("fordgobike.csv", index=False)

df_cleaned.head()

In [None]:
df_cleaned.shape

### What all manipulations have you done and insights you found?

*Insights After Cleaning*
    
*Missing Values:*
* Rows with missing member_birth_year and member_gender (around 8% of data) were removed, reducing the dataset from 94,802 rows to 86,963 rows.

*Datetime Conversion Successful:*
* Time fields (start_time and end_time) were successfully converted to datetime format, enabling the extraction of additional time-based features.

*New Variables Available:*
* With derived columns like trip_duration_minutes, start_hour, and start_month, we can now explore trends such as:

  * Most popular times of the day for bike rides

  * Seasonal trends (popular months)

  * Daily riding patterns

*Categorical Fields Optimized:*
* Categorical conversion helps in grouping and summarizing the data efficiently, especially for fields like user_type (Subscriber vs Customer) and member_gender (Male, Female, Other).

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
# Set plot style
sns.set(style="whitegrid")

# 1. How long does the average trip take?
average_trip_duration = df_cleaned['trip_duration_minutes'].mean()

# 2. Is the trip duration affected by weather (months/seasons)?
avg_duration_by_month = df_cleaned.groupby('start_month', observed=True)['trip_duration_minutes'].mean().reindex([
    'January', 'February', 'March', 'April', 'May', 'June',
    'July', 'August', 'September', 'October', 'November', 'December'
])

# 3. Does the above depend on if a user is a subscriber or customer?
avg_duration_by_month_user_type = df_cleaned.groupby(['start_month', 'user_type'], observed=True)['trip_duration_minutes'].mean().unstack().reindex([
    'January', 'February', 'March', 'April', 'May', 'June',
    'July', 'August', 'September', 'October', 'November', 'December'
])


# Plotting average trip duration by month
plt.figure(figsize=(10, 6))
sns.barplot(x=avg_duration_by_month.index, y=avg_duration_by_month.values, color='skyblue')
plt.xticks(rotation=45)
plt.title("Average Trip Duration by Month")
plt.ylabel("Trip Duration (minutes)")
plt.xlabel("Month")
plt.tight_layout()
plt.show()



In [None]:
# Plotting average trip duration by month and user type
plt.figure(figsize=(10, 6))
avg_duration_by_month_user_type.plot(kind='bar', stacked=False, colormap='coolwarm')
plt.title("Average Trip Duration by Month and User Type")
plt.ylabel("Trip Duration (minutes)")
plt.xlabel("Month")
plt.xticks(rotation=45)
plt.legend(title='User Type')
plt.tight_layout()
plt.show()

average_trip_duration

##### 1. Why did you pick the specific chart?

Chart 1: Average Trip Duration by Month
*Chart Type: Vertical Bar Chart*

Reason for Choosing:

* A bar chart clearly shows comparisons across months.
* Easy to spot which month has longer average trip durations.

Chart 2: Average Trip Duration by Month and User Type
*Chart Type: Grouped Bar Chart*

Reason for Choosing:

* Shows comparison between two groups (Customer vs Subscriber) for each month.
* Helps uncover behavioral differences between customer segments.

##### 2. What is/are the insight(s) found from the chart?

From Chart 1:
*Insight*
* The data reveals that in January, the average trip duration was the highest (12 minutes).
* No meaningful data for other months (likely because the dataset includes only January rides).

From Chart 2:
*Insight*
* Customers had a much longer average trip duration (27 minutes) compared to Subscribers (11 minutes) in January.
* Customers use bikes for longer trips while subscribers tend to use them for shorter, possibly more routine rides (like commuting).

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

*Positive Business Impact:*

* Targeting Customers: Since customers take longer rides, offering flexible, pay-per-minute or hourly plans can increase revenue.

* Subscriber Retention: As subscribers use bikes for shorter rides, introducing loyalty rewards for frequent short trips can increase retention.

* Marketing Strategies: Promotions and discounts could be launched specifically in January to capitalize on the higher trip duration trend.

*Negative Growth Concerns:*

* Limited Seasonal Data: The dataset is mainly from January. Hence, drawing full-year conclusions is risky.

* Customer Longer Usage: While longer rides mean higher revenue, they may also mean higher maintenance costs for the bikes if not managed carefully.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
# Make a proper copy of the filtered DataFrame to avoid SettingWithCopyWarning
df_cleaned = df.copy()

# Ensure datetime format for time-based analysis
df_cleaned['start_time'] = pd.to_datetime(df_cleaned['start_time'])

# Extract day of week, hour of day, and month
df_cleaned['day_of_week'] = df_cleaned['start_time'].dt.day_name()
df_cleaned['hour_of_day'] = df_cleaned['start_time'].dt.hour
df_cleaned['month'] = df_cleaned['start_time'].dt.month_name()

# Plot: Bike usage by hour of day
plt.figure(figsize=(12, 6))
sns.countplot(data=df_cleaned, x='hour_of_day', order=range(24), color='skyblue')
plt.title('Bike Usage by Hour of Day')
plt.xlabel('Hour of Day')
plt.ylabel('Number of Rides')
plt.tight_layout()
plt.show()




##### 1. Why did you pick the specific chart?

*Chart Type: Vertical Bar Chart (Histogram-like)*

Reason for Choosing:
* It effectively shows the distribution of bike usage across different hours in a day.
* Very easy to spot the peak and low usage hours.
* Bar charts are ideal for highlighting hourly patterns, helping in decision-making for operations and marketing.

##### 2. What is/are the insight(s) found from the chart?

Morning Peak:
* The highest bike usage is around 8 AM and 9 AM — classic morning commute hours.

Evening Peak:
* Another spike occurs around 5 PM and 6 PM, which corresponds to the evening commute time.

Low Usage:
* Very low bike usage between 12 AM and 5 AM (midnight to early morning).

Mid-Day Moderate Activity:
* Usage is steady but moderate between 10 AM to 3 PM, possibly for leisure rides or flexible work schedules.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, absolutely!

Optimized Bike Availability:
* Knowing the peak hours allows better rebalancing of bikes at stations — ensuring enough bikes are available during morning and evening rush hours.

Maintenance Scheduling:
* Maintenance tasks can be scheduled during off-peak hours (e.g., midnight to early morning) to avoid impacting customer experience.

Targeted Promotions:
* Launching promotions like "Late Ride Discounts" during off-peak hours could increase ridership when usage is otherwise low.

Dynamic Pricing Opportunities:
* Surge pricing could be introduced during peak hours, optimizing revenue.

***Insights that lead to negative Growth:***

Potential Concern:

Overloading during Peak Hours:
* If enough bikes are not available during peak times (8 AM, 5 PM), it may frustrate users, leading to customer churn.

Underutilization during Off-peak Hours:
* Many bikes stay idle late at night, which is an operational cost without generating revenue.

Justification:
* Inability to meet demand during high-traffic hours could directly hurt the brand's reliability and user trust.
* Conversely, empty stations or bikes standing idle too long increases wear and tear, maintenance, and storage costs without profit.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
plt.figure(figsize=(10, 6))
order_days = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
sns.countplot(data=df_cleaned, x='day_of_week', hue='day_of_week', order=order_days, palette='muted', legend=False)
plt.title('Bike Usage by Day of the Week')
plt.xlabel('Day of the Week')
plt.ylabel('Number of Rides')
plt.tight_layout()
plt.show()




##### 1. Why did you pick the specific chart?

Chart Type: Vertical Bar Chart

Reason for Choosing:
* It clearly compares the number of rides across the days of the week.
* The bar chart allows quick identification of the most and least popular days for bike usage.
* It’s ideal for visualizing trends and patterns across categorical data like weekdays.

##### 2. What is/are the insight(s) found from the chart?

Highest Usage:
* Tuesday is the day with the maximum number of rides, followed closely by Wednesday.

Moderate Usage:
* Monday, Thursday, and Friday have a moderate but noticeable drop compared to Tuesday and Wednesday.

Lowest Usage:
* Saturday and Sunday have the lowest usage, with Sunday being the absolute lowest.

Weekday vs Weekend:
* There is a clear trend: bike usage is much higher during weekdays compared to weekends.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, absolutely!

Operational Efficiency:
* More bikes can be distributed and serviced during the weekdays to meet demand.

Strategic Marketing:
* Targeted promotions could be focused on weekends to boost the low usage days (e.g., weekend family ride discounts).

Flexible Staffing:
* Staff shifts (for customer service or bike maintenance) can be optimized — more support on weekdays, lighter on weekends.

Revenue Optimization:
* Dynamic pricing models could offer lower rates during weekends to attract more customers.

***Insights that lead to negative growth:***

Potential Risk Area:

Low Weekend Usage:
* The sharp drop during Saturdays and Sundays could mean the company is missing out on a large potential leisure market.

Justification:
* Weekends are typically leisure times, and if bike usage remains low, it could indicate missed marketing or service opportunities.
* Without action, weekends continue to operate at under-capacity, leading to operational inefficiencies and lost revenue.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
# Make sure to avoid SettingWithCopyWarning
df_cleaned = df_cleaned.copy()

# Add new column for Weekday/Weekend classification
df_cleaned['week_part'] = df_cleaned['day_of_week'].apply(
    lambda x: 'Weekend' if x in ['Saturday', 'Sunday'] else 'Weekday')

# Plot
plt.figure(figsize=(8, 5))
sns.countplot(data=df_cleaned, x='week_part', hue='day_of_week', palette='Set2', legend=False)
plt.title('Bike Usage: Weekdays vs Weekends')
plt.xlabel('Part of the Week')
plt.ylabel('Number of Rides')
plt.tight_layout()
plt.show()




##### 1. Why did you pick the specific chart?

* This bar chart was picked because it clearly compares the number of bike rides during weekdays and weekends.
* It visually shows the difference between usage patterns across two parts of the week in a simple and easy-to-understand manner.
* Bar charts are great for comparison, which is the goal here.

##### 2. What is/are the insight(s) found from the chart?

* Bike usage is significantly higher on weekdays compared to weekends.
* There is a noticeable drop in the number of rides on weekends.
* Among weekdays, certain groups (shown by different colors) might represent different user types or time slots with varying usage.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, definitely.

* Knowing that usage peaks during weekdays can help bike rental companies, city planners, or service providers optimize the availability of bikes and maintenance schedules.
* They can run special promotions or events on weekends to boost lower usage days.
* Staffing and operational decisions (e.g., bike repairs, customer support) can be better aligned with the demand pattern.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
# Copy original dataframe for clean transformations
df_demo = df.copy()

# Calculate rider age
current_year = df_demo['start_time'].dt.year.max()
df_demo['age'] = current_year - df_demo['member_birth_year']

# Remove extreme outliers (e.g., age > 90 or < 15) to get meaningful analysis
df_demo = df_demo[(df_demo['age'] >= 15) & (df_demo['age'] <= 90)]

# 1. Ride duration variation by user type
plt.figure(figsize=(8, 4))
sns.boxplot(data=df_demo, x='user_type', y='trip_duration_minutes',hue="user_type", palette='pastel')
plt.ylim(0, 60)  # Focus on trips under 1 hour to reduce distortion from outliers
plt.title('Ride Duration by User Type')
plt.ylabel('Trip Duration (minutes)')
plt.xlabel('User Type')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

* A box plot was picked because it clearly shows the distribution, median, spread, and outliers of ride durations for two different user types — Customers and Subscribers.
* Box plots are ideal for comparing groups and understanding variability, which is perfect when analyzing how ride patterns differ between user categories.

##### 2. What is/are the insight(s) found from the chart?

* Customers generally have longer trip durations compared to Subscribers.
* Subscribers have a tighter, more consistent trip duration (less spread) compared to Customers.
* Customers show more variability (wider box and more outliers).
* Both groups have some outliers (very long trips), but it’s especially notable for Subscribers.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, very much.

* Knowing that Customers take longer trips suggests marketing strategies can be tailored for them — such as offering "extended ride packages" or "leisure ride promotions."
* For Subscribers, since rides are shorter and more consistent, companies can focus on commuter-focused plans (e.g., discounted monthly passes, fast service support).
* Bike maintenance and redistribution strategies can be better planned based on typical ride duration patterns.

***Insights that lead to negative growth:***

Potentially, yes:

* The high variability in Customer ride durations might cause issues if bikes are less available for other users, leading to customer dissatisfaction during peak times.
* Also, outliers (very long rides) could result in unexpected wear and tear on bikes, increasing maintenance costs.

If not managed properly, these could impact service quality and operational efficiency.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
# 2. Ride duration by gender
plt.figure(figsize=(8, 6))
sns.boxplot(data=df_demo, x='member_gender', y='trip_duration_minutes', hue="member_gender", palette='cool')
plt.ylim(0, 60)
plt.title('Ride Duration by Gender')
plt.ylabel('Trip Duration (minutes)')
plt.xlabel('Gender')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

* A box plot was chosen because it effectively visualizes the spread, median, and outliers of ride durations across different gender categories (Female, Male, and Other).
* Box plots are great for comparing distributions between multiple groups, making it easier to spot patterns and differences at a glance.

##### 2. What is/are the insight(s) found from the chart?

* Median ride durations are quite similar across all gender groups, but females and 'other' gender riders slightly tend to have longer median ride times compared to males.
* The 'Other' gender group shows a slightly wider range of trip durations.
* Outliers are present in all groups, but particularly a lot for Female and Other categories, indicating some extremely long rides.
* Overall, males seem to have slightly shorter and more consistent trip durations.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, absolutely.

* Understanding ride patterns by gender can help design more inclusive services and marketing campaigns.
* For example, longer ride times for certain groups might suggest promoting leisure or scenic ride packages to those demographics.
* Service improvements such as safety features, better bike ergonomics, or targeted loyalty programs could be planned to encourage higher engagement across all gender groups.

***Insights that lead to negative growth:***

Potentially, yes:

* The presence of extreme outliers (very long rides) could stress fleet management — causing bike shortages for other users if not planned for properly.
*  If bike availability is skewed due to unpredictable long rides by some user groups, it could lead to lower customer satisfaction, especially during peak times.
* Addressing these risks early with better tracking, reservation systems, or pricing policies (like charging extra for very long rides) can prevent negative impacts.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
# 3. Categorize age into groups
bins = [15, 24, 34, 44, 54, 64, 74, 90]
labels = ['15-24', '25-34', '35-44', '45-54', '55-64', '65-74', '75-90']
df_demo['age_group'] = pd.cut(df_demo['age'], bins=bins, labels=labels, right=True)

# Most common age group
common_age_group = df_demo['age_group'].value_counts().idxmax()

# Plot ride duration by age group
plt.figure(figsize=(10, 6))
sns.boxplot(data=df_demo, x='age_group', y='trip_duration_minutes',hue='age_group' , palette='viridis')
plt.ylim(0, 60)
plt.title('Ride Duration by Age Group')
plt.xlabel('Age Group')
plt.ylabel('Trip Duration (minutes)')
plt.tight_layout()
plt.show()

common_age_group

##### 1. Why did you pick the specific chart?

* The box plot was chosen because it clearly shows the distribution, median, spread, and outliers of trip durations across different age groups.
* It’s a powerful visual tool to compare multiple groups side-by-side and easily identify which age groups have longer or shorter ride durations.

##### 2. What is/are the insight(s) found from the chart?

* Younger age groups (like 15–24) have slightly longer median ride durations than middle-aged groups (25–54).
* Older age groups (65–74 and 75–90) show an increase again in ride duration compared to the middle-aged groups.
* Outliers (very long trips) are present across all age groups but are more prominent in younger and older groups.
* Trip durations for ages 25–54 seem to be more consistent and clustered around shorter ride times.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, definitely.

* Targeted promotions can be created for younger riders (15–24) and senior groups (65+) encouraging leisure trips or membership deals focused on longer rides.
* It could help optimize fleet management knowing that middle-aged riders take shorter trips — meaning bikes in business hubs can turn around faster.
* Tailoring service features like comfortable bikes for older adults or fun experiences for younger adults can increase usage and loyalty.

***Insights that lead to negative growth***

Possibly yes:

* Longer trips by older adults might slow down bike turnover, especially if there are not enough bikes available at key stations — which can frustrate frequent (short-trip) users.
* Younger riders’ variability and longer trips may lead to increased bike wear and tear, impacting maintenance costs if not monitored carefully.

#### Chart - 8

In [None]:
# Chart - 8 visualization code

# Top 10 most popular start stations
top_start_stations = df_cleaned['start_station_name'].value_counts().head(10)

# Create a new DataFrame to plot
top_start_df = top_start_stations.reset_index()
top_start_df.columns = ['start_station_name', 'count']

# Plot
plt.figure(figsize=(12, 6))
sns.barplot(data=top_start_df, x='count', y='start_station_name', palette='Blues_r', hue='start_station_name', dodge=False, legend=False)
plt.title('Top 10 Most Popular Start Stations')
plt.xlabel('Number of Rides')
plt.ylabel('Start Station')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

* The horizontal bar chart was picked because it is perfect for ranking categories like start stations.
* It clearly shows the comparison between different stations in terms of the number of rides, making it very easy to identify the most popular starting points.

##### 2. What is/are the insight(s) found from the chart?

* San Francisco Caltrain (Townsend St at 4th St) is the most popular start station.
* San Francisco Ferry Building and Berry St at 4th St are also extremely popular.
* Start stations near transit hubs (Caltrain, Ferry Building, BART stations) dominate the top 10.
* Bike usage is heavily concentrated around key transport interchanges and major streets.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Absolutely!

* Business can optimize bike distribution by increasing the number of bikes at these high-traffic start stations to meet demand.
* Marketing promotions (like discounts or loyalty rewards) can be strategically placed at these locations to maximize visibilit.
* Expansion planning can prioritize nearby areas to capture spillover demand from these hot spots.

***Insights that lead to negative growth***

Potentially yes:

* Overcrowding risk at the top stations could lead to bike shortages during peak times, causing customer dissatisfaction.
* If too much focus is placed on popular stations, smaller stations might get neglected, leading to uneven service quality and missed opportunities for growth in other areas.

A smart solution would be to dynamically redistribute bikes and predict peak usage times using data analytics to avoid these problems.

#### Chart - 9

In [None]:
# Chart - 9 visualization code

# Create 'route' column
df_cleaned['route'] = df_cleaned['start_station_name'] + " → " + df_cleaned['end_station_name']
top_routes = df_cleaned['route'].value_counts().head(10)

# Prepare data as a DataFrame for plotting with 'hue'
top_routes_df = top_routes.reset_index()
top_routes_df.columns = ['route', 'count']

# Bar plot with 'hue' and legend disabled
plt.figure(figsize=(14, 6))
sns.barplot(data=top_routes_df, x='count', y='route', hue='route', palette='magma', legend=False)
plt.title('Top 10 Most Common Bike Routes')
plt.xlabel('Number of Rides')
plt.ylabel('Route')
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

* The horizontal bar chart was chosen because it is ideal for comparing different routes easily.
* It makes the differences very clear between each route based on the number of rides, helping quickly identify the most common bike paths.

##### 2. What is/are the insight(s) found from the chart?

* The San Francisco Ferry Building → The Embarcadero at Sansome St route is the most popular by a large margin.
* Several short, central city routes dominate the list, showing that bike usage is mainly for short, urban trips rather than long commutes.
* Many common routes start or end at major transport hubs (Ferry Building, BART stations), confirming that people use bikes for first-mile/last-mile connectivity.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Definitely!

* You can improve bike availability on these popular routes, especially during peak hours.
* Bike maintenance and rebalancing efforts can focus more on these key corridors to reduce downtime and improve customer satisfaction.
* New marketing opportunities (like corporate passes or ads along these busy routes) can be identified.

***Insights that lead to negative growth:***

Yes, a possible risk:

* Overdependence on a few popular routes could mean if any issue (construction, traffic restrictions, station closure) affects these areas, it can disrupt a large portion of users.
* Other areas/routes may get underutilized, meaning potential growth opportunities elsewhere are missed.

Hence, while optimizing for these top routes, it’s important to also encourage exploration and use of less crowded routes by offering incentives like dynamic pricing or gamification rewards.


#### Chart - 10

In [None]:
# Chart - 10 visualization code
# Bar plot for top 10 end stations
plt.figure(figsize=(12, 6))
sns.barplot(x=top_end_stations.values, y=top_end_stations.index, palette='Greens_r', hue=top_end_stations.index)
plt.title('Top 10 Most Popular End Stations')
plt.xlabel('Number of Rides')
plt.ylabel('End Station')
plt.tight_layout()
plt.show()



##### 1. Why did you pick the specific chart?

* The horizontal bar chart is used because it clearly ranks the end stations based on the number of rides ending at each station.
* It allows for a quick, visual comparison, making it easy to identify which stations are the busiest end points.

##### 2. What is/are the insight(s) found from the chart?

* San Francisco Caltrain (Townsend St at 4th St) is the most popular end station by a significant margin.
* Stations like San Francisco Ferry Building and The Embarcadero at Sansome St are also heavily used as destinations.
* Major transportation hubs (Caltrain, BART, Ferry Building) dominate the list, indicating commuters and travelers heavily use bikes to reach these locations.
* Start and End station trends are quite aligned with earlier charts — popular start points are also popular end points.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Absolutely!

* Rebalancing efforts (moving bikes back to starting points) can be optimized because you now know where most bikes are ending.
* Can set up additional bike docks or expand capacity at busy end stations to prevent full docks (which frustrate users).
* Knowing the high-traffic end stations can help in strategic partnerships (e.g., advertising, events with Caltrain, BART, etc.).

***Insights that lead to negative growth:***

Absolutely!

* Rebalancing efforts (moving bikes back to starting points) can be optimized because you now know where most bikes are ending.
* Can set up additional bike docks or expand capacity at busy end stations to prevent full docks (which frustrate users).
* Knowing the high-traffic end stations can help in strategic partnerships (e.g., advertising, events with Caltrain, BART, etc.).

#### Chart - 11

In [None]:
# Chart - 11 visualization code

# Most popular start and end stations
top_start_stations = df_cleaned['start_station_name'].value_counts().head(10)
top_end_stations = df_cleaned['end_station_name'].value_counts().head(10)

# Most common routes
df_cleaned['route'] = df_cleaned['start_station_name'] + " → " + df_cleaned['end_station_name']
top_routes = df_cleaned['route'].value_counts().head(10)

# Inflow and outflow
station_inflow = df_cleaned['end_station_name'].value_counts()
station_outflow = df_cleaned['start_station_name'].value_counts()

# Combine inflow/outflow into a single DataFrame
station_flow = pd.DataFrame({
    'Inflow': station_inflow,
    'Outflow': station_outflow
}).fillna(0).astype(int)

station_flow['Net_Flow'] = station_flow['Inflow'] - station_flow['Outflow']

# Top stations
top_inflow = station_flow.sort_values(by='Inflow', ascending=False).head(10)
top_outflow = station_flow.sort_values(by='Outflow', ascending=False).head(10)
top_net_positive = station_flow.sort_values(by='Net_Flow', ascending=False).head(5)
top_net_negative = station_flow.sort_values(by='Net_Flow').head(5)

# Print results
print("Top Start Stations:\n", top_start_stations)
print("\nTop End Stations:\n", top_end_stations)
print("\nTop Routes:\n", top_routes)
print("\nTop Inflow Stations:\n", top_inflow)
print("\nTop Outflow Stations:\n", top_outflow)
print("\nStations with Highest Net Inflow:\n", top_net_positive)
print("\nStations with Highest Net Outflow:\n", top_net_negative)


In [None]:
# Set style
sns.set_style("whitegrid")

# Plot 4: Inflow vs Outflow (Top Stations)
top_stations = station_flow.loc[top_start_stations.index.union(top_end_stations.index)]

#plt.subplot(4, 1, 4)
top_stations[['Inflow', 'Outflow']].plot(kind='barh', stacked=False, figsize=(12,8), color=['green', 'blue'])
plt.title('Inflow and Outflow of Top Stations', fontsize=16)
plt.xlabel('Number of Rides', fontsize=12)
plt.ylabel('Station', fontsize=12)
plt.legend(title='Flow Type')
plt.grid(axis='x')
plt.tight_layout()

# Display all charts
plt.show()

##### 1. Why did you pick the specific chart?

* A grouped bar chart is chosen because it is perfect for comparing two related metrics (inflow and outflow) side-by-side for each station.
* It visually highlights the balance (or imbalance) between the number of bikes entering and leaving a station, making it easy to spot trends or issues at a glance.

##### 2. What is/are the insight(s) found from the chart?

* San Francisco Caltrain (Townsend St at 4th St) and its nearby station consistently show high inflow and outflow, but inflow exceeds outflow.
* Howard St at Beale St has noticeably higher outflow than inflow, suggesting it is more of a starting point than a destination.
* Berry St at 4th St is almost balanced between inflow and outflow.
* The Embarcadero at Sansome St and San Francisco Ferry Building are slightly more favored as destinations (higher inflow).

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Definitely!

* Stations with higher inflow need more docks available for parking bikes.
* Stations with higher outflow might need more bikes stocked to meet demand, especially during peak times.
* It helps optimize bike rebalancing operations, reducing customer frustration and ensuring bikes and docks are available where and when needed.
* Could also inform future expansion strategies (e.g., adding satellite stations near high-imbalance zones).

***Insights that lead to negative growth:***

Yes, potential risks are visible:

* Stations with heavy outflow but low inflow (like Howard St at Beale St) could run out of bikes quickly, leading to missed revenue opportunities and poor customer experience.
* If popular destination stations (like Caltrain or Ferry Building) frequently get full (no empty docks), users might stop using the service.

Thus, imbalances without proper management (like not rebalancing bikes timely) can lead to operational inefficiencies and reduced customer satisfaction.

#### Chart - 12 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
df_numeric = df_cleaned.select_dtypes(include=['number'])

# Compute the correlation matrix for numeric columns only
correlation_matrix = df_numeric.corr()

# Set up the matplotlib figure
plt.figure(figsize=(10, 6))

# Create a heatmap with annotations
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)

# Title for the plot
plt.title('Correlation Matrix (Numeric Variables)')

# Display the plot
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

* The correlation matrix (heatmap) was chosen because it is the best visual tool to quickly identify relationships (positive or negative) between multiple numeric variables at once.
* It simplifies complex data into a visual pattern where strong, weak, or no relationships can be spotted instantly using color coding (red for positive, blue for negative).

##### 2. What is/are the insight(s) found from the chart?

* Start and end station latitudes are highly positively correlated (~0.99), suggesting that most trips happen vertically (north-south movement) within a similar latitude range.
* Start and end station longitudes are also highly positively correlated (~0.99), meaning trips are localized geographically.
* Trip duration seconds and trip duration minutes have a perfect correlation (1.0), which is expected because one is just a unit conversion of the other.
* Start hour and hour of day have a perfect correlation (1.0) too — they essentially represent the same thing.
* Station IDs have moderate correlations (~0.59–0.63) with station coordinates, which makes sense (station IDs are location-based).
* Member birth year has almost no correlation with trip duration or timing, indicating that age is not a strong factor affecting trip length or trip start time.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

Optimize Bike Distribution During Peak Hours
* Since the analysis shows high usage during morning and evening commute times (around 8 AM and 5 PM), ensure that more bikes are available at popular starting stations during these periods to meet the commuter demand.

Expand Stations in High-Demand Areas
* Stations near business districts and residential hubs are used the most. Adding more stations and increasing bike inventory in these high-demand areas can boost customer satisfaction and reduce wait times.

Launch Targeted Marketing Campaigns

* Focus marketing efforts on the 25–40 age group since they form the core user base.
* Encourage casual users (customers) to become subscribers through promotional offers, especially targeting weekend riders who tend to use the service for leisure.

Create Special Weekend Promotions
* Since customers (casual riders) are more active on weekends, launching weekend passes or group ride discounts could help convert more recreational users into regular customers.

Personalize Communication by User Type
* Use different messaging for subscribers (emphasizing convenience, time savings) versus customers (highlighting fun, recreation, and leisure).

Improve Service for Female Riders
* Given the lower participation of female riders, Ford GoBike could investigate and address any barriers (like safety concerns or station placement) and run campaigns encouraging more women to use the service.

Introduce Loyalty and Referral Programs
* Reward frequent riders (especially subscribers) with loyalty points or perks. Offer referral bonuses to tap into users' social networks and grow the customer base.

Data-Driven Expansion Planning
* Continue analyzing trip and station data regularly to identify emerging high-demand areas and optimize future expansion or reallocation of bikes.

Enhance Mobile App Features

* Real-time bike availability updates

* Easy route suggestions during peak times

* Push notifications for promotions and service updates based on user riding patterns.

Educate Users on System Usage
* Offering tutorials or tips on how to best use the system (especially to new customers) can improve the user experience and encourage repeat usage.

# **Conclusion**

The analysis of the Ford GoBike System data provided valuable insights into user behavior, trip patterns, and service utilization. It was found that subscribers, primarily young working professionals aged 25–40, dominate the user base, using the service mainly during weekday commuting hours. Customers, on the other hand, are more active on weekends and tend to take longer trips, suggesting more recreational use.

Key findings highlighted the importance of time-based demand, station popularity, and demographic factors like age and gender in influencing bike usage patterns. This suggests that operational strategies, such as optimizing bike distribution during peak hours and expanding stations in high-demand locations, could significantly enhance service efficiency and customer satisfaction.

Additionally, the project uncovered opportunities for growth through targeted marketing, special weekend promotions, loyalty programs, and improved service offerings tailored to different user groups. Addressing barriers for underrepresented groups, like female riders, could further expand the user base.

Overall, this analysis demonstrates how data-driven insights can support Ford GoBike in improving operational performance, increasing customer engagement, and growing ridership, ultimately contributing to a more sustainable and user-friendly urban transportation system.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***