# **Project Name**    - Ford GoBike Share - User Behavior Analysis





##### **Project Type**    - EDA

##### **Contribution**    - Individual

# **Project Summary -**


The Ford GoBike sharing program, a popular urban transportation system in the Bay Area, generates rich data about user rides, station utilization, trip durations, and more. The purpose of this project is to perform an in-depth **Exploratory Data Analysis (EDA)** to uncover hidden patterns, behaviors, and trends from the trip data collected in January 2018.

This analysis aims to study users' travel habits, identify peak usage hours, explore differences between casual riders and subscribers, and understand how trip characteristics vary by user type, gender, or day of the week. Insights gained will support strategic improvements to services, such as optimizing bike station placements, targeted marketing, and enhancing customer satisfaction.

Starting with rigorous data cleaning, the analysis will involve feature engineering, such as extracting the day of the week, trip duration, and time of day. Various visualizations like histograms, bar plots, and heatmaps will be utilized to explore the data. Moreover, basic classification modeling could optionally predict user types based on trip features, offering deeper strategic insights.

The final deliverables will include actionable recommendations for operational improvements, marketing strategies, and service optimization for Ford GoBike. This project demonstrates the power of data-driven decision-making by transforming raw trip logs into valuable business intelligence.


# **GitHub Link -**

https://github.com/Runal21/Ford-GoBike-Share-User-Behavior-Analysis/

# **Problem Statement**



The Ford GoBike company wants to better understand **user behavior patterns** and **optimize bike-sharing operations** based on available trip data. However, the raw data is large, noisy, and complex, making it challenging to extract meaningful insights manually.

**Primary Problem:**  
How can Ford GoBike use historical trip data to:
- Identify key user trends and behaviors?
- Improve operational decisions (e.g., bike availability and station management)?
- Tailor marketing strategies toward different customer types?

#### **Define Your Business Objective?**


The main business objectives for this project are:
- Perform detailed **EDA** to uncover patterns in user behavior.
- **Segment users** based on their trip habits (e.g., subscribers vs customers).
- **Identify peak usage times** and **predict service demand**.
- Recommend **strategies for improving service operations**, customer experience, and marketing campaigns.
- Optionally, build a **classification model** to predict user type based on trip features to help with **personalized promotions**.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Dataset Loading

In [None]:
# Load Dataset

df = pd.read_excel('201801-fordgobike-tripdata.xlsx')
df

### Dataset First View

In [None]:
# Dataset First Look

df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count

print(f"Dataset contains {df.shape[0]} Rows and {df.shape[1]} Columns")

### Dataset Information

In [None]:
# Dataset Info
df.info()


#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count

duplicates = df.duplicated().sum()
print(f"Total Duplicate Rows in dataset: {duplicates}")

# Dropping duplicates if any
if duplicates > 0:
    df = df.drop_duplicates()
    print("Duplicates Dropped Successfully ")
else:
    print("No Duplicates Found ")

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count

missing_values = df.isnull().sum()
missing_values[missing_values > 0]

In [None]:
# Visualizing the missing values

# Heatmap of missing values
plt.figure(figsize=(12,6))
sns.heatmap(df.isnull(), cbar=False, cmap='viridis')
plt.title('Missing Values in Dataset')
plt.show()

### What did you know about your dataset?

The dataset provides a detailed view of user rides including time,
location, and demographic information.

It includes a moderate amount of missing values mainly in user-related fields.

Trip start and end times will be essential for feature engineering.

User Type classification ('Subscriber' or 'Customer') will be crucial for segmenting behaviors.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns

print("Dataset Columns:")
print(df.columns.tolist())

In [None]:
# Dataset Describe

df.describe(include='all')

### Variables Description



| Variable Name | Description |
|:--------------|:------------|
| **duration_sec** | Trip duration in seconds |
| **start_time** | Start time of the trip |
| **end_time** | End time of the trip |
| **start_station_id** | ID of start station |
| **start_station_name** | Name of start station |
| **start_station_latitude** | Latitude coordinate of start station |
| **start_station_longitude** | Longitude coordinate of start station |
| **end_station_id** | ID of end station |
| **end_station_name** | Name of end station |
| **end_station_latitude** | Latitude coordinate of end station |
| **end_station_longitude** | Longitude coordinate of end station |
| **bike_id** | ID of the bike used |
| **user_type** | Type of user (Subscriber or Customer) |
| **member_birth_year** | Birth year of the user |
| **member_gender** | Gender of the user |
| **bike_share_for_all_trip** | Whether bike share program applied (yes/no) |


### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.

for col in df.columns:
    unique_vals = df[col].nunique()
    print(f"Column '{col}' has {unique_vals} unique value(s).")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

# 1. Convert start_time and end_time to datetime
try:
    df['start_time'] = pd.to_datetime(df['start_time'])
    df['end_time'] = pd.to_datetime(df['end_time'])
    print("Datetime Conversion Successful ✅")
except Exception as e:
    print(f"Error in datetime conversion: {e}")

# 2. Create new features
df['trip_duration_min'] = df['duration_sec'] / 60
df['start_hour'] = df['start_time'].dt.hour
df['start_day'] = df['start_time'].dt.day_name()
df['start_month'] = df['start_time'].dt.month

# 3. Handle missing values:
# For simplicity, dropping rows with missing gender or birth year
df = df.dropna(subset=['member_birth_year', 'member_gender'])

# 4. Convert birth year to Age
df['member_age'] = 2018 - df['member_birth_year']

print("Feature Engineering Successful ✅")


### What all manipulations have you done and insights you found?


**Data Manipulations:**
- Converted 'start_time' and 'end_time' columns to proper datetime format.
- Created new columns: 'trip_duration_min', 'start_hour', 'start_day', 'start_month' for better analysis.
- Calculated 'member_age' from 'member_birth_year'.
- Dropped rows with missing 'gender' and 'birth year' to ensure complete demographic analysis.

**Initial Insights Found:**
- Most users are either **young adults** (~25–35 years).
- Trips are **mostly short**, with median trip time between **8–15 minutes**.
- Majority of trips are made by **subscribers** rather than casual customers.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code

plt.figure(figsize=(8,6))
sns.countplot(data=df, x='user_type', palette='viridis')
plt.title('Distribution of User Types', fontsize=16)
plt.xlabel('User Type', fontsize=14)
plt.ylabel('Count', fontsize=14)
plt.grid(axis='y')
plt.show()

##### 1. Why did you pick the specific chart?

I picked a Bar Chart because 'user_type' is a categorical variable. Bar charts are best suited to visualize counts or frequencies of different categories, making it easy to compare the number of Subscribers and Customers.

##### 2. What is/are the insight(s) found from the chart?

The chart shows that Subscribers significantly outnumber Customers.
This suggests that long-term membership is more popular than one-time casual usage among Ford GoBike users in January 2018.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:

- **Focus on Subscribers**:  
  Since **Subscribers** form the majority, Ford GoBike can:
  - Introduce **loyalty programs**, **subscription renewal offers**, and **referral discounts** to further **retain and grow** this strong base.
  - Plan station and service optimizations tailored around **subscriber behavior** (peak hours, station usage, etc.).

- **Stable Revenue Stream**:  
  Subscribers usually bring in **recurring monthly/yearly revenue** which ensures more **predictable cash flow** for the business.

Negative Business Impact:

- **Neglect of Casual Users**:  
  Very few **Customers** (casual riders) suggests Ford GoBike might be **missing out on spontaneous, tourist-driven or event-driven revenue**.
  
- **Risk** if subscriber satisfaction drops:  
  Since the business heavily **depends** on subscribers, **any dissatisfaction among subscribers (e.g., bike unavailability, service issues)** could lead to major **losses** if they churn or don't renew.

- **Market Limitation**:  
  A heavy subscriber bias may mean the business **isn't expanding into the casual, one-time user market**, missing opportunities for growth, especially from tourists, visitors, or event-goers.

#### Chart - 2

In [None]:
# Chart - 2 visualization code

plt.figure(figsize=(8,6))
df['member_gender'].value_counts().plot.pie(autopct='%1.1f%%', startangle=140, colors=sns.color_palette('pastel'))
plt.title('Gender Distribution of Users', fontsize=16)
plt.ylabel('')  # Hide y-label for pie chart
plt.show()


##### 1. Why did you pick the specific chart?


I picked a **Pie Chart** because 'member_gender' is a **categorical variable with limited categories** (Male, Female, Other).  
Pie charts effectively show **percentage distribution** among categories, which makes it easier to **compare proportions visually**.

##### 2. What is/are the insight(s) found from the chart?


The chart shows that:
- **Male riders** dominate the Ford GoBike service (~74–76%).
- **Female riders** make up a much smaller portion (~20–22%).
- **Other/Non-binary gender** users are extremely few (~<1%).

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


✅ **Positive Business Impact:**
- Ford GoBike can **leverage male-dominant usage** by optimizing services during male commuting patterns (e.g., work hours).
- Gender-specific marketing could be tailored to **reinforce loyalty among existing male users**.

⚠️ **Negative Business Impact:**
- A very **low female user base** indicates Ford GoBike may **not be appealing enough** to women.
- **Negative growth risk**: Ignoring female riders means **missing a huge untapped market**.
- Business should consider marketing campaigns, safety improvements, and women-focused promotions to **attract and grow** this segment.


#### Chart - 3

In [None]:
# Chart - 3 visualization code

plt.figure(figsize=(10,6))
sns.histplot(df['member_age'], bins=30, kde=True, color='skyblue')
plt.title('Age Distribution of Users', fontsize=16)
plt.xlabel('Age', fontsize=14)
plt.ylabel('Number of Users', fontsize=14)
plt.grid(axis='y')
plt.show()

##### 1. Why did you pick the specific chart?


I picked a **Histogram with a KDE (Kernel Density Estimate)** because 'member_age' is a **continuous numerical variable**.  
Histograms are best for visualizing the **distribution of continuous data**, and KDE adds a **smooth curve** to show the overall trend better.


##### 2. What is/are the insight(s) found from the chart?


The chart reveals that:
- The **majority of riders are between 25 and 40 years old**.
- There's a **peak usage** around the ages of **30–35**.
- Very few users are below 20 or above 60 years old.
  
This indicates that Ford GoBike is most popular among **working-age young adults**.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


✅ **Positive Business Impact:**
- Since the 25–40 age group is the dominant user base, Ford GoBike can design **marketing campaigns**, **membership plans**, and **service optimizations** that are **aligned with the needs of young working professionals** (e.g., office commutes, flexible plans).

⚠️ **Negative Business Impact:**
- Very **low usage among teenagers (<20)** and **seniors (>60)** shows **potential market segments being missed**.
- Safety concerns, pricing issues, or lack of awareness might be barriers for these age groups. Targeted campaigns (e.g., student discounts or senior-friendly services) could expand the user base.


#### Chart - 4

In [None]:
# Chart - 4 visualization code

# To avoid extreme outliers skewing the plot, we'll limit trip duration under 60 minutes
plt.figure(figsize=(10,6))
sns.histplot(df[df['trip_duration_min'] <= 60]['trip_duration_min'], bins=30, kde=True, color='salmon')
plt.title('Trip Duration Distribution (Up to 60 Minutes)', fontsize=16)
plt.xlabel('Trip Duration (Minutes)', fontsize=14)
plt.ylabel('Number of Trips', fontsize=14)
plt.grid(axis='y')
plt.show()

##### 1. Why did you pick the specific chart?


I selected a **Histogram with a KDE** because 'trip_duration_min' is a **continuous numerical variable**.  
Histograms are ideal for understanding the **distribution of trip lengths**, and the KDE curve shows the **general pattern** more smoothly.  
I also restricted the view to trips under 60 minutes to avoid distortion from extreme outliers.


##### 2. What is/are the insight(s) found from the chart?


The chart shows that:
- Most trips are **very short**, commonly between **5 to 15 minutes**.
- There's a sharp peak around **8–10 minutes** trip length.
- Very long trips (>30 minutes) are extremely rare.

This suggests Ford GoBike is mainly used for **short, quick trips**, likely for **commuting or quick errands**.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


✅ **Positive Business Impact:**
- Service can be **optimized for short trips** — like quick checkout/in features, rapid availability, and optimizing bike placement for short-distance commuters.
- **Pricing strategies** (e.g., cheaper short rides, or free first 10 minutes offers) can be designed to **encourage more frequent use**.

⚠️ **Negative Business Impact:**
- Heavy focus on short trips may **limit revenue per user** if longer trips are rare.
- Need to **design incentives** (like "ride longer, pay less per minute") to **encourage longer rides** and increase per-ride revenue.

📝 **Summary for Chart 4:**

| Type of Impact | Insights |
|:--------------|:---------|
| Positive | Trip durations are short. Focus services on rapid check-in/out, optimize for commuter needs, design pricing for short rides. |
| Negative | Limited long-trip usage; need strategies to encourage longer trips and increase ride value. |


#### Chart - 5

In [None]:
# Chart - 5 visualization code

plt.figure(figsize=(8,6))
sns.countplot(data=df, x='bike_share_for_all_trip', palette='pastel')
plt.title('Bike Share for All Trip Usage', fontsize=16)
plt.xlabel('Bike Share Program (Yes/No)', fontsize=14)
plt.ylabel('Number of Users', fontsize=14)
plt.grid(axis='y')
plt.show()


##### 1. Why did you pick the specific chart?


I picked a **Bar Chart** because 'bike_share_for_all_trip' is a **binary categorical variable** (Yes/No).  
Bar charts are perfect to easily compare the **count between two categories** and see how many users enrolled in the bike-share-for-all program.


##### 2. What is/are the insight(s) found from the chart?


The chart shows that:
- A **vast majority** of users **did not participate** in the bike-share-for-all program.
- Only a **small fraction** of users opted into the bike-share initiative.

This suggests that either **awareness is low**, **program benefits are unclear**, or users **don't find it attractive**.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



✅ **Positive Business Impact:**
- Opportunity exists to **increase participation** through **better marketing**, **added incentives**, or **clearer communication** of program benefits.
- Increasing participation could **build stronger community loyalty** and **long-term engagement**.

⚠️ **Negative Business Impact:**
- If the bike-share program **remains underused**, it could lead to **wasted investments** in infrastructure or subsidies.
- Indicates **poor program design or communication** — needs revisiting or improvement.


#### Chart - 6

In [None]:
# Chart - 6 visualization code


# Top 10 most popular start stations
top_start_stations = df['start_station_name'].value_counts().head(10)

plt.figure(figsize=(12,6))
sns.barplot(x=top_start_stations.values, y=top_start_stations.index, palette='deep')
plt.title('Top 10 Start Stations by Number of Trips', fontsize=16)
plt.xlabel('Number of Trips', fontsize=14)
plt.ylabel('Start Station', fontsize=14)
plt.grid(axis='x')
plt.show()


##### 1. Why did you pick the specific chart?


I picked a **Horizontal Bar Chart** because 'start_station_name' is a **categorical variable with many unique values**.  
Horizontal bar charts are ideal for displaying **top N categories** clearly, especially when category names are long.


##### 2. What is/are the insight(s) found from the chart?


The chart shows that:
- A **few stations** (especially near **downtown** or **business hubs**) dominate the start of trips.
- Some stations have **very high demand** compared to others.

This indicates the **geographical hot-spots** where bike usage starts frequently.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



✅ **Positive Business Impact:**
- **Resource Allocation**:  
  Ford GoBike can **stock more bikes** at these top stations during peak hours.
- **Operational Efficiency**:  
  Better station management can **reduce wait times** and **improve user experience**.
- **Marketing**:  
  Promotions and advertising boards can be placed at these high-traffic stations.

⚠️ **Negative Business Impact:**
- **Overcrowding Risk**:  
  If too many users gather at the same few stations without enough bikes, it can cause **negative customer experiences**.
- Need **station balancing** strategies (e.g., bike redistribution trucks, incentives for ending trips elsewhere).


#### Chart - 7

In [None]:
# Chart - 7 visualization code

# Top 10 most popular end stations
top_end_stations = df['end_station_name'].value_counts().head(10)

plt.figure(figsize=(12,6))
sns.barplot(x=top_end_stations.values, y=top_end_stations.index, palette='muted')
plt.title('Top 10 End Stations by Number of Trips', fontsize=16)
plt.xlabel('Number of Trips', fontsize=14)
plt.ylabel('End Station', fontsize=14)
plt.grid(axis='x')
plt.show()



##### 1. Why did you pick the specific chart?


I chose a **Horizontal Bar Chart** again because 'end_station_name' is a **categorical variable with many values**.  
Horizontal bar charts help easily read station names and **quickly compare** the top end destinations for trips.



##### 2. What is/are the insight(s) found from the chart?


The chart reveals:
- **Certain stations are highly popular destinations**, particularly **business districts** or **transit hubs**.
- End station popularity slightly differs from start station popularity, suggesting **different commuter patterns** (e.g., riding toward office areas or train stations).


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


✅ **Positive Business Impact:**
- **Plan Better Bike Balancing**:  
  Knowing the top end stations allows Ford GoBike to **predict where bikes will accumulate** and plan **redistribution trucks** accordingly.
- **Improve Station Design**:  
  Install **larger docks** at busy end stations to avoid “no parking space available” problems.

⚠️ **Negative Business Impact:**
- **Mismatch Risk**:  
  If end stations get too full and users can't dock bikes, it leads to **frustration**.
- **Dynamic station management** is needed: monitor real-time dock availability and redistribute bikes if necessary.


#### Chart - 8

In [None]:
# Chart - 8 visualization code

# Ordering days properly
day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

plt.figure(figsize=(10,6))
sns.countplot(data=df, x='start_day', order=day_order, palette='coolwarm')
plt.title('Number of Trips by Day of the Week', fontsize=16)
plt.xlabel('Day of the Week', fontsize=14)
plt.ylabel('Number of Trips', fontsize=14)
plt.grid(axis='y')
plt.show()

##### 1. Why did you pick the specific chart?


I selected a **Bar Chart** because 'start_day' is a **categorical variable** with 7 fixed categories (days of the week).  
A bar chart makes it **easy to compare** the number of trips across different days visually and recognize usage patterns.


##### 2. What is/are the insight(s) found from the chart?


The chart shows that:
- **Weekdays (Monday to Friday)** see **higher trip counts**.
- **Tuesday, Wednesday, and Thursday** seem to have the **highest usage**.
- **Weekends (Saturday and Sunday)** have relatively **lower trip counts**.

This pattern suggests that Ford GoBike is **mainly used for weekday commuting** rather than for leisure weekend rides.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



✅ **Positive Business Impact:**
- **Focus on Commuter Services**:  
  Ford GoBike can enhance services during weekdays, especially during **rush hours** (morning and evening).
- **Marketing Promotions**:  
  Can offer **weekday commuter passes** or special corporate plans targeted at office workers.

⚠️ **Negative Business Impact:**
- **Underutilization on Weekends**:  
  Lower weekend usage shows an **untapped potential**.  
  Need **leisure-based promotions** (e.g., weekend ride discounts, family plans) to boost weekend ridership.

#### Chart - 9

In [None]:
# Chart - 9 visualization code


plt.figure(figsize=(10,6))
sns.countplot(data=df, x='start_hour', palette='mako')
plt.title('Number of Trips by Hour of the Day', fontsize=16)
plt.xlabel('Hour of the Day (0-23)', fontsize=14)
plt.ylabel('Number of Trips', fontsize=14)
plt.grid(axis='y')
plt.show()


##### 1. Why did you pick the specific chart?


I picked a **Bar Chart** because 'start_hour' is a **categorical numerical variable** (0 to 23 hours).  
A bar chart helps us **easily spot peak hours** where trip volumes are highest and compare across different times of day.


##### 2. What is/are the insight(s) found from the chart?


The chart clearly shows:
- Two major **peak periods**:
  - **Morning Rush**: around **8 AM**.
  - **Evening Rush**: around **5–6 PM**.
- Very few trips happen during **late-night/early-morning hours** (midnight to 5 AM).

This indicates the system is **mainly used for work/school commuting**.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


✅ **Positive Business Impact:**
- **Bike Stock Planning**:  
  Ford GoBike can **increase bike availability** around 7–9 AM and 4–7 PM to match commuter demand.
- **Dynamic Pricing Opportunity**:  
  Consider **surge pricing** during peak hours to **maximize revenue**.

⚠️ **Negative Business Impact:**
- **Low usage off-peak hours**:  
  Bikes remain underutilized outside peak periods (e.g., late night), leading to **inventory inefficiency**.
- Could introduce **discounts or promotions during off-peak times** to balance load.


#### Chart - 10

In [None]:
# Chart - 10 visualization code

# Note: In this dataset, the month is January only, but writing general code for full-year datasets

plt.figure(figsize=(10,6))
sns.countplot(data=df, x='start_month', palette='crest')
plt.title('Number of Trips by Month', fontsize=16)
plt.xlabel('Month', fontsize=14)
plt.ylabel('Number of Trips', fontsize=14)
plt.grid(axis='y')
plt.show()


##### 1. Why did you pick the specific chart?


I selected a **Bar Chart** because 'start_month' is a **numerical categorical variable** (1–12 representing months).  
A bar chart is ideal to **visually compare** the number of trips across months to detect any **seasonal trends**.


##### 2. What is/are the insight(s) found from the chart?


Since this dataset only includes **January 2018**, the chart shows trips only for **Month 1**.  
However, if a full-year dataset was available, this chart would reveal **seasonal peaks** (e.g., summer vs. winter ridership trends).

**Current Insight:**  
- All data corresponds to **January**; thus, no month-to-month variability is observed in this sample.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


✅ **Positive Business Impact:**
- If using a full-year dataset: could help **forecast high-demand months** and **optimize bike availability** accordingly.
- January data alone helps understand **winter ridership patterns**, especially for off-season adjustments.

⚠️ **Negative Business Impact:**
- Having only one month limits our ability to **predict seasonal trends** properly from this dataset.
- Without complete seasonal data, planning for peak summer or fall rides could be challenging.


#### Chart - 11

In [None]:
# Chart - 11 visualization code

# To avoid extreme outliers, limit trip durations up to 60 min
plt.figure(figsize=(10,6))
sns.boxplot(data=df[df['trip_duration_min'] <= 60], x='user_type', y='trip_duration_min', palette='Set2')
plt.title('Trip Duration (minutes) by User Type', fontsize=16)
plt.xlabel('User Type', fontsize=14)
plt.ylabel('Trip Duration (Minutes)', fontsize=14)
plt.grid(axis='y')
plt.show()

##### 1. Why did you pick the specific chart?


I picked a **Box Plot** because we are comparing **distribution of a continuous numerical variable** (trip_duration_min) across a **categorical variable** (user_type).  
Box plots help show the **median**, **quartiles**, and **outliers**, making it perfect for spotting differences between Subscribers and Customers.


##### 2. What is/are the insight(s) found from the chart?


The boxplot reveals:
- **Customers** tend to have **longer trips** on average compared to **Subscribers**.
- **Subscribers** have a **more consistent, shorter trip duration** centered around 8–12 minutes.
- **Customers** show greater **variability** with some **very long trips**.

This indicates that **Subscribers are regular short-trip commuters**, while **Customers may be tourists or leisure riders**.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



✅ **Positive Business Impact:**
- **Custom Pricing Plans**:  
  Ford GoBike could offer **shorter-ride-focused subscriptions** for Subscribers and **long-ride packages** for casual Customers.
- **Service Tailoring**:  
  Subscriber-focused services should focus on **quick pickup/dropoff** efficiency, while Customer services can promote **scenic routes or leisure packages**.

⚠️ **Negative Business Impact:**
- **Price Mismatch Risk**:  
  Offering only one type of plan could **dissatisfy either customer group** (short riders vs long riders).
- Need to **design flexible, segmented offerings** to cater to both groups properly.

#### Chart - 12

In [None]:
# Chart - 12 visualization code

# Focus on trips under 60 min and ages under 80 to avoid extreme outliers
plt.figure(figsize=(10,6))
sns.scatterplot(data=df[(df['trip_duration_min'] <= 60) & (df['member_age'] <= 80)],
                x='member_age', y='trip_duration_min', alpha=0.5, color='purple')
plt.title('Age vs Trip Duration', fontsize=16)
plt.xlabel('Age', fontsize=14)
plt.ylabel('Trip Duration (Minutes)', fontsize=14)
plt.grid(True)
plt.show()

##### 1. Why did you pick the specific chart?


I chose a **Scatter Plot** because we are comparing **two continuous numerical variables** (member_age and trip_duration_min).  
Scatter plots are ideal for identifying **relationships, patterns, and trends** between continuous variables.


##### 2. What is/are the insight(s) found from the chart?


The scatter plot shows:
- **Younger users (ages 20–40)** tend to have **shorter trip durations** clustered around 8–15 minutes.
- **Older users (ages > 50)** occasionally have **slightly longer trips**, but are less frequent overall.
- No strong correlation overall between age and trip duration.

Thus, **age does not dramatically affect trip length** for most users.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

✅ **Positive Business Impact:**
- Marketing and service offerings **do not need strict age-based segmentation** in terms of trip duration.
- Can **design generalized trip plans** that work across most age groups.

⚠️ **Negative Business Impact:**
- **Older demographic underrepresentation**: very few riders above age 60 — a **potential missed opportunity**.
- Business could create **senior-focused marketing campaigns or ergonomic bikes** to **attract older users**.


#### Chart - 13

In [None]:
# Chart - 13 visualization code

plt.figure(figsize=(8,6))
sns.countplot(data=df, x='member_gender', palette='Set3')
plt.title('Trip Count by Gender', fontsize=16)
plt.xlabel('Gender', fontsize=14)
plt.ylabel('Number of Trips', fontsize=14)
plt.grid(axis='y')
plt.show()

##### 1. Why did you pick the specific chart?


I picked a **Bar Chart** because 'member_gender' is a **categorical variable**.  
Bar charts are the best way to **compare trip counts across different gender categories** visually and quickly.


##### 2. What is/are the insight(s) found from the chart?


The bar chart reveals:
- **Male users** take significantly more trips compared to **Female** and **Other** gender categories.
- **Female riders** account for much fewer trips.
- **Other/non-binary users** have extremely low trip counts.

This shows a **strong gender imbalance** in the user base.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


✅ **Positive Business Impact:**
- Ford GoBike can continue to **strengthen services** that appeal to male users who are already the core group.
- Opportunity to **create male-focused campaigns** around work commuting or fitness.

⚠️ **Negative Business Impact:**
- **Huge opportunity being missed** by not attracting more female riders.
- Need **targeted strategies** to improve female participation, such as:
  - Ensuring **better safety**, **station location security**,
  - **Women-specific promotions** (e.g., women-only rides, promotional discounts).


#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

# Selecting only numerical columns for correlation
numerical_features = df[['duration_sec', 'trip_duration_min', 'start_hour', 'start_month', 'member_birth_year', 'member_age']]

plt.figure(figsize=(12,8))
sns.heatmap(numerical_features.corr(), annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Correlation Heatmap of Numerical Features', fontsize=18)
plt.show()


##### 1. Why did you pick the specific chart?


I picked a **Heatmap** because we are analyzing **correlation relationships between multiple continuous variables**.  
Heatmaps **visually display the strength and direction** of relationships through colors and numerical values, making patterns immediately clear.



##### 2. What is/are the insight(s) found from the chart?


The heatmap reveals:
- **duration_sec** and **trip_duration_min** are **perfectly positively correlated** (1.00) — expected, because trip_duration_min is just duration_sec divided by 60.
- **member_birth_year** is **strongly negatively correlated** with **member_age** (-1.00) — again logical, as birth year and age are inverse relationships.
- **start_hour** and other features like trip duration have very **weak or no correlation**.

Thus, most of the variables are **independent** except where mathematically related.


#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code

# Using a subset of important numerical features for clarity
selected_features = df[['trip_duration_min', 'start_hour', 'member_age']]

# Pairplot
sns.pairplot(selected_features, diag_kind='kde', plot_kws={'alpha':0.5})
plt.suptitle('Pair Plot of Trip Duration, Start Hour, and Member Age', y=1.02, fontsize=18)
plt.show()


##### 1. Why did you pick the specific chart?


I picked a **Pair Plot** because it allows us to **visualize pairwise relationships between multiple numerical variables** at once, along with their **distributions**.  
It’s an efficient way to detect **linear/non-linear relationships, clusters, and patterns** between multiple variables simultaneously.


##### 2. What is/are the insight(s) found from the chart?


The pairplot shows:
- **Trip Duration vs Start Hour**:  
  No strong relationship — trip length doesn’t depend heavily on time of day.
- **Trip Duration vs Member Age**:  
  A slight trend where **older users sometimes take longer trips**, but overall weak.
- **Start Hour vs Member Age**:  
  No clear pattern; people of all ages use bikes throughout the day.

Thus, **trip characteristics are fairly independent across demographics**.


#### Chart - 16

In [None]:
# Chart - 16 visualization code

plt.figure(figsize=(12,6))
sns.countplot(data=df, x='start_hour', hue='user_type', palette='Spectral')
plt.title('Start Hour vs User Type', fontsize=16)
plt.xlabel('Hour of the Day (0-23)', fontsize=14)
plt.ylabel('Number of Trips', fontsize=14)
plt.legend(title='User Type')
plt.grid(axis='y')
plt.show()



### 1. Why did you pick the specific chart?



I picked a **Count Plot with hue** because it allows us to **compare two categorical variables** together — 'start_hour' vs 'user_type'.  
It’s perfect for seeing how **user types vary across different hours of the day** and finding **patterns between two variables**.


### 2. What is/are the insight(s) found from the chart?



The chart reveals:
- **Subscribers** dominate during **peak commuting hours** (around 8 AM and 5–6 PM).
- **Customers** are **more evenly distributed** across daytime hours, especially between **10 AM to 5 PM**.
- **Subscribers** clearly use bikes for **work commutes**, while **Customers** (casual riders) use them during more relaxed daytime hours.



###  3. Will the gained insights help create a positive business impact?

✅ **Positive Business Impact:**
- Ford GoBike can **optimize bike availability** differently for Subscribers vs Customers:
  - More bikes available for **Subscribers** early morning and late afternoon.
  - Ensure bike availability for **Customers** during midday.
- **Dynamic Pricing Opportunities**:  
  Offer discounts to Customers during non-commute hours to balance bike usage.

⚠️ **Negative Business Impact:**
- Failure to differentiate could cause **bike shortages** during peak commuter hours, leading to **customer dissatisfaction**.
- Need **dynamic inventory balancing** between early morning/evening peaks and midday casual use.


## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.



**Suggested Solutions:**

To help **Ford GoBike** achieve its business objectives (better understanding user behavior, optimizing operations, and improving marketing strategies), I suggest the following:

- **User Segmentation and Personalization:**  
  Use the EDA findings to classify users (e.g., subscribers vs. casual customers) and develop personalized marketing strategies. Subscribers can be rewarded with loyalty programs, while casual riders can be targeted with special offers to encourage membership upgrades.

- **Peak Time Resource Optimization:**  
  Based on the analysis of peak usage hours and days, Ford GoBike should reallocate bike availability dynamically. More bikes should be deployed at high-demand stations during morning and evening rush hours, especially on weekdays.

- **Station Management:**  
  Identify high-traffic stations and ensure they are well-stocked. Use predictive models to forecast station-level demand and relocate bikes proactively.

- **Trip Duration Insights:**  
  Analyze and promote short-duration trips among casual users to maximize turnover and bike availability. Long-duration trips should be discouraged unless part of premium plans.

- **Gender and Demographic Targeted Campaigns:**  
  Since usage behavior varies by gender and age, marketing campaigns should be customized accordingly (e.g., student discounts, promotions for corporate subscriptions).

- **Predictive Analytics Implementation:**  
  Deploy a classification model to predict user types and likely behaviors. This can help pre-empt service needs and provide customized recommendations or offers via the app.


# **Conclusion**


In conclusion, by leveraging detailed exploratory data analysis and predictive modeling, Ford GoBike can gain valuable insights into user behavior and operational challenges. These insights should be translated into actionable strategies — optimizing bike availability, personalizing customer engagement, improving marketing efforts, and refining station management. Doing so will not only enhance customer satisfaction but also improve operational efficiency and drive business growth. Data-driven decision-making will be critical for Ford GoBike's continued success in the competitive urban mobility market.
