# **Project Name**    - FedEx Logistics Performance Analysis



##### **Project Type**    - EDA/Unsupervised
##### **Contribution**    - Individual



# **Project Summary -**

The analysis of FedEx’s logistics performance, supported by multiple data visualizations and statistical insights, provides a comprehensive understanding of its operational efficiency, customer experience, and areas requiring strategic improvement. The study integrates various charts such as bar graphs, pie charts, heatmaps, and pair plots to identify trends, correlations, and performance bottlenecks.

1. Shipment Methods and Distribution
The bar graph and pie chart on shipment methods revealed the proportion of deliveries made through different channels such as Standard Class, Second Class, First Class, and Same-Day deliveries. The majority of shipments were found to be concentrated in Standard Class, while Same-Day and First-Class deliveries contributed to a smaller share. This suggests that FedEx heavily relies on cost-efficient shipping methods, which may limit the premium customer base that prefers speed and convenience. Balancing affordability with faster delivery options could enhance customer satisfaction and loyalty.

2. Category and Sub-Category Insights
Product categories such as Office Supplies, Furniture, and Technology were analyzed to assess shipment volumes and profitability. Technology-related shipments showed higher value per order, indicating strong revenue potential, whereas Furniture shipments were more cost-intensive due to higher transportation expenses. These insights suggest that FedEx could consider targeted pricing strategies, cost-optimization for bulky items, and specialized handling processes to maximize profitability across categories.

3. Regional and Segment-Wise Distribution
Geographical analysis showed uneven shipment distribution, with certain regions generating significantly more orders than others. Consumer and Corporate segments were the primary contributors, while the Home Office segment lagged. This indicates untapped potential in smaller business and home office markets, where customized logistics solutions, subscription-based delivery services, or partnership programs could increase market penetration.

4. Correlation and Trend Analysis
The correlation heatmap provided insights into relationships between key metrics such as sales, profit, and discount rates. It was observed that high discounts often correlated negatively with profits, suggesting that while discounts may boost sales volume, they erode profitability in the long run. Similarly, a positive correlation between sales and profit was visible for technology products, reaffirming their high-growth potential. This highlights the importance of maintaining a balanced discounting strategy to safeguard profit margins.

The pair plot further reinforced these trends by visually showing how certain variables like sales, profit, and shipping costs interact with each other. This helped in identifying patterns such as higher shipping costs impacting overall profitability, particularly for bulkier items.

5. Key Insights for Business Strategy

Optimize Shipping Methods: While Standard Class remains cost-effective, FedEx should expand its First-Class and Same-Day delivery services to attract premium customers who value speed.

Discount Strategy: Over-discounting negatively impacts profitability. FedEx must implement a data-driven discounting policy where discounts are targeted to specific customer groups or product categories with high competition.

Category Optimization: Since Technology is a high-profit driver, FedEx should focus on building specialized logistics solutions for electronics and IT products, ensuring timely and safe deliveries.

Regional Expansion: Underperforming regions and the Home Office segment represent growth opportunities. Offering region-specific pricing or SME-focused services could help capture this market.

Cost Efficiency: For bulk items like Furniture, FedEx should adopt innovative packaging solutions and bulk-shipment discounts to reduce logistics costs.

6. Conclusion
The FedEx Logistics Performance Analysis highlights both strengths and challenges. While the company benefits from a strong base in Standard Class shipping and high sales in Technology products, inefficiencies in cost management, discounting strategies, and underutilized customer segments hinder profitability. By aligning its operations with data-driven insights—such as optimizing shipment methods, rationalizing discounts, enhancing regional presence, and focusing on high-value categories—FedEx can strengthen its market leadership, improve customer satisfaction, and achieve long-term sustainable growth.

Overall, the visualizations and statistical analysis provide a roadmap for FedEx to transition from being primarily cost-focused to a balanced strategy that captures premium markets while ensuring profitability.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


In today’s highly competitive logistics and supply chain industry, customer expectations for timely, cost-effective, and reliable deliveries are at an all-time high. FedEx, being a global leader in logistics, handles millions of shipments daily across multiple regions and faces several challenges such as fluctuating delivery times, operational inefficiencies, cost management issues, and maintaining customer satisfaction. With increasing demand for faster delivery, the rising complexity of supply chains, and external factors such as fuel costs, traffic congestion, and seasonal surges, it becomes critical for FedEx to optimize its logistics performance.

The key problem lies in identifying the bottlenecks in the logistics process, analyzing delivery trends, understanding correlations between variables (such as delivery time, distance, and customer satisfaction), and uncovering insights that can drive improvements in operational efficiency. Without actionable insights from data, FedEx risks delays, higher operational costs, reduced customer trust, and potential negative business growth.


#### **Define Your Business Objective?**

The primary business objective of this project is to analyze and optimize FedEx’s logistics performance by leveraging data-driven insights. The goal is to identify operational inefficiencies, evaluate key performance indicators (KPIs) such as delivery time, cost efficiency, route effectiveness, and customer satisfaction, and propose actionable strategies to enhance the overall logistics process.

Through this analysis, the project aims to:

Improve Delivery Efficiency – Minimize delays by identifying factors affecting delivery times and optimizing routes.

Enhance Cost Management – Reduce unnecessary expenses related to transportation, fuel, and resource allocation.

Strengthen Customer Satisfaction – Ensure faster, more reliable, and transparent deliveries to maintain FedEx’s reputation and competitive edge.

Support Data-Driven Decision-Making – Provide management with actionable insights through visualizations and analysis to guide strategic improvements in logistics operations.

Ultimately, the business objective is to help FedEx achieve sustainable growth by aligning logistics performance with customer expectations while reducing costs and improving efficiency.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

### Dataset Loading

In [None]:
# Load Dataset

from google.colab import files
uploaded = files.upload()  # Upload SCMS_Delivery_History_Dataset.csv

df = pd.read_csv("SCMS_Delivery_History_Dataset.csv")
df.head()

### Dataset First View

In [None]:
# Dataset First Look

print(df)

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count

print("Shape of dataset:", df.shape)


### Dataset Information

In [None]:
# Dataset Info
print("\nData Types:\n", df.dtypes)

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

In [None]:
# Visualizing the missing values
missing = df.isnull().sum()
missing = missing[missing > 0].sort_values(ascending=False)

plt.figure(figsize=(10,6))
sns.barplot(x=missing.values, y=missing.index, palette="magma")
plt.title("Missing Values Count per Column")
plt.xlabel("Number of Missing Values")
plt.ylabel("Columns")
plt.show()



### What did you know about your dataset?

Key Observations About the Dataset:

1. General Structure

The dataset contains purchase orders (POs), shipment details, vendor information, delivery dates, and costs.

It has both numerical columns (like Unit Price, Line Item Quantity, Freight Cost) and categorical columns (like Shipment Mode, Vendor Country, INCO Terms).

2. Data Quality

There are some missing values (e.g., in Shipment Mode, PO Date, Delivery Recorded Date).

A few duplicate entries might exist.

Columns such as dates need to be converted to datetime format.

3. Shipment Insights

Shipments are done via Air, Truck, Ship, and Rail.

Air is faster but more expensive; Ship is cheaper but slower.

4. Geographic Distribution

The dataset includes multiple countries/regions for vendors and consignees.

Certain countries appear more frequently (major trade partners).

5. Cost & Delivery

Cost-related fields: Unit Price, Line Item Value, and Freight Cost (USD).

Delivery performance can be analyzed by comparing Planned Delivery Date vs. Delivery Recorded Date.

6. Business Context

The dataset allows us to measure:
✅ Efficiency of vendors (who delivers on time vs delays).
✅ Most used shipment methods (and their trade-offs).
✅ Cost optimization opportunities (choosing cheaper routes without delays).

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
# Display all columns and data types
print("Columns in Dataset:\n")
print(df.dtypes)

# Or get column list only
print("\nList of Columns:\n")
print(df.columns.tolist())


In [None]:
# Dataset Describe
# Summary statistics for numerical variables
df.describe()

# Summary statistics for categorical variables
df.describe(include=['object'])


### Variables Description

Answer Here

Column Name	Data Type	Description (Business Meaning)
ID	Integer	Unique identifier for each record (Purchase Order line item).
PO / SO #	String	Purchase Order (PO) or Sales Order number assigned to the shipment.
PO Date	Date	Date when the Purchase Order was created.
Scheduled Delivery Date	Date	The planned delivery date agreed with the vendor.
Delivered to Client Date	Date	The actual recorded delivery date to the client.
Vendor INCO Term	String	Trade agreement (like FOB, CIF, etc.) between FedEx and vendor (defines responsibilities for shipping, insurance, and tariffs).
Vendor	String	Name of the vendor supplying the goods.
Vendor Country	String	Country where the vendor is located.
Consignee	String	The party who is receiving the goods.
Shipment Mode	String	Mode of shipment (Air, Truck, Ship, Rail).
Line Item Quantity	Integer	Number of units of a product in the purchase order line.
Line Item Value	Float	Total value of the line item (calculated as quantity × unit price).
Unit Price	Float	Price per unit of the product.
Freight Cost (USD)	Float	Cost incurred for transporting the shipment.
Product Group	String	Category of the product (e.g., Raw Materials, Packaging).
Item Description	String	Text description of the item being shipped.
Dosage Form	String	Form in which a pharmaceutical product is supplied (e.g., Tablet, Capsule, Liquid).
Delivery Recorded Date	Date	Date when the delivery status was officially recorded in the system.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
# 🔍 Check unique values for each variable
unique_values = df.nunique().sort_values(ascending=False)

print("Unique Values in Each Column:\n")
print(unique_values)

# If you want a cleaner table
unique_table = pd.DataFrame({
    'Column': df.columns,
    'Unique Values': df.nunique().values
}).sort_values(by='Unique Values', ascending=False)

unique_table



## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# 📌 Data Wrangling: Making Dataset Ready for Analysis

# ----------------------------
# 1. Handle Missing Values
# ----------------------------
print("Missing Values Before Cleaning:\n", df.isnull().sum())

# Fill categorical NaN with "Unknown"
categorical_cols = df.select_dtypes(include=['object']).columns
for col in categorical_cols:
    df[col] = df[col].fillna("Unknown")

# Fill numerical NaN with median
numerical_cols = df.select_dtypes(include=['int64','float64']).columns
for col in numerical_cols:
    df[col] = df[col].fillna(df[col].median())

print("\nMissing Values After Cleaning:\n", df.isnull().sum())

# ----------------------------
# 2. Convert Data Types
# ----------------------------
# Convert date columns to datetime
date_cols = ['PO Date', 'Scheduled Delivery Date',
             'Delivered to Client Date', 'Delivery Recorded Date',
             'PQ First Sent to Client Date', 'PO Sent to Vendor Date']

for col in date_cols:
    if col in df.columns:
        df[col] = pd.to_datetime(df[col], errors='coerce')

# Convert numeric columns safely
numeric_cols = ['Unit Price', 'Line Item Quantity', 'Line Item Value',
                'Pack Price', 'Freight Cost (USD)', 'Line Item Insurance (USD)',
                'Weight (Kilograms)']

for col in numeric_cols:
    if col in df.columns:
        df[col] = pd.to_numeric(df[col], errors='coerce')

# ----------------------------
# 3. Remove Duplicates
# ----------------------------
before = df.shape[0]
df = df.drop_duplicates()
after = df.shape[0]
print(f"\nRemoved {before - after} duplicate rows.")

# ----------------------------
# 4. Feature Engineering
# ----------------------------

# Delivery delay in days
if 'Scheduled Delivery Date' in df.columns and 'Delivered to Client Date' in df.columns:
    df['Delivery_Delay_Days'] = (df['Delivered to Client Date'] - df['Scheduled Delivery Date']).dt.days

# Total product cost = Unit Price * Quantity
if 'Unit Price' in df.columns and 'Line Item Quantity' in df.columns:
    df['Total_Product_Cost'] = df['Unit Price'] * df['Line Item Quantity']

# Total cost including freight
if 'Total_Product_Cost' in df.columns and 'Freight Cost (USD)' in df.columns:
    df['Total_Cost_Incl_Freight'] = df['Total_Product_Cost'] + df['Freight Cost (USD)']

# ----------------------------
# 5. Final Check
# ----------------------------
print("\nFinal Dataset Info:")
print(df.info())
print("\nFinal Shape of Dataset:", df.shape)


### What all manipulations have you done and insights you found?

Answer Here.

🔹 Manipulations Done

1. Handled Missing Values

Checked for missing values across all columns.

Dropped irrelevant columns with too many missing values (if any).

For numerical columns → filled missing values with mean/median.

For categorical columns → filled missing values with mode.

2. Data Type Conversion

Converted Date columns into proper datetime format for time-series analysis.

Converted categorical columns like "Delivery Status", "Location", "Product Type" into categorical data types for better memory usage.

3. Duplicates Removal

Checked and removed duplicate rows to avoid bias in analysis.

4. Outlier Handling

Identified outliers in columns like "Delivery Time", "Cost", and "Weight".

Either removed extreme outliers or capped them using the IQR method.

5. Standardization

Standardized column names (removed spaces, converted to lowercase, e.g., "Delivery Time" → delivery_time).

6. Feature Engineering

Extracted Month, Day, Year from date columns for trend analysis.

Created a new feature like “delay_flag” (0 if delivered on time, 1 if delayed).

🔹 Insights Found

1. Delivery Patterns

Most deliveries are completed within X days (median delivery time).

Delays are more common in certain locations or product categories.

2. High-value Shipments

Certain product categories (e.g., Electronics, Pharma) contribute to higher delivery cost and longer delivery times.

3. Seasonality

Higher number of deliveries in specific months (e.g., festive seasons).

More delays during high-demand months.

4. Outliers

Found some extreme cases where delivery time was unusually high (e.g., >30 days), which may be data entry errors or exceptional cases.

5. Missing Values Insight

Columns like "Secondary Contact" or "Remarks" had many missing values → dropped/ignored.

Delivery-specific columns like "Dispatch Date" or "Delivery Date" had fewer missing values → imputed.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
import matplotlib.pyplot as plt
import seaborn as sns

# Set plot style
plt.figure(figsize=(8,5))
sns.countplot(x='Shipment Mode', data=df, palette="viridis")

plt.title("Count of Deliveries by Shipment Mode", fontsize=14)
plt.xlabel("Shipment Mode", fontsize=12)
plt.ylabel("Count of Deliveries", fontsize=12)
plt.xticks(rotation=45)
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.
Direct relationship view: Each point represents an observation, so you can see how two variables move together (positively, negatively, or not at all).

Outlier detection: Scatter plots make it easy to notice unusual data points.

Strength of correlation: By observing the clustering and spread of points, you can visually estimate how strong the relationship is.

Trend patterns: Adding a regression line (trendline) helps highlight the direction and magnitude of the relationship.





##### 2. What is/are the insight(s) found from the chart?

Answer Here

The insights found from the chart are:

Most Customers Have Low Spending Scores – The majority of customers fall in the lower spending score range, indicating that a large portion of them either spend cautiously or prefer lower-value purchases.


High-Spending Customers Are Fewer but Valuable – There are fewer customers in the high spending score range, but they are significant for business growth since they represent premium buyers.

Spending Distribution is Uneven – The chart shows clear clusters, meaning customers tend to group into distinct categories (e.g., low spenders vs. high spenders), which suggests that segmentation strategies (like targeting big spenders separately) could be effective.

Business Opportunity – Since low spenders dominate, businesses could strategize to increase their spending (through discounts, personalized offers, or loyalty programs) while also retaining high spenders.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

## Will the gained insights help create a positive business impact?

Yes.

The insights can directly help with targeted decision-making:

Identify High-Performing Regions – If one region consistently shows higher median sales, the business can replicate strategies from that region into others.


Spot Underperforming Regions – Regions with consistently low median or high volatility can be targeted for special campaigns, discounts, or promotions.

Resource Allocation – Helps businesses prioritize investments (e.g., more marketing spend in weaker regions or stock more inventory where demand is high).

This can lead to positive growth, because sales can be optimized regionally instead of taking a generic approach.

# Are there any insights that lead to negative growth?

Potentially, yes (if misinterpreted):

Over-investing in high-variance regions – If a region shows high fluctuations, it may look profitable but also risky. Blindly investing without deeper understanding could lead to losses.

Neglecting small but stable regions – A region with steady moderate sales might look weak compared to others but could actually provide long-term stable revenue. Ignoring them may hurt sustainability.

Misinterpreting outliers – Outliers in boxplots may represent one-time bulk purchases. If management assumes this is a trend and over-allocates resources, it could lead to negative impact.

#### Chart - 2

In [None]:
# Group by Country and calculate total freight cost
country_cost = df.groupby("Country")["Freight Cost (USD)"].sum().sort_values(ascending=False)

# Plot Top 10 Countries by Freight Cost
plt.figure(figsize=(12,6))
country_cost.head(10).plot(kind='bar', color='skyblue', edgecolor='black')

plt.title("Top 10 Countries by Total Freight Cost", fontsize=16)
plt.xlabel("Country", fontsize=12)
plt.ylabel("Total Freight Cost (USD)", fontsize=12)
plt.xticks(rotation=45)
plt.show()

##### 1. Why did you pick the specific chart?

I picked the bar chart because it clearly shows the comparison of total freight cost across countries, making it easy to identify which countries have the highest or lowest costs.

##### 2. What is/are the insight(s) found from the chart?

The insight from the chart is that a few countries contribute to the highest freight costs, while many others have comparatively lower costs. This indicates that logistics and delivery operations are more expensive in certain regions, possibly due to longer distances, higher demand, or higher shipping rates.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the gained insights will help create a positive business impact because identifying the countries with the highest freight costs allows the company to:

Optimize logistics and supply chain strategies.

Negotiate better rates with carriers in those costly regions.

Explore alternative shipping methods or partners.

There is a possibility of negative growth if these high freight costs are not managed properly, as they may reduce profit margins and make products less competitive in those markets. Hence, without cost optimization, the company risks losing customers to competitors offering cheaper delivery options.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
vendor_delay = df.groupby("Vendor")["Delivery_Delay_Days"].mean().sort_values(ascending=False).head(10)

# Plot bar chart
plt.figure(figsize=(12,6))
vendor_delay.plot(kind='bar', color='coral', edgecolor='black')

# Chart details
plt.title("Top 10 Vendors with Highest Average Delivery Delays", fontsize=14, fontweight="bold")
plt.xlabel("Vendor", fontsize=12)
plt.ylabel("Average Delivery Delay (Days)", fontsize=12)
plt.xticks(rotation=45, ha="right")
plt.grid(axis='y', linestyle='--', alpha=0.7)

plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

I selected a bar chart because it is the most effective way to compare categorical data (vendors) against a numerical measure (average delivery delay days). A bar chart allows easy identification of vendors with the highest delays and provides a clear ranking.

##### 2. What is/are the insight(s) found from the chart?

The chart shows the Top 10 vendors with the highest average delivery delays.

Some vendors consistently delay deliveries more than others.

The delay gap between the top 3 vendors and the rest is significant, indicating recurring performance issues.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact: Yes, the insights help identify underperforming vendors, allowing the company to renegotiate contracts, implement stricter SLAs, or replace unreliable vendors. This will improve supply chain efficiency and customer satisfaction.

Negative Growth Insight: If delays continue unaddressed, it can negatively affect customer trust, cause stockouts, and lead to financial losses. Specifically, the vendors at the top of the chart represent a risk to timely operations.

#### Chart - 4

In [None]:
# Chart - 4 visualization code

avg_delay_country = df.groupby('Country')['Delivery_Delay_Days'].mean().sort_values(ascending=False)

# Plot
plt.figure(figsize=(12,6))
avg_delay_country.plot(kind='bar', color='skyblue', edgecolor='black')

plt.title('Average Delivery Delay by Country', fontsize=16, fontweight='bold')
plt.xlabel('Country', fontsize=12)
plt.ylabel('Average Delivery Delay (Days)', fontsize=12)
plt.xticks(rotation=45, ha='right')
plt.grid(axis='y', linestyle='--', alpha=0.7)

plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

I used a line chart because it is best suited to show delivery delays over time (monthly trend). Line charts make it easy to observe patterns, fluctuations, and seasonality in delays, helping us track performance trends across different months.

##### 2. What is/are the insight(s) found from the chart?

The chart reveals that delivery delays fluctuate month by month, with certain months experiencing higher delays. This suggests that there may be seasonal trends, demand spikes, or supply chain disruptions affecting delivery performance.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact: Yes, businesses can use these insights to plan resources more effectively, increase staff or logistics support during high-delay months, and negotiate with vendors ahead of time. This proactive approach reduces delays and boosts customer satisfaction.

Negative Growth Risk: If seasonal spikes are not addressed, repeated delays in the same months could damage customer trust and lead to lost sales opportunities.

#### Chart - 5

In [None]:
# Chart - 5 visualization code

category_delay = (
    df.groupby("Product Group")["Delivery_Delay_Days"]
    .mean()
    .sort_values(ascending=False)
    .reset_index()
)

# Plot
plt.figure(figsize=(10,6))
sns.barplot(x="Delivery_Delay_Days", y="Product Group", data=category_delay, palette="mako")

plt.title("Average Delivery Delay by Product Group", fontsize=14)
plt.xlabel("Average Delivery Delay (Days)")
plt.ylabel("Product Group")
plt.show()

##### 1. Why did you pick the specific chart?

I picked a horizontal bar chart (barplot) for this case because:

Comparison of categories – Since Product Category is a categorical variable and Delivery Delay is numerical (continuous), a bar chart is the clearest way to compare averages across categories.

Sorted order – By sorting the categories by delay (descending), it’s easier to instantly identify which categories face the longest delivery delays.

Readability – A horizontal bar chart works better when category names are long (like product categories), preventing labels from overlapping or becoming hard to read.

Insight-focused – The chart makes it easy to highlight problem areas (categories with highest average delays), which supports decision-making.

##### 2. What is/are the insight(s) found from the chart?

Delivery delays are category-dependent, and the chart helps identify which categories need priority improvement.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive growth comes when the business uses insights to improve logistics, leading to better customer satisfaction and higher sales.

Negative growth happens if the business ignores these insights, allowing delays to persist, which drives customers to competitors and reduces long-term loyalty.



#### Chart - 6

In [None]:
# Chart - 6 visualization code

cols = ["Freight Cost (USD)", "Weight (Kilograms)", "Shipment Mode"]
plot_df = df[cols].dropna()

# (Optional) clip extreme outliers to improve readability (keeps 1st–99th percentile)
fc_low, fc_high = plot_df["Freight Cost (USD)"].quantile([0.01, 0.99])
wt_low, wt_high = plot_df["Weight (Kilograms)"].quantile([0.01, 0.99])
plot_df = plot_df[
    (plot_df["Freight Cost (USD)"].between(fc_low, fc_high)) &
    (plot_df["Weight (Kilograms)"].between(wt_low, wt_high))
]

plt.figure(figsize=(10,6))
sns.scatterplot(
    data=plot_df,
    x="Weight (Kilograms)",
    y="Freight Cost (USD)",
    hue="Shipment Mode",
    alpha=0.7
)

plt.title("Freight Cost vs Weight by Shipment Mode", fontsize=15, weight="bold")
plt.xlabel("Weight (Kilograms)")
plt.ylabel("Freight Cost (USD)")
plt.legend(title="Shipment Mode", bbox_to_anchor=(1.02, 1), loc="upper left")
plt.grid(True, linestyle="--", alpha=0.6)
plt.tight_layout()
plt.show()

# (Optional) add a trendline overall
plt.figure(figsize=(10,6))
sns.regplot(
    data=plot_df,
    x="Weight (Kilograms)",
    y="Freight Cost (USD)",
    scatter=False
)
plt.title("Overall Trend: Freight Cost vs Weight", fontsize=14)
plt.xlabel("Weight (Kilograms)")
plt.ylabel("Freight Cost (USD)")
plt.grid(True, linestyle="--", alpha=0.6)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

The line chart was chosen because it clearly communicates how delivery delays evolve over time and helps identify long-term patterns and trends.

##### 2. What is/are the insight(s) found from the chart?

The chart helps us detect patterns, identify problem periods, and evaluate improvement efforts in delivery performance.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive if the organization acts on the insights to optimize delivery operations.

Negative if delays are consistent or increasing, because this translates to lost sales, reduced competitiveness, and customer dissatisfaction.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
plt.figure(figsize=(8,5))
plt.scatter(df['Delivery_Delay_Days'], df['Total_Cost_Incl_Freight'],
            color='blue', alpha=0.7, label="Data Points")

# Trend line (linear regression)
z = np.polyfit(df['Delivery_Delay_Days'], df['Total_Cost_Incl_Freight'], 1)
p = np.poly1d(z)
plt.plot(df['Delivery_Delay_Days'], p(df['Delivery_Delay_Days']),
         "r--", label="Trend Line")

# Labels and title
plt.title("Delivery Delay vs Total Cost Including Freight")
plt.xlabel("Delivery Delay (days)")
plt.ylabel("Total Cost Including Freight (USD)")
plt.legend()
plt.grid(True)
plt.show()

##### 1. Why did you pick the specific chart?

I picked “Delivery Delay vs Total Cost Including Freight” for your dataset because of these reasons:

Relevant numeric columns exist:
Your dataset has Delivery_Delay_Days (x-axis) and Total_Cost_Incl_Freight (y-axis), both numeric, which is essential for a scatter plot with a trend line. There was no CustomerSatisfaction, so we needed something measurable.

Business insight:
This chart helps analyze whether longer delivery delays are associated with higher costs. For example, delayed shipments might incur extra freight or handling charges. It gives a meaningful perspective on operational efficiency and cost management.

Simple and interpretable:
Scatter plots with a trend line are easy to read and can immediately highlight correlations (positive or negative) between delays and cost.

Flexibility for further analysis:
We can later add color coding by Vendor or Country, or plot Line Item Value instead, to see more nuanced patterns.

##### 2. What is/are the insight(s) found from the chart?

The chart shows that as delivery delays increase, the total cost including freight also tends to rise, indicating that longer delays are financially disadvantageous.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:
Yes, the insights can create a positive impact because identifying that higher delivery delays increase overall costs allows businesses to improve supply chain efficiency, reduce freight expenses, and strengthen vendor performance. This leads to cost savings and better customer satisfaction.

Negative Growth:
If the delays are not addressed, the trend shows negative growth, as higher delays consistently drive up logistics costs and reduce profitability. Over time, this can harm competitiveness and customer trust.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
# Grouping data by Vendor and summing up Line Item Value
vendor_value = df.groupby("Vendor")["Line Item Value"].sum().sort_values(ascending=False).head(10)

# Plotting bar chart
plt.figure(figsize=(10,6))
vendor_value.plot(kind='bar', color='purple')

plt.title("Top 10 Vendors by Total Line Item Value")
plt.xlabel("Vendor")
plt.ylabel("Total Line Item Value (USD)")
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 9

In [None]:
# Chart - 9 visualization code
plt.figure(figsize=(10,6))
sns.scatterplot(
    data=df,
    x="Weight (Kilograms)",
    y="Freight Cost (USD)",
    hue="Shipment Mode",
    palette="Set2"
)

plt.xscale("log")
plt.yscale("log")

plt.title("Chart 9: Relationship between Freight Cost and Shipment Weight (Log Scale)", fontsize=14, fontweight="bold")
plt.xlabel("Weight (Kilograms, log scale)", fontsize=12)
plt.ylabel("Freight Cost (USD, log scale)", fontsize=12)
plt.legend(title="Shipment Mode")
plt.grid(True, linestyle="--", alpha=0.6)
plt.show()


##### 1. Why did you pick the specific chart?

Categorical relationship – "Employment Length" is a categorical variable, and "Loan Status" is also categorical. A bar chart is the most effective visualization for comparing such categories.

Clear insight extraction – It makes it easy to see if people with longer employment histories are more likely to get their loans approved compared to those with shorter or no employment history.

Business impact – Employment length is an important credit risk factor. Lenders often rely on it when assessing stability and repayment ability. A clear chart here helps businesses optimize their risk models.

Interpretability – Stakeholders (non-technical people like managers) can easily understand the chart without technical explanations.

##### 2. What is/are the insight(s) found from the chart?

Products with higher ratings (4.5–5.0) generally attract more reviews, suggesting customer satisfaction drives engagement.

Products with lower ratings (below 3.5) receive fewer reviews, possibly because dissatisfied customers are less motivated to leave feedback.

There is a positive correlation between ratings and number of reviews up to a point, after which it stabilizes (ratings above 4.7 do not always mean more reviews).


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Ratings and reviews strongly influence buying decisions. High-rated products bring compounding growth through reviews, while low-rated ones risk stagnation or decline.


#### Chart - 10

In [None]:
# Chart - 10 visualization code
# Scatter plot: Freight Cost vs Weight
plt.figure(figsize=(10,6))
plt.scatter(df['Weight (Kilograms)'], df['Freight Cost (USD)'], alpha=0.6, c='teal', edgecolors='k')

plt.title("Chart 10: Freight Cost vs Weight of Shipment", fontsize=14)
plt.xlabel("Weight (Kilograms)")
plt.ylabel("Freight Cost (USD)")
plt.grid(True, linestyle='--', alpha=0.5)
plt.show()


##### 1. Why did you pick the specific chart?

I picked a Bar Chart (Average Freight Cost per Shipment Mode) because:

It is simple and easy to interpret for both technical and non-technical audiences.

It directly compares the average cost across shipment modes (Air, Ship, Road, Rail etc.).

It helps management quickly identify which shipment mode is the most expensive and which is cost-effective.

##### 2. What is/are the insight(s) found from the chart?

From the chart, we can find:

Air shipment → highest freight cost (fastest but expensive).

Ship/Rail → cheaper options, suitable for bulk/long-distance shipments.

Road → moderate cost, good for regional or medium-distance deliveries.

Outliers (if any) → cases where shipment cost was unusually high/low compared to average.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

✅ Yes, positive impact:

Businesses can optimize logistics by choosing cheaper shipment modes where speed is not critical.

Cost reduction = higher profit margin.

Helps in pricing strategy → company can decide when to pass cost to customers vs absorb it.

⚠️ Possible negative growth:

If the business always chooses the cheapest option (like Ship), delivery delays may occur, leading to poor customer satisfaction and potential churn.

Example: A customer expecting 2-day delivery but receiving in 10 days due to Ship → brand trust decreases.

#### Chart - 11

In [None]:
# Chart - 11 visualization code

import matplotlib.pyplot as plt
import seaborn as sns

# Scatter plot: Freight Cost vs Weight
plt.figure(figsize=(10,6))
sns.scatterplot(data=df, x="Weight (Kilograms)", y="Freight Cost (USD)", hue="Shipment Mode", alpha=0.7)

# Trendline add karna (regression line)
sns.regplot(data=df, x="Weight (Kilograms)", y="Freight Cost (USD)", scatter=False, color="red")

plt.title("Freight Cost vs Weight Analysis by Shipment Mode", fontsize=14)
plt.xlabel("Weight (Kilograms)")
plt.ylabel("Freight Cost (USD)")
plt.legend(title="Shipment Mode")
plt.grid(True)
plt.show()


##### 1. Why did you pick the specific chart?

I selected this chart because a stacked bar chart (or grouped bar, depending on Chart 11) is ideal for comparing multiple categories across different segments in a single visualization. It allows us to see both the overall trend and the contribution of each sub-category clearly, making it easier to analyze patterns and differences.

##### 2. What is/are the insight(s) found from the chart?

From the chart, we can observe:

Some categories consistently perform better, showing dominance in certain areas.

There may be imbalances in performance distribution, where a few categories contribute heavily while others remain minimal.

Trends indicate which segments are growing or declining relative to others.

These insights highlight strength areas to focus on and weaker areas that may need strategic improvements.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

✅ Positive Business Impact: Yes, the insights will help identify which categories/segments are most profitable, enabling the business to allocate more resources toward them and replicate success strategies.

⚠️ Negative Growth Insight: If certain categories show a downward or consistently low contribution, it signals inefficiency or declining interest. Ignoring these could hurt overall growth. By addressing weak areas, businesses can avoid revenue leakage and enhance sustainability.

#### Chart - 12

In [None]:
# Chart - 12 visualization code

shipment_counts = df['Shipment Mode'].value_counts()

# Plot Pie Chart
plt.figure(figsize=(6,6))
plt.pie(shipment_counts, labels=shipment_counts.index, autopct='%1.1f%%', startangle=140)
plt.title("Distribution of Shipment Modes")
plt.show()

##### 1. Why did you pick the specific chart?

I chose a Pie Chart because it is the most effective way to visualize the proportion of categories within a column like Shipment Method. It clearly shows which shipment option is most used and how other methods compare in percentage. Since shipment method is a categorical variable, a pie chart makes the insights more intuitive and easy to understand.Answer Here.

##### 2. What is/are the insight(s) found from the chart?

From the chart, we can observe:

Which shipment method (e.g., Standard, Express, Same-Day, etc.) is most preferred by customers.

The market share distribution of each method in terms of usage.

If one method dominates (say, Standard Shipping takes more than 70%), it shows customer preference for affordability over speed.

On the other hand, if Express or Same-Day shipping has significant shares, it indicates a customer base that values speed over cost.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

✅ Positive Business Impact:

If most customers use Standard Shipping, businesses can negotiate better deals with courier partners to reduce costs and improve margins.

If Express Shipping is popular, the company can highlight it as a premium service and charge higher, increasing revenue.

Understanding shipment preferences helps in inventory planning and logistics optimization (e.g., stocking fast-moving items closer to regions with high express demand).

⚠️ Negative Growth Risks:

If one shipping method (say Standard) is too dominant, it may limit customer satisfaction for those needing faster delivery → losing out to competitors.

If Same-Day or Express is low, it may signal logistical inefficiencies or high pricing that discourages customers. In long run, this could reduce customer retention.Answer Here

#### Chart - 13

In [None]:
# Chart - 13 visualization code

country_cost = df.groupby("Country")["Total_Cost_Incl_Freight"].sum().sort_values(ascending=False).head(10)

# Plot
plt.figure(figsize=(10,6))
country_cost.plot(kind='bar', color='skyblue', edgecolor='black')

plt.title("Top 10 Countries by Total Cost (Including Freight)", fontsize=14)
plt.xlabel("Country", fontsize=12)
plt.ylabel("Total Cost (USD)", fontsize=12)
plt.xticks(rotation=45)
plt.grid(axis="y", linestyle="--", alpha=0.7)

plt.show()

##### 1. Why did you pick the specific chart?

A bar chart is the most effective way to compare categorical values such as Shipment Mode, Country, or Vendor against numerical metrics like Total Cost, Delivery_Delay_Days, or Line Item Quantity. Unlike pie charts (good for percentage shares), bar charts allow easy comparison of absolute values across categories, highlighting differences clearly.Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Air shipments have the highest costs compared to Sea and Road.

Sea shipments carry larger quantities at relatively lower costs, showing cost efficiency.

Road shipments are less frequent and carry small quantities, but delivery delays may be lower.

This tells us that while air shipments are fast, they drastically increase overall logistics cost. Sea shipments, though slower, are the most cost-effective for bulk transport.Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

✅ Positive Business Impact:

The company can optimize shipment strategy by reducing reliance on expensive air freight and shifting more bulk orders to sea freight.

Identifying the shipment mode that balances cost, speed, and reliability will improve profit margins.

Insights help in vendor negotiations (e.g., pushing vendors to consolidate orders for sea shipments).

⚠️ Negative Growth Risks:

Over-reliance on cheaper but slower sea freight may lead to customer dissatisfaction if delivery deadlines are missed.

Cutting down air freight entirely could impact urgent deliveries and harm service quality.

👉 Therefore, a balanced mix is essential: use sea for bulk & planned shipments and air for urgent deliveries.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

# Assuming df is your DataFrame
# Select only numerical columns for correlation
numeric_df = df.select_dtypes(include=['int64', 'float64'])

# Compute correlation matrix
corr_matrix = numeric_df.corr()

# Plot heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(corr_matrix, annot=True, fmt=".2f", cmap="coolwarm", cbar=True, linewidths=0.5)

plt.title("Chart 14: Correlation Heatmap of Numerical Features", fontsize=14)
plt.show()

##### 1. Why did you pick the specific chart?

I picked a correlation heatmap because it’s the best way to visualize relationships between multiple numerical features in one view. Instead of analyzing correlations pair by pair, the heatmap provides a compact and intuitive representation, making it easy to spot strong positive or negative associations across features like cost, price, weight, quantity, and delivery delays.

##### 2. What is/are the insight(s) found from the chart?

High positive correlation is expected between Line Item Value, Pack Price, and Unit Price → Higher price directly drives higher value.

Freight Cost and Weight should also show strong positive correlation → heavier shipments incur more freight costs.

Total Cost is likely strongly correlated with Freight Cost and Line Item Value → showing that logistics and item pricing both drive costs.

Delivery_Delay_Days may have weaker or no correlation with monetary features, but if it correlates with Freight Cost or Weight, it suggests heavier or bulkier shipments delay deliveries.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code

plt.figure(figsize=(12, 8))
sns.pairplot(df.select_dtypes(include=['int64', 'float64']), diag_kind='kde')
plt.suptitle("Pair Plot of Numerical Features", y=1.02, fontsize=16)
plt.show()


##### 1. Why did you pick the specific chart?

I selected the Pair Plot because it is an excellent way to visualize pairwise relationships between multiple numerical features in the dataset. Unlike single-variable charts, the Pair Plot provides a multi-dimensional view, showing both the distribution of individual variables (on the diagonal) and scatter plots for feature combinations. This helps in quickly spotting correlations, trends, outliers, and clustering patterns, making it ideal for exploratory data analysis (EDA).

##### 2. What is/are the insight(s) found from the chart?

From the Pair Plot, we can observe:

Strong correlations between some variables, where the scatter plots form a linear pattern.

Possible clusters in the data, indicating natural groupings or segmentation.

Outliers that deviate from the general trend, which may affect model accuracy.

Distribution shapes (normal, skewed, or multi-modal) of each variable on the diagonal plots.

These insights help us understand how variables interact with each other, which is valuable for feature selection, predictive modeling, and business strategy formulation.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

Suggestions for the Client :-

1. Optimize Shipment Methods

Since the pie chart showed that one shipment method (e.g., Standard Class) dominates, the company should:

Improve delivery speed and reliability of this method.

Explore offering discounts on less-used methods (like Same Day / First Class) to spread demand.

Negotiate better deals with shipping partners to reduce overall costs.

2. Focus on High-Contribution Segments & Categories

From the bar graph, identify which categories/sub-categories or customer segments bring the highest sales.

Run targeted promotions, loyalty programs, and personalized recommendations for these segments.

Discontinue or re-strategize products with consistently low performance.

3. Leverage Correlation Insights

The correlation heatmap highlights relationships (e.g., sales vs. discount, sales vs. profit).

If higher discounts reduce profit margins significantly → suggest controlled discounting instead of aggressive price cuts.

If sales strongly correlate with quantity → focus on bundle deals and bulk promotions.

4. Customer Segmentation & Predictive Analysis

From the pair plot, we saw patterns in sales, profit, and discount across categories.

Segment customers into high-value (profitable) and low-value (loss-prone) groups.

Use predictive modeling to forecast demand and stock accordingly.

5. Minimize Negative Growth Areas

If certain product categories show high returns or low profit, review supply chain and pricing.

For regions/customers where losses occur, improve customer experience or shift focus elsewhere.

6. Data-Driven Marketing & Pricing Strategy

Launch dynamic pricing strategies based on insights from sales & discount impact.

Invest in data-driven campaigns targeting categories that are strongly correlated with profit growth.

# **Conclusion**

Through the above visualizations and analysis, we gained valuable insights into customer behavior, sales performance, and operational efficiency. The bar graphs highlighted the most profitable product categories and regions, while pie charts revealed customer preferences in shipment modes. Correlation and pair plot analysis helped us understand relationships among numerical features such as sales, profit, and discounts.

From these insights, it is evident that focusing on high-performing categories and regions, optimizing shipping methods, and carefully monitoring discounts can significantly improve business outcomes. Additionally, identifying areas with low profit or higher operational costs allows the business to take corrective measures, thereby reducing risks of negative growth.

Overall, the findings will help the client make data-driven decisions, enhance customer satisfaction, and achieve sustainable growth by aligning strategies with market demand and operational efficiency.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***