# 📊 Exploratory Data Analysis (EDA): Supply Chain Logistics Dataset

This notebook performs **Exploratory Data Analysis (EDA)** on a real-world supply chain logistics dataset. EDA is the foundation of any data-driven investigation. It helps us understand the **structure**, **patterns**, and **quality** of the data before applying any models, algorithms, or drawing conclusions.

---

### 🎯 Objectives of This Notebook

- Load and preview the dataset
- Clean and prepare data (without modifying original files)
- Explore patterns through visualizations
- Understand distributions and correlations
- Surface operational, temporal, and spatial trends
- Explore how different features relate to delivery delay and risk classification

We strictly avoid modeling, statistical testing, or altering the source dataset at this stage.

---

### 📁 Dataset Source

This notebook uses a dataset located in the `0_datasets/` directory. Per project guidelines:

- Do not modify the original data
- Ensure all analysis is reproducible
- Save outputs and visuals to this exploration folder if needed

---

### 📚 Reference Guide

This analysis follows principles from:

- **Chapter 4 - Exploratory Data Analysis**  
  From *The Art of Data Science* by Roger D. Peng and Elizabeth Matsui

> "Exploratory analysis is not about testing a hypothesis. It's about allowing the data to reveal its structure, patterns, and surprises."

---

> **This is not modeling.**  
> No inferential statistics or machine learning is performed here.  
> This work is intended to prepare for those stages by grounding us in what the data is really saying.

---



## Step 1: Initial Glimpse and Summary
### ✅ Data Preview

- **Temporal**: timestamp
- **Geospatial**: vehicle_gps_latitude, vehicle_gps_longitude
- **Operational & Predictive**: fuel, delay, risk, congestion, equipment, etc.
- **Target/Label**: risk_classification, delivery_time_deviation
---
### 📊 Descriptive Summary

- Most variables are numerical and will be great for correlation and trend analysis.
- timestamp needs to be converted to a datetime format for time-based exploration.
- Two categorical variables:
  - timestamp (to be parsed)
  - risk_classification
---
### ⚠️ Data Quality

- **Missing values**: None (every column has 32,065 entries).
- **Duplicate rows**: `0` duplicates found – ✅ clean!


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load dataset using raw string literal and fallback encoding
df = pd.read_csv(r"C:\Users\alema\OneDrive\Desktop\DS_GroupProject\dynamic_supply_chain_logistics_dataset.csv", encoding='latin1')

# Basic Structure
df.head()

In [None]:

# Basic structure
df.shape

In [None]:
# Basic structure
df.info()

## Step 2: Data Cleaning & Type Adjustments

In [None]:
# Convert timestamp to datetime
df['timestamp'] = pd.to_datetime(df['timestamp'])

# Missing values heatmap
plt.figure(figsize=(12, 1))
sns.heatmap(df.isnull(), cbar=False, yticklabels=False, cmap="BuPu")
plt.title('Missing Data Overview')
plt.show()

## Step 3: Univariate Analysis

In [None]:
# Histograms of numerical features
for col in df.select_dtypes('float64').columns:
    plt.figure(figsize=(8, 4))
    sns.histplot(df[col], bins=30, kde=True)
    plt.title(f'Distribution of {col}')
    plt.show()

# Categorical variable
sns.countplot(x='risk_classification', data=df)
plt.title('Distribution of Risk Classification')
plt.show()

## Step 4: Multivariate Analysis

In [None]:
# Correlation heatmap
plt.figure(figsize=(16, 12))
sns.heatmap(df.corr(numeric_only=True), annot=False, cmap='coolwarm', center=0)
plt.title('Correlation Heatmap')
plt.show()

# Boxplots by risk classification
features = ['fuel_consumption_rate', 'eta_variation_hours', 'shipping_costs', 'delay_probability']
for col in features:
    sns.boxplot(data=df, x='risk_classification', y=col)
    plt.title(f'{col} by Risk Classification')
    plt.show()

### 🔗 Multivariate Analysis Summary
---
#### 1. 📉 Correlation Heatmap

**Strong Positive Correlations:**
- delay_probability and delivery_time_deviation
- shipping_costs and `warehouse_inventory_level

**Strong Negative Correlations:**
- driver_behavior_score and delay_probability
- fatigue_monitoring_score and risk-related variables

**These insights suggest operational behavior and fatigue are inversely tied to delay and disruption.**

#### 2. 🔍 Pairwise Scatterplots

**Observed Patterns:**
- High delay_probability → High delivery_time_deviation
- eta_variation_hours and fuel_consumption_rate vary widely

⚠️ Some non-linear relationships appear — **feature engineering** may improve future models.

#### 3. 📦 Boxplots by `risk_classification`

**Findings:**
- **High Risk** trips have greater delay_probability and delivery_time_deviation`
- **Shipping Costs** and **ETA Variation** are elevated in higher risk classes

✅ These differences validate that `risk_classification` reflects meaningful risk segmentation.


## Step 5: Time Series Trends

In [None]:
# Step 5: Time Series Exploration

# Create time-based features
df['hour'] = df['timestamp'].dt.hour
df['day'] = df['timestamp'].dt.day
df['month'] = df['timestamp'].dt.month
df['day_of_week'] = df['timestamp'].dt.dayofweek

# Time series trends: average delivery deviation and delay probability by month
monthly_trends = df.groupby('month')[['delivery_time_deviation', 'delay_probability']].mean()

# Plot monthly trends
monthly_trends.plot(marker='o', figsize=(10, 5))
plt.title('Monthly Averages: Delivery Deviation & Delay Probability')
plt.xlabel('Month')
plt.ylabel('Average Value')
plt.grid(True)
plt.show()

# Hourly trend: Average traffic congestion
hourly_congestion = df.groupby('hour')['traffic_congestion_level'].mean()

# Plot hourly congestion trend
plt.figure(figsize=(10, 4))
hourly_congestion.plot(marker='o')
plt.title('Average Traffic Congestion by Hour of Day')
plt.xlabel('Hour')
plt.ylabel('Traffic Congestion Level')
plt.grid(True)
plt.show()

# Daily variation: Average fuel consumption
daily_fuel = df.groupby('day')['fuel_consumption_rate'].mean()

plt.figure(figsize=(10, 4))
daily_fuel.plot(marker='o')
plt.title('Average Fuel Consumption by Day of Month')
plt.xlabel('Day')
plt.ylabel('Fuel Consumption Rate')
plt.grid(True)
plt.show()


### 📆 Monthly Trends

- **Delivery Time Deviation** and **Delay Probability** tend to rise mid-year (peaking around summer months).
- Possible **seasonality effects** (weather, demand surges, or vacation periods) may be influencing logistics delays.
---
### ⏰ Hourly Trends

- **Traffic Congestion Level** is higher during typical rush hours (**8–10 AM** and **4–6 PM**).
- Suggests congestion follows expected urban patterns and could inform **time-window planning**.
---
### 📅 Daily Trends

- **Fuel Consumption Rate** shows mild fluctuation day-to-day.
- Could reflect varying **routes, cargo weights, or operational conditions**.


## Step 6: Spatial Exploration

In [None]:
# GPS Coordinates
sns.scatterplot(x='vehicle_gps_longitude', y='vehicle_gps_latitude', hue='risk_classification', data=df, alpha=0.5)
plt.title('Vehicle Locations by Risk Classification')
plt.show()

## Step 7: Target Variable Analysis

In [None]:
# Boxplot by risk class for delivery_time_deviation
sns.boxplot(x='risk_classification', y='delivery_time_deviation', data=df)
plt.title('Delivery Time Deviation by Risk Classification')
plt.show()

# Correlations with delivery_time_deviation
corrs = df.corr(numeric_only=True)['delivery_time_deviation'].sort_values(ascending=False)
corrs.head(10).plot(kind='barh')
plt.title('Top Correlations with Delivery Time Deviation')
plt.gca().invert_yaxis()
plt.grid(True)
plt.show()

## 📌 Conclusion
- No missing data or duplicates found.
- Several strong correlations and temporal/spatial patterns.
- `risk_classification` and `delivery_time_deviation` show meaningful relationships with operational variables.
- This EDA sets a strong foundation for modeling or optimization work.
