# Hypothesis-Driven Exploratory Data Analysis (EDA)

## Project: E-commerce Operations & Customer Experience Analytics  
**Company Context:** Olist (Multi-seller e-commerce marketplace)

**Objective:**  
To validate predefined business hypotheses related to delivery performance, customer satisfaction, and revenue risk using structured exploratory data analysis.


## 1. Business Objective

Despite stable order volumes, customer satisfaction metrics have declined.
The objective of this analysis is to identify operational drivers—particularly delivery performance—
that negatively impact customer experience and pose potential revenue risk.

This EDA is conducted after:
- Problem Statement definition
- Project Charter approval
- Hypothesis-to-Metric mapping
- Data Validation & Quality checks


## 1. Business Objective

Despite stable order volumes, customer satisfaction metrics have declined.
The objective of this analysis is to identify operational drivers—particularly delivery performance—
that negatively impact customer experience and pose potential revenue risk.

This EDA is conducted after:
- Problem Statement definition
- Project Charter approval
- Hypothesis-to-Metric mapping
- Data Validation & Quality checks


## 2. Hypotheses Being Tested

The analysis focuses on validating the following hypotheses:

- **H1:** Delivery delays negatively impact customer ratings  
- **H2:** A small subset of sellers accounts for the majority of delivery delays  
- **H3:** Delivery delays are concentrated in specific geographic regions  
- **H4:** Delayed orders represent a significant portion of total revenue (revenue at risk)

All exploratory analysis is explicitly aligned to these hypotheses.


## 3. Dataset Overview

The dataset represents transactional data from a multi-seller e-commerce marketplace and includes:

- Orders and delivery timelines
- Order items and revenue information
- Seller identifiers
- Customer review scores
- Geographic information (city/state)

### Initial checks performed:
- Total number of orders
- Date range coverage
- Number of sellers
- Percentage of orders with customer reviews
- Percentage of delayed deliveries


## 4. Data Preparation Summary

Prior to analysis, the following data preparation steps were completed:

- Removed records with invalid or missing delivery dates
- Excluded cancelled orders from revenue calculations
- Excluded orders without reviews from satisfaction analysis
- Validated delivery timelines for logical consistency

These steps ensure that insights are based on accurate and reliable data.


## 5. Feature Engineering

The following derived features were created to support hypothesis testing:

- **delivery_delay_days** = actual_delivery_date − estimated_delivery_date
- **delivery_status** = On-time / Delayed
- **delay_bucket**:
  - 0 days
  - 1–2 days
  - 3–5 days
  - >5 days
- **order_revenue** = price × quantity

These features enable standardized comparison across orders, sellers, and regions.


## 6. Hypothesis 1: Impact of Delivery Delays on Customer Ratings

**Business Question:**  
Do delayed deliveries lead to lower customer satisfaction?


### Analysis Plan

To test this hypothesis, we will:
- Compare average customer ratings for on-time vs delayed orders
- Analyze ratings across delivery delay buckets
- Measure the percentage of low ratings (<3) for each delay bucket


### Insight

Orders delivered with delays greater than a few days show a significantly higher proportion
of low customer ratings. This indicates that delivery performance is a critical driver
of customer satisfaction.


## 7. Hypothesis 2: Seller-Level Concentration of Delivery Delays

**Business Question:**  
Are delivery delays driven by a small subset of sellers?


### Analysis Plan

- Calculate seller-wise delivery delay metrics
- Rank sellers by percentage of delayed orders
- Perform concentration analysis to identify high-impact sellers


### Insight

A relatively small group of sellers contributes disproportionately to delayed deliveries,
suggesting that targeted seller-level interventions could yield significant operational improvements.


## 8. Hypothesis 3: Geographic Concentration of Delivery Delays

**Business Question:**  
Are delivery delays concentrated in specific cities or states?


### Insight

Certain geographic regions consistently exhibit higher average delivery delays,
indicating potential regional logistics bottlenecks rather than isolated incidents.


## 9. Hypothesis 4: Revenue at Risk Due to Delivery Delays

**Business Question:**  
How much revenue is associated with delayed deliveries?


### Insight

A meaningful portion of total revenue is generated from delayed orders,
highlighting a direct link between operational inefficiencies and financial risk.


## 10. EDA Summary & Hypothesis Validation

| Hypothesis | Status | Summary |
|-----------|-------|--------|
| H1 | Supported | Ratings decline as delivery delays increase |
| H2 | Supported | Delays are concentrated among few sellers |
| H3 | Partially Supported | Regional delay patterns observed |
| H4 | Supported | Delayed orders represent revenue risk |

This EDA validates that delivery performance is a key lever for improving
customer satisfaction and protecting revenue.
