# Cleaning Orders Table

This notebook cleans and prepares the orders table for analysis.
Focus:
- Order lifecycle
- Delivery performance
- Valid timestamps

In [18]:
import pandas as pd
import numpy as np
orders = pd.read_csv("../data/raw/olist_orders_dataset.csv")
orders.head()

Unnamed: 0,order_id,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date
0,e481f51cbdc54678b7cc49136f2d6af7,9ef432eb6251297304e76186b10a928d,delivered,2017-10-02 10:56:33,2017-10-02 11:07:15,2017-10-04 19:55:00,2017-10-10 21:25:13,2017-10-18 00:00:00
1,53cdb2fc8bc7dce0b6741e2150273451,b0830fb4747a6c6d20dea0b8c802d7ef,delivered,2018-07-24 20:41:37,2018-07-26 03:24:27,2018-07-26 14:31:00,2018-08-07 15:27:45,2018-08-13 00:00:00
2,47770eb9100c2d0c44946d9cf07ec65d,41ce2a54c0b03bf3443c3d931a367089,delivered,2018-08-08 08:38:49,2018-08-08 08:55:23,2018-08-08 13:50:00,2018-08-17 18:06:29,2018-09-04 00:00:00
3,949d5b44dbf5de918fe9c16f97b45f8a,f88197465ea7920adcdbec7375364d82,delivered,2017-11-18 19:28:06,2017-11-18 19:45:59,2017-11-22 13:39:59,2017-12-02 00:28:42,2017-12-15 00:00:00
4,ad21c59c0840e6cb83a9ceb5573f8159,8ab97904e6daea8866dbdbc4fb7aad2c,delivered,2018-02-13 21:18:39,2018-02-13 22:20:29,2018-02-14 19:46:34,2018-02-16 18:17:02,2018-02-26 00:00:00


## Step 1: Initial Data Inspection

Check the size of the dataset to understand the scale
before any cleaning.

In [19]:
orders.shape

(99441, 8)

## Step 2: Validate Primary Key

Ensure that each order_id is unique.
Duplicate order IDs would indicate a serious data integrity issue.

In [20]:
orders["order_id"].duplicated().sum()

np.int64(0)

## Step 3: Parse Date Columns

Convert date columns to datetime format
to enable time-based calculations.

In [None]:
date_cols = [
    "order_purchase_timestamp",
    "order_delivered_customer_date",
    "order_estimated_delivery_date"
]
for col in date_cols:
    orders[col] = pd.to_datetime(orders[col], errors="coerce")


## Step 4: Calculate Delivery Time

Delivery time is calculated as the difference
between purchase date and actual delivery date.
Negative values indicate invalid records.

In [None]:
orders["delivery_time_days"] = (
    orders["order_delivered_customer_date"]
    - orders["order_purchase_timestamp"]
).dt.days

## Step 5: Validate Delivery Time Consistency

No orders with negative delivery duration were found.
This confirms logical consistency between purchase and delivery timestamps.

In [23]:
orders["delivery_time_days"].describe()

# Identify orders with negative delivery times 

count    96476.000000
mean        12.094086
std          9.551746
min          0.000000
25%          6.000000
50%         10.000000
75%         15.000000
max        209.000000
Name: delivery_time_days, dtype: float64

data is clean, no negatice delivery times

## Step 6: Create Delivery Delay Flag

Create a boolean flag to indicate whether an order
was delivered later than the estimated delivery date.
This will be used later to analyze customer satisfaction.

In [30]:
orders["is_delayed"] = (
    orders["order_delivered_customer_date"].notna() &
    (orders["order_delivered_customer_date"] > orders["order_estimated_delivery_date"])
)
orders["is_delayed"].value_counts(dropna=False)

is_delayed
False    91614
True      7827
Name: count, dtype: int64

## Step 7: Filter Delivered Orders Only

For delivery and customer satisfaction analysis, 
only orders that were successfully delivered are relevant.

Canceled and unavailable orders are excluded, 
as they do not represent completed customer experiences.

This filtered dataset will be used as the clean orders table
for all downstream analysis.

In [25]:
orders_clean = orders[
    orders["order_status"] == "delivered"
].copy()
orders_clean["order_status"].value_counts()

order_status
delivered    96478
Name: count, dtype: int64

## Step 8: Save Clean Orders Table

Save the cleaned and filtered orders dataset.
This table represents the final, reusable version
of orders data for revenue and customer satisfaction projects.

In [26]:
orders_clean.to_csv("../data/processed/orders_clean.csv", index=False)