# Cleaning Order Items Table

This notebook cleans and prepares the `order_items` table for analysis.

Focus:
- Item-level revenue
- Pricing and freight validation
- Preparing data for a master fact table

The cleaned output of this notebook will be used to analyze:
- Revenue sources
- Top-performing product categories

## Step 1: Load and Inspect the Dataset

Load the raw `order_items` dataset and perform an initial inspection
to understand its structure and scale.

In [37]:
import pandas as pd
import numpy as np

order_items = pd.read_csv("../data/raw/olist_order_items_dataset.csv")
order_items.head()

Unnamed: 0,order_id,order_item_id,product_id,seller_id,shipping_limit_date,price,freight_value
0,00010242fe8c5a6d1ba2dd792cb16214,1,4244733e06e7ecb4970a6e2683c13e61,48436dade18ac8b2bce089ec2a041202,2017-09-19 09:45:35,58.9,13.29
1,00018f77f2f0320c557190d7a144bdd3,1,e5f2d52b802189ee658865ca93d83a8f,dd7ddc04e1b6c2c614352b383efe2d36,2017-05-03 11:05:13,239.9,19.93
2,000229ec398224ef6ca0657da4fc703e,1,c777355d18b72b67abbeef9df44fd0fd,5b51032eddd242adc84c38acab88f23d,2018-01-18 14:48:30,199.0,17.87
3,00024acbcdf0a6daa1e931b038114c75,1,7634da152a4610f1595efa32f14722fc,9d7a1d34a5052409006425275ba1c2b4,2018-08-15 10:10:18,12.99,12.79
4,00042b26cf59d7ce69dfabb4e55b4fd9,1,ac6c3623068f30de03045865e4e10089,df560393f3a51e74553ab94004ba5c87,2017-02-13 13:57:51,199.9,18.14


## Step 2: Dataset Size Check

Check the number of rows and columns before cleaning.
This helps assess the impact of later filtering steps.

In [38]:
order_items.shape

(112650, 7)

## Step 3: Validate Logical Primary Key

The `order_items` table does not have a single primary key.
Instead, the logical key is the combination of:
- `order_id`
- `order_item_id`

This combination must be unique.

In [39]:
order_items.duplicated(
    subset=["order_id", "order_item_id"]
).sum()

np.int64(0)

## Step 4: Validate Pricing and Freight Values

Prices must be strictly positive.
Freight values must not be negative.

Invalid records are removed to ensure revenue accuracy.

In [40]:
order_items[
    (order_items["price"] <= 0) |
    (order_items["freight_value"] < 0)
].shape

(0, 7)

## Step 5: Parse Shipping Limit Date

Convert `shipping_limit_date` to datetime format
to enable time-based validation.

In [41]:
order_items["shipping_limit_date"] = pd.to_datetime(
    order_items["shipping_limit_date"],
    errors="coerce"
)

## Step 6a: Merge with Clean Orders to Bring Purchase Timestamp

`order_items` does not include `order_purchase_timestamp`.
To validate the shipping timeline, we merge with `orders_clean`
to bring the purchase timestamp into the dataset.

In [42]:
orders_clean = pd.read_csv("../data/processed/orders_clean.csv")

order_items = order_items.merge(
    orders_clean[["order_id", "order_purchase_timestamp"]],
    on="order_id",
    how="inner"
)

order_items[["order_id", "shipping_limit_date", "order_purchase_timestamp"]].head()

Unnamed: 0,order_id,shipping_limit_date,order_purchase_timestamp
0,00010242fe8c5a6d1ba2dd792cb16214,2017-09-19 09:45:35,2017-09-13 08:59:02
1,00018f77f2f0320c557190d7a144bdd3,2017-05-03 11:05:13,2017-04-26 10:53:06
2,000229ec398224ef6ca0657da4fc703e,2018-01-18 14:48:30,2018-01-14 14:33:31
3,00024acbcdf0a6daa1e931b038114c75,2018-08-15 10:10:18,2018-08-08 10:00:35
4,00042b26cf59d7ce69dfabb4e55b4fd9,2017-02-13 13:57:51,2017-02-04 13:57:51


## Step 6b: Ensure Datetime Types Before Comparison

Convert both date columns to datetime to avoid type mismatch issues.

In [43]:
order_items["order_purchase_timestamp"] = pd.to_datetime(
    order_items["order_purchase_timestamp"],
    errors="coerce"
)

## Step 6c: Validate Shipping Timeline

Shipping limit date should not be earlier than the purchase timestamp.
We remove records that violate this rule, while also excluding rows with missing dates.

In [44]:
before_shape = order_items.shape

order_items = order_items[
    order_items["shipping_limit_date"].notna() &
    order_items["order_purchase_timestamp"].notna() &
    (order_items["shipping_limit_date"] >= order_items["order_purchase_timestamp"])
]

after_shape = order_items.shape
before_shape, after_shape

((110197, 8), (110197, 8))

## Step 6c Result: Shipping Timeline Validation Outcome

After validating the shipping timeline, no records were removed.
This indicates that `shipping_limit_date` is always on or after the purchase timestamp
for all delivered orders included in this dataset.

## Step 7: Save Clean Order Items Table

Save the cleaned dataset for use in the master fact table.

Note:
`order_purchase_timestamp` was added via merge with `orders_clean`
to support timeline validation and will be kept for downstream analysis.

In [45]:
order_items_clean = order_items.copy()

order_items_clean.to_csv(
    "../data/processed/order_items_clean.csv",
    index=False
)