# Master Dataset Construction  
### Project: E-commerce Product & Customer Analysis (Olist Dataset)
### Author: Sayeh Nelson

This notebook merges all cleaned datasets into a single master analytical table.  
The output will serve as the foundation for exploratory data analysis, customer behavior insights, funnel analysis, and product-level performance metrics.

## Objectives
- Import all cleaned tables
- Validate row counts and primary keys
- Merge tables step-by-step with clear logic
- Engineer key features (delivery time, review delays, revenue, etc.)
- Export final master dataset for use in EDA and Tableau


In [1]:
import pandas as pd
import numpy as np

# Load cleaned tables
customers = pd.read_csv("../data/clean/customers_clean.csv")
orders = pd.read_csv("../data/clean/orders_clean.csv")
order_items = pd.read_csv("../data/clean/order_items_clean.csv")
products = pd.read_csv("../data/clean/products_clean.csv")
payments = pd.read_csv("../data/clean/payments_clean.csv")
reviews_clean = pd.read_csv("../data/clean/reviews_clean.csv")
sellers = pd.read_csv("../data/clean/sellers_clean.csv")
geo_clean = pd.read_csv("../data/clean/geolocation_clean.csv")


In [2]:
#Lets ensure all tables are loaded correctly and confirm unique keys

dfs = {
    "customers": customers,
    "orders": orders,
    "order_items": order_items,
    "products": products,
    "payments": payments,
    "reviews": reviews_clean,
    "sellers": sellers,
    "geo_clean": geo_clean}

for name, df in dfs.items():
    print(f"{name}: {df.shape} rows, {df.columns.tolist()}")


customers: (99441, 5) rows, ['customer_id', 'customer_unique_id', 'customer_zip_code_prefix', 'customer_city', 'customer_state']
orders: (99441, 11) rows, ['order_id', 'customer_id', 'order_status', 'order_purchase_timestamp', 'order_approved_at', 'order_delivered_carrier_date', 'order_delivered_customer_date', 'order_estimated_delivery_date', 'delivery_time_days', 'approved_lag_hours', 'delivery_delays_days']
order_items: (112650, 8) rows, ['order_id', 'order_item_id', 'product_id', 'seller_id', 'shipping_limit_date', 'price', 'freight_value', 'total_price']
products: (32951, 11) rows, ['product_id', 'product_category_name', 'product_name_length', 'product_description_length', 'product_photos_qty', 'product_weight_g', 'product_length_cm', 'product_height_cm', 'product_width_cm', 'product_category_name_english', 'volume_cm3']
payments: (103886, 5) rows, ['order_id', 'payment_sequential', 'payment_type', 'payment_installments', 'payment_value']
reviews: (98410, 7) rows, ['review_id', 'o

Looks good  :D

## Summary Table

| Merge Step | Key Column | Relationship | Why This Key? | Result |
|------------|------------|--------------|----------------|--------|
| customers → orders | customer_id | 1:M | Customers place orders | Adds customer info to each order |
| orders → items | order_id | 1:M | Orders contain items | Expands to product-level rows |
| items → products | product_id | M:1 | Items reference products | Adds product attributes |
| items → sellers | seller_id | M:1 | Each item has a seller | Adds seller details |
| customers → geo | zip_code_prefix | M:1 | Zip determines location | Adds lat/lng + city/state |
| orders → payments | order_id | 1:M | Orders may have multiple payments | Adds financial info |
| orders → reviews | order_id | 1:1 (after cleaning) | Each order gets a final review | Adds satisfaction score |


_____________

In [5]:
orders['order_id'].nunique()


99441

In [4]:
order_items['order_id'].nunique()

98666

In [7]:
# Merge: customers + orders

master = orders.merge(customers, on = 'customer_id', how= 'left')

In [8]:
print("Rows:", master.shape[0])
print("Columns:", master.shape[1])
print("Unique orders:", master['order_id'].nunique())
print("Unique customers:", master['customer_id'].nunique())
print(master.isna().sum().sort_values(ascending=False).head(10))
master.head()


Rows: 99441
Columns: 15
Unique orders: 99441
Unique customers: 99441
order_delivered_customer_date    2965
delivery_time_days               2965
delivery_delays_days             2965
order_delivered_carrier_date     1783
order_approved_at                 160
approved_lag_hours                160
order_id                            0
customer_id                         0
order_status                        0
order_purchase_timestamp            0
dtype: int64


Unnamed: 0,order_id,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date,delivery_time_days,approved_lag_hours,delivery_delays_days,customer_unique_id,customer_zip_code_prefix,customer_city,customer_state
0,e481f51cbdc54678b7cc49136f2d6af7,9ef432eb6251297304e76186b10a928d,delivered,2017-10-02 10:56:33,2017-10-02 11:07:15,2017-10-04 19:55:00,2017-10-10 21:25:13,2017-10-18,8.0,0.178333,-8.0,7c396fd4830fd04220f754e42b4e5bff,3149,sao paulo,SP
1,53cdb2fc8bc7dce0b6741e2150273451,b0830fb4747a6c6d20dea0b8c802d7ef,delivered,2018-07-24 20:41:37,2018-07-26 03:24:27,2018-07-26 14:31:00,2018-08-07 15:27:45,2018-08-13,13.0,30.713889,-6.0,af07308b275d755c9edb36a90c618231,47813,barreiras,BA
2,47770eb9100c2d0c44946d9cf07ec65d,41ce2a54c0b03bf3443c3d931a367089,delivered,2018-08-08 08:38:49,2018-08-08 08:55:23,2018-08-08 13:50:00,2018-08-17 18:06:29,2018-09-04,9.0,0.276111,-18.0,3a653a41f6f9fc3d2a113cf8398680e8,75265,vianopolis,GO
3,949d5b44dbf5de918fe9c16f97b45f8a,f88197465ea7920adcdbec7375364d82,delivered,2017-11-18 19:28:06,2017-11-18 19:45:59,2017-11-22 13:39:59,2017-12-02 00:28:42,2017-12-15,13.0,0.298056,-13.0,7c142cf63193a1473d2e66489a9ae977,59296,sao goncalo do amarante,RN
4,ad21c59c0840e6cb83a9ceb5573f8159,8ab97904e6daea8866dbdbc4fb7aad2c,delivered,2018-02-13 21:18:39,2018-02-13 22:20:29,2018-02-14 19:46:34,2018-02-16 18:17:02,2018-02-26,2.0,1.030556,-10.0,72632f0f9dd73dfee390c9b22eb56dd6,9195,santo andre,SP


In [9]:
# Merge order + order_items

master = master.merge(order_items, on= 'order_id', how = 'left')

In [10]:
print("Rows:", master.shape[0])
print("Columns:", master.shape[1])

print("\nUnique orders:", master['order_id'].nunique())
print("Unique customers:", master['customer_id'].nunique())

print("\nMissing values:")
print(master.isna().sum().sort_values(ascending=False).head(10))

master.head()


Rows: 113425
Columns: 22

Unique orders: 99441
Unique customers: 99441

Missing values:
order_delivered_customer_date    3229
delivery_time_days               3229
delivery_delays_days             3229
order_delivered_carrier_date     1968
total_price                       775
freight_value                     775
price                             775
shipping_limit_date               775
seller_id                         775
product_id                        775
dtype: int64


Unnamed: 0,order_id,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date,delivery_time_days,approved_lag_hours,...,customer_zip_code_prefix,customer_city,customer_state,order_item_id,product_id,seller_id,shipping_limit_date,price,freight_value,total_price
0,e481f51cbdc54678b7cc49136f2d6af7,9ef432eb6251297304e76186b10a928d,delivered,2017-10-02 10:56:33,2017-10-02 11:07:15,2017-10-04 19:55:00,2017-10-10 21:25:13,2017-10-18,8.0,0.178333,...,3149,sao paulo,SP,1.0,87285b34884572647811a353c7ac498a,3504c0cb71d7fa48d967e0e4c94d59d9,2017-10-06 11:07:15,29.99,8.72,38.71
1,53cdb2fc8bc7dce0b6741e2150273451,b0830fb4747a6c6d20dea0b8c802d7ef,delivered,2018-07-24 20:41:37,2018-07-26 03:24:27,2018-07-26 14:31:00,2018-08-07 15:27:45,2018-08-13,13.0,30.713889,...,47813,barreiras,BA,1.0,595fac2a385ac33a80bd5114aec74eb8,289cdb325fb7e7f891c38608bf9e0962,2018-07-30 03:24:27,118.7,22.76,141.46
2,47770eb9100c2d0c44946d9cf07ec65d,41ce2a54c0b03bf3443c3d931a367089,delivered,2018-08-08 08:38:49,2018-08-08 08:55:23,2018-08-08 13:50:00,2018-08-17 18:06:29,2018-09-04,9.0,0.276111,...,75265,vianopolis,GO,1.0,aa4383b373c6aca5d8797843e5594415,4869f7a5dfa277a7dca6462dcf3b52b2,2018-08-13 08:55:23,159.9,19.22,179.12
3,949d5b44dbf5de918fe9c16f97b45f8a,f88197465ea7920adcdbec7375364d82,delivered,2017-11-18 19:28:06,2017-11-18 19:45:59,2017-11-22 13:39:59,2017-12-02 00:28:42,2017-12-15,13.0,0.298056,...,59296,sao goncalo do amarante,RN,1.0,d0b61bfb1de832b15ba9d266ca96e5b0,66922902710d126a0e7d26b0e3805106,2017-11-23 19:45:59,45.0,27.2,72.2
4,ad21c59c0840e6cb83a9ceb5573f8159,8ab97904e6daea8866dbdbc4fb7aad2c,delivered,2018-02-13 21:18:39,2018-02-13 22:20:29,2018-02-14 19:46:34,2018-02-16 18:17:02,2018-02-26,2.0,1.030556,...,9195,santo andre,SP,1.0,65266b2da20d04dbe00c5c2d3bb7859e,2c9e548be18521d1c43cde1c582c6de8,2018-02-19 20:31:37,19.9,8.72,28.62


In [12]:
# Merge: master + product_id

master = master.merge(products, on = 'product_id', how = 'left')

In [13]:
print("Rows:", master.shape[0])
print("Columns:", master.shape[1])

print("\nUnique orders:", master['order_id'].nunique())
print("Unique customers:", master['customer_id'].nunique())
print("Unique products:", master['product_id'].nunique())

print("\nMissing values:")
print(master.isna().sum().sort_values(ascending=False).head(10))

master.head()


Rows: 113425
Columns: 32

Unique orders: 99441
Unique customers: 99441
Unique products: 32951

Missing values:
delivery_delays_days             3229
order_delivered_customer_date    3229
delivery_time_days               3229
product_category_name_english    2378
product_category_name            2378
product_name_length              2378
product_description_length       2378
product_photos_qty               2378
order_delivered_carrier_date     1968
product_weight_g                  801
dtype: int64


Unnamed: 0,order_id,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date,delivery_time_days,approved_lag_hours,...,product_category_name,product_name_length,product_description_length,product_photos_qty,product_weight_g,product_length_cm,product_height_cm,product_width_cm,product_category_name_english,volume_cm3
0,e481f51cbdc54678b7cc49136f2d6af7,9ef432eb6251297304e76186b10a928d,delivered,2017-10-02 10:56:33,2017-10-02 11:07:15,2017-10-04 19:55:00,2017-10-10 21:25:13,2017-10-18,8.0,0.178333,...,utilidades_domesticas,40.0,268.0,4.0,500.0,19.0,8.0,13.0,housewares,1976.0
1,53cdb2fc8bc7dce0b6741e2150273451,b0830fb4747a6c6d20dea0b8c802d7ef,delivered,2018-07-24 20:41:37,2018-07-26 03:24:27,2018-07-26 14:31:00,2018-08-07 15:27:45,2018-08-13,13.0,30.713889,...,perfumaria,29.0,178.0,1.0,400.0,19.0,13.0,19.0,perfumery,4693.0
2,47770eb9100c2d0c44946d9cf07ec65d,41ce2a54c0b03bf3443c3d931a367089,delivered,2018-08-08 08:38:49,2018-08-08 08:55:23,2018-08-08 13:50:00,2018-08-17 18:06:29,2018-09-04,9.0,0.276111,...,automotivo,46.0,232.0,1.0,420.0,24.0,19.0,21.0,auto,9576.0
3,949d5b44dbf5de918fe9c16f97b45f8a,f88197465ea7920adcdbec7375364d82,delivered,2017-11-18 19:28:06,2017-11-18 19:45:59,2017-11-22 13:39:59,2017-12-02 00:28:42,2017-12-15,13.0,0.298056,...,pet_shop,59.0,468.0,3.0,450.0,30.0,10.0,20.0,pet_shop,6000.0
4,ad21c59c0840e6cb83a9ceb5573f8159,8ab97904e6daea8866dbdbc4fb7aad2c,delivered,2018-02-13 21:18:39,2018-02-13 22:20:29,2018-02-14 19:46:34,2018-02-16 18:17:02,2018-02-26,2.0,1.030556,...,papelaria,38.0,316.0,4.0,250.0,51.0,15.0,15.0,stationery,11475.0


In [14]:
# Merge: Master + seller_id

master = master.merge(sellers, on='seller_id', how='left')

In [15]:
print("Rows:", master.shape[0])
print("Columns:", master.shape[1])

print("\nUnique orders:", master['order_id'].nunique())
print("Unique customers:", master['customer_id'].nunique())
print("Unique sellers:", master['seller_id'].nunique())

print("\nMissing values:")
print(master.isna().sum().sort_values(ascending=False).head(10))

master.head()


Rows: 113425
Columns: 35

Unique orders: 99441
Unique customers: 99441
Unique sellers: 3095

Missing values:
order_delivered_customer_date    3229
delivery_time_days               3229
delivery_delays_days             3229
product_photos_qty               2378
product_category_name            2378
product_name_length              2378
product_category_name_english    2378
product_description_length       2378
order_delivered_carrier_date     1968
product_weight_g                  801
dtype: int64


Unnamed: 0,order_id,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date,delivery_time_days,approved_lag_hours,...,product_photos_qty,product_weight_g,product_length_cm,product_height_cm,product_width_cm,product_category_name_english,volume_cm3,seller_zip_code_prefix,seller_city,seller_state
0,e481f51cbdc54678b7cc49136f2d6af7,9ef432eb6251297304e76186b10a928d,delivered,2017-10-02 10:56:33,2017-10-02 11:07:15,2017-10-04 19:55:00,2017-10-10 21:25:13,2017-10-18,8.0,0.178333,...,4.0,500.0,19.0,8.0,13.0,housewares,1976.0,9350.0,maua,SP
1,53cdb2fc8bc7dce0b6741e2150273451,b0830fb4747a6c6d20dea0b8c802d7ef,delivered,2018-07-24 20:41:37,2018-07-26 03:24:27,2018-07-26 14:31:00,2018-08-07 15:27:45,2018-08-13,13.0,30.713889,...,1.0,400.0,19.0,13.0,19.0,perfumery,4693.0,31570.0,belo horizonte,SP
2,47770eb9100c2d0c44946d9cf07ec65d,41ce2a54c0b03bf3443c3d931a367089,delivered,2018-08-08 08:38:49,2018-08-08 08:55:23,2018-08-08 13:50:00,2018-08-17 18:06:29,2018-09-04,9.0,0.276111,...,1.0,420.0,24.0,19.0,21.0,auto,9576.0,14840.0,guariba,SP
3,949d5b44dbf5de918fe9c16f97b45f8a,f88197465ea7920adcdbec7375364d82,delivered,2017-11-18 19:28:06,2017-11-18 19:45:59,2017-11-22 13:39:59,2017-12-02 00:28:42,2017-12-15,13.0,0.298056,...,3.0,450.0,30.0,10.0,20.0,pet_shop,6000.0,31842.0,belo horizonte,MG
4,ad21c59c0840e6cb83a9ceb5573f8159,8ab97904e6daea8866dbdbc4fb7aad2c,delivered,2018-02-13 21:18:39,2018-02-13 22:20:29,2018-02-14 19:46:34,2018-02-16 18:17:02,2018-02-26,2.0,1.030556,...,4.0,250.0,51.0,15.0,15.0,stationery,11475.0,8752.0,mogi das cruzes,SP


In [16]:
# Merge: master + geo_clean

master = master.merge(geo_clean, 
                      left_on= 'customer_zip_code_prefix',
                      right_on= 'geolocation_zip_code_prefix',
                      how= 'left')

In [17]:
print("Rows:", master.shape[0])
print("Columns:", master.shape[1])

print("\nUnique orders:", master['order_id'].nunique())
print("Unique customers:", master['customer_id'].nunique())

print("\nMissing values:")
print(master.isna().sum().sort_values(ascending=False).head(15))

master.head()


Rows: 113425
Columns: 40

Unique orders: 99441
Unique customers: 99441

Missing values:
order_delivered_customer_date    3229
delivery_time_days               3229
delivery_delays_days             3229
product_category_name_english    2378
product_photos_qty               2378
product_description_length       2378
product_name_length              2378
product_category_name            2378
order_delivered_carrier_date     1968
product_weight_g                  801
volume_cm3                        793
product_width_cm                  793
product_height_cm                 793
product_length_cm                 793
seller_zip_code_prefix            775
dtype: int64


Unnamed: 0,order_id,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date,delivery_time_days,approved_lag_hours,...,product_category_name_english,volume_cm3,seller_zip_code_prefix,seller_city,seller_state,geolocation_zip_code_prefix,geolocation_lat,geolocation_lng,geolocation_city,geolocation_state
0,e481f51cbdc54678b7cc49136f2d6af7,9ef432eb6251297304e76186b10a928d,delivered,2017-10-02 10:56:33,2017-10-02 11:07:15,2017-10-04 19:55:00,2017-10-10 21:25:13,2017-10-18,8.0,0.178333,...,housewares,1976.0,9350.0,maua,SP,3149.0,-23.57617,-46.587276,sao paulo,SP
1,53cdb2fc8bc7dce0b6741e2150273451,b0830fb4747a6c6d20dea0b8c802d7ef,delivered,2018-07-24 20:41:37,2018-07-26 03:24:27,2018-07-26 14:31:00,2018-08-07 15:27:45,2018-08-13,13.0,30.713889,...,perfumery,4693.0,31570.0,belo horizonte,SP,47813.0,-12.126651,-45.008162,barreiras,BA
2,47770eb9100c2d0c44946d9cf07ec65d,41ce2a54c0b03bf3443c3d931a367089,delivered,2018-08-08 08:38:49,2018-08-08 08:55:23,2018-08-08 13:50:00,2018-08-17 18:06:29,2018-09-04,9.0,0.276111,...,auto,9576.0,14840.0,guariba,SP,75265.0,-16.744472,-48.514624,vianopolis,GO
3,949d5b44dbf5de918fe9c16f97b45f8a,f88197465ea7920adcdbec7375364d82,delivered,2017-11-18 19:28:06,2017-11-18 19:45:59,2017-11-22 13:39:59,2017-12-02 00:28:42,2017-12-15,13.0,0.298056,...,pet_shop,6000.0,31842.0,belo horizonte,MG,59296.0,-5.774611,-35.273916,sao goncalo do amarante,RN
4,ad21c59c0840e6cb83a9ceb5573f8159,8ab97904e6daea8866dbdbc4fb7aad2c,delivered,2018-02-13 21:18:39,2018-02-13 22:20:29,2018-02-14 19:46:34,2018-02-16 18:17:02,2018-02-26,2.0,1.030556,...,stationery,11475.0,8752.0,mogi das cruzes,SP,9195.0,-23.675316,-46.515116,santo andre,SP


In [18]:
# Merge: master + payments

master= master.merge(payments, on='order_id', how= 'left')

In [19]:
print("Rows:", master.shape[0])
print("Columns:", master.shape[1])

print("\nUnique orders:", master['order_id'].nunique())
print("Unique customers:", master['customer_id'].nunique())
print("Unique payment types:", master['payment_type'].nunique())

print("\nMissing values:")
print(master.isna().sum().sort_values(ascending=False).head(12))

master.head()


Rows: 118434
Columns: 44

Unique orders: 99441
Unique customers: 99441
Unique payment types: 5

Missing values:
order_delivered_customer_date    3397
delivery_time_days               3397
delivery_delays_days             3397
product_category_name            2528
product_name_length              2528
product_description_length       2528
product_category_name_english    2528
product_photos_qty               2528
order_delivered_carrier_date     2074
product_weight_g                  858
product_length_cm                 850
product_height_cm                 850
dtype: int64


Unnamed: 0,order_id,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date,delivery_time_days,approved_lag_hours,...,seller_state,geolocation_zip_code_prefix,geolocation_lat,geolocation_lng,geolocation_city,geolocation_state,payment_sequential,payment_type,payment_installments,payment_value
0,e481f51cbdc54678b7cc49136f2d6af7,9ef432eb6251297304e76186b10a928d,delivered,2017-10-02 10:56:33,2017-10-02 11:07:15,2017-10-04 19:55:00,2017-10-10 21:25:13,2017-10-18,8.0,0.178333,...,SP,3149.0,-23.57617,-46.587276,sao paulo,SP,1.0,credit_card,1.0,18.12
1,e481f51cbdc54678b7cc49136f2d6af7,9ef432eb6251297304e76186b10a928d,delivered,2017-10-02 10:56:33,2017-10-02 11:07:15,2017-10-04 19:55:00,2017-10-10 21:25:13,2017-10-18,8.0,0.178333,...,SP,3149.0,-23.57617,-46.587276,sao paulo,SP,3.0,voucher,1.0,2.0
2,e481f51cbdc54678b7cc49136f2d6af7,9ef432eb6251297304e76186b10a928d,delivered,2017-10-02 10:56:33,2017-10-02 11:07:15,2017-10-04 19:55:00,2017-10-10 21:25:13,2017-10-18,8.0,0.178333,...,SP,3149.0,-23.57617,-46.587276,sao paulo,SP,2.0,voucher,1.0,18.59
3,53cdb2fc8bc7dce0b6741e2150273451,b0830fb4747a6c6d20dea0b8c802d7ef,delivered,2018-07-24 20:41:37,2018-07-26 03:24:27,2018-07-26 14:31:00,2018-08-07 15:27:45,2018-08-13,13.0,30.713889,...,SP,47813.0,-12.126651,-45.008162,barreiras,BA,1.0,boleto,1.0,141.46
4,47770eb9100c2d0c44946d9cf07ec65d,41ce2a54c0b03bf3443c3d931a367089,delivered,2018-08-08 08:38:49,2018-08-08 08:55:23,2018-08-08 13:50:00,2018-08-17 18:06:29,2018-09-04,9.0,0.276111,...,SP,75265.0,-16.744472,-48.514624,vianopolis,GO,1.0,credit_card,3.0,179.12


In [20]:
# Final Merge: master + reviews_clean

master= master.merge(reviews_clean, on= 'order_id', how='left')

In [21]:
print("Rows:", master.shape[0])
print("Columns:", master.shape[1])

print("\nUnique orders:", master['order_id'].nunique())
print("Unique customers:", master['customer_id'].nunique())
print("Unique reviews:", master['review_id'].nunique())

print("\nMissing values:")
print(master.isna().sum().sort_values(ascending=False).head(15))

master.head()


Rows: 118794
Columns: 50

Unique orders: 99441
Unique customers: 99441
Unique reviews: 98410

Missing values:
review_comment_title             104858
review_comment_message            68924
order_delivered_customer_date      3413
delivery_time_days                 3413
delivery_delays_days               3413
product_photos_qty                 2536
product_category_name              2536
product_category_name_english      2536
product_description_length         2536
product_name_length                2536
order_delivered_carrier_date       2081
review_creation_date               1625
review_score                       1625
review_id                          1625
review_answer_timestamp            1625
dtype: int64


Unnamed: 0,order_id,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date,delivery_time_days,approved_lag_hours,...,payment_sequential,payment_type,payment_installments,payment_value,review_id,review_score,review_comment_title,review_comment_message,review_creation_date,review_answer_timestamp
0,e481f51cbdc54678b7cc49136f2d6af7,9ef432eb6251297304e76186b10a928d,delivered,2017-10-02 10:56:33,2017-10-02 11:07:15,2017-10-04 19:55:00,2017-10-10 21:25:13,2017-10-18,8.0,0.178333,...,1.0,credit_card,1.0,18.12,a54f0611adc9ed256b57ede6b6eb5114,4.0,,"Não testei o produto ainda, mas ele veio corre...",2017-10-11 00:00:00,2017-10-12 03:43:48
1,e481f51cbdc54678b7cc49136f2d6af7,9ef432eb6251297304e76186b10a928d,delivered,2017-10-02 10:56:33,2017-10-02 11:07:15,2017-10-04 19:55:00,2017-10-10 21:25:13,2017-10-18,8.0,0.178333,...,3.0,voucher,1.0,2.0,a54f0611adc9ed256b57ede6b6eb5114,4.0,,"Não testei o produto ainda, mas ele veio corre...",2017-10-11 00:00:00,2017-10-12 03:43:48
2,e481f51cbdc54678b7cc49136f2d6af7,9ef432eb6251297304e76186b10a928d,delivered,2017-10-02 10:56:33,2017-10-02 11:07:15,2017-10-04 19:55:00,2017-10-10 21:25:13,2017-10-18,8.0,0.178333,...,2.0,voucher,1.0,18.59,a54f0611adc9ed256b57ede6b6eb5114,4.0,,"Não testei o produto ainda, mas ele veio corre...",2017-10-11 00:00:00,2017-10-12 03:43:48
3,53cdb2fc8bc7dce0b6741e2150273451,b0830fb4747a6c6d20dea0b8c802d7ef,delivered,2018-07-24 20:41:37,2018-07-26 03:24:27,2018-07-26 14:31:00,2018-08-07 15:27:45,2018-08-13,13.0,30.713889,...,1.0,boleto,1.0,141.46,8d5266042046a06655c8db133d120ba5,4.0,Muito boa a loja,Muito bom o produto.,2018-08-08,2018-08-08 18:37:50
4,47770eb9100c2d0c44946d9cf07ec65d,41ce2a54c0b03bf3443c3d931a367089,delivered,2018-08-08 08:38:49,2018-08-08 08:55:23,2018-08-08 13:50:00,2018-08-17 18:06:29,2018-09-04,9.0,0.276111,...,1.0,credit_card,3.0,179.12,e73b67b67587f7644d5bd1a52deb1b01,5.0,,,2018-08-18,2018-08-22 19:07:58


In [22]:
master.head()

Unnamed: 0,order_id,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date,delivery_time_days,approved_lag_hours,...,payment_sequential,payment_type,payment_installments,payment_value,review_id,review_score,review_comment_title,review_comment_message,review_creation_date,review_answer_timestamp
0,e481f51cbdc54678b7cc49136f2d6af7,9ef432eb6251297304e76186b10a928d,delivered,2017-10-02 10:56:33,2017-10-02 11:07:15,2017-10-04 19:55:00,2017-10-10 21:25:13,2017-10-18,8.0,0.178333,...,1.0,credit_card,1.0,18.12,a54f0611adc9ed256b57ede6b6eb5114,4.0,,"Não testei o produto ainda, mas ele veio corre...",2017-10-11 00:00:00,2017-10-12 03:43:48
1,e481f51cbdc54678b7cc49136f2d6af7,9ef432eb6251297304e76186b10a928d,delivered,2017-10-02 10:56:33,2017-10-02 11:07:15,2017-10-04 19:55:00,2017-10-10 21:25:13,2017-10-18,8.0,0.178333,...,3.0,voucher,1.0,2.0,a54f0611adc9ed256b57ede6b6eb5114,4.0,,"Não testei o produto ainda, mas ele veio corre...",2017-10-11 00:00:00,2017-10-12 03:43:48
2,e481f51cbdc54678b7cc49136f2d6af7,9ef432eb6251297304e76186b10a928d,delivered,2017-10-02 10:56:33,2017-10-02 11:07:15,2017-10-04 19:55:00,2017-10-10 21:25:13,2017-10-18,8.0,0.178333,...,2.0,voucher,1.0,18.59,a54f0611adc9ed256b57ede6b6eb5114,4.0,,"Não testei o produto ainda, mas ele veio corre...",2017-10-11 00:00:00,2017-10-12 03:43:48
3,53cdb2fc8bc7dce0b6741e2150273451,b0830fb4747a6c6d20dea0b8c802d7ef,delivered,2018-07-24 20:41:37,2018-07-26 03:24:27,2018-07-26 14:31:00,2018-08-07 15:27:45,2018-08-13,13.0,30.713889,...,1.0,boleto,1.0,141.46,8d5266042046a06655c8db133d120ba5,4.0,Muito boa a loja,Muito bom o produto.,2018-08-08,2018-08-08 18:37:50
4,47770eb9100c2d0c44946d9cf07ec65d,41ce2a54c0b03bf3443c3d931a367089,delivered,2018-08-08 08:38:49,2018-08-08 08:55:23,2018-08-08 13:50:00,2018-08-17 18:06:29,2018-09-04,9.0,0.276111,...,1.0,credit_card,3.0,179.12,e73b67b67587f7644d5bd1a52deb1b01,5.0,,,2018-08-18,2018-08-22 19:07:58


In [23]:
master['product_id'].isna().sum()

831

**Note:** Dropping orders with no items.

In [24]:
master = master[~master['product_id'].isna()]

In [25]:
master.shape

(117963, 50)

In [26]:
master.columns

Index(['order_id', 'customer_id', 'order_status', 'order_purchase_timestamp',
       'order_approved_at', 'order_delivered_carrier_date',
       'order_delivered_customer_date', 'order_estimated_delivery_date',
       'delivery_time_days', 'approved_lag_hours', 'delivery_delays_days',
       'customer_unique_id', 'customer_zip_code_prefix', 'customer_city',
       'customer_state', 'order_item_id', 'product_id', 'seller_id',
       'shipping_limit_date', 'price', 'freight_value', 'total_price',
       'product_category_name', 'product_name_length',
       'product_description_length', 'product_photos_qty', 'product_weight_g',
       'product_length_cm', 'product_height_cm', 'product_width_cm',
       'product_category_name_english', 'volume_cm3', 'seller_zip_code_prefix',
       'seller_city', 'seller_state', 'geolocation_zip_code_prefix',
       'geolocation_lat', 'geolocation_lng', 'geolocation_city',
       'geolocation_state', 'payment_sequential', 'payment_type',
       'payment_i

In [28]:
# Save cleaned master dataset
master.to_csv('../data/processed/master_olist_dataset.csv', index=False)

print("Saved to: ../data/processed/master_olist_dataset.csv")


Saved to: ../data/processed/master_olist_dataset.csv


## Master Dataset Summary (Post-Merge & Final Cleaning)
### Overview:
After merging all seven Olist tables and removing orders with no associated products, the final master dataset now contains:

Rows: 117,963
Columns: 50
Unique Orders: 99,441
Unique Customers: 99,441
Unique Reviews: 98,410
Unique Sellers: 3,095
Unique Products: 32,951

This dataset now represents a fully unified view connecting customers, sellers, products, orders, geolocation, payments, and reviews.

### Key Cleaning Actions Performed

1. Merged tables on correct primary keys
- customers → orders (customer_id)
- order_items → master (order_id)
- products → master (product_id)
- sellers → master (seller_id)
- geolocation → master (geolocation_zip_code_prefix)
- payments → master (order_id)
- reviews → master (order_id)

2. Dropped 831 orders with no associated product_id
- These rows represent incomplete or inconsistent transactions that cannot be used in funnel or behavior analysis.

3. Retained missing values intentionally
- Missing timestamps for delivery -> meaningful signals for delays/non-delivery.
- Missing review fields -> many customers do not leave comments.
- Missing dimensions/weights -. product-level analysis can be segmented by completeness.

4. No duplicate primary keys after merge
- Checked and confirmed: order_id, product_id, customer_id, seller_id, and zip prefixes remain consistent.

### Resulting Dataset is Ready For:
#### Customer Behavior
- First purchase, repeat purchase, lifetime value
- Cohort analysis
- Regional behavior trends

#### Product Analysis
- Top categories
- Price distributions
- Review sentiment (optional)
- Delivery performance by product type

#### Funnel Analysis
- Purchase timestamps
- Fulfillment timestamps
- Order status
- Payment data