# 1.0 Basic Data Exploration Summary

This section provides the initial quantitative snapshot of the merged Olist dataset.

### A. Dataset Scope

Metric & Value 
**Total Transaction Rows** | 118,434 |
**Unique Customers** | 96,096 |
**Unique Products** | 32,951 |
**Date Range Start** | 2016-09-04 |
**Date Range End** | 2018-10-17 |

*Observation: The number of rows is slightly higher than unique customers, confirming that some customers placed multiple orders, which is the basis for our LTV calculation.*

### B. Financial Summary (Payment Value in R$)

Statistic & Payment Value (R$)  & Observation 

**Count** | 118,431 | We have a small number of missing payment values (3 rows) to investigate. 
**Mean (Average)** | 172.85 | The typical transaction value.
**Median (50%)** | 108.20 | The median is significantly lower than the mean, indicating a **positive skew** caused by high-value outliers.
**Max** | 13,664.08 | The maximum transaction is extremely high.
**99.9th Percentile** | 2,732.06 | Transactions above this value (0.1% of the data) are extreme outliers that must be segmented or handled during LTV modeling.


In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import os 

In [2]:

DATA_PATH = '/Users/manarmohsin/ecommerce-cac-optimization/data/raw/archive'

In [3]:
# Load all individual CSV files into a Python dictionary called 'dataframes'
dataframes = {
    'customers': pd.read_csv(os.path.join(DATA_PATH, 'olist_customers_dataset.csv')),
    'orders': pd.read_csv(os.path.join(DATA_PATH, 'olist_orders_dataset.csv')),
    'items': pd.read_csv(os.path.join(DATA_PATH, 'olist_order_items_dataset.csv')),
    'payments': pd.read_csv(os.path.join(DATA_PATH, 'olist_order_payments_dataset.csv')),
    'products': pd.read_csv(os.path.join(DATA_PATH, 'olist_products_dataset.csv')),
    'reviews': pd.read_csv(os.path.join(DATA_PATH, 'olist_order_reviews_dataset.csv')),
    'sellers': pd.read_csv(os.path.join(DATA_PATH, 'olist_sellers_dataset.csv')),
    'category_translation': pd.read_csv(os.path.join(DATA_PATH, 'product_category_name_translation.csv'))
}


print("All necessary files loaded successfully!")

All necessary files loaded successfully!


In [17]:
#Data Merging Block (Sequential Joins)

# Stage 1: Start with the core 'orders' data and link to 'customers'.
df_combined = dataframes['orders'].merge(dataframes['customers'], on= 'customer_id', how='left')

# Stage 2: Merge with Order Items.
df_combined = df_combined.merge(dataframes['items'], on ='order_id', how='left')

# Stage 3: Merge with Payments.
df_combined = df_combined.merge(dataframes['payments'], on='order_id', how='left')

# Stage 4: Merge with Product Details.
df_combined = df_combined.merge(dataframes['products'], on='product_id', how='left')

# Stage 5: Merge with Category Translation.
df_combined = df_combined.merge(dataframes['category_translation'], on='product_category_name', how='left')

# Stage 6: Merge with Reviews.
df_combined = df_combined.merge(dataframes['sellers'], on='seller_id', how='left')


In [5]:
print("\nFinal Merged DataFrame (df_combined) created.")
print("---")
print(f"Shape: {df_combined.shape}")
df_combined.head()


Final Merged DataFrame (df_combined) created.
---
Shape: (118434, 34)


Unnamed: 0,order_id,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date,customer_unique_id,customer_zip_code_prefix,...,product_description_lenght,product_photos_qty,product_weight_g,product_length_cm,product_height_cm,product_width_cm,product_category_name_english,seller_zip_code_prefix,seller_city,seller_state
0,e481f51cbdc54678b7cc49136f2d6af7,9ef432eb6251297304e76186b10a928d,delivered,2017-10-02 10:56:33,2017-10-02 11:07:15,2017-10-04 19:55:00,2017-10-10 21:25:13,2017-10-18 00:00:00,7c396fd4830fd04220f754e42b4e5bff,3149,...,268.0,4.0,500.0,19.0,8.0,13.0,housewares,9350.0,maua,SP
1,e481f51cbdc54678b7cc49136f2d6af7,9ef432eb6251297304e76186b10a928d,delivered,2017-10-02 10:56:33,2017-10-02 11:07:15,2017-10-04 19:55:00,2017-10-10 21:25:13,2017-10-18 00:00:00,7c396fd4830fd04220f754e42b4e5bff,3149,...,268.0,4.0,500.0,19.0,8.0,13.0,housewares,9350.0,maua,SP
2,e481f51cbdc54678b7cc49136f2d6af7,9ef432eb6251297304e76186b10a928d,delivered,2017-10-02 10:56:33,2017-10-02 11:07:15,2017-10-04 19:55:00,2017-10-10 21:25:13,2017-10-18 00:00:00,7c396fd4830fd04220f754e42b4e5bff,3149,...,268.0,4.0,500.0,19.0,8.0,13.0,housewares,9350.0,maua,SP
3,53cdb2fc8bc7dce0b6741e2150273451,b0830fb4747a6c6d20dea0b8c802d7ef,delivered,2018-07-24 20:41:37,2018-07-26 03:24:27,2018-07-26 14:31:00,2018-08-07 15:27:45,2018-08-13 00:00:00,af07308b275d755c9edb36a90c618231,47813,...,178.0,1.0,400.0,19.0,13.0,19.0,perfumery,31570.0,belo horizonte,SP
4,47770eb9100c2d0c44946d9cf07ec65d,41ce2a54c0b03bf3443c3d931a367089,delivered,2018-08-08 08:38:49,2018-08-08 08:55:23,2018-08-08 13:50:00,2018-08-17 18:06:29,2018-09-04 00:00:00,3a653a41f6f9fc3d2a113cf8398680e8,75265,...,232.0,1.0,420.0,24.0,19.0,21.0,auto,14840.0,guariba,SP


In [6]:
# Get column names
print("Column Names:")
print(list(df_combined.columns))


Column Names:
['order_id', 'customer_id', 'order_status', 'order_purchase_timestamp', 'order_approved_at', 'order_delivered_carrier_date', 'order_delivered_customer_date', 'order_estimated_delivery_date', 'customer_unique_id', 'customer_zip_code_prefix', 'customer_city', 'customer_state', 'order_item_id', 'product_id', 'seller_id', 'shipping_limit_date', 'price', 'freight_value', 'payment_sequential', 'payment_type', 'payment_installments', 'payment_value', 'product_category_name', 'product_name_lenght', 'product_description_lenght', 'product_photos_qty', 'product_weight_g', 'product_length_cm', 'product_height_cm', 'product_width_cm', 'product_category_name_english', 'seller_zip_code_prefix', 'seller_city', 'seller_state']


In [7]:
# Get data types
print("\nData Types:")
print(df_combined.dtypes)


Data Types:
order_id                          object
customer_id                       object
order_status                      object
order_purchase_timestamp          object
order_approved_at                 object
order_delivered_carrier_date      object
order_delivered_customer_date     object
order_estimated_delivery_date     object
customer_unique_id                object
customer_zip_code_prefix           int64
customer_city                     object
customer_state                    object
order_item_id                    float64
product_id                        object
seller_id                         object
shipping_limit_date               object
price                            float64
freight_value                    float64
payment_sequential               float64
payment_type                      object
payment_installments             float64
payment_value                    float64
product_category_name             object
product_name_lenght              float64
pro

In [8]:
# concise summary (including column names and data types)
print("\nDataFrame Info:")
df_combined.info()


DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 118434 entries, 0 to 118433
Data columns (total 34 columns):
 #   Column                         Non-Null Count   Dtype  
---  ------                         --------------   -----  
 0   order_id                       118434 non-null  object 
 1   customer_id                    118434 non-null  object 
 2   order_status                   118434 non-null  object 
 3   order_purchase_timestamp       118434 non-null  object 
 4   order_approved_at              118258 non-null  object 
 5   order_delivered_carrier_date   116360 non-null  object 
 6   order_delivered_customer_date  115037 non-null  object 
 7   order_estimated_delivery_date  118434 non-null  object 
 8   customer_unique_id             118434 non-null  object 
 9   customer_zip_code_prefix       118434 non-null  int64  
 10  customer_city                  118434 non-null  object 
 11  customer_state                 118434 non-null  object 
 12  order_item_id

In [9]:
#The first 10 rows
df_combined.head(10)

Unnamed: 0,order_id,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date,customer_unique_id,customer_zip_code_prefix,...,product_description_lenght,product_photos_qty,product_weight_g,product_length_cm,product_height_cm,product_width_cm,product_category_name_english,seller_zip_code_prefix,seller_city,seller_state
0,e481f51cbdc54678b7cc49136f2d6af7,9ef432eb6251297304e76186b10a928d,delivered,2017-10-02 10:56:33,2017-10-02 11:07:15,2017-10-04 19:55:00,2017-10-10 21:25:13,2017-10-18 00:00:00,7c396fd4830fd04220f754e42b4e5bff,3149,...,268.0,4.0,500.0,19.0,8.0,13.0,housewares,9350.0,maua,SP
1,e481f51cbdc54678b7cc49136f2d6af7,9ef432eb6251297304e76186b10a928d,delivered,2017-10-02 10:56:33,2017-10-02 11:07:15,2017-10-04 19:55:00,2017-10-10 21:25:13,2017-10-18 00:00:00,7c396fd4830fd04220f754e42b4e5bff,3149,...,268.0,4.0,500.0,19.0,8.0,13.0,housewares,9350.0,maua,SP
2,e481f51cbdc54678b7cc49136f2d6af7,9ef432eb6251297304e76186b10a928d,delivered,2017-10-02 10:56:33,2017-10-02 11:07:15,2017-10-04 19:55:00,2017-10-10 21:25:13,2017-10-18 00:00:00,7c396fd4830fd04220f754e42b4e5bff,3149,...,268.0,4.0,500.0,19.0,8.0,13.0,housewares,9350.0,maua,SP
3,53cdb2fc8bc7dce0b6741e2150273451,b0830fb4747a6c6d20dea0b8c802d7ef,delivered,2018-07-24 20:41:37,2018-07-26 03:24:27,2018-07-26 14:31:00,2018-08-07 15:27:45,2018-08-13 00:00:00,af07308b275d755c9edb36a90c618231,47813,...,178.0,1.0,400.0,19.0,13.0,19.0,perfumery,31570.0,belo horizonte,SP
4,47770eb9100c2d0c44946d9cf07ec65d,41ce2a54c0b03bf3443c3d931a367089,delivered,2018-08-08 08:38:49,2018-08-08 08:55:23,2018-08-08 13:50:00,2018-08-17 18:06:29,2018-09-04 00:00:00,3a653a41f6f9fc3d2a113cf8398680e8,75265,...,232.0,1.0,420.0,24.0,19.0,21.0,auto,14840.0,guariba,SP
5,949d5b44dbf5de918fe9c16f97b45f8a,f88197465ea7920adcdbec7375364d82,delivered,2017-11-18 19:28:06,2017-11-18 19:45:59,2017-11-22 13:39:59,2017-12-02 00:28:42,2017-12-15 00:00:00,7c142cf63193a1473d2e66489a9ae977,59296,...,468.0,3.0,450.0,30.0,10.0,20.0,pet_shop,31842.0,belo horizonte,MG
6,ad21c59c0840e6cb83a9ceb5573f8159,8ab97904e6daea8866dbdbc4fb7aad2c,delivered,2018-02-13 21:18:39,2018-02-13 22:20:29,2018-02-14 19:46:34,2018-02-16 18:17:02,2018-02-26 00:00:00,72632f0f9dd73dfee390c9b22eb56dd6,9195,...,316.0,4.0,250.0,51.0,15.0,15.0,stationery,8752.0,mogi das cruzes,SP
7,a4591c265e18cb1dcee52889e2d8acc3,503740e9ca751ccdda7ba28e9ab8f608,delivered,2017-07-09 21:57:05,2017-07-09 22:10:13,2017-07-11 14:58:04,2017-07-26 10:57:55,2017-08-01 00:00:00,80bb27c7c16e8f973207a5086ab329e2,86320,...,608.0,1.0,7150.0,65.0,10.0,65.0,auto,7112.0,guarulhos,SP
8,136cce7faa42fdb2cefd53fdc79a6098,ed0271e0b7da060a393796590e7b737a,invoiced,2017-04-11 12:22:08,2017-04-13 13:25:17,,,2017-05-09 00:00:00,36edbb3fb164b1f16485364b6fb04c73,98900,...,,,600.0,35.0,35.0,15.0,,5455.0,sao paulo,SP
9,6514b8ad8028c9f2cc2374ded245783f,9bdf08b4b3b52b5526ff42d37d47f222,delivered,2017-05-16 13:10:30,2017-05-16 13:22:11,2017-05-22 10:07:46,2017-05-26 12:55:51,2017-06-07 00:00:00,932afa1e708222e5821dac9cd5db4cae,26525,...,956.0,1.0,50.0,16.0,16.0,17.0,auto,12940.0,atibaia,SP


In [10]:
#Last 10 rows
df_combined.tail(10)

Unnamed: 0,order_id,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date,customer_unique_id,customer_zip_code_prefix,...,product_description_lenght,product_photos_qty,product_weight_g,product_length_cm,product_height_cm,product_width_cm,product_category_name_english,seller_zip_code_prefix,seller_city,seller_state
118424,9115830be804184b91f5c00f6f49f92d,da2124f134f5dfbce9d06f29bdb6c308,delivered,2017-10-04 19:57:37,2017-10-04 20:07:14,2017-10-05 16:52:52,2017-10-20 20:25:45,2017-11-07 00:00:00,c716cf2b5b86fb24257cffe9e7969df8,78048,...,180.0,3.0,750.0,26.0,15.0,26.0,toys,26020.0,nova iguacu,RJ
118425,aa04ef5214580b06b10e2a378300db44,f01a6bfcc730456317e4081fe0c9940e,delivered,2017-01-27 00:30:03,2017-01-27 01:05:25,2017-01-30 11:40:16,2017-02-07 13:15:25,2017-03-17 00:00:00,e03dbdf5e56c96b106d8115ac336f47f,35502,...,657.0,1.0,750.0,38.0,12.0,25.0,health_beauty,80310.0,curitiba,PR
118426,aa04ef5214580b06b10e2a378300db44,f01a6bfcc730456317e4081fe0c9940e,delivered,2017-01-27 00:30:03,2017-01-27 01:05:25,2017-01-30 11:40:16,2017-02-07 13:15:25,2017-03-17 00:00:00,e03dbdf5e56c96b106d8115ac336f47f,35502,...,657.0,1.0,750.0,38.0,12.0,25.0,health_beauty,80310.0,curitiba,PR
118427,880675dff2150932f1601e1c07eadeeb,47cd45a6ac7b9fb16537df2ccffeb5ac,delivered,2017-02-23 09:05:12,2017-02-23 09:15:11,2017-03-01 10:22:52,2017-03-06 11:08:08,2017-03-22 00:00:00,831ce3f1bacbd424fc4e38fbd4d66d29,5127,...,254.0,2.0,2500.0,49.0,13.0,41.0,furniture_decor,14940.0,ibitinga,SP
118428,9c5dedf39a927c1b2549525ed64a053c,39bd1228ee8140590ac3aca26f2dfe00,delivered,2017-03-09 09:54:05,2017-03-09 09:54:05,2017-03-10 11:18:03,2017-03-17 15:08:01,2017-03-28 00:00:00,6359f309b166b0196dbf7ad2ac62bb5a,12209,...,1517.0,1.0,1175.0,22.0,13.0,18.0,health_beauty,12913.0,braganca paulista,SP
118429,63943bddc261676b46f01ca7ac2f7bd8,1fca14ff2861355f6e5f14306ff977a7,delivered,2018-02-06 12:58:58,2018-02-06 13:10:37,2018-02-07 23:22:42,2018-02-28 17:37:56,2018-03-02 00:00:00,da62f9e57a76d978d02ab5362c509660,11722,...,828.0,4.0,4950.0,40.0,10.0,40.0,baby,17602.0,tupa,SP
118430,83c1379a015df1e13d02aae0204711ab,1aa71eb042121263aafbe80c1b562c9c,delivered,2017-08-27 14:46:43,2017-08-27 15:04:16,2017-08-28 20:52:26,2017-09-21 11:24:17,2017-09-27 00:00:00,737520a9aad80b3fbbdad19b66b37b30,45920,...,500.0,2.0,13300.0,32.0,90.0,22.0,home_appliances_2,8290.0,sao paulo,SP
118431,11c177c8e97725db2631073c19f07b62,b331b74b18dc79bcdf6532d51e1637c1,delivered,2018-01-08 21:28:27,2018-01-08 21:36:21,2018-01-12 15:35:03,2018-01-25 23:32:54,2018-02-15 00:00:00,5097a5312c8b157bb7be58ae360ef43c,28685,...,1893.0,1.0,6550.0,20.0,20.0,20.0,computers_accessories,37175.0,ilicinea,MG
118432,11c177c8e97725db2631073c19f07b62,b331b74b18dc79bcdf6532d51e1637c1,delivered,2018-01-08 21:28:27,2018-01-08 21:36:21,2018-01-12 15:35:03,2018-01-25 23:32:54,2018-02-15 00:00:00,5097a5312c8b157bb7be58ae360ef43c,28685,...,1893.0,1.0,6550.0,20.0,20.0,20.0,computers_accessories,37175.0,ilicinea,MG
118433,66dea50a8b16d9b4dee7af250b4be1a5,edb027a75a1449115f6b43211ae02a24,delivered,2018-03-08 20:57:30,2018-03-09 11:20:28,2018-03-09 22:11:59,2018-03-16 13:08:30,2018-04-03 00:00:00,60350aa974b26ff12caad89e55993bd6,83750,...,569.0,1.0,150.0,16.0,7.0,15.0,health_beauty,14407.0,franca,SP


In [11]:
df_combined['order_purchase_timestamp'] = pd.to_datetime(df_combined['order_purchase_timestamp'])

In [12]:
earliest_purchase = df_combined['order_purchase_timestamp'].min()
latest_purchase = df_combined['order_purchase_timestamp'].max()
print(earliest_purchase)
print(latest_purchase)

2016-09-04 21:15:19
2018-10-17 17:30:18


In [13]:
unique_cuustomers = df_combined['customer_unique_id'].nunique()
print(unique_cuustomers)

96096


In [14]:
unique_prooducts = df_combined['product_id'].nunique()
print(unique_prooducts)

32951


In [15]:
df_combined['payment_value'].describe()

count    118431.000000
mean        172.849395
std         268.259831
min           0.000000
25%          60.860000
50%         108.200000
75%         189.245000
max       13664.080000
Name: payment_value, dtype: float64

In [16]:
df_combined['payment_value'].quantile([0.99, 0.999])

0.990    1223.70
0.999    2732.06
Name: payment_value, dtype: float64