
## EDA: Customer Order Analysis

This notebook focuses on understanding the **customer and order-level structure** of the dataset, with particular emphasis on customer behavior patterns such as **first-time vs repeat purchases**.

## Analogy ##

<p align="center">
  <img src="CustomerExample.png" width="500"/>
</p>


### 1. Schema and Null Value Analysis

* Examined the schema of both the **customers** and **orders** tables to understand available attributes and data types.
* Performed a null value analysis to identify missing or incomplete fields that could impact downstream analysis or feature engineering.

### 2. Table Grain Analysis

* Identified the **grain** of each table to establish how records are uniquely represented.
* For the **customer table**, the effective grain is a combination of **order_id** and **customer_id**.
* Observed that multiple records map to the same **customer_unique_id**, which represents the actual individual customer.
* This duplication is a key characteristic of the dataset and directly enables analysis of **repeat buyers vs first-time buyers**, making customer retention a core analytical dimension.

### 3. First-Time vs Repeat Customer Analysis

* Analyzed repeat customers by tracking multiple orders associated with the same **customer_unique_id**.
* Studied the **purchase timelines** of selected customers to understand ordering frequency and temporal gaps between purchases.
* Separately analyzed **first-time buyers** to quantify customers who placed only a single order on the platform.
* These analyses establish a foundational understanding of customer conversion and repeat purchase behavior, which informs later modeling and mart design.

### 4. Customer Geography and Demographics Analysis ###

* Analyzed whether a single customer_unique_id is associated with multiple geographic locations.
* Identified cases where customers appeared across different geographies, potentially due to address changes, relocations, or orders placed on behalf of others.




In [16]:
import pandas as pd
customers = pd.read_csv("../Source Data/olist_customers_dataset.csv")
orders=pd.read_csv("../Source Data/olist_orders_dataset.csv")
cart_items=pd.read_csv("../Source Data/olist_order_items_dataset.csv")


Checking Repeat Vs Non Repeat Customers 

In [17]:
customers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99441 entries, 0 to 99440
Data columns (total 5 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   customer_id               99441 non-null  object
 1   customer_unique_id        99441 non-null  object
 2   customer_zip_code_prefix  99441 non-null  int64 
 3   customer_city             99441 non-null  object
 4   customer_state            99441 non-null  object
dtypes: int64(1), object(4)
memory usage: 3.8+ MB


In [18]:
orders.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99441 entries, 0 to 99440
Data columns (total 8 columns):
 #   Column                         Non-Null Count  Dtype 
---  ------                         --------------  ----- 
 0   order_id                       99441 non-null  object
 1   customer_id                    99441 non-null  object
 2   order_status                   99441 non-null  object
 3   order_purchase_timestamp       99441 non-null  object
 4   order_approved_at              99281 non-null  object
 5   order_delivered_carrier_date   97658 non-null  object
 6   order_delivered_customer_date  96476 non-null  object
 7   order_estimated_delivery_date  99441 non-null  object
dtypes: object(8)
memory usage: 6.1+ MB


### TABLE GRAIN ANALYSIS ###

In [19]:
print (customers.duplicated("customer_unique_id").sum())
print (customers.duplicated("customer_id").sum())

3345
0


In [20]:
print (orders.duplicated("order_id").sum())
print (orders.duplicated("customer_id").sum())

0
0


### Repeat Customer Analysis ###

In [21]:
orders_per_customer = (
    customers.groupby("customer_unique_id")
    .size()
)
orders_per_customer.sort_values(ascending=False).head(10)


customer_unique_id
8d50f5eadf50201ccdcedfb9e2ac8455    17
3e43e6105506432c953e165fb2acf44c     9
6469f99c1f9dfae7733b25662e7f1782     7
ca77025e7201e3b30c44b472ff346268     7
1b6c7548a2a1f9037c1fd3ddfed95f33     7
12f5d6e1cbf93dafd9dcc19095df0b3d     6
de34b16117594161a6a89c50b289d35a     6
63cfc61cee11cbe306bff5857d00bfe4     6
f0e310a6839dce9de1638e0fe5ab282a     6
47c1a3033b8b77b3ab6e109eb4d5fdf3     6
dtype: int64

In [22]:
repeat_customers_ids = customers.loc[
    customers["customer_unique_id"] == "8d50f5eadf50201ccdcedfb9e2ac8455",
    "customer_id"
].tolist()

repeat_customers_ids


['1bd3585471932167ab72a84955ebefea',
 'a8fabc805e9a10a3c93ae5bff642b86b',
 '897b7f72042714efaa64ac306ba0cafc',
 'b2b13de0770e06de50080fea77c459e6',
 '42dbc1ad9d560637c9c4c1533746f86d',
 'dfb941d6f7b02f57a44c3b7c3fefb44b',
 '65f9db9dd07a4e79b625effa4c868fcb',
 '1c62b48fb34ee043310dcb233caabd2e',
 'a682769c4bc10fc6ef2101337a6c83c9',
 '6289b75219d757a56c0cce8d9e427900',
 '3414a9c813e3ca02504b8be8b2deb27f',
 '0e4fdc084a6b9329ed55d62dcd653ccf',
 'f5188d99e9281e214a4a7d1b139a8229',
 '89be66634d68fa73a95499b6352e085d',
 '0bf8bf19944a7f8b40ba86fef778ca7c',
 '9a1afef458843a022e431f4cb304dfe9',
 '31dd055624c66f291578297a551a6cdf']

In [23]:
repeat_orders = orders[orders["customer_id"].isin(repeat_customers_ids)].copy()
timeline = repeat_orders.sort_values("order_purchase_timestamp")
timeline.head(10)    


Unnamed: 0,order_id,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date
68664,5d848f3d93a493c1c8955e018240e7ca,0e4fdc084a6b9329ed55d62dcd653ccf,shipped,2017-05-15 23:30:03,2017-05-15 23:42:34,2017-05-17 10:42:20,,2017-05-26 00:00:00
77205,369634708db140c5d2c4e365882c443a,b2b13de0770e06de50080fea77c459e6,delivered,2017-06-18 22:56:48,2017-06-18 23:10:19,2017-06-19 20:12:26,2017-06-23 12:55:50,2017-07-07 00:00:00
16231,5837a2c844decae8a778657425f6d664,31dd055624c66f291578297a551a6cdf,unavailable,2017-07-17 22:11:13,2017-07-17 22:23:46,,,2017-08-17 00:00:00
33703,4f62d593acae92cea3c5662c76122478,dfb941d6f7b02f57a44c3b7c3fefb44b,delivered,2017-07-18 23:10:58,2017-07-18 23:23:26,2017-07-20 19:00:02,2017-07-21 16:19:40,2017-07-31 00:00:00
19127,bf92c69b7cc70f7fc2c37de43e366173,42dbc1ad9d560637c9c4c1533746f86d,delivered,2017-07-24 22:11:50,2017-07-24 22:25:14,2017-07-26 01:42:03,2017-07-31 16:59:58,2017-08-15 00:00:00
58811,519203404f6116d406a970763ee75799,1c62b48fb34ee043310dcb233caabd2e,delivered,2017-08-05 08:59:43,2017-08-05 09:10:13,2017-08-07 18:50:00,2017-08-09 15:22:28,2017-08-25 00:00:00
5167,e3071b7624445af6e4f3a1b23718667d,0bf8bf19944a7f8b40ba86fef778ca7c,delivered,2017-09-05 22:14:52,2017-09-05 22:30:56,2017-09-06 15:26:12,2017-09-11 13:27:49,2017-09-22 00:00:00
97240,cd4b336a02aacabd0ef22f6db711f95e,89be66634d68fa73a95499b6352e085d,delivered,2017-10-18 23:25:04,2017-10-19 00:36:08,2017-10-20 17:11:50,2017-10-23 18:33:01,2017-10-30 00:00:00
84977,89d9b111d2b990deb5f5f9769f92800b,9a1afef458843a022e431f4cb304dfe9,delivered,2017-10-29 16:58:02,2017-10-29 17:10:09,2017-10-30 15:58:52,2017-10-31 15:33:47,2017-11-10 00:00:00
39449,b850a16d8faf65a74c51287ef34379ce,1bd3585471932167ab72a84955ebefea,delivered,2017-11-22 20:01:53,2017-11-22 20:12:32,2017-11-24 16:07:56,2017-11-27 18:49:13,2017-12-04 00:00:00


### First Time Customers ###

In [24]:
orders_per_customer = (
    customers.groupby("customer_unique_id")
    .size()
)

orders_per_customer.sort_values(ascending=True).head(10)


customer_unique_id
0000366f3b9a7992bf8c76cfdf3221e2    1
a926cfc9bc7b082335de50450f48eec9    1
a926aae38267e7f54e67de9b5775d0a5    1
a92605ec492805540520d3a73aaeeb6e    1
a925c3e5df82fdc6082f1383d2834998    1
a924d89c1062f9149e55552f0359f894    1
a924d6ff7212a2815d130666e69cf915    1
a9230d79e397fb71fa526975917bf449    1
a9216bc4083909ed942b4f6c497c7f27    1
a9208203978f82d89b79822d0a8a6f6c    1
dtype: int64

In [25]:
customers.loc[customers["customer_unique_id"] == "0000366f3b9a7992bf8c76cfdf3221e2"]

Unnamed: 0,customer_id,customer_unique_id,customer_zip_code_prefix,customer_city,customer_state
64012,fadbb3709178fc513abc1b2670aa1ad2,0000366f3b9a7992bf8c76cfdf3221e2,7787,cajamar,SP


In [26]:
orders.loc[orders["customer_id"] == "fadbb3709178fc513abc1b2670aa1ad2"]

Unnamed: 0,order_id,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date
52798,e22acc9c116caa3f2b7121bbb380d08e,fadbb3709178fc513abc1b2670aa1ad2,delivered,2018-05-10 10:56:27,2018-05-10 11:11:18,2018-05-12 08:18:00,2018-05-16 20:48:37,2018-05-21 00:00:00


In [27]:
orders["order_status"].unique()

array(['delivered', 'invoiced', 'shipped', 'processing', 'unavailable',
       'canceled', 'created', 'approved'], dtype=object)

In [28]:
total_orders = len(orders)

cancel_unavail = orders[orders["order_status"].isin(["canceled", "unavailable"])]

count = len(cancel_unavail)
pct = count / total_orders

print("count:", count)
print("total:", total_orders)
print("pct:", pct)

count: 1234
total: 99441
pct: 0.012409368369183738


### GEO CONTEXT ###

In [29]:
zip_counts = (
    customers.groupby("customer_unique_id")["customer_zip_code_prefix"]
    .nunique()
    .reset_index(name="zip_count")
)

multi_zip = zip_counts[zip_counts["zip_count"] > 1]
multi_zip.head()


Unnamed: 0,customer_unique_id,zip_count
124,004b45ec5c64187465168251cd1c9c2f,2
144,0058f300f57d7b93c477a131a59b36c3,2
438,012452d40dafae4df401bced74cdb490,2
558,0178b244a5c281fb2ade54038dd4b161,2
598,018b5a7502c30eb5f230f1b4eb23a156,2


In [30]:
customers.loc[customers["customer_unique_id"]=='004b45ec5c64187465168251cd1c9c2f']

Unnamed: 0,customer_id,customer_unique_id,customer_zip_code_prefix,customer_city,customer_state
72451,49cf243e0d353cd418ca77868e24a670,004b45ec5c64187465168251cd1c9c2f,57055,maceio,AL
87012,d95f60d70d9ea9a7fe37c53c931940bb,004b45ec5c64187465168251cd1c9c2f,57035,maceio,AL
