
## Customers Mart – Implementation

<p align="center">
  <img src="../EDA/CustomerExample.png" width="500"/>
</p>

Based on insights from the EDA, the following transformations were implemented in the **Customers Mart**:

* Identified **customer_unique_id** as the true customer identifier and standardized all customer-level features at this grain.

* Consolidated multiple **customer_id** values mapping to the same `customer_unique_id`, accounting for customers with multiple orders.

* Derived behavioral indicators to distinguish **first-time buyers** from **repeat customers** based on historical order counts.
* Computed **temporal features** such as order frequency and gaps between purchases to capture customer engagement over time.
* Analyzed and incorporated **geographic attributes**, allowing for cases where a single customer appears across multiple locations due to address changes or proxy ordering.
* Produced a single, customer-level record suitable for joining with order- and review-level marts in the final master dataset.





## Customers Mart – Geographic Enrichment

<p align="center">
  <img src="../Data Modelling/CustomerGeoEnrichment.png" width="1000"/>
</p>


* Observed duplicate **zip code prefixes** across both customer and seller datasets.

* Computed **zip-level density features**, including total unique customers and total unique sellers per zip code.

* Normalized the geolocation table by selecting the **mode latitude and longitude** for each zip code, ensuring a single representative geo point.

* Used the normalized geo dimension to safely integrate customer density, seller density, and geographic coordinates into the Customers Mart without duplicating records.




In [1]:
import pandas as pd
customers = pd.read_csv("../Source Data/olist_customers_dataset.csv")
geo=pd.read_csv("../Source Data/olist_geolocation_dataset.csv")
sellers=pd.read_csv("../Source Data/olist_sellers_dataset.csv")

## Aggregating Total Unique Customers in a Zip ##

In [2]:
df_cust_int=customers.copy()

In [3]:
df_cust_int["tot_customers_in_zip"] = (
    df_cust_int.groupby("customer_zip_code_prefix")["customer_unique_id"]
    .transform("nunique")
)


In [4]:
df_cust_int.head(3)

Unnamed: 0,customer_id,customer_unique_id,customer_zip_code_prefix,customer_city,customer_state,tot_customers_in_zip
0,06b8999e2fba1a1fbc88172c00ba8bc7,861eff4711a542e4b93843c6dd7febb0,14409,franca,SP,17
1,18955e83d337fd6b2def6b18a428ac77,290c77bc529b7ac935b93aa66c333dc3,9790,sao bernardo do campo,SP,19
2,4e7b3e00288586ebd08712fdd0374a03,060e732b5b29e8181a18229c7b0b2b5e,1151,sao paulo,SP,9


In [5]:
zip_customers = (
    df_cust_int.groupby("customer_zip_code_prefix")["customer_unique_id"]
    .nunique()
    .reset_index(name="customers_in_zip")
)

zip_customers

Unnamed: 0,customer_zip_code_prefix,customers_in_zip
0,1003,1
1,1004,2
2,1005,5
3,1006,2
4,1007,4
...,...,...
14989,99960,2
14990,99965,2
14991,99970,1
14992,99980,2


## Aggregating Total Unique SELLERS in a Zip ##

In [6]:
df_sellers_int=sellers.copy()

In [7]:
df_sellers_int["tot_sellers_in_zip"] = (
    df_sellers_int.groupby("seller_zip_code_prefix")["seller_id"]
    .transform("nunique")
)


In [8]:
df_sellers_int.head(3)

Unnamed: 0,seller_id,seller_zip_code_prefix,seller_city,seller_state,tot_sellers_in_zip
0,3442f8959a84dea7ee197c632cb2df15,13023,campinas,SP,2
1,d1b65fc7debc3361ea86b5f14c68d2e2,13844,mogi guacu,SP,1
2,ce3ad9de960102d0677a81f5d0bb7b2d,20031,rio de janeiro,RJ,2


## BRIDGE GEO TABLE ##

In [9]:
cust_counts_zip = df_cust_int[["customer_zip_code_prefix", "tot_customers_in_zip"]].drop_duplicates()
seller_counts_zip = df_sellers_int[["seller_zip_code_prefix", "tot_sellers_in_zip"]].drop_duplicates()


In [10]:
zip_counts = cust_counts_zip.merge(
    seller_counts_zip,
    left_on="customer_zip_code_prefix",
    right_on="seller_zip_code_prefix",
    how="outer"
)

zip_counts.head()


Unnamed: 0,customer_zip_code_prefix,tot_customers_in_zip,seller_zip_code_prefix,tot_sellers_in_zip
0,14409.0,17.0,,
1,9790.0,19.0,,
2,1151.0,9.0,,
3,8775.0,13.0,,
4,13056.0,42.0,13056.0,1.0


In [11]:
zip_counts.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15078 entries, 0 to 15077
Data columns (total 4 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   customer_zip_code_prefix  14994 non-null  float64
 1   tot_customers_in_zip      14994 non-null  float64
 2   seller_zip_code_prefix    2246 non-null   float64
 3   tot_sellers_in_zip        2246 non-null   float64
dtypes: float64(4)
memory usage: 471.3 KB


In [12]:
zip_counts["zip_code_prefix"] = zip_counts["customer_zip_code_prefix"].combine_first(
    zip_counts["seller_zip_code_prefix"]
)

zip_counts = zip_counts.drop(columns=["customer_zip_code_prefix", "seller_zip_code_prefix"])


In [13]:
zip_counts.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15078 entries, 0 to 15077
Data columns (total 3 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   tot_customers_in_zip  14994 non-null  float64
 1   tot_sellers_in_zip    2246 non-null   float64
 2   zip_code_prefix       15078 non-null  float64
dtypes: float64(3)
memory usage: 353.5 KB


In [14]:
zip_counts["zip_code_prefix"].duplicated().sum()

0

In [15]:
zip_counts.head()

Unnamed: 0,tot_customers_in_zip,tot_sellers_in_zip,zip_code_prefix
0,17.0,,14409.0
1,19.0,,9790.0
2,9.0,,1151.0
3,13.0,,8775.0
4,42.0,1.0,13056.0


In [16]:
zip_counts[["tot_customers_in_zip", "tot_sellers_in_zip"]] = (
    zip_counts[["tot_customers_in_zip", "tot_sellers_in_zip"]].fillna(0)
)


## MERGE TO CUSTOMERS (MANY TO ONE ON ZIP). CUSTOMERS (ZIP : MANY) -> ZIP COUNT (BRIDGE TABLE) (ONE ZIP) ##

In [17]:
zip_counts.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15078 entries, 0 to 15077
Data columns (total 3 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   tot_customers_in_zip  15078 non-null  float64
 1   tot_sellers_in_zip    15078 non-null  float64
 2   zip_code_prefix       15078 non-null  float64
dtypes: float64(3)
memory usage: 353.5 KB


In [18]:
customers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99441 entries, 0 to 99440
Data columns (total 5 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   customer_id               99441 non-null  object
 1   customer_unique_id        99441 non-null  object
 2   customer_zip_code_prefix  99441 non-null  int64 
 3   customer_city             99441 non-null  object
 4   customer_state            99441 non-null  object
dtypes: int64(1), object(4)
memory usage: 3.8+ MB


In [19]:
customers.head(3)

Unnamed: 0,customer_id,customer_unique_id,customer_zip_code_prefix,customer_city,customer_state
0,06b8999e2fba1a1fbc88172c00ba8bc7,861eff4711a542e4b93843c6dd7febb0,14409,franca,SP
1,18955e83d337fd6b2def6b18a428ac77,290c77bc529b7ac935b93aa66c333dc3,9790,sao bernardo do campo,SP
2,4e7b3e00288586ebd08712fdd0374a03,060e732b5b29e8181a18229c7b0b2b5e,1151,sao paulo,SP


In [20]:
customers = customers.merge(
    zip_counts[["zip_code_prefix", "tot_customers_in_zip", "tot_sellers_in_zip"]],
    left_on="customer_zip_code_prefix",
    right_on="zip_code_prefix",
    how="left"
).drop(columns=["zip_code_prefix"])


In [21]:
customers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99441 entries, 0 to 99440
Data columns (total 7 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   customer_id               99441 non-null  object 
 1   customer_unique_id        99441 non-null  object 
 2   customer_zip_code_prefix  99441 non-null  int64  
 3   customer_city             99441 non-null  object 
 4   customer_state            99441 non-null  object 
 5   tot_customers_in_zip      99441 non-null  float64
 6   tot_sellers_in_zip        99441 non-null  float64
dtypes: float64(2), int64(1), object(4)
memory usage: 5.3+ MB


In [22]:
customers["seller_customer_ratio"] = (
    customers["tot_sellers_in_zip"].fillna(0) /
    (customers["tot_customers_in_zip"].fillna(0) + 1)
)
customers.head()

Unnamed: 0,customer_id,customer_unique_id,customer_zip_code_prefix,customer_city,customer_state,tot_customers_in_zip,tot_sellers_in_zip,seller_customer_ratio
0,06b8999e2fba1a1fbc88172c00ba8bc7,861eff4711a542e4b93843c6dd7febb0,14409,franca,SP,17.0,0.0,0.0
1,18955e83d337fd6b2def6b18a428ac77,290c77bc529b7ac935b93aa66c333dc3,9790,sao bernardo do campo,SP,19.0,0.0,0.0
2,4e7b3e00288586ebd08712fdd0374a03,060e732b5b29e8181a18229c7b0b2b5e,1151,sao paulo,SP,9.0,0.0,0.0
3,b2b6027bc5c5109e529d4dc6358b12c3,259dac757896d24d7702b9acbbff3f3c,8775,mogi das cruzes,SP,13.0,0.0,0.0
4,4f2d8ab171c80ec8364f7c12e35b23ad,345ecd01c38d18a9036ed96c73b8d066,13056,campinas,SP,42.0,1.0,0.023256


## ADDING GEO FEATURES ##

In [23]:
geo.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000163 entries, 0 to 1000162
Data columns (total 5 columns):
 #   Column                       Non-Null Count    Dtype  
---  ------                       --------------    -----  
 0   geolocation_zip_code_prefix  1000163 non-null  int64  
 1   geolocation_lat              1000163 non-null  float64
 2   geolocation_lng              1000163 non-null  float64
 3   geolocation_city             1000163 non-null  object 
 4   geolocation_state            1000163 non-null  object 
dtypes: float64(2), int64(1), object(2)
memory usage: 38.2+ MB


In [24]:
geo.head()

Unnamed: 0,geolocation_zip_code_prefix,geolocation_lat,geolocation_lng,geolocation_city,geolocation_state
0,1037,-23.545621,-46.639292,sao paulo,SP
1,1046,-23.546081,-46.64482,sao paulo,SP
2,1046,-23.546129,-46.642951,sao paulo,SP
3,1041,-23.544392,-46.639499,sao paulo,SP
4,1035,-23.541578,-46.641607,sao paulo,SP


In [25]:
geo_dupes = geo[geo["geolocation_zip_code_prefix"].duplicated(keep=False)]\
    .sort_values("geolocation_zip_code_prefix")

geo_dupes.head()


Unnamed: 0,geolocation_zip_code_prefix,geolocation_lat,geolocation_lng,geolocation_city,geolocation_state
864,1001,-23.549825,-46.63397,sao paulo,SP
583,1001,-23.551337,-46.634027,sao paulo,SP
596,1001,-23.550498,-46.634338,sao paulo,SP
1246,1001,-23.549292,-46.633559,sao paulo,SP
1062,1001,-23.550498,-46.634338,sao paulo,SP


In [26]:
geo_zip = (
    geo.groupby("geolocation_zip_code_prefix")
    .agg(
        geolocation_lat=("geolocation_lat", lambda s: s.mode().iloc[0]),
        geolocation_lng=("geolocation_lng", lambda s: s.mode().iloc[0])
    )
    .reset_index()
)


In [27]:
geo_zip.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19015 entries, 0 to 19014
Data columns (total 3 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   geolocation_zip_code_prefix  19015 non-null  int64  
 1   geolocation_lat              19015 non-null  float64
 2   geolocation_lng              19015 non-null  float64
dtypes: float64(2), int64(1)
memory usage: 445.8 KB


In [28]:
geo_zip['geolocation_zip_code_prefix'].duplicated().sum()

0

In [29]:
customers = customers.merge(
    geo_zip[["geolocation_zip_code_prefix", "geolocation_lat", "geolocation_lng"]],
    left_on="customer_zip_code_prefix",
    right_on="geolocation_zip_code_prefix",
    how="left"
).drop(columns=["geolocation_zip_code_prefix"])


In [30]:
customers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99441 entries, 0 to 99440
Data columns (total 10 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   customer_id               99441 non-null  object 
 1   customer_unique_id        99441 non-null  object 
 2   customer_zip_code_prefix  99441 non-null  int64  
 3   customer_city             99441 non-null  object 
 4   customer_state            99441 non-null  object 
 5   tot_customers_in_zip      99441 non-null  float64
 6   tot_sellers_in_zip        99441 non-null  float64
 7   seller_customer_ratio     99441 non-null  float64
 8   geolocation_lat           99163 non-null  float64
 9   geolocation_lng           99163 non-null  float64
dtypes: float64(5), int64(1), object(4)
memory usage: 7.6+ MB


In [31]:
customers[customers["geolocation_lat"].isna() | customers["geolocation_lng"].isna()][
    "customer_zip_code_prefix"
].unique()


array([72300, 11547, 64605, 72465,  7729, 72904, 35408, 78554, 73369,
        8980, 29949, 65137, 28655, 73255, 28388,  6930, 71676, 64047,
       61906, 83210, 71919, 36956, 35242, 72005, 29718, 41347, 70324,
       70686, 72341, 12332, 70716, 71905, 75784, 73082, 71884, 71574,
       72238, 71996, 76968, 71975, 72595, 72017, 72596, 67105, 25840,
       72002, 72821, 85118, 25919, 95853, 72583, 68511, 70701, 71591,
       72535, 95572, 73090, 72242, 86135, 70316, 73091, 41098, 58734,
       73310, 71810, 72280,  7430, 73081, 70333, 72268, 35104, 72455,
       72237, 17390, 76897, 84623, 70702, 72760, 73088, 29196, 36596,
       57254, 71995, 73093, 75257, 48504, 83843, 62625, 37005, 73401,
       49870, 13307, 28617, 73402, 56327, 71976, 72587, 85958, 19740,
       77404, 44135, 28120, 72863, 87323, 87511, 72440, 72243, 65830,
       71261, 28575,  2140, 71551, 72023, 28160, 55027, 43870, 94370,
       38710, 42716, 36248, 71593, 71953, 72549, 72457, 56485, 71590,
       93602,  7412,

In [32]:
geo_zip.loc[geo_zip['geolocation_zip_code_prefix'].isin([72300, 7430, 71976])]

Unnamed: 0,geolocation_zip_code_prefix,geolocation_lat,geolocation_lng


In [33]:
customers["is_latlng_missing"] = (
    customers["geolocation_lat"].isna() | customers["geolocation_lng"].isna()
).astype(int)

customers[["geolocation_lat", "geolocation_lng"]] = (
    customers[["geolocation_lat", "geolocation_lng"]].fillna(0)
)


In [34]:
customers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99441 entries, 0 to 99440
Data columns (total 11 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   customer_id               99441 non-null  object 
 1   customer_unique_id        99441 non-null  object 
 2   customer_zip_code_prefix  99441 non-null  int64  
 3   customer_city             99441 non-null  object 
 4   customer_state            99441 non-null  object 
 5   tot_customers_in_zip      99441 non-null  float64
 6   tot_sellers_in_zip        99441 non-null  float64
 7   seller_customer_ratio     99441 non-null  float64
 8   geolocation_lat           99441 non-null  float64
 9   geolocation_lng           99441 non-null  float64
 10  is_latlng_missing         99441 non-null  int64  
dtypes: float64(5), int64(2), object(4)
memory usage: 8.3+ MB


## VALIDATION ##

In [35]:
customers.head()

Unnamed: 0,customer_id,customer_unique_id,customer_zip_code_prefix,customer_city,customer_state,tot_customers_in_zip,tot_sellers_in_zip,seller_customer_ratio,geolocation_lat,geolocation_lng,is_latlng_missing
0,06b8999e2fba1a1fbc88172c00ba8bc7,861eff4711a542e4b93843c6dd7febb0,14409,franca,SP,17.0,0.0,0.0,-20.513713,-47.396644,0
1,18955e83d337fd6b2def6b18a428ac77,290c77bc529b7ac935b93aa66c333dc3,9790,sao bernardo do campo,SP,19.0,0.0,0.0,-23.724495,-46.548297,0
2,4e7b3e00288586ebd08712fdd0374a03,060e732b5b29e8181a18229c7b0b2b5e,1151,sao paulo,SP,9.0,0.0,0.0,-23.531294,-46.656404,0
3,b2b6027bc5c5109e529d4dc6358b12c3,259dac757896d24d7702b9acbbff3f3c,8775,mogi das cruzes,SP,13.0,0.0,0.0,-23.49693,-46.185352,0
4,4f2d8ab171c80ec8364f7c12e35b23ad,345ecd01c38d18a9036ed96c73b8d066,13056,campinas,SP,42.0,1.0,0.023256,-22.972931,-47.140439,0


In [36]:
customers.to_csv("../Processed Data/prd_customers.csv", index=False)