# Brazilian E-Commerce Public Dataset by Olist (On Going)

##### This project explores and cleans the Olist Brazilian E-Commerce dataset, a large real-world dataset containing customer, order, payment, product, and review information.
##### The goal is to prepare a clean, analysis-ready dataset to study key business questions such as:
##### - Customer behavior and satisfaction
##### - Payment methods and spending patterns
##### - Delivery performance and logistics
##### - Seller and product insights. 

## Steps:
### 1. Installing Kaggle and importing the data set
### 2. Import Libraries
### 3. Download all related data sets
### 4. Define cleaning functions
### 5. Clean and explore all data sets before merging
#### - 5.1. Olist Customer Dataset
#### - 5.2. Olist Geolocation Dataset
#### - 5.3. Olist Order Items Dataset
#### - 5.4. Olist Order Payments Dataset
#### - 5.5. Olist Order Reviews Dataset
#### - 5.6. Olist Orders Dataset
#### - 5.7. Olist Products
#### - 5.8. Olist Sellers
#### - 5.9. Olist Product Category


## 1. Installing Kaggle and importing the data set

In [4]:
# Install kaggle API
!pip install Kaggle



In [5]:
# Verify Kaggle is installed
import kaggle
print ("Kaggle API is installed")

Kaggle API is installed


In [6]:
# list the datasets
!kaggle datasets list

ref                                                              title                                                size  lastUpdated          downloadCount  voteCount  usabilityRating  
---------------------------------------------------------------  --------------------------------------------------  -----  -------------------  -------------  ---------  ---------------  
neurocipher/heartdisease                                         Heart Disease                                         3KB  2025-12-11 15:29:14           2114        264  1.0              
suvidyasonawane/student-academic-placement-performance-dataset   Student Academic Placement Performance Dataset       92KB  2026-01-11 02:02:47              0         23  1.0              
kundanbedmutha/exam-score-prediction-dataset                     Exam Score Prediction Dataset                       318KB  2025-11-28 07:29:01           5863        296  1.0              
neurocipher/student-performance                        

In [7]:
!kaggle datasets list -s "olist"

ref                                                               title                                               size  lastUpdated          downloadCount  voteCount  usabilityRating  
----------------------------------------------------------------  -------------------------------------------------  -----  -------------------  -------------  ---------  ---------------  
olistbr/brazilian-ecommerce                                       Brazilian E-Commerce Public Dataset by Olist        43MB  2021-10-01 19:08:27         420735       3869  1.0              
olistbr/marketing-funnel-olist                                    Marketing Funnel by Olist                          278KB  2018-11-16 14:00:20          17747        319  1.0              
terencicp/e-commerce-dataset-by-olist-as-an-sqlite-database       E-commerce dataset by Olist (SQLite)                49MB  2024-04-28 14:56:35           9422         95  1.0              
erak1006/brazilian-e-commerce-company-olist            

In [8]:
# download the dataset
!kaggle datasets download -d olistbr/brazilian-ecommerce --unzip

Dataset URL: https://www.kaggle.com/datasets/olistbr/brazilian-ecommerce
License(s): CC-BY-NC-SA-4.0
Downloading brazilian-ecommerce.zip to /Users/fatemehshahvirdi
100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 42.6M/42.6M [00:08<00:00, 4.07MB/s]
100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 42.6M/42.6M [00:08<00:00, 5.15MB/s]


## 2. Import libraries

In [10]:
import os
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [11]:
# Checking the present files in the directory
print(os.listdir())

['.config', 'Music', '.kaggle', '.condarc', 'olist_sellers_dataset.csv', 'Untitled1.ipynb', '.DS_Store', '.dbvis', 'product_category_name_translation.csv', '.CFUserTextEncoding', 'Udemy-backup.ipynb', '.xonshrc', 'anaconda_projects', 'Full Disk Access', '.zshrc', 'olist_orders_dataset.csv', '.psql_history', 'Pictures', 'Udemy.ipynb', 'Entertainment', 'olist_order_items_dataset.csv', '.zsh_history', 'Untitled2.ipynb', '.ipython', 'Desktop', 'Library', '.matplotlib', '.pgadmin', 'Public', '.tcshrc', 'olist_customers_dataset.csv', '.virtual_documents', '.anaconda', 'Movies', 'Applications', 'udemy_courses.csv', 'olist_geolocation_dataset.csv', '.Trash', 'olist_order_payments_dataset.csv', '.ipynb_checkpoints', '.jupyter', 'Documents', '.vscode', '.bash_profile', 'Photos', 'Work-Related', 'Brazilian E-Commerce Public Dataset by Olist.ipynb', 'Downloads', '.empty', '.continuum', 'Brazilian E-Commerce Public Dataset by Olist-2.ipynb', '.zsh_sessions', 'olist_order_reviews_dataset.csv', '.con

## 3. Download all related data sets

In [13]:
# Read the CSV files
df_olist_customers = pd.read_csv("olist_customers_dataset.csv")
df_olist_geolocation = pd.read_csv("olist_geolocation_dataset.csv")
df_olist_order_items = pd.read_csv("olist_order_items_dataset.csv")
df_olist_order_payments = pd.read_csv("olist_order_payments_dataset.csv")
df_olist_order_reviews = pd.read_csv("olist_order_reviews_dataset.csv")
df_olist_orders = pd.read_csv("olist_orders_dataset.csv")
df_olist_products = pd.read_csv("olist_products_dataset.csv")
df_olist_sellers = pd.read_csv("olist_sellers_dataset.csv")
df_olist_product_category_name = pd.read_csv("product_category_name_translation.csv")

## 4. Define cleaning functions

##### Instead of doing repetetice cleanings for some cleaning steps I define a function to use it later

In [16]:
def clean_basics(df, text_cols=None, date_cols=None):
 
    df = df.drop_duplicates()

    # 2. Clean text columns (trim + lowercase)
    if text_cols:
        for col in text_cols:
            df[col] = df[col].str.strip().str.lower()

    # 3. Convert date columns to datetime
    if date_cols:
        for col in date_cols:
            df[col] = pd.to_datetime(df[col], errors='coerce')

    return df

## 5. Clean and explore all data sets before merging

### 5.1. Olist Customer Dataset

In [19]:
df_olist_customers.head()

Unnamed: 0,customer_id,customer_unique_id,customer_zip_code_prefix,customer_city,customer_state
0,06b8999e2fba1a1fbc88172c00ba8bc7,861eff4711a542e4b93843c6dd7febb0,14409,franca,SP
1,18955e83d337fd6b2def6b18a428ac77,290c77bc529b7ac935b93aa66c333dc3,9790,sao bernardo do campo,SP
2,4e7b3e00288586ebd08712fdd0374a03,060e732b5b29e8181a18229c7b0b2b5e,1151,sao paulo,SP
3,b2b6027bc5c5109e529d4dc6358b12c3,259dac757896d24d7702b9acbbff3f3c,8775,mogi das cruzes,SP
4,4f2d8ab171c80ec8364f7c12e35b23ad,345ecd01c38d18a9036ed96c73b8d066,13056,campinas,SP


In [20]:
df_olist_customers.shape

(99441, 5)

In [21]:
df_olist_customers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99441 entries, 0 to 99440
Data columns (total 5 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   customer_id               99441 non-null  object
 1   customer_unique_id        99441 non-null  object
 2   customer_zip_code_prefix  99441 non-null  int64 
 3   customer_city             99441 non-null  object
 4   customer_state            99441 non-null  object
dtypes: int64(1), object(4)
memory usage: 3.8+ MB


In [22]:
# Using the function:
df_olist_customers = clean_basics(
    df_olist_customers,
    text_cols=['customer_city', 'customer_state'],
    date_cols=None
)

In [23]:
df_olist_customers.shape

(99441, 5)

In [24]:
# validate ID formats (all IDs should have the same length and be alphanumeric (hex_style) )
df_olist_customers['customer_id'].str.len().value_counts().head()
df_olist_customers['customer_unique_id'].str.len().value_counts().head()

customer_unique_id
32    99441
Name: count, dtype: int64

In [25]:
# check postal codes for unrealistic values (Brazilian ZIP prefixes usually range roughly between 01000â€“99999)
df_olist_customers['customer_zip_code_prefix'].describe()

count    99441.000000
mean     35137.474583
std      29797.938996
min       1003.000000
25%      11347.000000
50%      24416.000000
75%      58900.000000
max      99990.000000
Name: customer_zip_code_prefix, dtype: float64

In [26]:
# Remove special characters or whitespace in text (they might cause merging issues later)
df_olist_customers= df_olist_customers.applymap(
    lambda x: x.strip() if isinstance(x, str) else x
)

  df_olist_customers= df_olist_customers.applymap(


### 5.2. Olist Geolocation Dataset

In [28]:
df_olist_geolocation.head()

Unnamed: 0,geolocation_zip_code_prefix,geolocation_lat,geolocation_lng,geolocation_city,geolocation_state
0,1037,-23.545621,-46.639292,sao paulo,SP
1,1046,-23.546081,-46.64482,sao paulo,SP
2,1046,-23.546129,-46.642951,sao paulo,SP
3,1041,-23.544392,-46.639499,sao paulo,SP
4,1035,-23.541578,-46.641607,sao paulo,SP


In [29]:
df_olist_geolocation.shape

(1000163, 5)

In [30]:
# Using the function:
df_olist_geolocation = clean_basics(
    df_olist_geolocation,
    text_cols=['geolocation_city', 'geolocation_state'],
    date_cols=None
)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[col] = df[col].str.strip().str.lower()


In [31]:
df_olist_geolocation.shape

(738332, 5)

In [67]:
# validate latitude and longitude ranges
df_olist_geolocation[['geolocation_lat','geolocation_lng']].describe()

Unnamed: 0,geolocation_lat,geolocation_lng
count,738332.0,738332.0
mean,-20.998353,-46.461098
std,5.892315,4.393705
min,-36.605374,-101.466766
25%,-23.603061,-48.867822
50%,-22.873588,-46.647278
75%,-19.923336,-43.836974
max,45.065933,121.105394


##### Brazilâ€™s approximate bounding box:
##### Latitude: âˆ’35 â‰¤ lat â‰¤ +5
##### Longitude: âˆ’75 â‰¤ lng â‰¤ âˆ’30
##### Everything outside that is invalid for Brazil, they are clearly a few outliers, but I am curious to find out how many.

In [75]:
invalid_coords = df_olist_geolocation.query(
    "geolocation_lat < -35 or geolocation_lat > 5 or geolocation_lng < -75 or geolocation_lng > -30"
)
print("invalid_coordinates: " , len(invalid_coords))
print("Percentage of invalid raws: ", round(len(invalid_coords) / len(df_olist_geolocation) *100, 4), "%")

invalid_coordinates:  25
Percentage of invalid raws:  0.0034 %


##### I decide to remove them, but I want to see where they are, bacasue I am curius :D

In [78]:
invalid_coords[['geolocation_zip_code_prefix','geolocation_lat' , 'geolocation_lng' ,'geolocation_city' ,'geolocation_state']].head(10)

Unnamed: 0,geolocation_zip_code_prefix,geolocation_lat,geolocation_lng,geolocation_city,geolocation_state
387565,18243,28.008978,-15.536867,bom retiro da esperanca,sp
513631,28165,41.614052,-8.411675,vila nova de campos,rj
513754,28155,42.439286,13.820214,santa maria,rj
514429,28333,38.381672,-6.3282,raposo,rj
516682,28595,43.684961,-7.41108,portela,rj
538512,29654,29.409252,-98.484121,santo antÃ´nio do canaÃ£,es
538557,29654,21.657547,-101.466766,santo antonio do canaa,es
585242,35179,25.995203,-98.078544,santana do paraÃ­so,mg
585260,35179,25.995245,-98.078533,santana do paraiso,mg
695377,45936,38.323939,-6.775035,itabatan,ba


##### This inspection of invalid rows revealed latitude and longitude values located far outside Brazil (e.g. Europe, North America, and Asia) despite Brazilian city names.
##### These were likely geocoding mismatches or input errors, so I remove them.

In [95]:
df_olist_geolocation = df_olist_geolocation.query(
 "-35 <= geolocation_lat <= 5 and -75 <= geolocation_lng <= -30"
)
print("25 invalid coordinations removed")
print("New Shape: ", df_olist_geolocation.shape)

25 invalid coordinations removed
New Shape:  (738307, 5)


### 5.3. Olist Order Items Dataset

In [98]:
df_olist_order_items.head()

Unnamed: 0,order_id,order_item_id,product_id,seller_id,shipping_limit_date,price,freight_value
0,00010242fe8c5a6d1ba2dd792cb16214,1,4244733e06e7ecb4970a6e2683c13e61,48436dade18ac8b2bce089ec2a041202,2017-09-19 09:45:35,58.9,13.29
1,00018f77f2f0320c557190d7a144bdd3,1,e5f2d52b802189ee658865ca93d83a8f,dd7ddc04e1b6c2c614352b383efe2d36,2017-05-03 11:05:13,239.9,19.93
2,000229ec398224ef6ca0657da4fc703e,1,c777355d18b72b67abbeef9df44fd0fd,5b51032eddd242adc84c38acab88f23d,2018-01-18 14:48:30,199.0,17.87
3,00024acbcdf0a6daa1e931b038114c75,1,7634da152a4610f1595efa32f14722fc,9d7a1d34a5052409006425275ba1c2b4,2018-08-15 10:10:18,12.99,12.79
4,00042b26cf59d7ce69dfabb4e55b4fd9,1,ac6c3623068f30de03045865e4e10089,df560393f3a51e74553ab94004ba5c87,2017-02-13 13:57:51,199.9,18.14


In [100]:
df_olist_order_items.shape

(112650, 7)

In [104]:
# Using the function:
df_olist_order_items = clean_basics(
    df_olist_order_items,
    date_cols=['shipping_limit_date'],
    text_cols=None
)

In [114]:
df_olist_order_items.info()
df_olist_order_items.isna().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 112650 entries, 0 to 112649
Data columns (total 7 columns):
 #   Column               Non-Null Count   Dtype         
---  ------               --------------   -----         
 0   order_id             112650 non-null  object        
 1   order_item_id        112650 non-null  int64         
 2   product_id           112650 non-null  object        
 3   seller_id            112650 non-null  object        
 4   shipping_limit_date  112650 non-null  datetime64[ns]
 5   price                112650 non-null  float64       
 6   freight_value        112650 non-null  float64       
dtypes: datetime64[ns](1), float64(2), int64(1), object(3)
memory usage: 6.0+ MB


order_id               0
order_item_id          0
product_id             0
seller_id              0
shipping_limit_date    0
price                  0
freight_value          0
dtype: int64

In [116]:
# check if each order par ('order_id', 'order_item_id') is unique.
df_olist_order_items.duplicated(
    subset = ['order_id', 'order_item_id']
).sum()

0

In [122]:
# Check for invalid entries in price and freight value
(df_olist_order_items[['price', 'freight_value']]< 0 ).sum()

price            0
freight_value    0
dtype: int64

### 5.4. Olist Order Payments Dataset

In [127]:
df_olist_order_payments.head()

Unnamed: 0,order_id,payment_sequential,payment_type,payment_installments,payment_value
0,b81ef226f3fe1789b1e8b2acac839d17,1,credit_card,8,99.33
1,a9810da82917af2d9aefd1278f1dcfa0,1,credit_card,1,24.39
2,25e8ea4e93396b6fa0d3dd708e76c1bd,1,credit_card,1,65.71
3,ba78997921bbcdc1373bb41e913ab953,1,credit_card,8,107.78
4,42fdf880ba16b47b59251dd489d4441a,1,credit_card,2,128.45


In [133]:
df_olist_order_payments.shape

(103886, 5)

In [135]:
df_olist_order_payments.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 103886 entries, 0 to 103885
Data columns (total 5 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   order_id              103886 non-null  object 
 1   payment_sequential    103886 non-null  int64  
 2   payment_type          103886 non-null  object 
 3   payment_installments  103886 non-null  int64  
 4   payment_value         103886 non-null  float64
dtypes: float64(1), int64(2), object(2)
memory usage: 4.0+ MB


In [137]:
df_olist_order_payments = clean_basics(
    df_olist_order_payments
)

In [139]:
# Ensure each payment sequence is unique per order
df_olist_order_payments.duplicated(
    subset=['order_id', 'payment_sequential']
).sum()

0

### 5.5. Olist Order Reviews Dataset

In [144]:
df_olist_order_reviews.head()

Unnamed: 0,review_id,order_id,review_score,review_comment_title,review_comment_message,review_creation_date,review_answer_timestamp
0,7bc2406110b926393aa56f80a40eba40,73fc7af87114b39712e6da79b0a377eb,4,,,2018-01-18 00:00:00,2018-01-18 21:46:59
1,80e641a11e56f04c1ad469d5645fdfde,a548910a1c6147796b98fdf73dbeba33,5,,,2018-03-10 00:00:00,2018-03-11 03:05:13
2,228ce5500dc1d8e020d8d1322874b6f0,f9e4b658b201a9f2ecdecbb34bed034b,5,,,2018-02-17 00:00:00,2018-02-18 14:36:24
3,e64fb393e7b32834bb789ff8bb30750e,658677c97b385a9be170737859d3511b,5,,Recebi bem antes do prazo estipulado.,2017-04-21 00:00:00,2017-04-21 22:02:06
4,f7c4243c7fe1938f181bec41a392bdeb,8e6bfb81e283fa7e4f11123a3fb894f1,5,,ParabÃ©ns lojas lannister adorei comprar pela I...,2018-03-01 00:00:00,2018-03-02 10:26:53


In [146]:
df_olist_order_reviews.shape

(99224, 7)

In [150]:
df_olist_order_reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99224 entries, 0 to 99223
Data columns (total 7 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   review_id                99224 non-null  object
 1   order_id                 99224 non-null  object
 2   review_score             99224 non-null  int64 
 3   review_comment_title     11568 non-null  object
 4   review_comment_message   40977 non-null  object
 5   review_creation_date     99224 non-null  object
 6   review_answer_timestamp  99224 non-null  object
dtypes: int64(1), object(6)
memory usage: 5.3+ MB


In [152]:
df_olist_order_reviews = clean_basics(
    df_olist_order_reviews,
    date_cols=['review_creation_date', 'review_answer_timestamp'],
    text_cols=None
)

In [154]:
# Review scores should only be between 1 and 5
df_olist_order_reviews['review_score'].value_counts().sort_index()

review_score
1    11424
2     3151
3     8179
4    19142
5    57328
Name: count, dtype: int64

In [160]:
# Ensure one review per order. If an order has multiple reviews, joins will duplicate rows
# Olist is known to have one review per order
df_olist_order_reviews.duplicated(
    subset=['order_id']
).sum()

551

#### Those 551 orders will duplicate rows
#### Metrics like revenue, counts, averages will be wrong, so I decide to keep the latest review per order

In [163]:
# Keep the latest review per order based on review_creation_date
df_olist_order_reviews = (
    df_olist_order_reviews
    .sort_values('review_creation_date')
    .drop_duplicates(subset=['order_id'], keep='last')
)

In [165]:
df_olist_order_reviews.duplicated(subset=['order_id']).sum()

0

###  5.6. Olist Orders Dataset

In [169]:
df_olist_orders.head()

Unnamed: 0,order_id,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date
0,e481f51cbdc54678b7cc49136f2d6af7,9ef432eb6251297304e76186b10a928d,delivered,2017-10-02 10:56:33,2017-10-02 11:07:15,2017-10-04 19:55:00,2017-10-10 21:25:13,2017-10-18 00:00:00
1,53cdb2fc8bc7dce0b6741e2150273451,b0830fb4747a6c6d20dea0b8c802d7ef,delivered,2018-07-24 20:41:37,2018-07-26 03:24:27,2018-07-26 14:31:00,2018-08-07 15:27:45,2018-08-13 00:00:00
2,47770eb9100c2d0c44946d9cf07ec65d,41ce2a54c0b03bf3443c3d931a367089,delivered,2018-08-08 08:38:49,2018-08-08 08:55:23,2018-08-08 13:50:00,2018-08-17 18:06:29,2018-09-04 00:00:00
3,949d5b44dbf5de918fe9c16f97b45f8a,f88197465ea7920adcdbec7375364d82,delivered,2017-11-18 19:28:06,2017-11-18 19:45:59,2017-11-22 13:39:59,2017-12-02 00:28:42,2017-12-15 00:00:00
4,ad21c59c0840e6cb83a9ceb5573f8159,8ab97904e6daea8866dbdbc4fb7aad2c,delivered,2018-02-13 21:18:39,2018-02-13 22:20:29,2018-02-14 19:46:34,2018-02-16 18:17:02,2018-02-26 00:00:00


In [171]:
df_olist_orders.shape

(99441, 8)

In [173]:
df_olist_orders.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99441 entries, 0 to 99440
Data columns (total 8 columns):
 #   Column                         Non-Null Count  Dtype 
---  ------                         --------------  ----- 
 0   order_id                       99441 non-null  object
 1   customer_id                    99441 non-null  object
 2   order_status                   99441 non-null  object
 3   order_purchase_timestamp       99441 non-null  object
 4   order_approved_at              99281 non-null  object
 5   order_delivered_carrier_date   97658 non-null  object
 6   order_delivered_customer_date  96476 non-null  object
 7   order_estimated_delivery_date  99441 non-null  object
dtypes: object(8)
memory usage: 6.1+ MB


In [175]:
df_olist_orders = clean_basics(
    df_olist_orders,
    date_cols=[
        'order_purchase_timestamp',
        'order_approved_at',
        'order_delivered_carrier_date',
        'order_delivered_customer_date',
        'order_estimated_delivery_date'
    ],
    text_cols=None
)


In [177]:
# Ensure one row per order
df_olist_orders.duplicated(subset=['order_id']).sum()

0

In [179]:
# Delivered orders should not be delivered before purchase
(
    df_olist_orders['order_delivered_customer_date']
    < df_olist_orders['order_purchase_timestamp']
).sum()

0

###  5.7. Olist Products

In [187]:
df_olist_products.head()

Unnamed: 0,product_id,product_category_name,product_name_lenght,product_description_lenght,product_photos_qty,product_weight_g,product_length_cm,product_height_cm,product_width_cm
0,1e9e8ef04dbcff4541ed26657ea517e5,perfumaria,40.0,287.0,1.0,225.0,16.0,10.0,14.0
1,3aa071139cb16b67ca9e5dea641aaa2f,artes,44.0,276.0,1.0,1000.0,30.0,18.0,20.0
2,96bd76ec8810374ed1b65e291975717f,esporte_lazer,46.0,250.0,1.0,154.0,18.0,9.0,15.0
3,cef67bcfe19066a932b7673e239eb23d,bebes,27.0,261.0,1.0,371.0,26.0,4.0,26.0
4,9dc1a7de274444849c219cff195d0b71,utilidades_domesticas,37.0,402.0,4.0,625.0,20.0,17.0,13.0


In [193]:
df_olist_products.shape

(32951, 9)

In [191]:
df_olist_products.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32951 entries, 0 to 32950
Data columns (total 9 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   product_id                  32951 non-null  object 
 1   product_category_name       32341 non-null  object 
 2   product_name_lenght         32341 non-null  float64
 3   product_description_lenght  32341 non-null  float64
 4   product_photos_qty          32341 non-null  float64
 5   product_weight_g            32949 non-null  float64
 6   product_length_cm           32949 non-null  float64
 7   product_height_cm           32949 non-null  float64
 8   product_width_cm            32949 non-null  float64
dtypes: float64(7), object(2)
memory usage: 2.3+ MB


In [197]:
df_olist_products = clean_basics(df_olist_products)

In [199]:
# Ensure one row per product
df_olist_products.duplicated(subset=['product_id']).sum()

0

In [201]:
# Physical attributes should not be negative
(df_olist_products[
    ['product_weight_g', 'product_length_cm', 'product_height_cm', 'product_width_cm']
] < 0).sum()

product_weight_g     0
product_length_cm    0
product_height_cm    0
product_width_cm     0
dtype: int64

###  5.8. Olist Sellers

In [207]:
df_olist_sellers.head()

Unnamed: 0,seller_id,seller_zip_code_prefix,seller_city,seller_state
0,3442f8959a84dea7ee197c632cb2df15,13023,campinas,SP
1,d1b65fc7debc3361ea86b5f14c68d2e2,13844,mogi guacu,SP
2,ce3ad9de960102d0677a81f5d0bb7b2d,20031,rio de janeiro,RJ
3,c0f3eea2e14555b6faeea3dd58c1b1c3,4195,sao paulo,SP
4,51a04a8a6bdcb23deccc82b0b80742cf,12914,braganca paulista,SP


In [209]:
df_olist_sellers.shape

(3095, 4)

In [211]:
df_olist_sellers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3095 entries, 0 to 3094
Data columns (total 4 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   seller_id               3095 non-null   object
 1   seller_zip_code_prefix  3095 non-null   int64 
 2   seller_city             3095 non-null   object
 3   seller_state            3095 non-null   object
dtypes: int64(1), object(3)
memory usage: 96.8+ KB


In [213]:
df_olist_sellers = clean_basics(df_olist_sellers)

In [215]:
# Ensure one row per seller (prevents row duplication when joining)
df_olist_sellers.duplicated(subset=['seller_id']).sum()

0

In [217]:
# Seller state should be 2-letter codes (like SP, RJ)
df_olist_sellers['seller_state'].str.len().value_counts()

seller_state
2    3095
Name: count, dtype: int64

###  5.9. Olist Product Category

In [220]:
df_olist_product_category_name.head()

Unnamed: 0,product_category_name,product_category_name_english
0,beleza_saude,health_beauty
1,informatica_acessorios,computers_accessories
2,automotivo,auto
3,cama_mesa_banho,bed_bath_table
4,moveis_decoracao,furniture_decor


In [224]:
df_olist_product_category_name.shape

(71, 2)

In [226]:
df_olist_product_category_name.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 71 entries, 0 to 70
Data columns (total 2 columns):
 #   Column                         Non-Null Count  Dtype 
---  ------                         --------------  ----- 
 0   product_category_name          71 non-null     object
 1   product_category_name_english  71 non-null     object
dtypes: object(2)
memory usage: 1.2+ KB


In [228]:
df_olist_product_category_name = clean_basics(
    df_olist_product_category_name
)

In [230]:
# Ensure one row per product category
df_olist_product_category_name.duplicated(
    subset=['product_category_name']
).sum()

0