# Data Cleaning

In this notebook, I will clean the 9 raw Olist datasets to prepare them for analysis.  
The main tasks include:
- Handling missing values
- Fixing inconsistent column names
- Dropping duplicates where necessary
- Converting data types (e.g., dates)
- Standardizing categorical values

The datasets being cleaned are:
1. Orders  
2. Customers  
3. Order Items  
4. Payments  
5. Products  
6. Sellers  
7. Reviews  
8. Geolocation  
9. Categories


In [1]:
import pandas as pd
import numpy as np
from pathlib import Path

# Paths
DATA_RAW = Path("../data/raw")
DATA_CLEANED = Path("../data/cleaned")
DATA_CLEANED.mkdir(parents=True, exist_ok=True)

# Load datasets
orders = pd.read_csv(DATA_RAW / "olist_orders_dataset.csv")
customers = pd.read_csv(DATA_RAW / "olist_customers_dataset.csv")
order_items = pd.read_csv(DATA_RAW / "olist_order_items_dataset.csv")
payments = pd.read_csv(DATA_RAW / "olist_order_payments_dataset.csv")
products = pd.read_csv(DATA_RAW / "olist_products_dataset.csv")
sellers = pd.read_csv(DATA_RAW / "olist_sellers_dataset.csv")
reviews = pd.read_csv(DATA_RAW / "olist_order_reviews_dataset.csv")
geolocation = pd.read_csv(DATA_RAW / "olist_geolocation_dataset.csv")
categories = pd.read_csv(DATA_RAW / "product_category_name_translation.csv")

print("âœ… Raw data loaded")


âœ… Raw data loaded


## Orders
The orders dataset contains purchase and delivery information.  
Cleaning tasks:
- Ensure date columns are in datetime format.
- Keep all rows (missing delivery dates often mean the order was canceled).
- Drop duplicates if any.


In [20]:
 # Orders cleaning
orders['order_purchase_timestamp'] = pd.to_datetime(orders['order_purchase_timestamp'])
orders['order_approved_at'] = pd.to_datetime(orders['order_approved_at'])
orders['order_delivered_carrier_date'] = pd.to_datetime(orders['order_delivered_carrier_date'])
orders['order_delivered_customer_date'] = pd.to_datetime(orders['order_delivered_customer_date'])
orders['order_estimated_delivery_date'] = pd.to_datetime(orders['order_estimated_delivery_date'])

# Drop duplicates
orders.drop_duplicates(inplace=True)

orders.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99441 entries, 0 to 99440
Data columns (total 8 columns):
 #   Column                         Non-Null Count  Dtype         
---  ------                         --------------  -----         
 0   order_id                       99441 non-null  object        
 1   customer_id                    99441 non-null  object        
 2   order_status                   99441 non-null  object        
 3   order_purchase_timestamp       99441 non-null  datetime64[ns]
 4   order_approved_at              99281 non-null  datetime64[ns]
 5   order_delivered_carrier_date   97658 non-null  datetime64[ns]
 6   order_delivered_customer_date  96476 non-null  datetime64[ns]
 7   order_estimated_delivery_date  99441 non-null  datetime64[ns]
dtypes: datetime64[ns](5), object(3)
memory usage: 6.1+ MB


## Customers
The customers dataset links unique customers to orders.  
Cleaning tasks:
- Remove duplicates.
- Handle missing values if any (should be rare).


In [21]:
# Customers cleaning
customers.drop_duplicates(inplace=True)

customers.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99441 entries, 0 to 99440
Data columns (total 5 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   customer_id               99441 non-null  object
 1   customer_unique_id        99441 non-null  object
 2   customer_zip_code_prefix  99441 non-null  int64 
 3   customer_city             99441 non-null  object
 4   customer_state            99441 non-null  object
dtypes: int64(1), object(4)
memory usage: 3.8+ MB


## Order Items
The order items dataset links products to orders.  
Cleaning tasks:
- Drop duplicates.
- Ensure `price` and `freight_value` are non-negative.


In [22]:
# Order items cleaning
order_items.drop_duplicates(inplace=True)

# Ensure no negative values
order_items = order_items[(order_items['price'] >= 0) & (order_items['freight_value'] >= 0)]

order_items.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 112650 entries, 0 to 112649
Data columns (total 7 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   order_id             112650 non-null  object 
 1   order_item_id        112650 non-null  int64  
 2   product_id           112650 non-null  object 
 3   seller_id            112650 non-null  object 
 4   shipping_limit_date  112650 non-null  object 
 5   price                112650 non-null  float64
 6   freight_value        112650 non-null  float64
dtypes: float64(2), int64(1), object(4)
memory usage: 6.0+ MB


## Payments
The payments dataset contains payment details.  
Cleaning tasks:
- Drop duplicates.
- Remove rows where payment values are negative (if any).



In [23]:
# Payments cleaning
payments.drop_duplicates(inplace=True)
payments = payments[payments['payment_value'] >= 0]

payments.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 103886 entries, 0 to 103885
Data columns (total 5 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   order_id              103886 non-null  object 
 1   payment_sequential    103886 non-null  int64  
 2   payment_type          103886 non-null  object 
 3   payment_installments  103886 non-null  int64  
 4   payment_value         103886 non-null  float64
dtypes: float64(1), int64(2), object(2)
memory usage: 4.0+ MB


## Products
The products dataset has metadata about products.  
Cleaning tasks:
- Fix column name typos (`lenght` â†’ `length`).
- Fill missing category names with `"unknown"`.
- Fill missing numeric values with the median.


In [36]:
# Fill missing category names with "unknown"
products["product_category_name"] = products["product_category_name"].fillna("unknown")

# List of numerical columns to clean
num_cols = [
    "product_name_length",
    "product_description_length",
    "product_photos_qty",
    "product_weight_g",
    "product_length_cm",
    "product_height_cm",
    "product_width_cm"
]

# Fill missing numerical values with the median of each column
for col in num_cols:
    products[col] = products[col].fillna(products[col].median())

products.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32951 entries, 0 to 32950
Data columns (total 9 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   product_id                  32951 non-null  object 
 1   product_category_name       32951 non-null  object 
 2   product_name_length         32951 non-null  float64
 3   product_description_length  32951 non-null  float64
 4   product_photos_qty          32951 non-null  float64
 5   product_weight_g            32951 non-null  float64
 6   product_length_cm           32951 non-null  float64
 7   product_height_cm           32951 non-null  float64
 8   product_width_cm            32951 non-null  float64
dtypes: float64(7), object(2)
memory usage: 2.3+ MB


## Sellers
The sellers dataset contains seller information.  
Cleaning tasks:
- Drop duplicates.


In [25]:
# Sellers cleaning
sellers.drop_duplicates(inplace=True)

sellers.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3095 entries, 0 to 3094
Data columns (total 4 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   seller_id               3095 non-null   object
 1   seller_zip_code_prefix  3095 non-null   int64 
 2   seller_city             3095 non-null   object
 3   seller_state            3095 non-null   object
dtypes: int64(1), object(3)
memory usage: 96.8+ KB


## Reviews
The reviews dataset contains customer reviews and ratings.  
Cleaning tasks:
- Convert date columns to datetime.
- Drop duplicates.


In [26]:
# Reviews cleaning
reviews['review_creation_date'] = pd.to_datetime(reviews['review_creation_date'])
reviews['review_answer_timestamp'] = pd.to_datetime(reviews['review_answer_timestamp'])
reviews.drop_duplicates(inplace=True)

reviews.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99224 entries, 0 to 99223
Data columns (total 7 columns):
 #   Column                   Non-Null Count  Dtype         
---  ------                   --------------  -----         
 0   review_id                99224 non-null  object        
 1   order_id                 99224 non-null  object        
 2   review_score             99224 non-null  int64         
 3   review_comment_title     11568 non-null  object        
 4   review_comment_message   40977 non-null  object        
 5   review_creation_date     99224 non-null  datetime64[ns]
 6   review_answer_timestamp  99224 non-null  datetime64[ns]
dtypes: datetime64[ns](2), int64(1), object(4)
memory usage: 5.3+ MB


## Geolocation
The geolocation dataset contains postal codes and coordinates.  
Cleaning tasks:
- Drop duplicates.
- Standardize postal codes (remove spaces, keep only numbers).


In [27]:
# Geolocation cleaning
geolocation.drop_duplicates(inplace=True)
geolocation['geolocation_zip_code_prefix'] = geolocation['geolocation_zip_code_prefix'].astype(str).str.strip()

geolocation.info()


<class 'pandas.core.frame.DataFrame'>
Index: 738332 entries, 0 to 1000161
Data columns (total 5 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   geolocation_zip_code_prefix  738332 non-null  object 
 1   geolocation_lat              738332 non-null  float64
 2   geolocation_lng              738332 non-null  float64
 3   geolocation_city             738332 non-null  object 
 4   geolocation_state            738332 non-null  object 
dtypes: float64(2), object(3)
memory usage: 33.8+ MB


## Categories
The categories dataset maps Portuguese category names to English.  
Cleaning tasks:
- Drop duplicates.
- Fill missing translations with `"unknown"`.


In [29]:
# Categories cleaning
categories['product_category_name_english'] = categories['product_category_name_english'].fillna("unknown")

categories.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 71 entries, 0 to 70
Data columns (total 2 columns):
 #   Column                         Non-Null Count  Dtype 
---  ------                         --------------  ----- 
 0   product_category_name          71 non-null     object
 1   product_category_name_english  71 non-null     object
dtypes: object(2)
memory usage: 1.2+ KB


## Cleaning Summary

To confirm the success of the cleaning process, I will print out the number of rows and columns for each dataset after cleaning.  
This helps ensure:
- No accidental data loss.
- Duplicates were removed.
- Missing values were handled appropriately.


In [37]:
# Create a dictionary of datasets
datasets = {
    "Orders": orders,
    "Customers": customers,
    "Order Items": order_items,
    "Payments": payments,
    "Products": products,
    "Sellers": sellers,
    "Reviews": reviews,
    "Geolocation": geolocation,
    "Categories": categories
}

# Print a summary of cleaned datasets
print("ðŸ“Š Dataset Summary After Cleaning:\n")
for name, df in datasets.items():
    print(f"{name}: {df.shape[0]} rows, {df.shape[1]} columns")


ðŸ“Š Dataset Summary After Cleaning:

Orders: 99441 rows, 8 columns
Customers: 99441 rows, 5 columns
Order Items: 112650 rows, 7 columns
Payments: 103886 rows, 5 columns
Products: 32951 rows, 9 columns
Sellers: 3095 rows, 4 columns
Reviews: 99224 rows, 7 columns
Geolocation: 738332 rows, 5 columns
Categories: 71 rows, 2 columns


## Export Cleaned Datasets

After cleaning, I will export each dataset into the `data/processed/` folder.  
This ensures:
- The raw data is preserved in `data/raw/`
- The cleaned versions are stored separately
- Future analysis notebooks can simply load the cleaned data instead of repeating the cleaning steps


In [38]:
# Define output folder
output_path = "../data/processed/"

# Save each cleaned dataset to CSV
orders.to_csv(output_path + "orders_cleaned.csv", index=False)
customers.to_csv(output_path + "customers_cleaned.csv", index=False)
order_items.to_csv(output_path + "order_items_cleaned.csv", index=False)
payments.to_csv(output_path + "payments_cleaned.csv", index=False)
products.to_csv(output_path + "products_cleaned.csv", index=False)
sellers.to_csv(output_path + "sellers_cleaned.csv", index=False)
reviews.to_csv(output_path + "reviews_cleaned.csv", index=False)
geolocation.to_csv(output_path + "geolocation_cleaned.csv", index=False)
categories.to_csv(output_path + "categories_cleaned.csv", index=False)

print("âœ… All cleaned datasets have been exported to /data/processed/")


âœ… All cleaned datasets have been exported to /data/processed/
