# Exploratory Data Analysis - Brazilian E-Commerce Dataset

This notebook performs exploratory data analysis (EDA) on the Brazilian E-Commerce Public Dataset using ydata_profiling.

## Dataset Overview
- **9 CSV files** containing information about 100k orders from 2016 to 2018
- **Key tables**: Orders, Customers, Order Items, Products, Sellers, Reviews, Payments, Geolocation

## Objectives
1. Load and explore each dataset
2. Generate comprehensive profiling reports
3. Identify data quality issues
4. Understand relationships between tables

In [1]:
# Import required libraries
import pandas as pd
import numpy as np
from ydata_profiling import ProfileReport
import os
import warnings
warnings.filterwarnings('ignore')

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

print("Libraries imported successfully!")

Libraries imported successfully!


## 1. Define Data Paths and Load Datasets

In [2]:
# Define paths
RAW_DATA_PATH = '../data/raw/'
REPORTS_PATH = '../reports/'

# Ensure reports directory exists
os.makedirs(REPORTS_PATH, exist_ok=True)

# Define dataset files
datasets = {
    'orders': 'olist_orders_dataset.csv',
    'customers': 'olist_customers_dataset.csv',
    'order_items': 'olist_order_items_dataset.csv',
    'order_payments': 'olist_order_payments_dataset.csv',
    'order_reviews': 'olist_order_reviews_dataset.csv',
    'products': 'olist_products_dataset.csv',
    'sellers': 'olist_sellers_dataset.csv',
    'geolocation': 'olist_geolocation_dataset.csv',
    'category_translation': 'product_category_name_translation.csv'
}

print(f"Data path: {RAW_DATA_PATH}")
print(f"Reports path: {REPORTS_PATH}")
print(f"\nDatasets to analyze: {len(datasets)}")

Data path: ../data/raw/
Reports path: ../reports/

Datasets to analyze: 9


## 2. Load All Datasets

In [3]:
# Load all datasets
data = {}

for name, filename in datasets.items():
    filepath = os.path.join(RAW_DATA_PATH, filename)
    try:
        df = pd.read_csv(filepath)
        data[name] = df
        print(f"âœ“ Loaded {name}: {df.shape[0]:,} rows Ã— {df.shape[1]} columns")
    except Exception as e:
        print(f"âœ— Error loading {name}: {e}")

print(f"\nTotal datasets loaded: {len(data)}")

âœ“ Loaded orders: 99,441 rows Ã— 8 columns
âœ“ Loaded customers: 99,441 rows Ã— 5 columns
âœ“ Loaded order_items: 112,650 rows Ã— 7 columns
âœ“ Loaded order_payments: 103,886 rows Ã— 5 columns
âœ“ Loaded order_reviews: 99,224 rows Ã— 7 columns
âœ“ Loaded products: 32,951 rows Ã— 9 columns
âœ“ Loaded sellers: 3,095 rows Ã— 4 columns
âœ“ Loaded geolocation: 1,000,163 rows Ã— 5 columns
âœ“ Loaded category_translation: 71 rows Ã— 2 columns

Total datasets loaded: 9


## 3. Quick Overview of Each Dataset

In [4]:
# Display basic information for each dataset
for name, df in data.items():
    print(f"\n{'='*60}")
    print(f"Dataset: {name.upper()}")
    print(f"{'='*60}")
    print(f"Shape: {df.shape[0]:,} rows Ã— {df.shape[1]} columns")
    print(f"\nColumns: {list(df.columns)}")
    print(f"\nFirst 3 rows:")
    display(df.head(3))
    print(f"\nData types:")
    print(df.dtypes)
    print(f"\nMissing values:")
    print(df.isnull().sum())


Dataset: ORDERS
Shape: 99,441 rows Ã— 8 columns

Columns: ['order_id', 'customer_id', 'order_status', 'order_purchase_timestamp', 'order_approved_at', 'order_delivered_carrier_date', 'order_delivered_customer_date', 'order_estimated_delivery_date']

First 3 rows:


Unnamed: 0,order_id,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date
0,e481f51cbdc54678b7cc49136f2d6af7,9ef432eb6251297304e76186b10a928d,delivered,2017-10-02 10:56:33,2017-10-02 11:07:15,2017-10-04 19:55:00,2017-10-10 21:25:13,2017-10-18 00:00:00
1,53cdb2fc8bc7dce0b6741e2150273451,b0830fb4747a6c6d20dea0b8c802d7ef,delivered,2018-07-24 20:41:37,2018-07-26 03:24:27,2018-07-26 14:31:00,2018-08-07 15:27:45,2018-08-13 00:00:00
2,47770eb9100c2d0c44946d9cf07ec65d,41ce2a54c0b03bf3443c3d931a367089,delivered,2018-08-08 08:38:49,2018-08-08 08:55:23,2018-08-08 13:50:00,2018-08-17 18:06:29,2018-09-04 00:00:00



Data types:
order_id                         object
customer_id                      object
order_status                     object
order_purchase_timestamp         object
order_approved_at                object
order_delivered_carrier_date     object
order_delivered_customer_date    object
order_estimated_delivery_date    object
dtype: object

Missing values:
order_id                            0
customer_id                         0
order_status                        0
order_purchase_timestamp            0
order_approved_at                 160
order_delivered_carrier_date     1783
order_delivered_customer_date    2965
order_estimated_delivery_date       0
dtype: int64

Dataset: CUSTOMERS
Shape: 99,441 rows Ã— 5 columns

Columns: ['customer_id', 'customer_unique_id', 'customer_zip_code_prefix', 'customer_city', 'customer_state']

First 3 rows:


Unnamed: 0,customer_id,customer_unique_id,customer_zip_code_prefix,customer_city,customer_state
0,06b8999e2fba1a1fbc88172c00ba8bc7,861eff4711a542e4b93843c6dd7febb0,14409,franca,SP
1,18955e83d337fd6b2def6b18a428ac77,290c77bc529b7ac935b93aa66c333dc3,9790,sao bernardo do campo,SP
2,4e7b3e00288586ebd08712fdd0374a03,060e732b5b29e8181a18229c7b0b2b5e,1151,sao paulo,SP



Data types:
customer_id                 object
customer_unique_id          object
customer_zip_code_prefix     int64
customer_city               object
customer_state              object
dtype: object

Missing values:
customer_id                 0
customer_unique_id          0
customer_zip_code_prefix    0
customer_city               0
customer_state              0
dtype: int64

Dataset: ORDER_ITEMS
Shape: 112,650 rows Ã— 7 columns

Columns: ['order_id', 'order_item_id', 'product_id', 'seller_id', 'shipping_limit_date', 'price', 'freight_value']

First 3 rows:


Unnamed: 0,order_id,order_item_id,product_id,seller_id,shipping_limit_date,price,freight_value
0,00010242fe8c5a6d1ba2dd792cb16214,1,4244733e06e7ecb4970a6e2683c13e61,48436dade18ac8b2bce089ec2a041202,2017-09-19 09:45:35,58.9,13.29
1,00018f77f2f0320c557190d7a144bdd3,1,e5f2d52b802189ee658865ca93d83a8f,dd7ddc04e1b6c2c614352b383efe2d36,2017-05-03 11:05:13,239.9,19.93
2,000229ec398224ef6ca0657da4fc703e,1,c777355d18b72b67abbeef9df44fd0fd,5b51032eddd242adc84c38acab88f23d,2018-01-18 14:48:30,199.0,17.87



Data types:
order_id                object
order_item_id            int64
product_id              object
seller_id               object
shipping_limit_date     object
price                  float64
freight_value          float64
dtype: object

Missing values:
order_id               0
order_item_id          0
product_id             0
seller_id              0
shipping_limit_date    0
price                  0
freight_value          0
dtype: int64

Dataset: ORDER_PAYMENTS
Shape: 103,886 rows Ã— 5 columns

Columns: ['order_id', 'payment_sequential', 'payment_type', 'payment_installments', 'payment_value']

First 3 rows:


Unnamed: 0,order_id,payment_sequential,payment_type,payment_installments,payment_value
0,b81ef226f3fe1789b1e8b2acac839d17,1,credit_card,8,99.33
1,a9810da82917af2d9aefd1278f1dcfa0,1,credit_card,1,24.39
2,25e8ea4e93396b6fa0d3dd708e76c1bd,1,credit_card,1,65.71



Data types:
order_id                 object
payment_sequential        int64
payment_type             object
payment_installments      int64
payment_value           float64
dtype: object

Missing values:
order_id                0
payment_sequential      0
payment_type            0
payment_installments    0
payment_value           0
dtype: int64

Dataset: ORDER_REVIEWS
Shape: 99,224 rows Ã— 7 columns

Columns: ['review_id', 'order_id', 'review_score', 'review_comment_title', 'review_comment_message', 'review_creation_date', 'review_answer_timestamp']

First 3 rows:


Unnamed: 0,review_id,order_id,review_score,review_comment_title,review_comment_message,review_creation_date,review_answer_timestamp
0,7bc2406110b926393aa56f80a40eba40,73fc7af87114b39712e6da79b0a377eb,4,,,2018-01-18 00:00:00,2018-01-18 21:46:59
1,80e641a11e56f04c1ad469d5645fdfde,a548910a1c6147796b98fdf73dbeba33,5,,,2018-03-10 00:00:00,2018-03-11 03:05:13
2,228ce5500dc1d8e020d8d1322874b6f0,f9e4b658b201a9f2ecdecbb34bed034b,5,,,2018-02-17 00:00:00,2018-02-18 14:36:24



Data types:
review_id                  object
order_id                   object
review_score                int64
review_comment_title       object
review_comment_message     object
review_creation_date       object
review_answer_timestamp    object
dtype: object

Missing values:
review_id                      0
order_id                       0
review_score                   0
review_comment_title       87656
review_comment_message     58247
review_creation_date           0
review_answer_timestamp        0
dtype: int64

Dataset: PRODUCTS
Shape: 32,951 rows Ã— 9 columns

Columns: ['product_id', 'product_category_name', 'product_name_lenght', 'product_description_lenght', 'product_photos_qty', 'product_weight_g', 'product_length_cm', 'product_height_cm', 'product_width_cm']

First 3 rows:


Unnamed: 0,product_id,product_category_name,product_name_lenght,product_description_lenght,product_photos_qty,product_weight_g,product_length_cm,product_height_cm,product_width_cm
0,1e9e8ef04dbcff4541ed26657ea517e5,perfumaria,40.0,287.0,1.0,225.0,16.0,10.0,14.0
1,3aa071139cb16b67ca9e5dea641aaa2f,artes,44.0,276.0,1.0,1000.0,30.0,18.0,20.0
2,96bd76ec8810374ed1b65e291975717f,esporte_lazer,46.0,250.0,1.0,154.0,18.0,9.0,15.0



Data types:
product_id                     object
product_category_name          object
product_name_lenght           float64
product_description_lenght    float64
product_photos_qty            float64
product_weight_g              float64
product_length_cm             float64
product_height_cm             float64
product_width_cm              float64
dtype: object

Missing values:
product_id                      0
product_category_name         610
product_name_lenght           610
product_description_lenght    610
product_photos_qty            610
product_weight_g                2
product_length_cm               2
product_height_cm               2
product_width_cm                2
dtype: int64

Dataset: SELLERS
Shape: 3,095 rows Ã— 4 columns

Columns: ['seller_id', 'seller_zip_code_prefix', 'seller_city', 'seller_state']

First 3 rows:


Unnamed: 0,seller_id,seller_zip_code_prefix,seller_city,seller_state
0,3442f8959a84dea7ee197c632cb2df15,13023,campinas,SP
1,d1b65fc7debc3361ea86b5f14c68d2e2,13844,mogi guacu,SP
2,ce3ad9de960102d0677a81f5d0bb7b2d,20031,rio de janeiro,RJ



Data types:
seller_id                 object
seller_zip_code_prefix     int64
seller_city               object
seller_state              object
dtype: object

Missing values:
seller_id                 0
seller_zip_code_prefix    0
seller_city               0
seller_state              0
dtype: int64

Dataset: GEOLOCATION
Shape: 1,000,163 rows Ã— 5 columns

Columns: ['geolocation_zip_code_prefix', 'geolocation_lat', 'geolocation_lng', 'geolocation_city', 'geolocation_state']

First 3 rows:


Unnamed: 0,geolocation_zip_code_prefix,geolocation_lat,geolocation_lng,geolocation_city,geolocation_state
0,1037,-23.545621,-46.639292,sao paulo,SP
1,1046,-23.546081,-46.64482,sao paulo,SP
2,1046,-23.546129,-46.642951,sao paulo,SP



Data types:
geolocation_zip_code_prefix      int64
geolocation_lat                float64
geolocation_lng                float64
geolocation_city                object
geolocation_state               object
dtype: object

Missing values:
geolocation_zip_code_prefix    0
geolocation_lat                0
geolocation_lng                0
geolocation_city               0
geolocation_state              0
dtype: int64

Dataset: CATEGORY_TRANSLATION
Shape: 71 rows Ã— 2 columns

Columns: ['product_category_name', 'product_category_name_english']

First 3 rows:


Unnamed: 0,product_category_name,product_category_name_english
0,beleza_saude,health_beauty
1,informatica_acessorios,computers_accessories
2,automotivo,auto



Data types:
product_category_name            object
product_category_name_english    object
dtype: object

Missing values:
product_category_name            0
product_category_name_english    0
dtype: int64


## 4. Generate ydata_profiling Reports

This section generates comprehensive profiling reports for each dataset. Reports will be saved as HTML files in the `reports/` directory.

In [5]:
# Function to generate profiling report
def generate_profile_report(df, name, minimal=False):
    """
    Generate ydata_profiling report for a dataset.
    
    Parameters:
    -----------
    df : pandas.DataFrame
        The dataset to analyze
    name : str
        Name of the dataset (used for report title)
    minimal : bool, optional
        If True, generates a minimal report (faster for large datasets)
    """
    print(f"\nGenerating profile report for {name}...")
    
    try:
        # Configure report settings
        profile = ProfileReport(
            df,
            title=f"{name.replace('_', ' ').title()} - Data Profile",
            minimal=minimal,
            explorative=True,
            samples={'head': 5, 'tail': 5},
            correlations={
                'pearson': {'calculate': True},
                'spearman': {'calculate': True},
                'kendall': {'calculate': False},
                'phi_k': {'calculate': True},
                'cramers': {'calculate': True}
            }
        )
        
        # Save report
        report_path = os.path.join(REPORTS_PATH, f"{name}_profile.html")
        profile.to_file(report_path)
        print(f"âœ“ Report saved: {report_path}")
        
        return profile
        
    except Exception as e:
        print(f"âœ— Error generating report for {name}: {e}")
        return None

### 4.1 Orders Dataset (Central Table)

In [6]:
# Generate report for orders (central table)
if 'orders' in data:
    orders_profile = generate_profile_report(data['orders'], 'orders', minimal=False)
    if orders_profile:
        # Display summary statistics
        print("\nOrders Dataset Summary:")
        print(f"Total orders: {data['orders'].shape[0]:,}")
        print(f"\nOrder status distribution:")
        print(data['orders']['order_status'].value_counts())


Generating profile report for orders...


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 8/8 [00:03<00:00,  2.64it/s]


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

âœ“ Report saved: ../reports/orders_profile.html

Orders Dataset Summary:
Total orders: 99,441

Order status distribution:
order_status
delivered      96478
shipped         1107
canceled         625
unavailable      609
invoiced         314
processing       301
created            5
approved           2
Name: count, dtype: int64


### 4.2 Customers Dataset

In [7]:
# Generate report for customers
if 'customers' in data:
    customers_profile = generate_profile_report(data['customers'], 'customers', minimal=False)
    if customers_profile:
        print("\nCustomers Dataset Summary:")
        print(f"Total customers: {data['customers'].shape[0]:,}")
        print(f"\nTop 5 states by customer count:")
        print(data['customers']['customer_state'].value_counts().head())


Generating profile report for customers...


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 5/5 [00:01<00:00,  2.84it/s]


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

âœ“ Report saved: ../reports/customers_profile.html

Customers Dataset Summary:
Total customers: 99,441

Top 5 states by customer count:
customer_state
SP    41746
RJ    12852
MG    11635
RS     5466
PR     5045
Name: count, dtype: int64


### 4.3 Order Items Dataset

In [8]:
# Generate report for order items
if 'order_items' in data:
    order_items_profile = generate_profile_report(data['order_items'], 'order_items', minimal=False)
    if order_items_profile:
        print("\nOrder Items Dataset Summary:")
        print(f"Total order items: {data['order_items'].shape[0]:,}")
        print(f"\nPrice statistics:")
        print(data['order_items']['price'].describe())


Generating profile report for order_items...


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 7/7 [00:01<00:00,  4.27it/s]


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

âœ“ Report saved: ../reports/order_items_profile.html

Order Items Dataset Summary:
Total order items: 112,650

Price statistics:
count    112650.000000
mean        120.653739
std         183.633928
min           0.850000
25%          39.900000
50%          74.990000
75%         134.900000
max        6735.000000
Name: price, dtype: float64


### 4.4 Order Payments Dataset

In [9]:
# Generate report for order payments
if 'order_payments' in data:
    order_payments_profile = generate_profile_report(data['order_payments'], 'order_payments', minimal=False)
    if order_payments_profile:
        print("\nOrder Payments Dataset Summary:")
        print(f"Total payment records: {data['order_payments'].shape[0]:,}")
        print(f"\nPayment type distribution:")
        print(data['order_payments']['payment_type'].value_counts())


Generating profile report for order_payments...


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 5/5 [00:00<00:00,  5.85it/s]


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

âœ“ Report saved: ../reports/order_payments_profile.html

Order Payments Dataset Summary:
Total payment records: 103,886

Payment type distribution:
payment_type
credit_card    76795
boleto         19784
voucher         5775
debit_card      1529
not_defined        3
Name: count, dtype: int64


### 4.5 Order Reviews Dataset

In [10]:
# Generate report for order reviews
if 'order_reviews' in data:
    order_reviews_profile = generate_profile_report(data['order_reviews'], 'order_reviews', minimal=False)
    if order_reviews_profile:
        print("\nOrder Reviews Dataset Summary:")
        print(f"Total reviews: {data['order_reviews'].shape[0]:,}")
        print(f"\nReview score distribution:")
        print(data['order_reviews']['review_score'].value_counts().sort_index())


Generating profile report for order_reviews...


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 7/7 [00:02<00:00,  2.77it/s]


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

âœ“ Report saved: ../reports/order_reviews_profile.html

Order Reviews Dataset Summary:
Total reviews: 99,224

Review score distribution:
review_score
1    11424
2     3151
3     8179
4    19142
5    57328
Name: count, dtype: int64


### 4.6 Products Dataset

In [11]:
# Generate report for products
if 'products' in data:
    products_profile = generate_profile_report(data['products'], 'products', minimal=False)
    if products_profile:
        print("\nProducts Dataset Summary:")
        print(f"Total products: {data['products'].shape[0]:,}")
        print(f"\nProduct categories: {data['products']['product_category_name'].nunique()}")


Generating profile report for products...


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 9/9 [00:00<00:00, 18.27it/s]


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

âœ“ Report saved: ../reports/products_profile.html

Products Dataset Summary:
Total products: 32,951

Product categories: 73


### 4.7 Sellers Dataset

In [12]:
# Generate report for sellers
if 'sellers' in data:
    sellers_profile = generate_profile_report(data['sellers'], 'sellers', minimal=False)
    if sellers_profile:
        print("\nSellers Dataset Summary:")
        print(f"Total sellers: {data['sellers'].shape[0]:,}")
        print(f"\nTop 5 states by seller count:")
        print(data['sellers']['seller_state'].value_counts().head())


Generating profile report for sellers...


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 4/4 [00:00<00:00, 63.60it/s]


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

âœ“ Report saved: ../reports/sellers_profile.html

Sellers Dataset Summary:
Total sellers: 3,095

Top 5 states by seller count:
seller_state
SP    1849
PR     349
MG     244
SC     190
RJ     171
Name: count, dtype: int64


### 4.8 Geolocation Dataset (Large Dataset - Minimal Report)

In [13]:
# Generate minimal report for geolocation (large dataset)
if 'geolocation' in data:
    geolocation_profile = generate_profile_report(data['geolocation'], 'geolocation', minimal=True)
    if geolocation_profile:
        print("\nGeolocation Dataset Summary:")
        print(f"Total geolocation records: {data['geolocation'].shape[0]:,}")
        print(f"\nUnique zip codes: {data['geolocation']['geolocation_zip_code_prefix'].nunique():,}")


Generating profile report for geolocation...


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 5/5 [00:01<00:00,  3.77it/s]


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

âœ“ Report saved: ../reports/geolocation_profile.html

Geolocation Dataset Summary:
Total geolocation records: 1,000,163

Unique zip codes: 19,015


### 4.9 Category Translation Dataset

In [14]:
# Generate report for category translation
if 'category_translation' in data:
    category_translation_profile = generate_profile_report(data['category_translation'], 'category_translation', minimal=False)
    if category_translation_profile:
        print("\nCategory Translation Dataset Summary:")
        print(f"Total categories: {data['category_translation'].shape[0]:,}")
        print(f"\nSample translations:")
        print(data['category_translation'].head(10))


Generating profile report for category_translation...


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 2/2 [00:00<00:00, 35.71it/s]


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

âœ“ Report saved: ../reports/category_translation_profile.html

Category Translation Dataset Summary:
Total categories: 71

Sample translations:
    product_category_name product_category_name_english
0            beleza_saude                 health_beauty
1  informatica_acessorios         computers_accessories
2              automotivo                          auto
3         cama_mesa_banho                bed_bath_table
4        moveis_decoracao               furniture_decor
5           esporte_lazer                sports_leisure
6              perfumaria                     perfumery
7   utilidades_domesticas                    housewares
8               telefonia                     telephony
9      relogios_presentes                 watches_gifts


## 5. Data Quality Summary

In [15]:
# Create a summary of data quality metrics
quality_summary = []

for name, df in data.items():
    total_cells = df.shape[0] * df.shape[1]
    missing_cells = df.isnull().sum().sum()
    missing_percentage = (missing_cells / total_cells) * 100
    duplicate_rows = df.duplicated().sum()
    
    quality_summary.append({
        'Dataset': name,
        'Rows': df.shape[0],
        'Columns': df.shape[1],
        'Missing Values': missing_cells,
        'Missing %': round(missing_percentage, 2),
        'Duplicate Rows': duplicate_rows
    })

# Create summary DataFrame
quality_df = pd.DataFrame(quality_summary)
quality_df = quality_df.sort_values('Missing %', ascending=False)

print("Data Quality Summary:")
display(quality_df)

# Save summary to CSV
quality_df.to_csv(os.path.join(REPORTS_PATH, 'data_quality_summary.csv'), index=False)
print(f"\nQuality summary saved to: {os.path.join(REPORTS_PATH, 'data_quality_summary.csv')}")

Data Quality Summary:


Unnamed: 0,Dataset,Rows,Columns,Missing Values,Missing %,Duplicate Rows
4,order_reviews,99224,7,145903,21.01,0
5,products,32951,9,2448,0.83,0
0,orders,99441,8,4908,0.62,0
1,customers,99441,5,0,0.0,0
2,order_items,112650,7,0,0.0,0
3,order_payments,103886,5,0,0.0,0
6,sellers,3095,4,0,0.0,0
7,geolocation,1000163,5,0,0.0,261831
8,category_translation,71,2,0,0.0,0



Quality summary saved to: ../reports/data_quality_summary.csv


## 6. Key Insights and Next Steps

In [16]:
print("="*60)
print("EXPLORATORY DATA ANALYSIS COMPLETE")
print("="*60)

print("\nðŸ“Š Generated Reports:")
for name in datasets.keys():
    report_file = os.path.join(REPORTS_PATH, f"{name}_profile.html")
    if os.path.exists(report_file):
        print(f"  âœ“ {name}_profile.html")

print("\nðŸ“‹ Next Steps:")
print("  1. Review the HTML reports in the reports/ directory")
print("  2. Identify data quality issues and missing values")
print("  3. Explore correlations between variables")
print("  4. Perform data cleaning and preprocessing")
print("  5. Create visualizations and advanced analyses")

print("\nðŸ’¡ Tips:")
print("  - Open HTML reports in a web browser for interactive exploration")
print("  - Pay attention to missing values and outliers")
print("  - Review correlation matrices for relationships")
print("  - Check data types and convert if necessary")

EXPLORATORY DATA ANALYSIS COMPLETE

ðŸ“Š Generated Reports:
  âœ“ orders_profile.html
  âœ“ customers_profile.html
  âœ“ order_items_profile.html
  âœ“ order_payments_profile.html
  âœ“ order_reviews_profile.html
  âœ“ products_profile.html
  âœ“ sellers_profile.html
  âœ“ geolocation_profile.html
  âœ“ category_translation_profile.html

ðŸ“‹ Next Steps:
  1. Review the HTML reports in the reports/ directory
  2. Identify data quality issues and missing values
  3. Explore correlations between variables
  4. Perform data cleaning and preprocessing
  5. Create visualizations and advanced analyses

ðŸ’¡ Tips:
  - Open HTML reports in a web browser for interactive exploration
  - Pay attention to missing values and outliers
  - Review correlation matrices for relationships
  - Check data types and convert if necessary
