# Análise Preliminar dos Dados

Este notebook faz uma análise inicial dos arquivos CSV disponíveis no dataset do e-commerce brasileiro da Olist.
Vamos examinar a estrutura, tamanho e características básicas de cada tabela para entender melhor o dataset.

In [1]:
import pandas as pd
import numpy as np
import os

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

DATASET_DIR = os.path.abspath(os.path.join(os.getcwd(), '..', 'dataset'))

## Listagem dos arquivos disponíveis

In [2]:
csv_files = [f for f in os.listdir(DATASET_DIR) if f.endswith('.csv')]

file_info = []
for file in csv_files:
    file_path = os.path.join(DATASET_DIR, file)
    size_mb = os.path.getsize(file_path) / (1024 * 1024)
    file_info.append({
        'Arquivo': file,
        'Tamanho (MB)': round(size_mb, 2)
    })

pd.DataFrame(file_info).sort_values(by='Tamanho (MB)', ascending=False)

Unnamed: 0,Arquivo,Tamanho (MB)
1,olist_geolocation_dataset.csv,58.44
2,olist_orders_dataset.csv,16.84
3,olist_order_items_dataset.csv,14.72
5,olist_order_reviews_dataset.csv,13.78
0,olist_customers_dataset.csv,8.62
4,olist_order_payments_dataset.csv,5.51
6,olist_products_dataset.csv,2.27
7,olist_sellers_dataset.csv,0.17
8,product_category_name_translation.csv,0.0


## Análise da estrutura de cada tabela

analisando a estrutura e algumas estatísticas básicas de cada arquivo CSV.

In [3]:
def analyze_table(file_name):
    """Analisa e exibe informações sobre uma tabela CSV."""
    file_path = os.path.join(DATASET_DIR, file_name)
    
    print(f"\n{'='*50}")
    print(f"Análise da tabela: {file_name}")
    print(f"{'='*50}\n")
    
    df = pd.read_csv(file_path)
    
    print(f"Dimensões: {df.shape[0]} linhas × {df.shape[1]} colunas")
    print("\nPrimeiras 5 linhas:")
    display(df.head())
    
    print("\nTipos de dados:")
    display(df.dtypes)
    
    print("\nEstatísticas descritivas:")
    display(df.describe(include='all').T)
    
    null_counts = df.isnull().sum()
    if null_counts.sum() > 0:
        print("\nValores nulos por coluna:")
        null_percent = np.round(null_counts.values / len(df) * 100, 2)
        display(pd.DataFrame({
            'Coluna': null_counts.index,
            'Valores Nulos': null_counts.values,
            'Porcentagem (%)': null_percent
        }).query('`Valores Nulos` > 0').sort_values('Valores Nulos', ascending=False).reset_index(drop=True))
    else:
        print("\nNão há valores nulos nesta tabela.")
    
    return df

### 1. Clientes (olist_customers_dataset.csv)

In [4]:
customers_df = analyze_table('olist_customers_dataset.csv')


Análise da tabela: olist_customers_dataset.csv

Dimensões: 99441 linhas × 5 colunas

Primeiras 5 linhas:


Unnamed: 0,customer_id,customer_unique_id,customer_zip_code_prefix,customer_city,customer_state
0,06b8999e2fba1a1fbc88172c00ba8bc7,861eff4711a542e4b93843c6dd7febb0,14409,franca,SP
1,18955e83d337fd6b2def6b18a428ac77,290c77bc529b7ac935b93aa66c333dc3,9790,sao bernardo do campo,SP
2,4e7b3e00288586ebd08712fdd0374a03,060e732b5b29e8181a18229c7b0b2b5e,1151,sao paulo,SP
3,b2b6027bc5c5109e529d4dc6358b12c3,259dac757896d24d7702b9acbbff3f3c,8775,mogi das cruzes,SP
4,4f2d8ab171c80ec8364f7c12e35b23ad,345ecd01c38d18a9036ed96c73b8d066,13056,campinas,SP



Tipos de dados:


customer_id                 object
customer_unique_id          object
customer_zip_code_prefix     int64
customer_city               object
customer_state              object
dtype: object


Estatísticas descritivas:


Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
customer_id,99441.0,99441.0,274fa6071e5e17fe303b9748641082c8,1.0,,,,,,,
customer_unique_id,99441.0,96096.0,8d50f5eadf50201ccdcedfb9e2ac8455,17.0,,,,,,,
customer_zip_code_prefix,99441.0,,,,35137.474583,29797.938996,1003.0,11347.0,24416.0,58900.0,99990.0
customer_city,99441.0,4119.0,sao paulo,15540.0,,,,,,,
customer_state,99441.0,27.0,SP,41746.0,,,,,,,



Não há valores nulos nesta tabela.


### 2. Geolocalização (olist_geolocation_dataset.csv)

In [5]:
file_path = os.path.join(DATASET_DIR, 'olist_geolocation_dataset.csv')
geo_df = pd.read_csv(file_path, nrows=10000)  # Carregar apenas 10.000 linhas

print(f"\n{'='*50}")
print(f"Análise da tabela: olist_geolocation_dataset.csv (amostra de 10.000 linhas)")
print(f"{'='*50}\n")
    
print(f"Dimensões da amostra: {geo_df.shape[0]} linhas × {geo_df.shape[1]} colunas")
print("\nPrimeiras 5 linhas:")
display(geo_df.head())

print("\nTipos de dados:")
display(geo_df.dtypes)

print("\nEstatísticas descritivas:")
display(geo_df.describe(include='all').T)

null_counts = geo_df.isnull().sum()
if null_counts.sum() > 0:
    print("\nValores nulos por coluna (na amostra):")
    display(pd.DataFrame({
        'Coluna': null_counts.index,
        'Valores Nulos': null_counts.values,
        'Porcentagem (%)': round(null_counts.values / len(geo_df) * 100, 2)
    }).query('`Valores Nulos` > 0').sort_values('Valores Nulos', ascending=False).reset_index(drop=True))


Análise da tabela: olist_geolocation_dataset.csv (amostra de 10.000 linhas)

Dimensões da amostra: 10000 linhas × 5 colunas

Primeiras 5 linhas:


Unnamed: 0,geolocation_zip_code_prefix,geolocation_lat,geolocation_lng,geolocation_city,geolocation_state
0,1037,-23.545621,-46.639292,sao paulo,SP
1,1046,-23.546081,-46.64482,sao paulo,SP
2,1046,-23.546129,-46.642951,sao paulo,SP
3,1041,-23.544392,-46.639499,sao paulo,SP
4,1035,-23.541578,-46.641607,sao paulo,SP



Tipos de dados:


geolocation_zip_code_prefix      int64
geolocation_lat                float64
geolocation_lng                float64
geolocation_city                object
geolocation_state               object
dtype: object


Estatísticas descritivas:


Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
geolocation_zip_code_prefix,10000.0,,,,1203.1158,92.912944,1001.0,1139.0,1226.0,1258.0,1333.0
geolocation_lat,10000.0,,,,-23.543068,0.012137,-23.696567,-23.550124,-23.54296,-23.535707,-23.485404
geolocation_lng,10000.0,,,,-46.650588,0.01491,-46.690379,-46.657604,-46.650364,-46.642628,-46.457361
geolocation_city,10000.0,2.0,sao paulo,8502.0,,,,,,,
geolocation_state,10000.0,1.0,SP,10000.0,,,,,,,


### 3. Itens dos Pedidos (olist_order_items_dataset.csv)

In [6]:
order_items_df = analyze_table('olist_order_items_dataset.csv')


Análise da tabela: olist_order_items_dataset.csv

Dimensões: 112650 linhas × 7 colunas

Primeiras 5 linhas:


Unnamed: 0,order_id,order_item_id,product_id,seller_id,shipping_limit_date,price,freight_value
0,00010242fe8c5a6d1ba2dd792cb16214,1,4244733e06e7ecb4970a6e2683c13e61,48436dade18ac8b2bce089ec2a041202,2017-09-19 09:45:35,58.9,13.29
1,00018f77f2f0320c557190d7a144bdd3,1,e5f2d52b802189ee658865ca93d83a8f,dd7ddc04e1b6c2c614352b383efe2d36,2017-05-03 11:05:13,239.9,19.93
2,000229ec398224ef6ca0657da4fc703e,1,c777355d18b72b67abbeef9df44fd0fd,5b51032eddd242adc84c38acab88f23d,2018-01-18 14:48:30,199.0,17.87
3,00024acbcdf0a6daa1e931b038114c75,1,7634da152a4610f1595efa32f14722fc,9d7a1d34a5052409006425275ba1c2b4,2018-08-15 10:10:18,12.99,12.79
4,00042b26cf59d7ce69dfabb4e55b4fd9,1,ac6c3623068f30de03045865e4e10089,df560393f3a51e74553ab94004ba5c87,2017-02-13 13:57:51,199.9,18.14



Tipos de dados:


order_id                object
order_item_id            int64
product_id              object
seller_id               object
shipping_limit_date     object
price                  float64
freight_value          float64
dtype: object


Estatísticas descritivas:


Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
order_id,112650.0,98666.0,8272b63d03f5f79c56e9e4120aec44ef,21.0,,,,,,,
order_item_id,112650.0,,,,1.197834,0.705124,1.0,1.0,1.0,1.0,21.0
product_id,112650.0,32951.0,aca2eb7d00ea1a7b8ebd4e68314663af,527.0,,,,,,,
seller_id,112650.0,3095.0,6560211a19b47992c3666cc44a7e94c0,2033.0,,,,,,,
shipping_limit_date,112650.0,93318.0,2017-07-21 18:25:23,21.0,,,,,,,
price,112650.0,,,,120.653739,183.633928,0.85,39.9,74.99,134.9,6735.0
freight_value,112650.0,,,,19.99032,15.806405,0.0,13.08,16.26,21.15,409.68



Não há valores nulos nesta tabela.


### 4. Pagamentos dos Pedidos (olist_order_payments_dataset.csv)

In [7]:
order_payments_df = analyze_table('olist_order_payments_dataset.csv')


Análise da tabela: olist_order_payments_dataset.csv

Dimensões: 103886 linhas × 5 colunas

Primeiras 5 linhas:


Unnamed: 0,order_id,payment_sequential,payment_type,payment_installments,payment_value
0,b81ef226f3fe1789b1e8b2acac839d17,1,credit_card,8,99.33
1,a9810da82917af2d9aefd1278f1dcfa0,1,credit_card,1,24.39
2,25e8ea4e93396b6fa0d3dd708e76c1bd,1,credit_card,1,65.71
3,ba78997921bbcdc1373bb41e913ab953,1,credit_card,8,107.78
4,42fdf880ba16b47b59251dd489d4441a,1,credit_card,2,128.45



Tipos de dados:


order_id                 object
payment_sequential        int64
payment_type             object
payment_installments      int64
payment_value           float64
dtype: object


Estatísticas descritivas:


Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
order_id,103886.0,99440.0,fa65dad1b0e818e3ccc5cb0e39231352,29.0,,,,,,,
payment_sequential,103886.0,,,,1.092679,0.706584,1.0,1.0,1.0,1.0,29.0
payment_type,103886.0,5.0,credit_card,76795.0,,,,,,,
payment_installments,103886.0,,,,2.853349,2.687051,0.0,1.0,1.0,4.0,24.0
payment_value,103886.0,,,,154.10038,217.494064,0.0,56.79,100.0,171.8375,13664.08



Não há valores nulos nesta tabela.


### 5. Avaliações dos Pedidos (olist_order_reviews_dataset.csv)

In [8]:
order_reviews_df = analyze_table('olist_order_reviews_dataset.csv')


Análise da tabela: olist_order_reviews_dataset.csv

Dimensões: 99224 linhas × 7 colunas

Primeiras 5 linhas:


Unnamed: 0,review_id,order_id,review_score,review_comment_title,review_comment_message,review_creation_date,review_answer_timestamp
0,7bc2406110b926393aa56f80a40eba40,73fc7af87114b39712e6da79b0a377eb,4,,,2018-01-18 00:00:00,2018-01-18 21:46:59
1,80e641a11e56f04c1ad469d5645fdfde,a548910a1c6147796b98fdf73dbeba33,5,,,2018-03-10 00:00:00,2018-03-11 03:05:13
2,228ce5500dc1d8e020d8d1322874b6f0,f9e4b658b201a9f2ecdecbb34bed034b,5,,,2018-02-17 00:00:00,2018-02-18 14:36:24
3,e64fb393e7b32834bb789ff8bb30750e,658677c97b385a9be170737859d3511b,5,,Recebi bem antes do prazo estipulado.,2017-04-21 00:00:00,2017-04-21 22:02:06
4,f7c4243c7fe1938f181bec41a392bdeb,8e6bfb81e283fa7e4f11123a3fb894f1,5,,Parabéns lojas lannister adorei comprar pela I...,2018-03-01 00:00:00,2018-03-02 10:26:53



Tipos de dados:


review_id                  object
order_id                   object
review_score                int64
review_comment_title       object
review_comment_message     object
review_creation_date       object
review_answer_timestamp    object
dtype: object


Estatísticas descritivas:


Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
review_id,99224.0,98410.0,08528f70f579f0c830189efc523d2182,3.0,,,,,,,
order_id,99224.0,98673.0,df56136b8031ecd28e200bb18e6ddb2e,3.0,,,,,,,
review_score,99224.0,,,,4.086421,1.347579,1.0,4.0,5.0,5.0,5.0
review_comment_title,11568.0,4527.0,Recomendo,423.0,,,,,,,
review_comment_message,40977.0,36159.0,Muito bom,230.0,,,,,,,
review_creation_date,99224.0,636.0,2017-12-19 00:00:00,463.0,,,,,,,
review_answer_timestamp,99224.0,98248.0,2017-06-15 23:21:05,4.0,,,,,,,



Valores nulos por coluna:


Unnamed: 0,Coluna,Valores Nulos,Porcentagem (%)
0,review_comment_title,87656,88.34
1,review_comment_message,58247,58.7


### 6. Pedidos (olist_orders_dataset.csv)

In [9]:
orders_df = analyze_table('olist_orders_dataset.csv')


Análise da tabela: olist_orders_dataset.csv

Dimensões: 99441 linhas × 8 colunas

Primeiras 5 linhas:


Unnamed: 0,order_id,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date
0,e481f51cbdc54678b7cc49136f2d6af7,9ef432eb6251297304e76186b10a928d,delivered,2017-10-02 10:56:33,2017-10-02 11:07:15,2017-10-04 19:55:00,2017-10-10 21:25:13,2017-10-18 00:00:00
1,53cdb2fc8bc7dce0b6741e2150273451,b0830fb4747a6c6d20dea0b8c802d7ef,delivered,2018-07-24 20:41:37,2018-07-26 03:24:27,2018-07-26 14:31:00,2018-08-07 15:27:45,2018-08-13 00:00:00
2,47770eb9100c2d0c44946d9cf07ec65d,41ce2a54c0b03bf3443c3d931a367089,delivered,2018-08-08 08:38:49,2018-08-08 08:55:23,2018-08-08 13:50:00,2018-08-17 18:06:29,2018-09-04 00:00:00
3,949d5b44dbf5de918fe9c16f97b45f8a,f88197465ea7920adcdbec7375364d82,delivered,2017-11-18 19:28:06,2017-11-18 19:45:59,2017-11-22 13:39:59,2017-12-02 00:28:42,2017-12-15 00:00:00
4,ad21c59c0840e6cb83a9ceb5573f8159,8ab97904e6daea8866dbdbc4fb7aad2c,delivered,2018-02-13 21:18:39,2018-02-13 22:20:29,2018-02-14 19:46:34,2018-02-16 18:17:02,2018-02-26 00:00:00



Tipos de dados:


order_id                         object
customer_id                      object
order_status                     object
order_purchase_timestamp         object
order_approved_at                object
order_delivered_carrier_date     object
order_delivered_customer_date    object
order_estimated_delivery_date    object
dtype: object


Estatísticas descritivas:


Unnamed: 0,count,unique,top,freq
order_id,99441,99441,66dea50a8b16d9b4dee7af250b4be1a5,1
customer_id,99441,99441,edb027a75a1449115f6b43211ae02a24,1
order_status,99441,8,delivered,96478
order_purchase_timestamp,99441,98875,2018-08-02 12:05:26,3
order_approved_at,99281,90733,2018-02-27 04:31:10,9
order_delivered_carrier_date,97658,81018,2018-05-09 15:48:00,47
order_delivered_customer_date,96476,95664,2018-05-08 19:36:48,3
order_estimated_delivery_date,99441,459,2017-12-20 00:00:00,522



Valores nulos por coluna:


Unnamed: 0,Coluna,Valores Nulos,Porcentagem (%)
0,order_delivered_customer_date,2965,2.98
1,order_delivered_carrier_date,1783,1.79
2,order_approved_at,160,0.16


### 7. Produtos (olist_products_dataset.csv)

In [10]:
products_df = analyze_table('olist_products_dataset.csv')


Análise da tabela: olist_products_dataset.csv

Dimensões: 32951 linhas × 9 colunas

Primeiras 5 linhas:


Unnamed: 0,product_id,product_category_name,product_name_lenght,product_description_lenght,product_photos_qty,product_weight_g,product_length_cm,product_height_cm,product_width_cm
0,1e9e8ef04dbcff4541ed26657ea517e5,perfumaria,40.0,287.0,1.0,225.0,16.0,10.0,14.0
1,3aa071139cb16b67ca9e5dea641aaa2f,artes,44.0,276.0,1.0,1000.0,30.0,18.0,20.0
2,96bd76ec8810374ed1b65e291975717f,esporte_lazer,46.0,250.0,1.0,154.0,18.0,9.0,15.0
3,cef67bcfe19066a932b7673e239eb23d,bebes,27.0,261.0,1.0,371.0,26.0,4.0,26.0
4,9dc1a7de274444849c219cff195d0b71,utilidades_domesticas,37.0,402.0,4.0,625.0,20.0,17.0,13.0



Tipos de dados:


product_id                     object
product_category_name          object
product_name_lenght           float64
product_description_lenght    float64
product_photos_qty            float64
product_weight_g              float64
product_length_cm             float64
product_height_cm             float64
product_width_cm              float64
dtype: object


Estatísticas descritivas:


Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
product_id,32951.0,32951.0,106392145fca363410d287a815be6de4,1.0,,,,,,,
product_category_name,32341.0,73.0,cama_mesa_banho,3029.0,,,,,,,
product_name_lenght,32341.0,,,,48.476949,10.245741,5.0,42.0,51.0,57.0,76.0
product_description_lenght,32341.0,,,,771.495285,635.115225,4.0,339.0,595.0,972.0,3992.0
product_photos_qty,32341.0,,,,2.188986,1.736766,1.0,1.0,1.0,3.0,20.0
product_weight_g,32949.0,,,,2276.472488,4282.038731,0.0,300.0,700.0,1900.0,40425.0
product_length_cm,32949.0,,,,30.815078,16.914458,7.0,18.0,25.0,38.0,105.0
product_height_cm,32949.0,,,,16.937661,13.637554,2.0,8.0,13.0,21.0,105.0
product_width_cm,32949.0,,,,23.196728,12.079047,6.0,15.0,20.0,30.0,118.0



Valores nulos por coluna:


Unnamed: 0,Coluna,Valores Nulos,Porcentagem (%)
0,product_category_name,610,1.85
1,product_name_lenght,610,1.85
2,product_description_lenght,610,1.85
3,product_photos_qty,610,1.85
4,product_weight_g,2,0.01
5,product_length_cm,2,0.01
6,product_height_cm,2,0.01
7,product_width_cm,2,0.01


### 8. Vendedores (olist_sellers_dataset.csv)

In [11]:
sellers_df = analyze_table('olist_sellers_dataset.csv')


Análise da tabela: olist_sellers_dataset.csv

Dimensões: 3095 linhas × 4 colunas

Primeiras 5 linhas:


Unnamed: 0,seller_id,seller_zip_code_prefix,seller_city,seller_state
0,3442f8959a84dea7ee197c632cb2df15,13023,campinas,SP
1,d1b65fc7debc3361ea86b5f14c68d2e2,13844,mogi guacu,SP
2,ce3ad9de960102d0677a81f5d0bb7b2d,20031,rio de janeiro,RJ
3,c0f3eea2e14555b6faeea3dd58c1b1c3,4195,sao paulo,SP
4,51a04a8a6bdcb23deccc82b0b80742cf,12914,braganca paulista,SP



Tipos de dados:


seller_id                 object
seller_zip_code_prefix     int64
seller_city               object
seller_state              object
dtype: object


Estatísticas descritivas:


Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
seller_id,3095.0,3095.0,9e25199f6ef7e7c347120ff175652c3b,1.0,,,,,,,
seller_zip_code_prefix,3095.0,,,,32291.059451,32713.45383,1001.0,7093.5,14940.0,64552.5,99730.0
seller_city,3095.0,611.0,sao paulo,694.0,,,,,,,
seller_state,3095.0,23.0,SP,1849.0,,,,,,,



Não há valores nulos nesta tabela.


### 9. Tradução das Categorias de Produtos (product_category_name_translation.csv)

In [12]:
category_translation_df = analyze_table('product_category_name_translation.csv')


Análise da tabela: product_category_name_translation.csv

Dimensões: 71 linhas × 2 colunas

Primeiras 5 linhas:


Unnamed: 0,product_category_name,product_category_name_english
0,beleza_saude,health_beauty
1,informatica_acessorios,computers_accessories
2,automotivo,auto
3,cama_mesa_banho,bed_bath_table
4,moveis_decoracao,furniture_decor



Tipos de dados:


product_category_name            object
product_category_name_english    object
dtype: object


Estatísticas descritivas:


Unnamed: 0,count,unique,top,freq
product_category_name,71,71,beleza_saude,1
product_category_name_english,71,71,health_beauty,1



Não há valores nulos nesta tabela.


## Resumo e Próximos Passos

Com base na análise preliminar, vou seguir com as etapas da pipeline de dados seguindo a arquitetura Medallion:

1. **Landing Zone**: Os arquivos CSV originais já estão disponíveis no diretório `dataset/`

2. **Bronze**: Ingestão dos dados brutos com validações básicas
   - Identificação de problemas de qualidade (valores nulos, formatos inconsistentes)
   - Adição de metadados de ingestão

3. **Silver**: Limpeza e transformação dos dados
   - Tratamento de valores nulos e duplicados
   - Conversão de tipos de dados (datas, números)
   - Normalização de texto (maiúsculas/minúsculas, acentos)
   - Estabelecimento das relações entre tabelas

4. **Gold**: Agregações e métricas de negócio
   - Análise de vendas por período, região, categoria
   - Métricas de satisfação do cliente
   - Análise de tempo de entrega
   - Segmentação de clientes

Próximo passo: Iniciar o processo de ingestão dos dados para a camada Bronze no notebook `1_landing_to_bronze.ipynb`.