# 00. Data Setup: Loading and Initial Exploration


---

## 1. Introduction

This notebook is the first step in the analysis pipeline. Its main goals are:
1.  To load all raw `.csv` files into memory.
2.  To perform an initial exploration of each individual dataset to understand its structure, columns, and content.


---
## 2. Loading Libraries

In this section, we will load the necessary libraries for our analysis.

In [1]:
# --- 2. Setup and Imports ---

# Importing standard data analysis libraries
import pandas as pd
import numpy as np
import os
import sys

# Adding the project's root directory to the Python path
# This allows us to import our custom modules from the 'src' folder
sys.path.append('..') 

# Importing our custom data handling functions
from src.data_utils import load_raw, save_processed, load_processed 

# Configuring pandas for better display
pd.set_option('display.max_columns', 80)
pd.set_option('display.max_rows', 80)
pd.set_option('display.width', 80)

print("Setup complete. Libraries and custom functions imported successfully.")

Setup complete. Libraries and custom functions imported successfully.


---
## 3. Loading Raw Data

In this section, we will load each individual `.csv` file from the `data/raw` directory into a dictionary of pandas DataFrames. This approach allows us to keep the original datasets separate for initial inspection before merging them.

In [2]:
# List of the raw data files we need to load
raw_files_to_load = [
    'olist_customers_dataset.csv',
    'olist_geolocation_dataset.csv',
    'olist_orders_dataset.csv',
    'olist_order_items_dataset.csv',
    'olist_order_payments_dataset.csv',
    'olist_order_reviews_dataset.csv',
    'olist_products_dataset.csv',
    'olist_sellers_dataset.csv',
    'product_category_name_translation.csv'
]

# Using a dictionary comprehension to load all files into a dictionary named 'dataframes'
# The key for each dataframe will be a clean name (e.g., 'customers', 'orders')
dataframes = {
    file.replace('olist_', '').replace('_dataset.csv', '').replace('.csv', ''): load_raw(file)
    for file in raw_files_to_load
}

# --- Verification ---
print("\n--- Data Loading Verification ---")
print(f"Successfully loaded {len(dataframes)} dataframes.")
print("Available dataframes are:", list(dataframes.keys()))

Loading data from: /home/lucas/olist-data-analysis-project/notebooks/../data/raw/olist_customers_dataset.csv
Loading data from: /home/lucas/olist-data-analysis-project/notebooks/../data/raw/olist_geolocation_dataset.csv
Loading data from: /home/lucas/olist-data-analysis-project/notebooks/../data/raw/olist_orders_dataset.csv
Loading data from: /home/lucas/olist-data-analysis-project/notebooks/../data/raw/olist_order_items_dataset.csv
Loading data from: /home/lucas/olist-data-analysis-project/notebooks/../data/raw/olist_order_payments_dataset.csv
Loading data from: /home/lucas/olist-data-analysis-project/notebooks/../data/raw/olist_order_reviews_dataset.csv
Loading data from: /home/lucas/olist-data-analysis-project/notebooks/../data/raw/olist_products_dataset.csv
Loading data from: /home/lucas/olist-data-analysis-project/notebooks/../data/raw/olist_sellers_dataset.csv
Loading data from: /home/lucas/olist-data-analysis-project/notebooks/../data/raw/product_category_name_translation.csv

-

---
## 4. Initial Exploration of Raw Datasets

Before merging, we will inspect each dataframe individually. This helps us understand the content, identify potential keys for merging, and spot any immediate data quality issues like missing values or incorrect data types.

In [3]:
# Loop through each dataframe in our dictionary to get a quick overview

for name, df in dataframes.items():
    print(f"--- EXPLORING DATAFRAME: '{name}' ---")
    print(f"Shape: {df.shape}")
    
    # Print info to check data types and null values
    print("\n[INFO]")
    df.info()
    
    # Display the first few rows to see the actual data
    print("\n[HEAD]")
    display(df.head())
    
    # Print a separator for better readability
    print("\n" + "="*80 + "\n\n")

--- EXPLORING DATAFRAME: 'customers' ---
Shape: (99441, 5)

[INFO]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99441 entries, 0 to 99440
Data columns (total 5 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   customer_id               99441 non-null  object
 1   customer_unique_id        99441 non-null  object
 2   customer_zip_code_prefix  99441 non-null  int64 
 3   customer_city             99441 non-null  object
 4   customer_state            99441 non-null  object
dtypes: int64(1), object(4)
memory usage: 3.8+ MB

[HEAD]


Unnamed: 0,customer_id,customer_unique_id,customer_zip_code_prefix,customer_city,customer_state
0,06b8999e2fba1a1fbc88172c00ba8bc7,861eff4711a542e4b93843c6dd7febb0,14409,franca,SP
1,18955e83d337fd6b2def6b18a428ac77,290c77bc529b7ac935b93aa66c333dc3,9790,sao bernardo do campo,SP
2,4e7b3e00288586ebd08712fdd0374a03,060e732b5b29e8181a18229c7b0b2b5e,1151,sao paulo,SP
3,b2b6027bc5c5109e529d4dc6358b12c3,259dac757896d24d7702b9acbbff3f3c,8775,mogi das cruzes,SP
4,4f2d8ab171c80ec8364f7c12e35b23ad,345ecd01c38d18a9036ed96c73b8d066,13056,campinas,SP





--- EXPLORING DATAFRAME: 'geolocation' ---
Shape: (1000163, 5)

[INFO]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000163 entries, 0 to 1000162
Data columns (total 5 columns):
 #   Column                       Non-Null Count    Dtype  
---  ------                       --------------    -----  
 0   geolocation_zip_code_prefix  1000163 non-null  int64  
 1   geolocation_lat              1000163 non-null  float64
 2   geolocation_lng              1000163 non-null  float64
 3   geolocation_city             1000163 non-null  object 
 4   geolocation_state            1000163 non-null  object 
dtypes: float64(2), int64(1), object(2)
memory usage: 38.2+ MB

[HEAD]


Unnamed: 0,geolocation_zip_code_prefix,geolocation_lat,geolocation_lng,geolocation_city,geolocation_state
0,1037,-23.545621,-46.639292,sao paulo,SP
1,1046,-23.546081,-46.64482,sao paulo,SP
2,1046,-23.546129,-46.642951,sao paulo,SP
3,1041,-23.544392,-46.639499,sao paulo,SP
4,1035,-23.541578,-46.641607,sao paulo,SP





--- EXPLORING DATAFRAME: 'orders' ---
Shape: (99441, 8)

[INFO]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99441 entries, 0 to 99440
Data columns (total 8 columns):
 #   Column                         Non-Null Count  Dtype 
---  ------                         --------------  ----- 
 0   order_id                       99441 non-null  object
 1   customer_id                    99441 non-null  object
 2   order_status                   99441 non-null  object
 3   order_purchase_timestamp       99441 non-null  object
 4   order_approved_at              99281 non-null  object
 5   order_delivered_carrier_date   97658 non-null  object
 6   order_delivered_customer_date  96476 non-null  object
 7   order_estimated_delivery_date  99441 non-null  object
dtypes: object(8)
memory usage: 6.1+ MB

[HEAD]


Unnamed: 0,order_id,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date
0,e481f51cbdc54678b7cc49136f2d6af7,9ef432eb6251297304e76186b10a928d,delivered,2017-10-02 10:56:33,2017-10-02 11:07:15,2017-10-04 19:55:00,2017-10-10 21:25:13,2017-10-18 00:00:00
1,53cdb2fc8bc7dce0b6741e2150273451,b0830fb4747a6c6d20dea0b8c802d7ef,delivered,2018-07-24 20:41:37,2018-07-26 03:24:27,2018-07-26 14:31:00,2018-08-07 15:27:45,2018-08-13 00:00:00
2,47770eb9100c2d0c44946d9cf07ec65d,41ce2a54c0b03bf3443c3d931a367089,delivered,2018-08-08 08:38:49,2018-08-08 08:55:23,2018-08-08 13:50:00,2018-08-17 18:06:29,2018-09-04 00:00:00
3,949d5b44dbf5de918fe9c16f97b45f8a,f88197465ea7920adcdbec7375364d82,delivered,2017-11-18 19:28:06,2017-11-18 19:45:59,2017-11-22 13:39:59,2017-12-02 00:28:42,2017-12-15 00:00:00
4,ad21c59c0840e6cb83a9ceb5573f8159,8ab97904e6daea8866dbdbc4fb7aad2c,delivered,2018-02-13 21:18:39,2018-02-13 22:20:29,2018-02-14 19:46:34,2018-02-16 18:17:02,2018-02-26 00:00:00





--- EXPLORING DATAFRAME: 'order_items' ---
Shape: (112650, 7)

[INFO]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 112650 entries, 0 to 112649
Data columns (total 7 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   order_id             112650 non-null  object 
 1   order_item_id        112650 non-null  int64  
 2   product_id           112650 non-null  object 
 3   seller_id            112650 non-null  object 
 4   shipping_limit_date  112650 non-null  object 
 5   price                112650 non-null  float64
 6   freight_value        112650 non-null  float64
dtypes: float64(2), int64(1), object(4)
memory usage: 6.0+ MB

[HEAD]


Unnamed: 0,order_id,order_item_id,product_id,seller_id,shipping_limit_date,price,freight_value
0,00010242fe8c5a6d1ba2dd792cb16214,1,4244733e06e7ecb4970a6e2683c13e61,48436dade18ac8b2bce089ec2a041202,2017-09-19 09:45:35,58.9,13.29
1,00018f77f2f0320c557190d7a144bdd3,1,e5f2d52b802189ee658865ca93d83a8f,dd7ddc04e1b6c2c614352b383efe2d36,2017-05-03 11:05:13,239.9,19.93
2,000229ec398224ef6ca0657da4fc703e,1,c777355d18b72b67abbeef9df44fd0fd,5b51032eddd242adc84c38acab88f23d,2018-01-18 14:48:30,199.0,17.87
3,00024acbcdf0a6daa1e931b038114c75,1,7634da152a4610f1595efa32f14722fc,9d7a1d34a5052409006425275ba1c2b4,2018-08-15 10:10:18,12.99,12.79
4,00042b26cf59d7ce69dfabb4e55b4fd9,1,ac6c3623068f30de03045865e4e10089,df560393f3a51e74553ab94004ba5c87,2017-02-13 13:57:51,199.9,18.14





--- EXPLORING DATAFRAME: 'order_payments' ---
Shape: (103886, 5)

[INFO]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 103886 entries, 0 to 103885
Data columns (total 5 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   order_id              103886 non-null  object 
 1   payment_sequential    103886 non-null  int64  
 2   payment_type          103886 non-null  object 
 3   payment_installments  103886 non-null  int64  
 4   payment_value         103886 non-null  float64
dtypes: float64(1), int64(2), object(2)
memory usage: 4.0+ MB

[HEAD]


Unnamed: 0,order_id,payment_sequential,payment_type,payment_installments,payment_value
0,b81ef226f3fe1789b1e8b2acac839d17,1,credit_card,8,99.33
1,a9810da82917af2d9aefd1278f1dcfa0,1,credit_card,1,24.39
2,25e8ea4e93396b6fa0d3dd708e76c1bd,1,credit_card,1,65.71
3,ba78997921bbcdc1373bb41e913ab953,1,credit_card,8,107.78
4,42fdf880ba16b47b59251dd489d4441a,1,credit_card,2,128.45





--- EXPLORING DATAFRAME: 'order_reviews' ---
Shape: (99224, 7)

[INFO]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99224 entries, 0 to 99223
Data columns (total 7 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   review_id                99224 non-null  object
 1   order_id                 99224 non-null  object
 2   review_score             99224 non-null  int64 
 3   review_comment_title     11568 non-null  object
 4   review_comment_message   40977 non-null  object
 5   review_creation_date     99224 non-null  object
 6   review_answer_timestamp  99224 non-null  object
dtypes: int64(1), object(6)
memory usage: 5.3+ MB

[HEAD]


Unnamed: 0,review_id,order_id,review_score,review_comment_title,review_comment_message,review_creation_date,review_answer_timestamp
0,7bc2406110b926393aa56f80a40eba40,73fc7af87114b39712e6da79b0a377eb,4,,,2018-01-18 00:00:00,2018-01-18 21:46:59
1,80e641a11e56f04c1ad469d5645fdfde,a548910a1c6147796b98fdf73dbeba33,5,,,2018-03-10 00:00:00,2018-03-11 03:05:13
2,228ce5500dc1d8e020d8d1322874b6f0,f9e4b658b201a9f2ecdecbb34bed034b,5,,,2018-02-17 00:00:00,2018-02-18 14:36:24
3,e64fb393e7b32834bb789ff8bb30750e,658677c97b385a9be170737859d3511b,5,,Recebi bem antes do prazo estipulado.,2017-04-21 00:00:00,2017-04-21 22:02:06
4,f7c4243c7fe1938f181bec41a392bdeb,8e6bfb81e283fa7e4f11123a3fb894f1,5,,Parabéns lojas lannister adorei comprar pela I...,2018-03-01 00:00:00,2018-03-02 10:26:53





--- EXPLORING DATAFRAME: 'products' ---
Shape: (32951, 9)

[INFO]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32951 entries, 0 to 32950
Data columns (total 9 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   product_id                  32951 non-null  object 
 1   product_category_name       32341 non-null  object 
 2   product_name_lenght         32341 non-null  float64
 3   product_description_lenght  32341 non-null  float64
 4   product_photos_qty          32341 non-null  float64
 5   product_weight_g            32949 non-null  float64
 6   product_length_cm           32949 non-null  float64
 7   product_height_cm           32949 non-null  float64
 8   product_width_cm            32949 non-null  float64
dtypes: float64(7), object(2)
memory usage: 2.3+ MB

[HEAD]


Unnamed: 0,product_id,product_category_name,product_name_lenght,product_description_lenght,product_photos_qty,product_weight_g,product_length_cm,product_height_cm,product_width_cm
0,1e9e8ef04dbcff4541ed26657ea517e5,perfumaria,40.0,287.0,1.0,225.0,16.0,10.0,14.0
1,3aa071139cb16b67ca9e5dea641aaa2f,artes,44.0,276.0,1.0,1000.0,30.0,18.0,20.0
2,96bd76ec8810374ed1b65e291975717f,esporte_lazer,46.0,250.0,1.0,154.0,18.0,9.0,15.0
3,cef67bcfe19066a932b7673e239eb23d,bebes,27.0,261.0,1.0,371.0,26.0,4.0,26.0
4,9dc1a7de274444849c219cff195d0b71,utilidades_domesticas,37.0,402.0,4.0,625.0,20.0,17.0,13.0





--- EXPLORING DATAFRAME: 'sellers' ---
Shape: (3095, 4)

[INFO]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3095 entries, 0 to 3094
Data columns (total 4 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   seller_id               3095 non-null   object
 1   seller_zip_code_prefix  3095 non-null   int64 
 2   seller_city             3095 non-null   object
 3   seller_state            3095 non-null   object
dtypes: int64(1), object(3)
memory usage: 96.8+ KB

[HEAD]


Unnamed: 0,seller_id,seller_zip_code_prefix,seller_city,seller_state
0,3442f8959a84dea7ee197c632cb2df15,13023,campinas,SP
1,d1b65fc7debc3361ea86b5f14c68d2e2,13844,mogi guacu,SP
2,ce3ad9de960102d0677a81f5d0bb7b2d,20031,rio de janeiro,RJ
3,c0f3eea2e14555b6faeea3dd58c1b1c3,4195,sao paulo,SP
4,51a04a8a6bdcb23deccc82b0b80742cf,12914,braganca paulista,SP





--- EXPLORING DATAFRAME: 'product_category_name_translation' ---
Shape: (71, 2)

[INFO]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 71 entries, 0 to 70
Data columns (total 2 columns):
 #   Column                         Non-Null Count  Dtype 
---  ------                         --------------  ----- 
 0   product_category_name          71 non-null     object
 1   product_category_name_english  71 non-null     object
dtypes: object(2)
memory usage: 1.2+ KB

[HEAD]


Unnamed: 0,product_category_name,product_category_name_english
0,beleza_saude,health_beauty
1,informatica_acessorios,computers_accessories
2,automotivo,auto
3,cama_mesa_banho,bed_bath_table
4,moveis_decoracao,furniture_decor







---
## 5. Key Findings from Exploration & Merging Strategy

Based on the initial exploration of the individual datasets, several key insights and strategic decisions have been made before proceeding with the merge.

### Key Findings:
1.  **Dataset Discrepancy:** The number of rows varies significantly across datasets, which is expected. This highlights the importance of choosing the correct merge strategy (i.e., `left` vs. `inner` join) to avoid unintentional data loss.
2.  **Identifier Fields:** The distinction between `customer_id` (transaction-level) and `customer_unique_id` (customer-level) is crucial. This structure allows for two distinct types of analysis: one focused on individual orders/products and another focused on customer behavior over time.
3.  **Product Categories:** The `products` dataset contains category names in Portuguese. The `product_category_name_translation` dataset must be used to translate these into English for consistency.
4.  **Review Coverage:** A significant number of orders may not have a corresponding review. A `left join` from orders to reviews will be necessary to keep all order information, even for those without reviews. We should quantify this coverage.
5.  **Geolocation Data:** The `geolocation` dataset is extremely large and contains granular, zip-code-level data. For the scope of this project, the state and city information already present in the `customers` and `sellers` tables is sufficient for high-level geographic analysis. The `geolocation` table will be excluded from the main merge to maintain performance and focus.

### Merging Strategy:
1.  **Central DataFrame:** The `orders` dataset will serve as the central table for our merges.
2.  **Join Type:** We will exclusively use `LEFT JOIN` (`how='left'`) starting from the `orders` table. This ensures that we do not lose any order information, even if corresponding details are missing in other tables (like reviews or payment information).
3.  **Join Keys:** The primary keys for merging will be `order_id`, `customer_id`, `product_id`, and `seller_id`.
4.  **Order of Operations:**
    * First, we will perform a quick calculation to determine the percentage of orders that have reviews.
    * Then, we will create a single, comprehensive master DataFrame by merging all relevant tables.
    * The `geolocation` table will not be included in this merge.

In [4]:
# --- 5.1. Quantifying Review Coverage ---

# Get the number of unique orders from the orders and reviews tables
num_orders = dataframes['orders']['order_id'].nunique()
num_reviews = dataframes['order_reviews']['order_id'].nunique()

# Calculate the percentage of orders that have at least one review
review_coverage_percentage = (num_reviews / num_orders) * 100

print(f"Total unique orders: {num_orders}")
print(f"Orders with at least one review: {num_reviews}")
print(f"Review coverage percentage: {review_coverage_percentage:.2f}%")

Total unique orders: 99441
Orders with at least one review: 98673
Review coverage percentage: 99.23%


---
## 6. Merging Datasets into a Single DataFrame

Now we will execute the merging strategy defined above. We will start with the `orders` dataframe and sequentially left-join the other relevant dataframes to create a single, comprehensive dataset. The `geolocation` table will be excluded.

In [5]:
# --- 6.1. Executing the Master Merge ---

# Start with the main 'orders' dataframe
df_master = dataframes['orders'].copy()

print(f"Starting with 'orders' dataframe. Shape: {df_master.shape}")

# Dataframes to merge
# Format: 'dataframe_name': 'join_key'
tables_to_merge = {
    'order_items': 'order_id',
    'order_payments': 'order_id',
    'order_reviews': 'order_id',
    'customers': 'customer_id',
    'products': 'product_id',
    'sellers': 'seller_id',
    'product_category_name_translation': 'product_category_name'
}

# Loop through the dictionary and perform left merges
for name, key in tables_to_merge.items():
    df_master = pd.merge(df_master, dataframes[name], on=key, how='left')
    print(f"  -> Merged with '{name}'. New shape: {df_master.shape}")

# --- Verification ---
print("\n--- Master DataFrame Verification ---")
print(f"Final merged dataframe shape: {df_master.shape}")
print("Columns in the final dataframe:", list(df_master.columns))
print("\nFirst 5 rows of the master dataframe:")
display(df_master.head())

Starting with 'orders' dataframe. Shape: (99441, 8)
  -> Merged with 'order_items'. New shape: (113425, 14)
  -> Merged with 'order_payments'. New shape: (118434, 18)
  -> Merged with 'order_reviews'. New shape: (119143, 24)
  -> Merged with 'customers'. New shape: (119143, 28)
  -> Merged with 'products'. New shape: (119143, 36)
  -> Merged with 'sellers'. New shape: (119143, 39)
  -> Merged with 'product_category_name_translation'. New shape: (119143, 40)

--- Master DataFrame Verification ---
Final merged dataframe shape: (119143, 40)
Columns in the final dataframe: ['order_id', 'customer_id', 'order_status', 'order_purchase_timestamp', 'order_approved_at', 'order_delivered_carrier_date', 'order_delivered_customer_date', 'order_estimated_delivery_date', 'order_item_id', 'product_id', 'seller_id', 'shipping_limit_date', 'price', 'freight_value', 'payment_sequential', 'payment_type', 'payment_installments', 'payment_value', 'review_id', 'review_score', 'review_comment_title', 'review_

Unnamed: 0,order_id,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date,order_item_id,product_id,seller_id,shipping_limit_date,price,freight_value,payment_sequential,payment_type,payment_installments,payment_value,review_id,review_score,review_comment_title,review_comment_message,review_creation_date,review_answer_timestamp,customer_unique_id,customer_zip_code_prefix,customer_city,customer_state,product_category_name,product_name_lenght,product_description_lenght,product_photos_qty,product_weight_g,product_length_cm,product_height_cm,product_width_cm,seller_zip_code_prefix,seller_city,seller_state,product_category_name_english
0,e481f51cbdc54678b7cc49136f2d6af7,9ef432eb6251297304e76186b10a928d,delivered,2017-10-02 10:56:33,2017-10-02 11:07:15,2017-10-04 19:55:00,2017-10-10 21:25:13,2017-10-18 00:00:00,1.0,87285b34884572647811a353c7ac498a,3504c0cb71d7fa48d967e0e4c94d59d9,2017-10-06 11:07:15,29.99,8.72,1.0,credit_card,1.0,18.12,a54f0611adc9ed256b57ede6b6eb5114,4.0,,"Não testei o produto ainda, mas ele veio corre...",2017-10-11 00:00:00,2017-10-12 03:43:48,7c396fd4830fd04220f754e42b4e5bff,3149,sao paulo,SP,utilidades_domesticas,40.0,268.0,4.0,500.0,19.0,8.0,13.0,9350.0,maua,SP,housewares
1,e481f51cbdc54678b7cc49136f2d6af7,9ef432eb6251297304e76186b10a928d,delivered,2017-10-02 10:56:33,2017-10-02 11:07:15,2017-10-04 19:55:00,2017-10-10 21:25:13,2017-10-18 00:00:00,1.0,87285b34884572647811a353c7ac498a,3504c0cb71d7fa48d967e0e4c94d59d9,2017-10-06 11:07:15,29.99,8.72,3.0,voucher,1.0,2.0,a54f0611adc9ed256b57ede6b6eb5114,4.0,,"Não testei o produto ainda, mas ele veio corre...",2017-10-11 00:00:00,2017-10-12 03:43:48,7c396fd4830fd04220f754e42b4e5bff,3149,sao paulo,SP,utilidades_domesticas,40.0,268.0,4.0,500.0,19.0,8.0,13.0,9350.0,maua,SP,housewares
2,e481f51cbdc54678b7cc49136f2d6af7,9ef432eb6251297304e76186b10a928d,delivered,2017-10-02 10:56:33,2017-10-02 11:07:15,2017-10-04 19:55:00,2017-10-10 21:25:13,2017-10-18 00:00:00,1.0,87285b34884572647811a353c7ac498a,3504c0cb71d7fa48d967e0e4c94d59d9,2017-10-06 11:07:15,29.99,8.72,2.0,voucher,1.0,18.59,a54f0611adc9ed256b57ede6b6eb5114,4.0,,"Não testei o produto ainda, mas ele veio corre...",2017-10-11 00:00:00,2017-10-12 03:43:48,7c396fd4830fd04220f754e42b4e5bff,3149,sao paulo,SP,utilidades_domesticas,40.0,268.0,4.0,500.0,19.0,8.0,13.0,9350.0,maua,SP,housewares
3,53cdb2fc8bc7dce0b6741e2150273451,b0830fb4747a6c6d20dea0b8c802d7ef,delivered,2018-07-24 20:41:37,2018-07-26 03:24:27,2018-07-26 14:31:00,2018-08-07 15:27:45,2018-08-13 00:00:00,1.0,595fac2a385ac33a80bd5114aec74eb8,289cdb325fb7e7f891c38608bf9e0962,2018-07-30 03:24:27,118.7,22.76,1.0,boleto,1.0,141.46,8d5266042046a06655c8db133d120ba5,4.0,Muito boa a loja,Muito bom o produto.,2018-08-08 00:00:00,2018-08-08 18:37:50,af07308b275d755c9edb36a90c618231,47813,barreiras,BA,perfumaria,29.0,178.0,1.0,400.0,19.0,13.0,19.0,31570.0,belo horizonte,SP,perfumery
4,47770eb9100c2d0c44946d9cf07ec65d,41ce2a54c0b03bf3443c3d931a367089,delivered,2018-08-08 08:38:49,2018-08-08 08:55:23,2018-08-08 13:50:00,2018-08-17 18:06:29,2018-09-04 00:00:00,1.0,aa4383b373c6aca5d8797843e5594415,4869f7a5dfa277a7dca6462dcf3b52b2,2018-08-13 08:55:23,159.9,19.22,1.0,credit_card,3.0,179.12,e73b67b67587f7644d5bd1a52deb1b01,5.0,,,2018-08-18 00:00:00,2018-08-22 19:07:58,3a653a41f6f9fc3d2a113cf8398680e8,75265,vianopolis,GO,automotivo,46.0,232.0,1.0,420.0,24.0,19.0,21.0,14840.0,guariba,SP,auto


---
## 7. Saving the Processed Data

Finally, we will save the consolidated master dataframe and a smaller sample version into the `data/processed` directory. These files will be saved in the efficient `.parquet` format and will serve as the starting point for all subsequent analysis notebooks.

In [6]:
# --- 7.1. Saving the Main and Sample DataFrames ---

# 1. Save the complete master dataframe
# The function will save it as 'main_data.parquet'
save_processed(df_master, 'main_data')

# 2. Create a 5% random sample for quick explorations
sample_fraction = 0.05
df_sample = df_master.sample(frac=sample_fraction, random_state=42)

# 3. Save the sample dataframe
# The function will save it as 'sample_data.parquet'
save_processed(df_sample, 'sample_data')


# --- Verification ---
print("\n--- Data Saving Verification ---")
try:
    # Check if the main file exists
    main_file_path = '../data/processed/main_data.parquet'
    if os.path.exists(main_file_path):
        print(f"✅ Success: Main file saved correctly at '{main_file_path}'")
    else:
        print(f"❌ Error: Main file not found.")

    # Check if the sample file exists
    sample_file_path = '../data/processed/sample_data.parquet'
    if os.path.exists(sample_file_path):
        print(f"✅ Success: Sample file saved correctly at '{sample_file_path}'")
    else:
        print(f"❌ Error: Sample file not found.")
except Exception as e:
    print(f"An error occurred during verification: {e}")

Data saved to: /home/lucas/olist-data-analysis-project/notebooks/../data/processed/main_data.parquet
Data saved to: /home/lucas/olist-data-analysis-project/notebooks/../data/processed/sample_data.parquet

--- Data Saving Verification ---
✅ Success: Main file saved correctly at '../data/processed/main_data.parquet'
✅ Success: Sample file saved correctly at '../data/processed/sample_data.parquet'
