# Data Integration & Preprocessing for Customer Satisfaction Analysis

This notebook is part of a larger project exploring customer satisfaction in Brazilian e-commerce using the [Olist dataset](https://www.kaggle.com/datasets/olistbr/brazilian-ecommerce).  
It builds on the cleaned datasets prepared in the previous notebook by joining them into a single analytical table, performing additional data cleaning, and engineering features to support the upcoming analysis.


**Goals of this notebook:**

- Join the cleaned tables into a consolidated dataset at the order level  
- Clean and standardize the merged dataset (e.g., data types, missing values)  
- Create new features to capture relevant aspects of orders, products, payments etc.
- Export the final dataset for analysis

**This notebook is preceded and followed by:**

- [Data Cleaning Notebook](./01_data_cleaning.ipynb): loads and prepares the individual raw datasets for integration  
- [Exploratory Analysis Notebook](./03_customer-satisfaction-analysis.ipynb): investigates customer satisfaction patterns and key influencing factors
---

## **Structure of the Notebook**

> _Note: Section links and “Back to top” links work best in Jupyter environments (e.g., Jupyter Lab or VS Code). They may not work as expected when clicked directly on GitHub._

- [Data Integration](#data-integration)
  - [Load cleaned datasets](#load-cleaned-csv-files-into-duckdb-tables)
  - [Join tables into order-level dataset](#join-tables-into-order-level-dataset)

In [None]:
import duckdb

## **Data Integration**

### Load Cleaned CSV Files into DuckDB Tables

In [None]:
# Load Cleaned CSV Files into DuckDB Tables

# Path to the folder containing cleaned CSV files
data_path = "../data/cleaned"

# Connect to an in-memory DuckDB database
con = duckdb.connect(database=":memory:")

# Load each cleaned dataset into its own DuckDB table
con.execute(f"""
    CREATE TABLE customers AS 
    SELECT * FROM read_csv_auto('{data_path}/customers.csv');
""")

con.execute(f"""
    CREATE TABLE orders AS 
    SELECT * FROM read_csv_auto('{data_path}/orders.csv');
""")

con.execute(f"""
    CREATE TABLE items AS 
    SELECT * FROM read_csv_auto('{data_path}/items.csv');
""")

con.execute(f"""
    CREATE TABLE payments AS 
    SELECT * FROM read_csv_auto('{data_path}/payments.csv');
""")

con.execute(f"""
    CREATE TABLE reviews AS 
    SELECT * FROM read_csv_auto('{data_path}/reviews.csv');
""")

con.execute(f"""
    CREATE TABLE products AS 
    SELECT * FROM read_csv_auto('{data_path}/products.csv');
""")

con.execute(f"""
    CREATE TABLE sellers AS 
    SELECT * FROM read_csv_auto('{data_path}/sellers.csv');
""")

[🠉 Back to top](#structure-of-the-notebook)

### Join Tables into Order-Level Dataset

In [None]:
# Build Query to Create Order-Level Analytical Dataset
q_orders = """
WITH item_quantities AS (
  -- Aggregate product and seller info per product in each order
  SELECT
    i.order_id,
    i.product_id,
    COALESCE(p.product_category_name_english, 'unknown') AS product_category_name,
    COALESCE(s.seller_state, 'unknown') AS seller_state,
    COALESCE(s.seller_id, 'unknown') AS seller_id,
    COUNT(*) AS product_quantity,
    SUM(i.price) AS product_price,
    SUM(i.freight_value) AS product_freight,
    MAX(p.product_weight_g) AS product_weight_g,
    MAX(i.shipping_limit_date) AS shipping_limit_date,
    AVG(p.product_name_lenght) AS product_name_length,
    AVG(p.product_description_lenght) AS product_description_length,
    AVG(p.product_photos_qty) AS product_photos_qty,
    AVG(p.product_length_cm) AS product_length_cm,
    AVG(p.product_height_cm) AS product_height_cm,
    AVG(p.product_width_cm) AS product_width_cm
  FROM items AS i
  LEFT JOIN products AS p ON i.product_id = p.product_id
  LEFT JOIN sellers AS s ON i.seller_id = s.seller_id
  GROUP BY i.order_id, i.product_id, s.seller_state, s.seller_id, p.product_category_name_english
),

-- Collapse product category into a single value per order
category_per_order AS (
  SELECT
    order_id,
    CASE 
      WHEN COUNT(DISTINCT product_category_name) = 1 
        THEN MAX(product_category_name)
      ELSE 'multiple_categories'
    END AS product_category_name
  FROM item_quantities
  GROUP BY order_id
),

-- Collapse seller state into a single value per order
seller_state_per_order AS (
  SELECT
    order_id,
    CASE 
      WHEN COUNT(DISTINCT seller_state) = 1 
        THEN MAX(seller_state)
      ELSE 'multiple_states'
    END AS seller_state
  FROM item_quantities
  GROUP BY order_id
),

-- Collapse seller ID into a single value per order
seller_id_per_order AS (
  SELECT
    order_id,
    CASE 
      WHEN COUNT(DISTINCT seller_id) = 1 
        THEN MAX(seller_id)
      ELSE 'multiple_sellers'
    END AS seller_id
  FROM item_quantities
  GROUP BY order_id
),

-- Collapse product ID into a single value per order
product_id_per_order AS (
  SELECT
    order_id,
    CASE 
      WHEN COUNT(DISTINCT product_id) = 1 
        THEN MAX(product_id)
      ELSE 'multiple_products'
    END AS product_id
  FROM item_quantities
  GROUP BY order_id
),

-- Average product-level features across items in the order
product_features_per_order AS (
  SELECT
    order_id,
    ROUND(AVG(product_name_length)) AS product_name_length,
    ROUND(AVG(product_description_length)) AS product_description_length,
    ROUND(AVG(product_photos_qty)) AS product_photos_qty,
    ROUND(AVG(product_length_cm)) AS product_length_cm,
    ROUND(AVG(product_height_cm)) AS product_height_cm,
    ROUND(AVG(product_width_cm)) AS product_width_cm
  FROM item_quantities
  GROUP BY order_id
),

-- Aggregate item-level purchase details per order
orders_items AS (
  SELECT
    order_id,
    COUNT(DISTINCT product_id) AS num_unique_products,
    ROUND(SUM(product_quantity)) AS num_items,
    SUM(product_price) AS total_price,
    SUM(product_freight) AS total_freight,
    SUM(product_price + product_freight) AS total_amount,
    SUM(product_quantity * product_weight_g) AS total_order_weight,
    MAX(shipping_limit_date) AS shipping_limit_date
  FROM item_quantities
  GROUP BY order_id
),

-- Aggregate payment details per order
payments_agg AS (
  SELECT
    order_id,
    MAX(payment_installments) AS max_payment_installments,
    COUNT(payment_sequential) AS n_payment_records, 
    STRING_AGG(DISTINCT payment_type, ', ' ORDER BY payment_type) AS payment_types,
    COUNT(DISTINCT payment_type) AS n_payment_types
  FROM payments
  GROUP BY order_id
)

------ Final Joined Order-Level Table ------
SELECT
  oi.order_id,
  r.review_score,
  pid.product_id,
  cat.product_category_name,
  oi.num_unique_products,
  oi.num_items,
  oi.total_price,
  oi.total_freight,
  oi.total_amount,
  oi.total_order_weight,
  ROUND(oi.total_freight / NULLIF(oi.total_amount, 0), 4) AS freight_share,
  ROUND(oi.total_freight / NULLIF(oi.total_price, 0), 4) AS freight_to_price_ratio,
  pf.product_name_length,
  pf.product_description_length,
  pf.product_photos_qty,
  pf.product_length_cm,
  pf.product_height_cm,
  pf.product_width_cm,
  pa.payment_types,
  pa.n_payment_types,
  pa.max_payment_installments,
  pa.n_payment_records,
  oi.shipping_limit_date,
  o.order_delivered_carrier_date,

  -- Calculate shipping delay (days late shipping vs. promised date)
  CASE 
    WHEN o.order_delivered_carrier_date IS NOT NULL AND oi.shipping_limit_date IS NOT NULL 
    THEN DATE_PART('day', o.order_delivered_carrier_date - oi.shipping_limit_date)
    ELSE NULL 
  END AS shipping_delay_days,
  o.order_delivered_customer_date,
  o.order_estimated_delivery_date,

  -- Calculate delivery delay (days late vs. estimated)
  CASE 
    WHEN o.order_delivered_customer_date IS NOT NULL AND o.order_estimated_delivery_date IS NOT NULL 
    THEN DATE_PART('day', o.order_delivered_customer_date - o.order_estimated_delivery_date) 
    ELSE NULL 
  END AS delivery_delay_days,
  r.review_comment_message,
  r.review_creation_date,
  r.review_answer_timestamp,

  -- Time between review creation and publication (platform-side delay)
  CASE 
    WHEN r.review_answer_timestamp IS NOT NULL AND r.review_creation_date IS NOT NULL 
    THEN DATE_PART('day', r.review_answer_timestamp - r.review_creation_date) 
    ELSE NULL 
  END AS review_processing_delay_days,
  
  c.customer_unique_id,
  c.customer_state,
  ssoid.seller_id,
  sso.seller_state

FROM orders_items AS oi
LEFT JOIN orders AS o USING(order_id)
LEFT JOIN customers AS c USING(customer_id)
LEFT JOIN payments_agg AS pa USING(order_id)
LEFT JOIN reviews AS r USING(order_id)
LEFT JOIN category_per_order AS cat USING(order_id)
LEFT JOIN seller_state_per_order AS sso USING(order_id)
LEFT JOIN seller_id_per_order AS ssoid USING(order_id)
LEFT JOIN product_features_per_order AS pf USING(order_id)
LEFT JOIN product_id_per_order AS pid USING(order_id)

-- Filter: only delivered orders with valid review scores
WHERE o.order_status = 'delivered' 
  AND r.review_score IS NOT NULL

ORDER BY oi.order_id;
"""

In [None]:
# Execute the final SQL query and convert the result to a pandas DataFrame
df_orders = con.execute(q_orders).df()

# Show shape of the resulting dataset
print(f"Order-level joined table has {df_orders.shape[0]} rows and {df_orders.shape[1]} columns.")

# Preview the first few rows
df_orders.head()

In [None]:
# Close the DuckDB connection
con.close()

[🠉 Back to top](#structure-of-the-notebook)