# Day 2 Exercise: The Olist Marketplace Analysis

**Your Name:** [Student name here]  
**Date:** Day 2, Block A  
**Course:** ECBS5294 - Introduction to Data Science: Working with Data

---

## 📧 Monday Morning at Olist

**Your Role:** Junior Data Analyst at Olist  
**Your Location:** São Paulo, Brazil  
**Date:** December 2018

You're a newly hired Data Analyst at **Olist**, Brazil's largest department store marketplace. Olist connects thousands of small businesses (sellers) with customers across Brazil through a unified e-commerce platform. Think of it as Brazil's answer to Amazon Marketplace or Etsy at scale.

You arrive Monday morning to find this email from **Paula Costa**, VP of Marketplace Operations:

---

**From:** Paula Costa <paula.costa@olist.com>  
**To:** You (Data Analytics Team)  
**Subject:** URGENT: Q4 Board Meeting Data - Due Wednesday

> Team,
>
> Our Q4 board meeting is Friday morning. I need insights on seller performance, customer behavior, and product trends to present to the executive team. The board is particularly interested in:
> 
> 1. **Revenue drivers** - which product categories are generating the most sales?
> 2. **Seller performance** - who are our top-performing sellers by state?
> 3. **Customer feedback gaps** - what percentage of orders lack reviews, and why does this matter?
> 
> I need **data-driven answers** by Wednesday EOD. Please analyze our marketplace data (2016-2018) and provide:
> - Clear SQL queries that I can verify
> - Business insights I can present to non-technical executives
> - Recommendations for action
> 
> **Context:** We're evaluating whether to expand seller recruitment in certain states, invest in review incentive programs, and potentially restructure our product category strategy. Your analysis will inform million-dollar decisions.
> 
> I'm counting on you. Let's show the board what data analytics can do.
> 
> — Paula

---

**Your Mission:** Use SQL joins, aggregations, and data analysis to answer Paula's questions and deliver actionable business insights.

---

## Setup: Load Data and Connect to DuckDB

In [None]:
# Import libraries
import duckdb
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

# Connect to in-memory database
con = duckdb.connect(':memory:')

print("✅ Connected to DuckDB!")

In [None]:
# Load all necessary tables
con.execute("""
    CREATE TABLE orders AS
    SELECT * FROM '../../data/day2/block_a/olist_orders_dataset.csv'
""")

con.execute("""
    CREATE TABLE customers AS
    SELECT * FROM '../../data/day2/block_a/olist_customers_dataset.csv'
""")

con.execute("""
    CREATE TABLE order_items AS
    SELECT * FROM '../../data/day2/block_a/olist_order_items_dataset.csv'
""")

con.execute("""
    CREATE TABLE products AS
    SELECT * FROM '../../data/day2/block_a/olist_products_dataset.csv'
""")

con.execute("""
    CREATE TABLE sellers AS
    SELECT * FROM '../../data/day2/block_a/olist_sellers_dataset.csv'
""")

con.execute("""
    CREATE TABLE reviews AS
    SELECT * FROM '../../data/day2/block_a/olist_order_reviews_dataset.csv'
""")

con.execute("""
    CREATE TABLE categories AS
    SELECT * FROM '../../data/day2/block_a/product_category_name_translation.csv'
""")

print("✅ All tables loaded successfully!")

---

## Part 1: In-Class Queries (25 minutes)

Complete these 3 queries during class time. Each query is scaffolded with TODO comments to guide you.

---

### Query 1: Revenue by Product Category (⏱️ 7 minutes)

**Paula's Question:**
> "Which product categories generate the most revenue? I need this for Q1 budget allocation decisions. Show me the top 10 categories by total revenue."

**What you need to do:**
- Join `order_items` → `products` → `categories` (to get English category names)
- Calculate total revenue per category: `SUM(price)`
- Count distinct orders per category
- Sort by revenue descending, show top 10

**Tables needed:**
- `order_items` (has `price` and `product_id`)
- `products` (has `product_category_name`)
- `categories` (translates Portuguese → English)

**Join keys:**
- `order_items.product_id = products.product_id`
- `products.product_category_name = categories.product_category_name`

In [None]:
# Query 1: Revenue by Category
# TODO: Complete this query

result_q1 = con.execute("""
    SELECT 
        -- TODO: Select category name in English (from categories table)
        -- TODO: Count DISTINCT orders (COUNT(DISTINCT oi.order_id))
        -- TODO: Sum price for total revenue (SUM(oi.price))
    FROM order_items oi
    -- TODO: INNER JOIN to products table (ON oi.product_id = p.product_id)
    -- TODO: INNER JOIN to categories table (ON p.product_category_name = cat.product_category_name)
    -- TODO: GROUP BY category
    -- TODO: ORDER BY total revenue DESC
    LIMIT 10
""").df()

result_q1

In [None]:
# Validation: Check your results
assert result_q1.shape[0] == 10, "Should return 10 categories"
assert 'total_revenue' in [col.lower().replace('_', '') for col in result_q1.columns] or 'revenue' in str(result_q1.columns).lower(), "Should have a revenue column"
print("✅ Query 1 validation passed!")

**Your interpretation for Paula (2-3 sentences):**

[Write your business insight here: What are the top revenue-generating categories? What does this tell Paula about where to focus budget?]

---

### Query 2: Unreviewed Orders Investigation (⏱️ 7 minutes)

**Paula's Question:**
> "Some orders don't have customer reviews. This concerns me - we use reviews for quality control and seller ratings. Find all orders that lack reviews and tell me: how many unreviewed orders are there by order status? Which statuses have the most unreviewed orders?"

**What you need to do:**
- LEFT JOIN `orders` → `reviews` (keep all orders, even without reviews)
- Filter to orders WHERE review is NULL
- Group by order status
- Count unreviewed orders by status

**Why LEFT JOIN?** We want ALL orders. INNER JOIN would only show orders WITH reviews (the opposite of what Paula needs!).

**Tables needed:**
- `orders` (has `order_id` and `order_status`)
- `reviews` (has `review_id` and links to `order_id`)

**Join key:**
- `orders.order_id = reviews.order_id`

In [None]:
# Query 2: Unreviewed Orders
# TODO: Complete this query

result_q2 = con.execute("""
    SELECT 
        -- TODO: Select order_status from orders table
        -- TODO: Count unreviewed orders (use COUNT(*) or COUNT(o.order_id))
    FROM orders o
    -- TODO: LEFT JOIN to reviews table (ON o.order_id = r.order_id)
    WHERE r.review_id IS NULL
    -- TODO: GROUP BY order_status
    ORDER BY unreviewed_count DESC
""").df()

result_q2

In [None]:
# Validation: Check your results
assert result_q2.shape[0] > 0, "Should find some unreviewed orders"
assert 'order_status' in [col.lower() for col in result_q2.columns], "Should have order_status column"
assert any('count' in col.lower() for col in result_q2.columns), "Should have a count column (e.g., unreviewed_count, order_count)"
print("✅ Query 2 validation passed!")

**Your interpretation for Paula (2-3 sentences):**

[Write your business insight here: How many total unreviewed orders are there? Which order statuses have the most unreviewed orders? Should Paula be concerned about review gaps in certain order statuses?]

---

### Query 3: Top Sellers by State (⏱️ 7 minutes)

**Paula's Question:**
> "I need to recognize our top-performing sellers in each state for our quarterly awards program. Show me the top 3 sellers in each state by total revenue. This will also help us identify which states have strong seller ecosystems."

**What you need to do:**
- Join `order_items` → `sellers`
- Calculate total revenue per seller
- Rank sellers within each state (use ROW_NUMBER() window function from Day 1!)
- Filter to top 3 per state

**Hint:** Use a CTE for clean structure:
1. First CTE: Calculate seller revenue
2. Second CTE: Add ranking with ROW_NUMBER() OVER (PARTITION BY state ...)
3. Main query: Filter WHERE rank <= 3

**Note:** The query includes `LIMIT 30` to show approximately the top 10 states (10 states × 3 sellers = 30 rows). This keeps output manageable for review. In a real analysis, you'd remove the LIMIT to see all states.

**Tables needed:**
- `order_items` (has `price` and `seller_id`)
- `sellers` (has `seller_state`)

**Join key:**
- `order_items.seller_id = sellers.seller_id`

In [None]:
# Query 3: Top Sellers by State
# TODO: Complete this query

result_q3 = con.execute("""
    WITH seller_revenue AS (
        -- TODO: Calculate total revenue per seller
        SELECT 
            -- TODO: Select seller_id, seller_state
            -- TODO: Count distinct orders
            -- TODO: Sum price as total_revenue
        FROM order_items oi
        -- TODO: INNER JOIN to sellers table
        -- TODO: GROUP BY seller_id, seller_state
    ),
    ranked_sellers AS (
        -- TODO: Add ranking within each state
        SELECT 
            *,
            -- TODO: ROW_NUMBER() OVER (PARTITION BY seller_state ORDER BY total_revenue DESC)
        FROM seller_revenue
    )
    -- TODO: Select from ranked_sellers WHERE rank_in_state <= 3
    -- TODO: ORDER BY seller_state, rank_in_state
    LIMIT 30
""").df()

result_q3

In [None]:
# Validation: Check your results
assert result_q3.shape[0] > 0, "Should find top sellers"
assert result_q3.shape[0] <= 30, "Limited to 30 rows (top 10 states × 3 sellers)"
print("✅ Query 3 validation passed!")

**Your interpretation for Paula (2-3 sentences):**

[Write your business insight here: Which states have the highest-revenue sellers? Are revenues concentrated among a few top sellers or distributed evenly?]

---

## In-Class Reflection (3-4 sentences)

**Based on your three queries above, write a brief summary for Paula:**

1. What are the key insights from the data?
2. What surprised you?
3. What should Paula focus on first?

[Write your reflection here]

---

**🎉 Great work! You've completed the in-class portion. The queries below are homework.**

---

## Part 2: Homework Queries (Complete after class)

These queries build on what you learned in class. They are less scaffolded - you'll need to figure out the full query structure yourself.

---

### Query 4: Customer Geography Analysis

**Paula's Question:**
> "Which states have customers with the highest average order value? Calculate the average revenue per order by customer state. This will inform our regional marketing budget allocation."

**Hint:** 
- Join `orders` → `customers` → `order_items`
- Calculate total order value: SUM(price) per order_id
- Then average by customer_state
- Consider using a CTE to get order-level revenue first

**Expected result:** One row per state, with average order value

In [None]:
# Query 4: Customer Geography (Homework)
# Write your query here

result_q4 = con.execute("""
    -- Your query here
    SELECT 1 as todo  -- Replace this entire query
""").df()

result_q4

**Your interpretation for Paula:**

[Write your analysis here]

---

### Query 5: Product Quality Issues

**Paula's Question:**
> "Identify products with average review scores below 2.5 stars AND at least 5 reviews. These products need immediate seller intervention. Show me product category, average rating, and review count."

**Hint:**
- Join `products` → `order_items` → `orders` → `reviews`
- Group by product_id and category
- Calculate AVG(review_score) and COUNT(reviews)
- Use HAVING clause to filter: AVG < 2.5 AND COUNT >= 5

**Expected result:** Products with poor ratings (at least 5 reviews)

In [None]:
# Query 5: Product Quality Issues (Homework)
# Write your query here

result_q5 = con.execute("""
    -- Your query here
    SELECT 1 as todo  -- Replace this entire query
""").df()

result_q5

**Your interpretation for Paula:**

[Write your analysis here]

---

### Query 6 (BONUS): Seller Performance Gaps

**Paula's Question:**
> "Which sellers have made sales but NEVER received a review? This could indicate a data quality issue or problematic seller behavior. Find these sellers and calculate their total revenue."

**Hint:**
- This is a complex multi-table LEFT JOIN
- Chain: `sellers` → `order_items` → `orders` → `reviews`
- Use LEFT JOINs to keep sellers even if reviews don't exist
- Filter WHERE review_id IS NULL
- Group by seller to get total revenue

**Expected result:** Sellers with sales but zero reviews

In [None]:
# Query 6: Seller Performance Gaps (BONUS - Homework)
# Write your query here

result_q6 = con.execute("""
    -- Your query here (this one is challenging!)
    SELECT 1 as todo  -- Replace this entire query
""").df()

result_q6

**Your interpretation for Paula:**

[Write your analysis here]

---

## Executive Summary for Paula (Homework)

Write an 8-10 sentence summary for Paula Costa to present at the board meeting. Remember:
- Paula is non-technical (avoid SQL jargon)
- Focus on business impact, not query mechanics
- Include specific numbers from your analysis
- Make clear recommendations

**Structure:**
1. **Opening:** What did you analyze and why?
2. **Key findings:** 3-4 main insights from your queries
3. **Business impact:** What do these findings mean for Olist?
4. **Recommendations:** 2-3 specific actions Paula should take

---

### Executive Summary

[Write your 8-10 sentence executive summary here. Start with: "Paula, I analyzed our marketplace data from 2016-2018 to understand revenue drivers, seller performance, and customer engagement. Here's what I found..."]

---

## Submission Checklist

**Before submitting, verify:**

- [ ] All 3 in-class queries (Q1-Q3) complete and working
- [ ] All validation cells pass (no assertion errors)
- [ ] Interpretations written for each query (2-3 sentences)
- [ ] In-class reflection completed (3-4 sentences)
- [ ] Homework queries (Q4-Q6) attempted
- [ ] Executive summary written (8-10 sentences)
- [ ] Notebook runs end-to-end: **"Restart & Run All"** succeeds
- [ ] All outputs visible (don't clear them!)
- [ ] File renamed to: `day2_exercise_joins_[your_name].ipynb`

**Upload to Moodle by: Start of next class**

---

**Great work! 🎉 You've applied SQL joins to solve real business problems!**