# Lab 06 Solution — GroupBy & Joins

**Focus Area:** Turning messy raw data into consistent, joined datasets by using `groupby` aggregations and join patterns (inner/left/outer)

---

## Outcomes

By the end of this lab, you will be able to:

1. Use `groupby().agg(...)` to compute per‑key metrics (mean, sum, count, nunique) with **named aggregations**.
2. Choose the correct **join** (inner/left/outer) for a question, and verify cardinality with `validate=`.
3. Diagnose join pitfalls: duplicated keys (fan‑out), missing keys (anti‑join), and column suffix collisions.
4. Build tidy aggregates for downstream LLM prompts/features and persist results to Parquet.

## Prerequisites & Setup

Import required libraries and create sample datasets.

In [None]:
import pandas as pd
import numpy as np
import pyarrow.parquet as pq
import pyarrow as pa
from pathlib import Path

print(f"pandas version: {pd.__version__}")
print(f"numpy version: {np.__version__}")

### Create Sample Data

If you don't have artifacts from previous labs, we'll synthesize mini tables:

In [None]:
# Create sample orders dataset
orders = pd.DataFrame({
    'OrderID': [1, 2, 3, 4, 5, 6],
    'CustomerID': ['ALFKI', 'ANATR', 'ANTON', 'ALFKI', 'BERGS', 'CHOPS'],
    'ShipCountry': ['USA', 'DE', 'USA', 'USA', 'SE', 'SG'],
    'Freight': [32.1, 12.0, 5.0, 50.0, 80.0, 22.0]
})

# Create sample customers dataset
customers = pd.DataFrame({
    'CustomerID': ['ALFKI', 'ANATR', 'ANTON', 'BONAP', 'BERGS'],
    'CompanyName': ['Alfreds', 'Ana Trujillo', 'Antonio Moreno', 'Bon app', 'Berglunds'],
    'Country': ['Germany', 'Mexico', 'Mexico', 'France', 'Sweden']
})

print("Orders Dataset:")
display(orders.head())
print("\nCustomers Dataset:")
display(customers.head())

---

## Part A — GroupBy Fundamentals & Named Aggregations

### A1. Basic aggregates

Per ShipCountry: count orders, mean freight, and total freight

In [None]:
# Per ShipCountry: orders, mean freight, total freight
agg_country = (
    orders
    .groupby('ShipCountry', as_index=False)
    .agg(
        orders=('OrderID', 'count'),
        freight_mean=('Freight', 'mean'),
        freight_sum=('Freight', 'sum')
    )
    .sort_values('orders', ascending=False)
)

print("Aggregates by Ship Country:")
display(agg_country)

### A2. Multiple keys & nunique

Per (ShipCountry, CustomerID): count orders and distinct customers

In [None]:
# Per (ShipCountry, CustomerID): count & distinct orders
agg_cc = (
    orders
    .groupby(['ShipCountry', 'CustomerID'], as_index=False)
    .agg(
        n_orders=('OrderID', 'count'),
        n_cust=('CustomerID', 'nunique')
    )
)

print("Aggregates by Ship Country and Customer ID:")
display(agg_cc.head())

### A3. `size` vs `count` and missing values

Understanding the difference: `size` counts rows, `count` ignores NaN in the column

In [None]:
# size counts rows; count ignores NaN in the column
orders_with_nulls = orders.assign(Maybe=None)

print("Using .size() - counts all rows including NaN:")
size_result = orders_with_nulls.groupby('ShipCountry').size()
display(size_result)

print("\nUsing .count() - ignores NaN values:")
count_result = orders_with_nulls.groupby('ShipCountry')['Maybe'].count()
display(count_result)

**Checkpoint:** When would you choose `size` vs `count`?

- Use `size()` when you want to count all rows in each group, regardless of missing values
- Use `count()` when you want to count only non-null values in a specific column
- `size()` returns the number of rows in each group
- `count()` returns the number of non-null values in each group for the specified column(s)

---

## Part B — Join Patterns & Cardinality Checks

### B1. Inner vs Left vs Outer (visual & code)

Compare different join types and their effects on the resulting dataset

In [None]:
# Different join types
inner = orders.merge(customers, on='CustomerID', how='inner', validate='many_to_one')
left = orders.merge(customers, on='CustomerID', how='left', validate='many_to_one')
outer = orders.merge(customers, on='CustomerID', how='outer')

print(f"Original orders: {len(orders)} rows")
print(f"Inner join: {len(inner)} rows")
print(f"Left join: {len(left)} rows")
print(f"Outer join: {len(outer)} rows")

print("\nInner join result (only matching keys):")
display(inner)

print("\nLeft join result (all orders, even without matching customer):")
display(left)

print("\nOuter join result (all keys from both sides):")
display(outer)

**Join Type Explanation:**

- **Inner:** Keep only matching `CustomerID` on both sides (typical for orders↔customers when analyzing realized orders)
- **Left:** Keep all orders even if customer row is missing (good for data quality checks / anti‑join)
- **Outer:** Keep all keys from both sides (useful for audits, rare in production metrics)

### B2. Anti‑join & keys that didn't match

Find orders without a matching customer (data quality check)

In [None]:
# orders without a matching customer (left rows with NaN on right)
anti = left[left['CompanyName'].isna()][['OrderID', 'CustomerID']]

print("Orders without matching customers (anti-join):")
display(anti)

if len(anti) > 0:
    print(f"\nFound {len(anti)} order(s) without matching customer records!")
else:
    print("\nAll orders have matching customer records.")

**Checkpoint:** What business question would left join answer that inner wouldn't?

A left join helps answer:
- "Are there orders in our system from customers that don't exist in our customer table?"
- "What is the data quality issue rate (orphaned orders)?"
- "Do we have referential integrity problems?"

Inner join would silently drop these problematic records, hiding data quality issues.

### B3. Guard against fan‑out (duplicated keys)

Use `validate=` to catch unexpected cardinality issues

In [None]:
# Introduce a duplicate key to illustrate
cust_dupe = pd.concat([customers, customers.iloc[[0]]], ignore_index=True)

print("Customer data with duplicated key:")
display(cust_dupe)

print("\nAttempting merge with validate='many_to_one':")
try:
    result = orders.merge(cust_dupe, on='CustomerID', how='inner', validate='many_to_one')
    display(result)
except Exception as e:
    print(f"✓ Validation caught the issue: {e}")
    print("\nThis prevents accidental fan-out joins that would duplicate rows!")

**Validation Options:**
- `validate='one_to_one'` - Both keys must be unique
- `validate='one_to_many'` - Left key unique, right can have duplicates
- `validate='many_to_one'` - Right key unique, left can have duplicates
- `validate='many_to_many'` - Both sides can have duplicates (use with caution!)

### B4. Column name collisions & suffixes

Handle cases where both sides share column names

In [None]:
# When both sides share column names (e.g., Country), rename or use suffixes
# Option 1: Rename before merge
joined = orders.merge(
    customers.rename(columns={'Country': 'CustCountry'}),
    on='CustomerID',
    how='inner'
)

print("Join with renamed column to avoid collision:")
display(joined.filter(items=['CustomerID', 'ShipCountry', 'CustCountry']).head())

# Option 2: Use suffixes parameter
# Note: This would be needed if we didn't rename and had conflicting column names
print("\nAlternatively, you can use suffixes parameter for automatic renaming.")

---

## Part C — End‑to‑End Mini Task: Customer Segments & Country Rollups

> Goal: Build per‑customer metrics and join to customer attributes for LLM‑ready features.

### C1. Per‑customer aggregates

Calculate metrics for each customer

In [None]:
# Per-customer aggregates
per_cust = (
    orders
    .groupby('CustomerID', as_index=False)
    .agg(
        n_orders=('OrderID', 'count'),
        freight_mean=('Freight', 'mean'),
        freight_sum=('Freight', 'sum')
    )
)

print("Per-customer aggregates:")
display(per_cust)

### C2. Join with customers (inner vs left) and create segments

Enrich customer metrics with customer attributes and create spend segments

In [None]:
# Join with customers using different strategies
per_cust_inner = per_cust.merge(customers, on='CustomerID', how='inner', validate='one_to_one')
per_cust_left = per_cust.merge(customers, on='CustomerID', how='left', validate='one_to_one')

print(f"Inner join: {len(per_cust_inner)} customers")
print(f"Left join: {len(per_cust_left)} customers")

# Create spend segments
bins = [0, 20, 50, np.inf]
labels = ['low', 'mid', 'high']
seg = pd.cut(per_cust_inner['freight_sum'], bins=bins, labels=labels, right=False)
per_cust_inner = per_cust_inner.assign(spend_segment=seg)

print("\nPer-customer data with spend segments:")
display(per_cust_inner)

### C3. Country rollup for reporting

Aggregate metrics at the country level

In [None]:
# Country rollup for reporting
country_rollup = (
    per_cust_inner
    .groupby('Country', as_index=False)
    .agg(
        customers=('CustomerID', 'count'),
        orders=('n_orders', 'sum'),
        freight_sum=('freight_sum', 'sum')
    )
    .sort_values('orders', ascending=False)
)

print("Country rollup:")
display(country_rollup)

**Checkpoint:** Which join choice (inner/left) makes more sense for this segment report and why?

For a **segment report**, **inner join** is preferred because:
- We want to analyze only customers who have complete information
- Segments require customer attributes (like Country) to be meaningful
- Incomplete customer records would create misleading segments
- This is about analyzing realized, valid business relationships

**Left join** would be better if:
- We wanted to identify data quality issues
- We needed to report on ALL orders regardless of customer data completeness
- We were doing an audit or diagnostic report

---

## Part D — Bonus (Optional) — Use partitioned `orders` Parquet

If you have artifacts from previous labs, load and process them

In [None]:
# Check for partitioned parquet files from previous labs
p = Path('artifacts/parquet/orders')

if p.exists():
    print(f"Found partitioned parquet directory: {p}")
    files = sorted(p.glob('shipcountry=*.parquet'))
    print(f"Found {len(files)} partition files")
    
    if files:
        # Load all partitions
        df = pd.concat([pd.read_parquet(f) for f in files], ignore_index=True)
        print(f"\nLoaded {len(df)} rows from partitioned data")
        
        # Repeat per-customer aggregates on larger dataset
        per_cust_bonus = (
            df
            .groupby('CustomerID', as_index=False)
            .agg(
                n_orders=('OrderID', 'count'),
                freight_mean=('Freight', 'mean'),
                freight_sum=('Freight', 'sum')
            )
        )
        
        print("\nPer-customer aggregates from partitioned data:")
        display(per_cust_bonus.head(3))
    else:
        print("No partition files found in the directory")
else:
    print(f"Partitioned parquet directory not found: {p}")
    print("Skipping bonus section - using sample data from earlier sections")

---

## Part E — Wrap‑Up

### Summary & Key Learnings

#### 1. When to use Inner vs Left joins

**Inner join is preferred when:**
- Analyzing realized business transactions (e.g., orders with valid customers)
- Building reports/metrics that require complete information from both sides
- Creating customer segments where all attributes must be present
- Example: "Calculate total revenue by customer country" - requires valid customer records

**Left join is required when:**
- Performing data quality checks (finding orphaned records)
- Ensuring no data loss when preserving all records from the primary table
- Anti-join patterns to find missing relationships
- Example: "Find all orders that don't have matching customer records" - reveals data integrity issues

#### 2. Using validate= to catch fan-out

The `validate=` parameter prevents accidental data duplication:

In [None]:
# Example: Catch fan-out before it pollutes metrics
print("Safe merge with cardinality validation:")
print("""\n# This will catch if customers table has duplicate CustomerIDs
safe_join = orders.merge(
    customers,
    on='CustomerID',
    how='inner',
    validate='many_to_one'  # Orders: many, Customers: one (unique)
)\n""")

print("Without validation, a duplicate in customers would:")
print("  - Silently duplicate order rows")
print("  - Inflate revenue/freight metrics")
print("  - Create incorrect aggregate counts")
print("\nWith validation, you get an immediate error message!")

#### 3. Export final datasets to Parquet

Save our processed datasets for downstream labs

In [None]:
# Create output directory
out = Path('artifacts/clean')
out.mkdir(parents=True, exist_ok=True)

# Export per-customer data
per_cust_path = out / 'per_customer.parquet'
pq.write_table(
    pa.Table.from_pandas(per_cust_inner, preserve_index=False),
    per_cust_path
)
print(f"✓ Exported per-customer data to: {per_cust_path}")

# Export country rollup
country_path = out / 'country_rollup.parquet'
pq.write_table(
    pa.Table.from_pandas(country_rollup, preserve_index=False),
    country_path
)
print(f"✓ Exported country rollup to: {country_path}")

# Verify files were created
print(f"\nVerification:")
print(f"  per_customer.parquet: {per_cust_path.stat().st_size} bytes")
print(f"  country_rollup.parquet: {country_path.stat().st_size} bytes")

---

## Common Pitfalls & How to Avoid Them

### 1. Using `count` vs `size` incorrectly

- `count()` skips NaN in the counted column
- `size()` counts all rows regardless of NaN values
- Choose based on whether you want to include/exclude missing values

In [None]:
# Demonstration
demo_df = pd.DataFrame({
    'group': ['A', 'A', 'B', 'B'],
    'value': [1, None, 2, 3]
})

print("Demo data with NaN:")
display(demo_df)

print("\n.size() - counts all rows:")
display(demo_df.groupby('group').size())

print("\n.count() - ignores NaN:")
display(demo_df.groupby('group')['value'].count())

### 2. Forgetting `validate=` and accidentally creating a fan‑out join

Always use `validate=` when you know the expected cardinality:

In [None]:
print("Best practices for validate parameter:")
print("""\n1. Orders → Customers: validate='many_to_one'
   (many orders per customer, one customer record)

2. Customer → Orders: validate='one_to_many'
   (one customer, many possible orders)

3. Order → OrderDetails: validate='one_to_many'
   (one order, many line items)

4. User → Profile: validate='one_to_one'
   (one user, one profile)\n""")

### 3. Relying on outer joins for metrics

- **Outer joins** often inflate counts and create confusion
- Prefer **inner joins** for realized facts (actual transactions)
- Use **left joins** for QA/auditing purposes
- Reserve **outer joins** for specific reconciliation tasks

### 4. Column name collisions

Two strategies to handle conflicting column names:

In [None]:
print("Strategy 1: Rename columns before merge")
print("""\ncustomers.rename(columns={'Country': 'CustCountry'})\n""")

print("Strategy 2: Use suffixes parameter")
print("""\ndf.merge(other, on='key', suffixes=('_left', '_right'))\n""")

print("Recommendation: Rename before merge for clarity and explicit control")

---

## Solution Reference

### Named aggregations pattern:
```python
orders.groupby('CustomerID', as_index=False).agg(
    n_orders=('OrderID', 'count'),
    freight_mean=('Freight', 'mean')
)
```

### Cardinality checks:
```python
orders.merge(
    customers,
    on='CustomerID',
    how='inner',
    validate='many_to_one'
)
```

### Anti‑join pattern:
```python
orders.merge(
    customers,
    on='CustomerID',
    how='left',
    indicator=True
).query("_merge == 'left_only'")
```

---

## Lab Complete! 🎉

You have successfully:
- ✓ Mastered groupby with named aggregations
- ✓ Understood different join types and their use cases
- ✓ Implemented cardinality validation to prevent data quality issues
- ✓ Built customer segments and country rollups
- ✓ Exported clean datasets for downstream use

The exported Parquet files in `artifacts/clean/` are ready for use in subsequent labs!