# FIT5202 – Big Data  
## Week 4a & 4b – Parallel Sort & Parallel GroupBy

---

## 0) Notation & Variable Glossary (used throughout formulas)

- **N** — number of processors.
- **P** — page size in bytes.
- **R** — size of relation R in bytes.
- **|R|** — number of records in relation R.
- **Ri** — fragment of R at processor *i* (bytes).
- **|Ri|** — number of records in Ri.
- **IO** — time to read/write one page from/to disk.
- **tr** — CPU time to read one record from memory into a tuple buffer.
- **tw** — CPU time to write one record to memory/output.
- **mp** — per-page network message protocol cost.
- **ml** — per-page network message latency cost.
- **ts** — CPU time to sort one record in memory.
- **B** — number of available memory buffers per processor (pages).
- **G** — number of groups after grouping (GroupBy result cardinality).
- **σg** — group selectivity ratio (fraction of records producing distinct groups).
- **πR** — projectivity ratio for R after grouping.
- **M** — main memory available for sort or aggregation (bytes).

---

## 1) Parallel Sort – Overview

### Goal
Sort a large relation R using **multiple processors** to speed up.

### High-level steps (common pattern)
1. **Data partitioning** – so each processor gets a slice of the data for its sorting responsibility.
2. **Local sort** – each processor sorts its data in memory (with external sort if needed).
3. **Merge** – combine sorted partitions into final global order.

---

## 2) Partitioning Methods in Parallel Sort

### 2.1 Range Partitioning
- Decide split points in key space.
- Each processor gets records in its range.
- **Problem:** *Data skew* if key distribution is uneven.

### 2.2 Hash Partitioning
- Hash the key to assign to processors.
- Produces balanced load if hash is uniform.
- **Problem:** Output is not globally sorted; requires extra merging.

### 2.3 Sampling for Range Partition Boundaries
- Randomly sample keys → sort sample → pick split points so partitions are balanced.
- Avoids skew from naive ranges.

---

## 3) Parallel Sort Cost Model (Range-partition-based)

### Phase 1 – Local Sort

**Scan cost**
$$
\frac{R_i}{P} \times IO
$$  
*Meaning:* Read the local fragment from disk into memory.  

**Select cost**
$$
|R_i| \times (t_r + t_w)
$$  
*Meaning:* CPU read/write of all records in memory buffers.  

**Sort cost**
$$
|R_i| \times t_s
$$  
*Meaning:* CPU cost to sort all records locally.

---

### Phase 2 – Redistribution (send sorted partitions to correct processors)

**Transfer cost**
$$
\frac{R_i}{P} \times (N-1) \times (m_p + m_l)
$$  
*Meaning:* Each processor sends part of its data to all other processors (N-1 destinations).

**Receive cost**
$$
\left( \frac{R}{P} - \frac{R_i}{P} \right) \times m_p
$$  
*Meaning:* Receive all pages from others, excluding your own.

**Slide question:** *Why `(mp)` only for receive cost?*  
**Answer:** Latency `ml` is charged at sender side; receiver only pays protocol handling per page.  
**Reasoning:** Avoid double-counting network latency.

---

### Phase 3 – Merge locally received partitions

**Scan cost**
$$
\frac{R}{P} \times IO
$$  
*Meaning:* Read all your received data pages from disk.

**Merge CPU cost**
$$
|R| \times (t_r + t_w)
$$  
*Meaning:* Read & write all records while merging.

---

## 4) Parallel Sort Optimizations

- **Main memory merge**: Keep incoming partitions in memory to skip merge I/O if M is large enough.
- **Balanced partitioning**: Use sampling to avoid skew.

---

## 5) Parallel GroupBy – Overview

### Goal
Aggregate data into groups in parallel.

### High-level steps
1. **Partition by group key** – hash or range.
2. **Local aggregation** – aggregate per processor.
3. **Merge partial aggregates** – final group results.

---

## 6) Parallel GroupBy Cost Model (Hash-based)

### Phase 1 – Data Loading

**Scan cost**
$$
\frac{R_i}{P} \times IO
$$  
*Meaning:* Read local fragment from disk.

**Select cost**
$$
|R_i| \times (t_r + t_w)
$$  
*Meaning:* CPU read/write from memory buffers.

---

### Phase 2 – Redistribution by Group Key

**Transfer cost**
$$
\frac{R_i}{P} \times (N-1) \times (m_p + m_l)
$$  
*Meaning:* Send group-key partitions to all other processors.

**Receive cost**
$$
\left( \frac{R}{P} - \frac{R_i}{P} \right) \times m_p
$$  
*Meaning:* Receive other processors’ partitions.  

**Slide question:** *Why might we only send group key and aggregate value during redistribution?*  
**Answer:** To reduce data volume.  
**Reasoning:** Full tuples are not needed for final aggregation; only key and partial aggregate suffice.

---

### Phase 3 – Local Aggregation of Received Data

**CPU aggregation cost**
$$
|R| \times (t_r + t_w)
$$  
*Meaning:* Read and combine partial aggregates.

**Output size**
$$
G \times \text{(size per group)}
$$  
*Meaning:* Final output depends on number of groups G.

---

## 7) Parallel GroupBy Optimizations

- **Pre-aggregation** before send: Reduce tuples sent over network by combining duplicates locally.
- **Two-phase aggregation**: Local → redistribute → final aggregation.

---

## 8) Slide Questions with Answers

- **Q:** Why does range partitioning risk skew?  
  **A:** If key distribution is uneven, some processors get more data.  
  **Reasoning:** All keys in a range may be concentrated in one partition.

- **Q:** Why use sampling before setting range split points?  
  **A:** To estimate key distribution and choose balanced split points.  
  **Reasoning:** Prevents data skew without scanning all data.

- **Q:** In GroupBy, why might final aggregation be faster than initial local aggregation?  
  **A:** Fewer records to process (only partial aggregates).  
  **Reasoning:** Local phase deals with all input; final phase deals with already reduced data.

---

## 9) Summary

- **Parallel Sort**:
  - Partition → Local sort → Merge.
  - Watch for skew in range partitioning.
  - Sampling helps balance.
  - Cost model includes scan, select, sort, transfer, receive, merge.

- **Parallel GroupBy**:
  - Partition → Local aggregate → Merge aggregates.
  - Pre-aggregation saves network bandwidth.
  - Cost model mirrors parallel join’s structure.

---
