# FIT5202 – Big Data  
## Week 3a & 3b – Parallel Join & Parallel Outer Join

---

## 1. Revision
- **Inter-query parallelism**: Multiple different queries run at the same time.
- **Intra-query parallelism**: Single query runs in parallel across multiple processors.
- **Hash data partitioning + discrete range search** → only selected processors are used.

---

## 2. Join Operations
- Links two tables based on matching attributes.
- **Types**:
  - **Inner Join** – only matching records.
  - **Outer Join** – matching + unmatched records from one/both tables.

---

## 3. Serial Join Algorithms
1. **Nested-Loop Join**  
   For each record in R, scan all records in S.
2. **Sort-Merge Join**  
   Sort both tables by join attribute(s), then merge.
3. **Hash-Based Join**  
   Hash R and S into buckets using same hash function, then join matching buckets.

---

## 4. Parallel Join Algorithms
- Achieved via **data parallelism**.
- Two stages:
  1. **Data Partitioning**
  2. **Local Join** (using any serial join algorithm)

### 4.1 Data Partitioning Methods
1. **Divide & Broadcast**
   - Divide one table into disjoint partitions (equal partitioning).
   - Broadcast the other table to all processors.
   - Best to **broadcast smaller table**.
   - No load imbalance, but heavy broadcast cost.
   - In shared-memory: no table replication, but may run into memory limits.

2. **Disjoint Partitioning**
   - Partition both tables by **range** or **hash**.
   - Each processor gets only relevant fragments.
   - Potential load imbalance due to **data skew**.

---

## 5. Cost Models for Parallel Join

### 5.1 Divide & Broadcast
**Phases**:
1. **Data Loading**  
   - **Scan cost**: `(Si / P) × IO`
   - **Select cost**: `|Si| × (tr + tw)`
2. **Data Broadcasting**  
   - Transfer cost: `(Si / P) × (N - 1) × (mp + ml)`
   - Receive cost: `(S/P – Si/P) × mp`
3. **Data Storing**  
   - Store cost: `(S/P – Si/P) × IO`

### 5.2 Local Join (Hash-Based Example)
- **Phase 1: Loading**  
  - Scan cost: `((Ri / P) + (S / P)) × IO`  
  - Select cost: `(|Ri| + |S|) × (tr + tw)`
- **Phase 2: Join Processing**  
  - Join cost: `(|Ri| × (tr + th) + |S| × (tr + th + tj))`
  - Overflow buckets cost:  
    `(1 - min(H/|Ri|, 1)) × (Ri / P) × 2 × IO`
- **Phase 3: Storing Results**  
  - Generate result cost: `|Ri| × σj × |S| × tw`
  - Disk store cost: `(πR × |Ri| × σj × πS × |S| / P) × IO`

---

## 6. Parallel Join Optimization
- **Main memory optimization**: avoid writing to disk between distribution & local join if possible.
- **Load balancing**: more fragments than processors → rearrange to avoid skew.

---

## 7. Parallel Outer Join Algorithms

### 7.1 ROJA (Redistribution Outer Join Algorithm)
- **Step 1**: Redistribute both R and S on join attribute.
- **Step 2**: Local outer join.
- **Pros**: Simple, fast (2 steps).
- **Cons**: Data skew, high communication cost.

### 7.2 DOJA (Duplication Outer Join Algorithm)
- **Step 1**: Replicate small table.
- **Step 2**: Local inner join.
- **Step 3**: Redistribute inner join results by attribute X.
- **Step 4**: Local outer join.
- **Cons**: Expensive if "small" table is still large.

### 7.3 DER (Duplication & Efficient Redistribution)
- **Step 1**: Broadcast left table.
- **Step 2**: Local inner join.
- **Step 3**: Identify ROW IDs with no matches.
- **Step 4**: Redistribute only ROW IDs (not full rows).
- **Step 5**: Replicate ROW IDs.
- **Step 6**: Inner join with replicated data.
- **Pros**: Reduces data transfer compared to DOJA.
- **Cons**: Still has replication costs for large tables.

---

## 8. Load Balancing in Outer Joins
- **OJSO (Outer Join Skew Optimization)**:
  - Avoid redistributing dangling records.
  - Process dangling records locally when possible.
  - Reduces skew & network overhead.

---

## 9. Summary
- **Parallel Join**:
  - Data partitioning: Divide & Broadcast or Disjoint Partitioning.
  - Local join: Nested-loop, Sort-merge, or Hash-based.
- **Parallel Outer Join**:
  - Algorithms: ROJA, DOJA, DER.
  - Skew handling: OJSO.
- **Optimization Goals**:
  - Reduce disk I/O.
  - Avoid network bottlenecks.
  - Balance load among processors.

---


# Parallel Join
Join operations  
- link two tables based on the nominated attributes  
- one from each table  

## Serial Join Algorithms  
- Nested-loop join algorithm  
    - For each record of table R, it goes through all records of table S  
- Sort-merge join algorithm  
    - Both tables must be sorted on join attribute  
    - Then merge both sorted tables  
- Hash-based join algorithm (Hash tables <-> Python dictionaries)  
    - The records of files R and S are both hashed to the same hash file, using the 
    same hashing function on the join attributes A of R and B of S as hash keys  
    - A single pass through the file with fewer records (say, R) hashes its records to 
    the hash file buckets  
    - A single pass through the other file (S) then hashes each of its records to the 
    appropriate bucket, where the record is combined with all matching records 
    from R  
    - Hash the smaller table, as it needs to fit into memory.
    
## Parallel Join Algorithms  
Divide-and-broadcast: You photocopy all the documents and give everyone a full set, so they can work independently. (Fast to explain, expensive in paper/ink)
Disjoint partitioning: You sort all documents by topic and give each topic to the right person. (Cheaper to copy, but requires sorting effort up front)
Stage 1: Data Partitioning
- Divide and Broadcast
    - Two stages: data partitioning using the divide and broadcast method, and a local join  
    - Divide and Broadcast method: Divide one table into multiple disjoint partitions, 
    where each partition is allocated a processor, and broadcast the other table to 
    all available processors  
    - Dividing one table can simply use equal division  
    - Broadcast means replicate the table to all processors  
    - Hence, choose the smaller table to broadcast and the larger table to divide  

    - No load imbalance problem, but the broadcasting method is inefficient
    - The problem of workload imbalance will occur if the table is already partitioned 
    using random-unequal partitioning
    - If shared-memory is used, then there is no replication of the broadcast table. 
    Each processor will access the entire table S and a portion of table R. But if 
    each processor does not have enough working space, then the local join might 
    not be able to use a hash-based join

- Disjoint data partitioning