# Chapter 37: Concurrency and Parallel Algorithms

> *"Concurrency is about dealing with lots of things at once. Parallelism is about doing lots of things at once."* — Rob Pike

---

## 37.1 Introduction

As hardware evolves toward multi‑core processors and distributed systems become ubiquitous, writing correct and efficient concurrent and parallel programs is essential. This chapter explores data structures and algorithms designed to work in such environments. We will cover:

- **Lock‑free data structures** that use atomic operations to avoid the pitfalls of locking.
- **Concurrent hash maps** that allow multiple threads to read and write simultaneously.
- **Parallel sorting algorithms** that divide work across cores.
- The **MapReduce** programming model for large‑scale data processing.

### 37.1.1 Why Concurrency and Parallelism Matter

```
┌─────────────────────────────────────────────────────────────────────┐
│                    IMPORTANCE OF CONCURRENCY & PARALLELISM           │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  1. PERFORMANCE: Utilise multiple cores to speed up computation.    │
│  2. SCALABILITY: Handle increasing workloads by adding resources.   │
│  3. RESPONSIVENESS: Keep UI responsive while background work runs.  │
│  4. THROUGHPUT: Process many requests simultaneously (servers).     │
│  5. FAULT TOLERANCE: Distribute computation across machines.        │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘
```

**Key challenge:** Coordinating access to shared data without introducing bugs like race conditions, deadlocks, or livelocks.

---

## 37.2 Lock‑Free Data Structures

Lock‑free data structures guarantee that at least one thread makes progress even if others are delayed, and they avoid problems like deadlock and priority inversion. They rely on **atomic operations** provided by hardware (e.g., compare‑and‑swap, fetch‑and‑add).

### 37.2.1 Compare‑and‑Swap (CAS)

CAS is a CPU instruction that atomically compares the contents of a memory location to a given value and, if they are the same, modifies the memory location to a new value. It returns whether the swap succeeded.

```cpp
bool compare_and_swap(int* p, int oldval, int newval) {
    if (*p == oldval) {
        *p = newval;
        return true;
    }
    return false;
}
```

In C++11, it's `std::atomic<T>::compare_exchange_strong`. In Java, `AtomicInteger.compareAndSet`. In Python, `ctypes` can access it, but typically we use higher‑level libraries like `multiprocessing` or `concurrent.futures`.

### 37.2.2 The ABA Problem

Consider a thread reading a shared variable `*p = A`, then later it CASes expecting it still to be A. Between the read and CAS, another thread might change it to B and then back to A. The CAS would succeed, but the structure might have changed (e.g., the node was freed and reallocated). This is the **ABA problem**.

**Solutions:**
- Use double‑width CAS (tagged pointers) that include a version counter.
- Use hazard pointers or epoch‑based reclamation.

### 37.2.3 Lock‑Free Stack (Treiber Stack)

The Treiber stack is a classic lock‑free stack using a linked list and CAS.

```cpp
#include <atomic>

template<typename T>
class TreiberStack {
private:
    struct Node {
        T data;
        Node* next;
        Node(const T& d) : data(d), next(nullptr) {}
    };
    std::atomic<Node*> head;

public:
    TreiberStack() : head(nullptr) {}

    void push(const T& data) {
        Node* new_node = new Node(data);
        new_node->next = head.load(std::memory_order_relaxed);
        while (!head.compare_exchange_weak(new_node->next, new_node,
                                           std::memory_order_release,
                                           std::memory_order_relaxed)) {
            // retry if head changed
        }
    }

    bool pop(T& result) {
        Node* old_head = head.load(std::memory_order_relaxed);
        while (old_head &&
               !head.compare_exchange_weak(old_head, old_head->next,
                                           std::memory_order_acquire,
                                           std::memory_order_relaxed)) {
            // retry
        }
        if (old_head) {
            result = old_head->data;
            // For production, need safe memory reclamation (hazard pointers)
            delete old_head;
            return true;
        }
        return false;
    }
};
```

**Note:** Memory reclamation is non‑trivial; simply deleting a node that might still be accessed by other threads is unsafe. Production implementations use hazard pointers, epoch‑based reclamation, or rely on garbage collection.

### 37.2.4 Lock‑Free Queue (Michael‑Scott Queue)

The Michael‑Scott queue is a lock‑less FIFO queue using a linked list with a dummy node and CAS on both head and tail.

**Simplified pseudocode (C++‑like):**

```cpp
template<typename T>
class MSQueue {
    struct Node {
        T data;
        std::atomic<Node*> next;
        Node(const T& d) : data(d), next(nullptr) {}
    };
    std::atomic<Node*> head;
    std::atomic<Node*> tail;

public:
    MSQueue() {
        Node* dummy = new Node(T());
        head.store(dummy, std::memory_order_relaxed);
        tail.store(dummy, std::memory_order_relaxed);
    }

    void enqueue(const T& data) {
        Node* new_node = new Node(data);
        Node* last = tail.load(std::memory_order_relaxed);
        Node* null = nullptr;
        while (!last->next.compare_exchange_weak(null, new_node)) {
            last = tail.load(std::memory_order_relaxed);
            null = nullptr;
        }
        tail.compare_exchange_strong(last, new_node);
    }

    bool dequeue(T& result) {
        Node* first = head.load(std::memory_order_relaxed);
        Node* next = first->next.load(std::memory_order_relaxed);
        if (next == nullptr) return false; // empty
        result = next->data;
        if (head.compare_exchange_weak(first, next)) {
            // reclaim first (dummy) node
            delete first;
            return true;
        }
        return false; // retry in real impl
    }
};
```

Again, careful memory reclamation is needed.

---

## 37.3 Concurrent Hash Maps

Concurrent hash maps allow multiple threads to insert, delete, and look up without blocking each other unnecessarily. Several strategies exist:

### 37.3.1 Coarse‑Grained Locking

Wrap the entire hash map with a single lock. Simple but poor scalability.

### 37.3.2 Fine‑Grained Locking (Striping)

Partition the hash table into segments (buckets) each with its own lock. Threads lock only the segment they need. This is how `ConcurrentHashMap` in Java (pre‑8) and `tbb::concurrent_hash_map` work.

**Example (simplified):**

```python
import threading

class ConcurrentHashMap:
    def __init__(self, num_buckets=16):
        self.num_buckets = num_buckets
        self.buckets = [{} for _ in range(num_buckets)]
        self.locks = [threading.Lock() for _ in range(num_buckets)]

    def _bucket(self, key):
        return hash(key) % self.num_buckets

    def put(self, key, value):
        b = self._bucket(key)
        with self.locks[b]:
            self.buckets[b][key] = value

    def get(self, key):
        b = self._bucket(key)
        with self.locks[b]:
            return self.buckets[b].get(key)
```

**Scalability:** Good if hash distributes evenly; lock contention only within each bucket.

### 37.3.3 Lock‑Free Hash Maps

Fully lock‑free hash maps are more complex but offer even higher scalability. They often use atomic compare‑and‑swap on pointers to the bucket arrays. Example: Java's `ConcurrentHashMap` in later versions uses a mix of techniques including CAS for certain operations.

**Idea:**
- Maintain an array of atomic pointers to nodes (linked lists per bucket).
- Use CAS to insert a node at the head of a bucket.
- For resizing, use a multi‑step process where threads help with the resize.

Implementation is beyond the scope of this handbook, but libraries like `libcds` provide them.

---

## 37.4 Parallel Sorting

Parallel sorting algorithms divide the array among threads, sort subarrays in parallel, and then merge.

### 37.4.1 Parallel Merge Sort

Split the array into chunks, sort each chunk in parallel, then merge the results. The merging step can also be parallelized (e.g., using parallel merge of two sorted arrays).

**Simple implementation using Python's `multiprocessing` (conceptual):**

```python
from multiprocessing import Pool

def parallel_merge_sort(arr, num_workers=4):
    if len(arr) <= 1:
        return arr
    chunk_size = len(arr) // num_workers
    chunks = [arr[i:i+chunk_size] for i in range(0, len(arr), chunk_size)]
    with Pool(num_workers) as pool:
        sorted_chunks = pool.map(sorted, chunks)  # simple sort per chunk
    # merge sorted_chunks (merge two at a time)
    while len(sorted_chunks) > 1:
        new_chunks = []
        for i in range(0, len(sorted_chunks), 2):
            if i+1 < len(sorted_chunks):
                new_chunks.append(merge(sorted_chunks[i], sorted_chunks[i+1]))
            else:
                new_chunks.append(sorted_chunks[i])
        sorted_chunks = new_chunks
    return sorted_chunks[0]
```

**Efficiency:** O((n/p) log (n/p)) per chunk, plus merging O(n log p). Overall O(n log n) work, but with improved constant.

### 37.4.2 Bitonic Sort

Bitonic sort is a sorting network that can be parallelized efficiently on parallel architectures (e.g., GPUs). It has O(log² n) stages, each performing O(n) comparisons, but they can be done in parallel.

**Idea:** A bitonic sequence is one that first increases then decreases (or vice‑versa). Bitonic sort recursively builds bitonic sequences and then sorts them with bitonic merge.

**C++ pseudocode (using OpenMP):**

```cpp
void bitonic_sort(int* arr, int n) {
    for (int k = 2; k <= n; k *= 2) {
        for (int j = k/2; j > 0; j /= 2) {
            #pragma omp parallel for
            for (int i = 0; i < n; i++) {
                int ixj = i ^ j;
                if (ixj > i) {
                    if (( (i & k) == 0 && arr[i] > arr[ixj] ) ||
                        ( (i & k) != 0 && arr[i] < arr[ixj] )) {
                        std::swap(arr[i], arr[ixj]);
                    }
                }
            }
        }
    }
}
```

**Complexity:** O(n log² n) operations but with good parallelism.

### 37.4.3 Sample Sort

Sample sort is a parallel version of quicksort that selects a set of splitters from a random sample of elements, partitions the array into buckets, sorts each bucket in parallel, and concatenates.

**Algorithm:**
1. Choose p‑1 splitters by sorting a random sample of size p * s (s is oversampling factor).
2. Partition the array into p buckets according to these splitters (using binary search to find bucket index).
3. Sort each bucket in parallel (e.g., using quicksort or sample sort recursively).
4. Concatenate buckets.

**Advantages:** Good load balancing, works well on distributed systems.

**Python‑like pseudocode (ignoring actual parallelism):**

```python
import random

def sample_sort(arr, p):
    if len(arr) <= 1:
        return arr
    # choose splitters
    sample = random.sample(arr, p*16)
    sample.sort()
    splitters = [sample[i * len(sample) // p] for i in range(1, p)]
    # create buckets
    buckets = [[] for _ in range(p)]
    for x in arr:
        idx = bisect.bisect_right(splitters, x)  # find bucket
        buckets[idx].append(x)
    # sort buckets in parallel (here sequentially)
    sorted_buckets = [sorted(b) for b in buckets]
    # concatenate
    result = []
    for b in sorted_buckets:
        result.extend(b)
    return result
```

---

## 37.5 MapReduce Paradigm

MapReduce is a programming model for processing large datasets in a distributed cluster. It was popularized by Google and is implemented in frameworks like Hadoop.

### 37.5.1 Model

The computation is expressed as two phases: **Map** and **Reduce**.

- **Map:** The master node splits the input into independent chunks and assigns them to worker nodes. Each worker applies the user‑defined `map` function to each record, producing intermediate key/value pairs.
- **Shuffle:** The system sorts and groups intermediate pairs by key.
- **Reduce:** Workers apply the user‑defined `reduce` function to each group, producing final output.

### 37.5.2 Word Count Example

```python
def map(document):
    # document is a string
    for word in document.split():
        emit(word, 1)

def reduce(key, values):
    # key: a word, values: list of counts
    emit(key, sum(values))
```

### 37.5.3 Data Structures and Algorithms in MapReduce

- **Map phase:** Typically stateless; can be done in parallel without communication.
- **Shuffle:** Requires distributed sorting and partitioning (often using a hash of the key) to send all values for a key to the same reducer. This is similar to external sorting.
- **Reduce:** Aggregates values; may need to handle large groups by streaming or combining.

**Combiners** are optional “mini‑reducers” that run on the map side to reduce network traffic.

### 37.5.4 Distributed Consistency

MapReduce handles failures by re‑executing failed tasks. It assumes a shared‑nothing architecture and relies on the file system for intermediate data.

---

## 37.6 Summary

```
┌─────────────────────────────────────────────────────────────────────┐
│                    CONCURRENCY & PARALLEL ALGORITHMS SUMMARY         │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  Lock‑Free Data Structures:                                         │
│    • Use CAS to coordinate without locks.                           │
│    • Treiber stack, Michael‑Scott queue.                            │
│    • Must handle ABA problem and memory reclamation.                │
│                                                                      │
│  Concurrent Hash Maps:                                              │
│    • Fine‑grained locking (bucket locks).                           │
│    • Lock‑free variants use CAS on pointers.                        │
│                                                                      │
│  Parallel Sorting:                                                  │
│    • Parallel merge sort, bitonic sort, sample sort.                │
│    • Aim to distribute work evenly across cores.                    │
│                                                                      │
│  MapReduce:                                                         │
│    • Distributed programming model: Map, Shuffle, Reduce.           │
│    • Handles large‑scale data processing on clusters.               │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘
```

---

## 37.7 Practice Problems

### Lock‑Free Structures
1. Implement a lock‑free stack using CAS in your language of choice.
2. Explain the ABA problem and propose a solution.
3. Implement a lock‑free queue (Michael‑Scott) with hazard pointers (conceptually).

### Concurrent Hash Maps
4. Design a concurrent hash map with fine‑grained locking and dynamic resizing.
5. Compare the performance of coarse‑grained vs. fine‑grained locking under high contention.

### Parallel Sorting
6. Write a parallel merge sort using Python's `multiprocessing`.
7. Implement bitonic sort in OpenMP or C++ threads.
8. Simulate sample sort on a large array and analyze load balancing.

### MapReduce
9. Design a MapReduce algorithm to compute the average of values per key.
10. Explain how you would handle stragglers in a MapReduce job.

### General
11. Describe the differences between concurrency and parallelism.
12. What is the role of atomic operations in lock‑free programming?
13. How would you measure the scalability of a concurrent data structure?

---

## 37.8 Further Reading

1. **"The Art of Multiprocessor Programming"** by Herlihy & Shavit – The definitive text.
2. **"Java Concurrency in Practice"** by Goetz et al. – Practical guide, includes concurrent collections.
3. **"C++ Concurrency in Action"** by Anthony Williams – For lock‑free programming in C++.
4. **"MapReduce: Simplified Data Processing on Large Clusters"** – Original Google paper by Dean & Ghemawat.
5. **"Introduction to Algorithms" (CLRS)** – Chapter on parallel algorithms (sorting).
6. **Online resources:** Intel Threading Building Blocks (TBB) documentation, OpenMP tutorials.

---

> **Coming in Chapter 38**: **Problem‑Solving Framework** – We'll cover the UMPIRE method, time management, communication strategies, and edge case identification.

---

**End of Chapter 37**