# 🚀 Lecture 24 — Advanced: Modern 2025 Sorting Techniques
Extension for Instructor Notebook — Parallel, GPU, External-Memory & Distributed Sorting
---
This section introduces modern approaches used in 2025 production systems: parallel CPU sorting, GPU-accelerated sorting, external-memory (out-of-core) sorting, and distributed sorting frameworks.


## 🔍 Overview & When to use which

- **In-memory sorts (Θ(n log n))**: Use for data that fits in RAM; e.g., `sorted()`, `numpy.sort()`.
- **Parallel CPU sorting**: Use multi-core machines to sort large arrays faster by dividing work across cores.
- **GPU sorting**: Use when sorting very large numeric arrays where GPU memory and throughput provide huge speedups (RAPIDS / cuDF / Thrust / CUDA radix sort).
- **External-memory (out-of-core) sorting**: For datasets larger than RAM; use chunking + external merge.
- **Distributed sorting (Spark/Hadoop)**: For petabyte-scale data across many machines; use distributed shuffle and sort-by-key.


## 1) Parallel CPU Sorting (chunk, sort, merge)

Idea: split the array into `k` chunks, sort each chunk in parallel (multi-thread/process), then merge the sorted chunks with a k-way merge.
Complexity: work remains Θ(n log n), but wall-clock time can approach Θ((n log n)/p) given p workers (plus merging overhead).


In [None]:

# Example: Parallel sort using concurrent.futures (works in Colab)
from concurrent.futures import ProcessPoolExecutor, as_completed
import heapq
import math
import random

def sort_chunk(chunk):
    return sorted(chunk)

def parallel_sort(arr, n_workers=4):
    # split into roughly equal chunks
    n = len(arr)
    if n_workers <= 1 or n < 2_000:
        return sorted(arr)
    chunk_size = math.ceil(n / n_workers)
    chunks = [arr[i:i+chunk_size] for i in range(0, n, chunk_size)]
    sorted_chunks = []
    with ProcessPoolExecutor(max_workers=n_workers) as ex:
        futures = [ex.submit(sort_chunk, c) for c in chunks]
        for fut in as_completed(futures):
            sorted_chunks.append(fut.result())
    # k-way merge using heapq.merge (lazy generator)
    return list(heapq.merge(*sorted_chunks))

# demo with random data
arr = [random.random() for _ in range(10000)]
res = parallel_sort(arr, n_workers=4)
print(len(res), res[:5])


## 2) GPU-Accelerated Sorting (RAPIDS / cuDF / CuPy)

GPUs can sort large numeric arrays much faster using massively parallel primitives. In Python, RAPIDS (`cudf`) and `cupy` mirror `pandas`/`numpy` APIs.

- Typical approach: move data to GPU memory, call GPU sort (radix or merge-based), retrieve results.
- Complexity: Θ(n log n) or Θ(n) for specialized radix sorts on integers; massive constant-factor speedups for throughput.

Colab note: GPU runtimes may not have RAPIDS preinstalled; the examples below are guarded and optional.


In [None]:

# Optional GPU example (will run only if cupy is installed and GPU is available)
try:
    import cupy as cp
    import numpy as np
    a = cp.asarray(np.random.rand(10_000_00))  # 1M elements if memory allows
    # cupy.sort uses GPU
    t0 = cp.cuda.Event(); t1 = cp.cuda.Event()
    t0.record()
    a_sorted = cp.sort(a)
    t1.record(); t1.synchronize()
    print("GPU sort done. Time (approx ms):", cp.get_default_memory_pool())
    # bring back a small slice
    print(cp.asnumpy(a_sorted[:5]))
except Exception as e:
    print("GPU sort example skipped (cupy not available or GPU not present):", e)


## 3) External-Memory (Out-of-Core) Sorting

When data doesn't fit in RAM, use external merge sort:
1. Read chunks that fit in memory, sort each chunk, write sorted chunk to disk.
2. Perform a k-way merge of the chunk files (use priority queue) to produce final sorted output.

Complexity: Θ(n log n) I/O and CPU; dominated by disk I/O cost. Key metrics: number of passes, read/write volume, sequential I/O bandwidth.


In [None]:

# External merge sort (sketch). This demo uses small in-memory temp lists instead of files.
import heapq, tempfile, os

def external_sort_sim(data, chunk_size=1000):
    # Step 1: write sorted chunks to temp files
    temp_files = []
    for i in range(0, len(data), chunk_size):
        chunk = sorted(data[i:i+chunk_size])
        tf = tempfile.NamedTemporaryFile(delete=False, mode='w+t')
        for x in chunk:
            tf.write(f"{x}\n")
        tf.flush(); tf.seek(0)
        temp_files.append(tf)
    # Step 2: k-way merge from temp files
    iterators = [map(float, open(tf.name)) for tf in temp_files]
    with open('external_sorted_output.txt', 'w') as out:
        for val in heapq.merge(*iterators):
            out.write(str(val) + '\n')
    # cleanup
    for tf in temp_files:
        try:
            os.unlink(tf.name)
        except:
            pass
    return 'external_sorted_output.txt'

# demo
data = [random.random() for _ in range(5000)]
out_file = external_sort_sim(data, chunk_size=500)
print('Wrote sorted output to', out_file)


## 4) Distributed Sorting (Spark / Dask / Hadoop)

For huge datasets across machines, distributed frameworks perform a shuffle and sort-by-key. Example tools:
- Apache Spark: `rdd.sortByKey()` or `df.sort()` with partitioning.
- Dask: parallelize pandas-like operations across cluster.
- Hadoop MapReduce: map phase emits key,value and reduce collects sorted keys.

Complexity: network and disk I/O dominate; wall-clock time depends on cluster size, data skew, and shuffle costs.


### Spark example (pseudocode)

```python
rdd = sc.textFile('hdfs://...')
pairs = rdd.map(lambda line: (key_fn(line), line))
sorted = pairs.sortByKey()  # distributed shuffle + local sorts
sorted.saveAsTextFile('hdfs://.../sorted')
```


## 5) Specialized / Practical Techniques

- **Radix sort** for integers: Θ(n) time with base choice; excellent in practice for fixed-width keys.
- **Parallel k-way merge algorithms** to merge many sorted runs efficiently.
- **External-memory libraries**: GNU `sort`, `pandas.read_csv(..., chunksize=...)`, `dask.dataframe`, `vaex`.
- **Avoid full sorts when possible**: use `nlargest`, `nsmallest`, streaming top-k with heaps (Θ(n log k)).


In [None]:

# Example: top-k streaming using heapq (useful in log processing / ML candidate selection)
import heapq, random

def top_k_stream(iterable, k=10):
    h = []
    for x in iterable:
        if len(h) < k:
            heapq.heappush(h, x)
        else:
            if x > h[0]:
                heapq.heapreplace(h, x)
    return sorted(h, reverse=True)

data = [random.random() for _ in range(10000)]
print('Top 5:', top_k_stream(data, k=5))


## 6) Practical Notes & Tradeoffs

- **Memory vs Time**: Merge sort uses extra memory; in-place sorts save RAM.
- **IO bottlenecks**: For external sorts, sequential disk throughput matters more than CPU.
- **Data skew**: In distributed sorts, skew causes stragglers; use custom partitioning.
- **Stability**: Important when sorting by multiple columns; choose stable algorithms when needed.


## ✅ Hands-on Tasks (Instructor)
1. Run the parallel sort example with varying `n_workers` and input sizes.
2. If GPU is available, try the CUDA/CuPy example; compare speedups.
3. Simulate external sort on a large synthetic dataset and measure I/O time.
4. Demonstrate Spark `sortByKey()` on a small cluster or cloud notebook (pseudocode ok).
