# CS 267: Introduction to Data Parallelism Lecture 6

Kathy Yelick

https://sites.google.com/lbl.gov/cs267-spr2019/

#### **Lessons from Today's Lecture**

- Data parallelism is beautiful!
- Automatically mapping it to (today's) hardware is hard
- Surprising things you can do with scans
- Many parallel programming models use some data parallel features
  - GPUs
  - MPI collectives
  - Cloud MapReduce
- Useful in designing (nontrivial) parallel algorithms

"Every nontrivial parallel algorithm uses a prefix scan"

#### Parallel Machines and Programming



These are the natural "abstract" machine models

cost

CS267 Lecture 6 3 02/07/2019

#### **Data Parallel Programming: Unary Operators**

Unary operations applied to all elements of an array

```
A = array
B = array
f = square (any unary function, i.e., 1 argument)
B = f(A)
```



#### **Data Parallel Programming: Binary Operators**

Binary operations applied to all pairs of elements

```
A = array
B = array
C = array
- or any other binary operator
C = A - B
                         3
                                                             3
                      2
                                  2
                                        2
                                              3
                            0
       A:
                                    - applied to each pair
        B:
                                                             5
                                        -2
                                              3
```

#### **Data Parallel Programming: Broadcast**

Broadcast fill a value into all elements of an array



Useful for a\*X+Y called axpy, saxpy, daxpy



#### **Data Parallel Programming: Reduce**

Reduce an array to a value with + or any associative op



- Associative so we can perform op in any order
- Useful for dot products (ddot, sdot, etc.)

$$b = X^{T}Y = \Sigma_{j} X[j] * Y[j]$$

$$b = dot(X, Y) = sum(X .* Y)$$

$$X: 1 1 1 3 3 2 1$$

$$Y: 1 2 0 2 1 3 1 intermediate products$$

$$Y: dot product$$

$$b: 19$$

$$CS267 Lecture 6$$

7

#### **Data Parallel Programming: Scans**

- Fill array with partial reductions any associative op
- Sum scan:



Max scan:



#### **Inclusive and Exclusive Scans**

Two variations of a scan, given an input vector  $[x_0, x_1, ..., x_{n-1}]$ :

inclusive scan includes input x<sub>i</sub> when computing output y<sub>i</sub>

$$[a_0, (a_0 \otimes a_1), ..., (a_0 \otimes a_1 ... \otimes a_{n-1})]$$
  
e.g., add\_scan\_inclusive([1, 0, 3, 0, 2])  $\rightarrow$  [1, 1, 4, 4, 6]

exclusive scan does not x<sub>i</sub> when computing output y<sub>i</sub>

 $[I, a_0, (a_0 \odot a_1), ..., (a_0 \odot a_1 ... \odot a_{n-2})]$  where I is the identity for  $\odot$ 

e.g., add\_scan\_exclusive([1, 0, 3, 0, 2]) 
$$\rightarrow$$
 [0, 1, 1, 4, 4]

You can get the inclusive version from the exclusive by applying the operation across vectors: X<sub>©</sub>Y. You can convert both directions using vector shifts left or right.

#### **Data Parallel Programming: Masks**

Can apply operations under a "mask"



Related: Segmented scans to be presented later

# Idealized Hardware and Performance Model

#### SIMD Systems Implemented Data Parallelism

- A large number of (usually) tiny processors.
  - A single "control processor" issues each instruction.
  - Each processor executes the same instruction.
  - Some processors may be turned off on some instructions.
- Originally machines were specialized to scientific computing, few made (CM2, Maspar)



#### **Ideal Cost Model for Data Parallelism**

- Machine
  - An unbounded number of processors (p)
  - Control overhead is free
  - Communication is free
- Shows the inherent parallelism (inherent serialization)
- Called the algorithm's "span"
- Defines a lower bound on real machines



CS267 Lecture 6

#### Cost on Ideal Machine (Span)

Span for unary or binary operations (pleasingly parallel)

$$C = A + B$$



- Even if arrays are not aligned, communication is "free" here
- Reductions and broadcasts



#### **Broadcast and reduction use processor trees**

Broadcast of 1 value to p processors with log n span



Reduction of n values to 1 with log n span

02/07/2019

Takes advantage of associativity in +, \*, min, max, etc.



15

# Can reductions go faster: No, log n lower bound and on any function of n variables!

- Given a function f (x1,...xn) of n input variables and 1 output variable, how fast can we evaluate it in parallel?
- Assume we only have binary operations, one per time step
- After 1 time step, an output can only depend on two inputs
- Use induction to show that after k time units, an output can only depend on 2<sup>k</sup> inputs
  - After log<sub>2</sub> n time units, output depends on at most n inputs
- A binary tree performs such a computation



#### Multiplying n-by-n matrices in O(log n) time

- Use n<sup>3</sup> processors
- Step 1: For all  $(1 \le i,j,k \le n)$  P(i,j,k) = A(i,k) \* B(k,j)
  - cost = 1 time unit, using  $n^3$  processors
- Step 2:" For all  $(1 \le i,j \le n)$   $C(i,j) = \sum_{k=1}^{n} P(i,j,k)$  cost = O(log n) time, using n<sup>2</sup> trees, n<sup>3</sup> / 2 processors each



Put a processor at every point in this cube

#### Related to Communication-Optimal "2.5D" MatMul



"A face"

Processors execute internal sub-cubes

#### What about Scan (aka Parallel Prefix)?

 Recall: the scan operation takes a binary associative operator 

, and an array of n elements

[
$$a_0$$
,  $a_1$ ,  $a_2$ , ...  $a_{n-1}$ ]  
and produces the array  
[ $a_0$ , ( $a_0 \odot a_1$ ), ... ( $a_0 \odot a_1$  ...  $\odot a_{n-1}$ )]

Example: add scan of

```
[1, 2, 0, 4, 2, 1, 1, 3] is [1, 3, 3, 7, 9, 10, 11, 14]
```

- Other operators
  - Reals: +, \*, min, max
  - · Booleans: and, or
  - Matrices: mat mul

#### Can we parallelize a scan?

It looks like this:

```
y(0) = 0;
for i = 1:n
y(i) = y(i-1) + x(i);
```

- Takes n-1 operations (adds) to do in serial
- The i<sup>th</sup> iteration of the loop depends completely on the (i-1)<sup>st</sup> iteration.

Impossible to parallelize, right?

#### A clue

Is there any value in adding, say, 5+6+7+8?

If we separately have 1+2+3+4, what can we do?

Suppose we added 1+2, 3+4, etc. pairwise -- what could we do?

#### Prefix sum in parallel

Algorithm: 1. Pairwise sum 2. Recursive prefix 3. Pairwise sum



CS267 Lecture 6

#### Parallel prefix cost



Pairwise sum

**Recursive prefix** 

Pairwise sum (update odds)

Time for this algorithm on one processor (work)

- $T_1(n) = n/2 + n/2 + T_1(n/2) = n + T_1(n/2) = 2n 1$ Time on unbounded number of processors (span)
- $T_{\infty}(n) = 2 \log n$

Parallelism at the cost of more work (2x)!

#### Non-recursive view of parallel prefix scan



### **Real Hardware (Today)**

#### Vectors use Data Parallelism (at smaller scale)

- Vector instructions operate on a vector of elements
  - These are specified as operations on vector registers



- Old supercomputer vector register: ~32-64 elts
  - The number of elements is larger than the amount of parallel hardware, called vector pipes or lanes, say 2-4
- The hardware performs a full vector operation in
  - #elements-per-vector-register / #pipes
  - E.g., 64 elements in register, but only 8 fp adders to use
  - "Virtualizes" the amount of hardware, which is n



26

#### Cray X1: Parallel Vector Architecture

#### Cray combined several technologies in the X1

- 12.8 Gflop/s Vector processors (MSP)
- Shared caches (unusual on earlier vector machines)
- 4 processor nodes sharing up to 64 GB of memory
- Single System Image to 4096 Processors
- Remote put/get between nodes (faster than MPI)





Expensive to design and build, market too small

#### Vectors use Data Parallelism (at smaller scale)

- SIMD instructions on microprocessors are vectors
  - Shorter than old vector supercomputers (e.g., 256 bits)
  - They don't virtualize the hardware (arithmetic units), so each processor version may require code rewrites
- Reductions and broadcasts are in register
  - Require inside-register data movement
- Assuming vector length (or SIMD width) are small constants → no theoretical speedup
  - But in practice this can make a big different (2-16x...)
  - And algorithms may still be useful
- Revisit these ideas with GPUs

#### **Data parallelism on Distributed Memory**



#### **Distributed Memory**

Processors execute own instruction stream

Communicate by sending messages

Message time depends on size, but not location

- Today's parallel machines
  - Powerful processors
  - Distributed memory (at scale)
  - Clusters or MPPs (Massively Parallel Processors)
- Need to map n-way parallelism to p-way
  - Attempts to do this automatically
- High Performance Fortran
  - Large effort in the 90s
  - Semi-automatic: Data layout hints were necessary
  - And it was still hard
- But still useful manually

#### **Mapping Data Parallelism to Clusters**

Binary and unary operations on MPPs



- If arrays are not "aligned" then communication required
- Reductions and broadcasts



#### Parallel prefix cost on p processors



Compute local prefix sums in n/p steps

Updates across processors in log p steps

Time for this algorithm in parallel:

•  $T_p(n) = O(n/p + log p)$ 

serial time on each processor

communication and computation up and down the processor tree

#### The myth of log n

 The log<sub>2</sub> n span is not the main reason for the usefulness of parallel prefix.

• Say n = 1,000,000p (1,000,000 elements per proc)

• Cost = 
$$(2000000 \text{ adds})$$
 +  $(\log_2 P \text{ message passings})$ 

fast & embarrassingly parallel

(2000000 local adds are serial for each processor, of course)

Key to implementing data parallel algorithms on clusters, SMPs, MPPs, i.e., modern supercomputers

#### Data Parallelism is an Elegant Programming Model

- Strict data parallelism has serial semantics:
  - E.g., no difference from executing A+B one element at a time or in parallel
- Reductions also preserve serial semantics for truly associative operations:
  - + \* min, etc. on integers and more;
  - some differences for floating point due to order of evaluation (but can be deterministic, i.e., the same result every time)
- Easy to understand and reason about
- "In spirit" in MPI collectives, CUDA, MapReduce...

#### Limitations:

- Some algorithms (e.g., adaptive) don't fit easily
- Non-trivial to implement on some hardware

#### Scans are useful for many things (partial list here)

- Reduction and broadcast in O(log n) time
- Parallel prefix (scan) in O(log n) time
- Adding two n-bit integers in O(log n) time
- Multiplying n-by-n matrices in O(log n) time
- Inverting n-by-n triangular matrices in O(log<sup>2</sup> n) time
- Inverting n-by-n dense matrices in O(log<sup>2</sup> n) time
- Evaluating arbitrary expressions in O(log n) time
- Evaluating recurrences in O(log n) time
- "2D parallel prefix", for image segmentation (Catanzaro & Keutzer)
- Sparse-Matrix-Vector-Multiply (SpMV) using Segmented Scan
- Parallel page layout in a browser (Leo Meyerovich, Ras Bodik)
- Solving n-by-n tridiagonal matrices in O(log n) time
- Traversing linked lists
- Computing minimal spanning trees
- Computing convex hulls of point sets...

#### **Application: Stream Compression**

Given an array of 0/1 flags

and an array (stream) of values

compress into

Step 1: Compute an exclusive add scan of flags:

• Step 2: "Scatter" values into result at index, masked by flags



CS267 Lecture 6

#### **Application: Radix Sort**



input

odds = last bit of each element

evens = complement of odds (last bit = 0)

epos = exclusive sum scans of evens

totalEvens = broadcast last element

indx = constant array of 0..n

oddpos = indx - epos + totalEvens

pos = if evens then esum else oddpos

Using two masked assignments

Scatter input using pos as index

Repeat with next bit to left until done

#### **Application: Radix Sort**

Sort on least significant bit

XX0 < XX1 (evens before odds)

$$Bit_2=0$$
  $Bit_2=1$ 

Sort on next bit X0X < X1X

$$Bit_1=0$$
  $Bit_1=1$ 

Sort on next bit

$$Bit_0=0$$
  $Bit_0=1$ 

Each step maintains the ordering unless, they have to switch based on the current bit

#### Application: Adding n-bit integers in O(log n) time

- Computing sum s of two n-bit binary numbers, a and b
  - a = a[n-1] a[n-2]...a[0] and b = b[n-1] b[n-2]...b[0]
  - s = a+b = s[n] s[n-1]...s[0] (use carry-bit array c = c[n-1]...c[0] c[-1])
- Formula

Example

Challenge: compute all c[i] in O(log n) time via parallel prefix

#### Application: Adding n-bit integers in O(log n) time

Recall carry bit calculation

```
 c[-1] = 0 \qquad ... \ rightmost \ carry \ bit \\ for \ i = 0 \ to \ n-1 \\ c[i] = ( \ (a[i] \ xor \ b[i]) \ and \ c[i-1] ) \ or \ ( \ a[i] \ and \ b[i] ) ... \ next \ carry \ bit
```

Compute all c[i] in O(log n) time via parallel prefix

```
for all (0 <= i <= n-1) p[i] = a[i] xor b[i] ... propagate bit for all (0 <= i <= n-1) g[i] = a[i] and b[i] ... generate bit  \begin{bmatrix} c[i] \end{bmatrix} = \begin{bmatrix} (p[i] \text{ and } c[i-1]) \text{ or } g[i] \end{bmatrix} = \begin{bmatrix} p[i] \\ 0 \end{bmatrix} \begin{bmatrix} g[i] \end{bmatrix} * \begin{bmatrix} c[i-1] \end{bmatrix} = M[i] * \begin{bmatrix} c[i-1] \end{bmatrix} \end{bmatrix} 
= M[i] * M[i-1] * ... M[0] * \begin{bmatrix} 0 \\ 1 \end{bmatrix} 
... evaluate M[i] * M[i-1] * ... * M[0] by parallel prefix ... 2-by-2 Boolean matrix multiplication is associative
```

Used in all computers to -- Carry look-ahead addition

#### This idea is used in all hardware



Even going back to Babbage

#### **Segmented Scans**

Inputs = value array, flag array, associative operator ⊕

#### Inclusive segmented sum scan



## Flags are sometimes done with Boolean and switch points



#### Result

| 1 | 3 | 3 | 7 | 12 | 6 | 13 | 8 |
|---|---|---|---|----|---|----|---|
|---|---|---|---|----|---|----|---|

#### SpMV in Compressed Sparse Row (CSR) Format

SpMV: y = y + A\*x

Sparse matrices: only store, do arithmetic, on nonzero entries CSR format is simplest one of many possible data structures for A



Matrix-vector multiply kernel: y(i) ← y(i) + A(i,j) × x(j)

```
for each row i
  for k=ptr[i] to ptr[i+1]-1 do
    y[i] = y[i] + val[k]*x[ind[k]]
```

#### **SPMV** (Segmented Suffix Scan)

Sparse Matrix-Vector Multiplication (SPMV)

Y = SUMS(PTR)

Segmented Operations for Sparse Matrix Computation on Vector Multiprocessors . Guy E. **Blelloch**, Michael A. **Heroux**, and Marco Zagha. CMU-CS-93-173

#### **Application: Fibonacci via Matrix Multiply Prefix**

$$\mathbf{F_{n+1}} = \mathbf{F_n} + \mathbf{F_{n-1}}$$

#### Can compute all F<sub>n</sub> by matmul prefix on

#### then select the upper left entry

Slide source: Alan Edelman

#### Lexical analysis (tokenizing, scanning)

#### Given a language of:

- Identifiers: string of chars
- Strings: in double quotes
- Ops: +,-,\*,=,<,>,<=, >=

TABLE I. A Finite-State Automaton for Recognizing Tokens

| Old   |   | Character Read |  |   |   |   |   |   |   |   |   |   |       |             |
|-------|---|----------------|--|---|---|---|---|---|---|---|---|---|-------|-------------|
| State | A | В              |  | Υ | z | + |   | * | < | > | = | " | Space | New<br>line |
| N     | Α | Α              |  | Α | Α | * | * | * | < | < | * | Q | N     | N           |
| Α     | Z | Z              |  | Z | Z | * | * | * | < | < | * | Q | N     | N           |
| Z     | Z | Z              |  | Z | Z | * | * | * | < | < | * | Q | N     | N           |
| *     | Α | Α              |  | Α | Α | * | * | * | < | < | * | Q | N     | N           |
| <     | Α | Α              |  | Α | Α | * | * | * | < | < | = | Q | N     | N           |
| =     | Α | Α              |  | Α | Α | * | * | * | < | < | * | Q | N     | N           |
| Q     | S | S              |  | S | S | S | S | S | S | S | S | Ε | S     | S           |
| S     | S | S              |  | s | S | S | S | S | S | S | S | E | S     | S           |
| Ε     | Ε | Ε              |  | Ε | Ε | * | * | * | < | < | * | S | N     | N           |

#### Lexical analysis

- Replace every character in the string with the array representation of its state-to-state function (column).
- Perform a parallel-prefix operation with ⊕ as the array composition. Each character becomes an array representing the state-to-state function for that prefix.
- Use initial state (row 1) to index into these arrays.

CS267 Lecture 6 47

#### Inverting Dense n-by-n matrices in O(log<sup>2</sup> n) time

- Lemma 1: Cayley-Hamilton Theorem
  - expression for A<sup>-1</sup> via characteristic polynomial in A
- Lemma 2: Newton's Identities
  - Triangular system of equations for coefficients of characteristic polynomial, where matrix entries =  $s_k$
- Lemma 3:  $s_k = trace(A^k) = \sum_{i=1}^{n} A^i_i$  [i,i]
- Csanky's Algorithm (1976)
  - 1) Compute the powers A<sup>2</sup>, A<sup>3</sup>, ...,A<sup>n-1</sup> by parallel prefix cost = O(log<sup>2</sup> n)
  - 2) Compute the traces  $s_k = \text{trace}(A^k)$ cost = O(log n)
  - 3) Solve Newton identities for coefficients of characteristic polynomial cost = O(log² n)
  - 4) Evaluate A<sup>-1</sup> using Cayley-Hamilton Theorem cost = O(log n)

#### Completely numerically unstable

### Lessons from Data Parallel Languages

- Sequential semantics (or nearly) is very nice
  - Debugging is much easier without non-determinism
  - Correctness easier to reason about
- Cost model is independent of number of processors
  - How much inherent parallelism
- Need to "throttle" parallelism
  - n >> p can be hard to map, especially with nesting
  - Memory use is a problem

See: Blelloch "NESL Revisited", Intel Workshop 2006