# Design of Digital Circuits (252-0028-00L), Spring 2021 Optional HW 6: Vector Processors and GPUs

Instructor: Prof. Onur Mutlu

Juan Gomez-Luna, Jisung Park, Hasan Hassan, Mohammed Alser, Lois Orosa, Minesh Patel, Jawad Haj-Yahya, Haiyu Mao, Behzad Salami, Jeremie Kim, Giray Yaglikci, Can Firtina, Geraldo De Oliveira Junior, Rahul Bera, Konstantinos Kanellopoulos, Nika Mansouri, Gagandeep Singh

Released: Monday, May 31, 2021

# 1 Vector Processing I

Consider the following piece of code:

for (i = 0; i < 100; i ++)  

$$A[i] = ((B[i] * C[i]) + D[i])/2;$$

(a) Translate this code into assembly language using the following instructions in the ISA (note the number of cycles each instruction takes is shown next to each instruction):

| Opcode       | Operands      | Number of cycles | Description                                                          |
|--------------|---------------|------------------|----------------------------------------------------------------------|
| LEA          | Rd, X         | 1                | $\mathtt{Rd} \leftarrow \mathrm{address} \ \mathrm{of} \ \mathtt{X}$ |
| LD           | Rd, Rs, Rt    | 11               | $\mathtt{Rd} \leftarrow \mathtt{MEM} [\mathtt{Rs} + \mathtt{Rt}]$    |
| ST           | Rs, Rt, Ru    | 11               | $\texttt{MEM}[\texttt{Rt} + \texttt{Ru}] \leftarrow \texttt{Rs}$     |
| MOVI         | Rd, imm       | 1                | $\mathtt{Rd} \leftarrow \mathtt{imm}$                                |
| MUL          | Rd, Rs, Rt    | 6                | $\mathtt{Rd} \leftarrow \mathtt{Rs} \times \mathtt{Rt}$              |
| ADD          | Rd, Rs, Rt    | 4                | $\mathtt{Rd} \leftarrow \mathtt{Rs} + \mathtt{Rt}$                   |
| ADD          | Rd, Rs, imm   | 4                | $\mathtt{Rd} \leftarrow \mathtt{Rs} + \mathtt{imm}$                  |
| RSHFA        | Rd, Rs, shamt | 1                | $\mathtt{Rd} \leftarrow \mathtt{Rs} >>> \mathtt{shamt}$              |
| BR <cc></cc> | Rs, X         | 1                | Branch to X if Rs satisfies condition code CC                        |

Assume one memory location is required to store each element of the array. Also assume that there are eight register, R0 to R7.

Condition codes are set after the execution of an arithmetic instruction. You can assume typically available condition codes such as zero (EQZ), positive (GTZ), negative (LTZ), non-negative (GEZ), and non-positive (LEZ).

| How many cycle | How many cycles does it take to execute the program? |  |  |  |  |  |  |
|----------------|------------------------------------------------------|--|--|--|--|--|--|
|                |                                                      |  |  |  |  |  |  |
|                |                                                      |  |  |  |  |  |  |
|                |                                                      |  |  |  |  |  |  |
|                |                                                      |  |  |  |  |  |  |
|                |                                                      |  |  |  |  |  |  |
|                |                                                      |  |  |  |  |  |  |

(b) Now write Cray-like vector assembly code with the minimum number of instructions. Assume that there are eight vector registers and the length of each vector register is 64. Use the following instructions in the vector ISA:

| Opcode | Operands      | Number of cycles | Description                                                                                                          |
|--------|---------------|------------------|----------------------------------------------------------------------------------------------------------------------|
| SET    | Vst, imm      | 1                | $\mathtt{Vst} \leftarrow \mathtt{imm} \ (\mathtt{Vst} \colon \mathrm{Vector} \ \mathrm{Stride} \ \mathrm{Register})$ |
| SET    | Vln, imm      | 1                | Vln ← imm (Vln: Vector Length Register)                                                                              |
| VLD    | Vd, X         | 11, pipelined    | $Vd \leftarrow \texttt{MEM}[\texttt{address of X}]$                                                                  |
| VST    | Vs, X         | 11, pipelined    | $\texttt{MEM}[\text{address of X}] \leftarrow \texttt{Vs}$                                                           |
| VADD   | Vd, Vs, Vt    | 4, pipelined     | $\texttt{Vd} \leftarrow \texttt{Vs} + \texttt{Vt}$                                                                   |
| VMUL   | Vd, Vs, Vt    | 6, pipelined     | $\mathtt{Vd}_i \leftarrow \mathtt{Vs}_i \times \mathtt{Vt}_i$                                                        |
| VRSHFA | Vd, Vs, shamt | 1                | $\mathtt{Vd}_i \leftarrow \mathtt{Vs}_i >>> \mathtt{shamt}$                                                          |

| _   |  |  |  |
|-----|--|--|--|
|     |  |  |  |
|     |  |  |  |
|     |  |  |  |
|     |  |  |  |
|     |  |  |  |
|     |  |  |  |
|     |  |  |  |
|     |  |  |  |
|     |  |  |  |
|     |  |  |  |
|     |  |  |  |
|     |  |  |  |
|     |  |  |  |
|     |  |  |  |
|     |  |  |  |
|     |  |  |  |
|     |  |  |  |
|     |  |  |  |
|     |  |  |  |
|     |  |  |  |
|     |  |  |  |
| - 1 |  |  |  |
|     |  |  |  |
|     |  |  |  |
|     |  |  |  |
|     |  |  |  |
|     |  |  |  |
|     |  |  |  |
|     |  |  |  |
|     |  |  |  |
|     |  |  |  |
|     |  |  |  |
|     |  |  |  |
|     |  |  |  |
|     |  |  |  |
|     |  |  |  |
|     |  |  |  |
|     |  |  |  |
|     |  |  |  |
|     |  |  |  |
|     |  |  |  |
|     |  |  |  |
|     |  |  |  |
|     |  |  |  |
|     |  |  |  |
|     |  |  |  |
| - 1 |  |  |  |
| - 1 |  |  |  |
| -1  |  |  |  |
| -1  |  |  |  |
|     |  |  |  |
|     |  |  |  |
|     |  |  |  |
|     |  |  |  |
|     |  |  |  |
| -1  |  |  |  |
| - 1 |  |  |  |
|     |  |  |  |
| -1  |  |  |  |
|     |  |  |  |
|     |  |  |  |
| -1  |  |  |  |
|     |  |  |  |
| -1  |  |  |  |
| - 1 |  |  |  |
|     |  |  |  |
|     |  |  |  |
|     |  |  |  |
| _   |  |  |  |
|     |  |  |  |

| is 16-way interle                                                                | eaved.               |                |   | fors? Assume that memor |  |  |
|----------------------------------------------------------------------------------|----------------------|----------------|---|-------------------------|--|--|
| Vector processor without chaining, 1 port to memory (1 load or store per cycle): |                      |                |   |                         |  |  |
|                                                                                  |                      |                |   |                         |  |  |
|                                                                                  |                      |                |   |                         |  |  |
|                                                                                  |                      |                |   |                         |  |  |
|                                                                                  |                      |                |   |                         |  |  |
|                                                                                  |                      |                |   |                         |  |  |
|                                                                                  |                      |                |   |                         |  |  |
|                                                                                  |                      |                |   |                         |  |  |
|                                                                                  |                      |                |   |                         |  |  |
|                                                                                  |                      |                |   |                         |  |  |
|                                                                                  |                      |                |   |                         |  |  |
|                                                                                  |                      |                |   |                         |  |  |
|                                                                                  |                      |                |   |                         |  |  |
|                                                                                  |                      |                |   |                         |  |  |
|                                                                                  |                      |                |   |                         |  |  |
|                                                                                  |                      |                |   |                         |  |  |
|                                                                                  |                      |                |   |                         |  |  |
| Vector proces                                                                    | sor with chaining, 1 | port to memory | : |                         |  |  |
|                                                                                  |                      |                |   |                         |  |  |
|                                                                                  |                      |                |   |                         |  |  |
|                                                                                  |                      |                |   |                         |  |  |
|                                                                                  |                      |                |   |                         |  |  |
|                                                                                  |                      |                |   |                         |  |  |
|                                                                                  |                      |                |   |                         |  |  |
|                                                                                  |                      |                |   |                         |  |  |
|                                                                                  |                      |                |   |                         |  |  |
|                                                                                  |                      |                |   |                         |  |  |
|                                                                                  |                      |                |   |                         |  |  |
|                                                                                  |                      |                |   |                         |  |  |
|                                                                                  |                      |                |   |                         |  |  |
|                                                                                  |                      |                |   |                         |  |  |
|                                                                                  |                      |                |   |                         |  |  |
|                                                                                  |                      |                |   |                         |  |  |

| tor processor with chaining, 2 read ports and 1 write port to memory: |  |  |  |  |  |
|-----------------------------------------------------------------------|--|--|--|--|--|
|                                                                       |  |  |  |  |  |
|                                                                       |  |  |  |  |  |
|                                                                       |  |  |  |  |  |
|                                                                       |  |  |  |  |  |
|                                                                       |  |  |  |  |  |
|                                                                       |  |  |  |  |  |
|                                                                       |  |  |  |  |  |
|                                                                       |  |  |  |  |  |
|                                                                       |  |  |  |  |  |
|                                                                       |  |  |  |  |  |
|                                                                       |  |  |  |  |  |
|                                                                       |  |  |  |  |  |
|                                                                       |  |  |  |  |  |
|                                                                       |  |  |  |  |  |
|                                                                       |  |  |  |  |  |
|                                                                       |  |  |  |  |  |
|                                                                       |  |  |  |  |  |
|                                                                       |  |  |  |  |  |
|                                                                       |  |  |  |  |  |

# 2 Vector Processing II

You are studying a program that runs on a vector computer with the following latencies for various instructions:

- VLD and VST: 50 cycles for each vector element; fully interleaved and pipelined.
- VADD: 4 cycles for each vector element (fully pipelined).
- VMUL: 16 cycles for each vector element (fully pipelined).
- VDIV: 32 cycles for each vector element (fully pipelined).
- VRSHFA: 1 cycle for each vector element (fully pipelined).

#### Assume that:

- The machine has an in-order pipeline.
- The machine supports chaining between vector functional units.
- In order to support 1-cycle memory access after the first element in a vector, the machine interleaves vector elements across memory banks. All vectors are stored in memory with the first element mapped to bank 0, the second element mapped to bank 1, and so on.
- Each memory bank has an 8 KB row buffer.
- Vector elements are 64 bits in size.
- Each memory bank has two ports (so that two loads/stores can be active simultaneously), and there are two load/store functional units available.

(a) What is the minimum power-of-two number of banks required in order for memory accesses to never

| stall? (Assume a vector stride of 1.) |  |  |  |  |  |  |  |
|---------------------------------------|--|--|--|--|--|--|--|
|                                       |  |  |  |  |  |  |  |
|                                       |  |  |  |  |  |  |  |
|                                       |  |  |  |  |  |  |  |
|                                       |  |  |  |  |  |  |  |
|                                       |  |  |  |  |  |  |  |
|                                       |  |  |  |  |  |  |  |
|                                       |  |  |  |  |  |  |  |
|                                       |  |  |  |  |  |  |  |
|                                       |  |  |  |  |  |  |  |

| (b) | The ma                               |     |                 |       |                |             | ks as ;             | you fo            | ound i | n part   | a) ex  | ecute   | s the f | ollowii | ng pro  | gram   | (assum   | ne that |
|-----|--------------------------------------|-----|-----------------|-------|----------------|-------------|---------------------|-------------------|--------|----------|--------|---------|---------|---------|---------|--------|----------|---------|
|     | VLD<br>VLD<br>VADD<br>VMUL<br>VRSHFA | V4, | B<br>V1,<br>V3, | V1    | //<br>//<br>// | $V3$ $V4_i$ | ← B<br>← V1<br>← V3 | $_{i}$ $\times$ 1 | $V1_i$ |          |        |         |         |         |         |        |          |         |
|     | It takes<br>in a vec                 |     |                 | es to | execu          | ite th      | is pro              | gram              | . Wha  | at is th | ne vec | ctor le | ength . | L (i.e. | , the r | umbe   | r of ele | ements  |
|     |                                      |     |                 |       |                |             |                     |                   |        |          |        |         |         |         |         |        |          |         |
|     |                                      |     |                 |       |                |             |                     |                   |        |          |        |         |         |         |         |        |          |         |
|     |                                      |     |                 |       |                |             |                     |                   |        |          |        |         |         |         |         |        |          |         |
|     |                                      |     |                 |       |                |             |                     |                   |        |          |        |         |         |         |         |        |          |         |
|     |                                      |     |                 |       |                |             |                     |                   |        |          |        |         |         |         |         |        |          |         |
|     |                                      |     |                 |       |                |             |                     |                   |        |          |        |         |         |         |         |        |          |         |
|     | If the n                             |     |                 |       |                |             |                     |                   |        |          |        | oeline  | indep   | endent  | oper    | ations | ), how   | many    |
|     |                                      |     |                 |       |                |             |                     |                   |        |          |        |         |         |         |         |        |          |         |
|     |                                      |     |                 |       |                |             |                     |                   |        |          |        |         |         |         |         |        |          |         |
|     |                                      |     |                 |       |                |             |                     |                   |        |          |        |         |         |         |         |        |          |         |
|     |                                      |     |                 |       |                |             |                     |                   |        |          |        |         |         |         |         |        |          |         |
|     |                                      |     |                 |       |                |             |                     |                   |        |          |        |         |         |         |         |        |          |         |
|     |                                      |     |                 |       |                |             |                     |                   |        |          |        |         |         |         |         |        |          |         |
|     |                                      |     |                 |       |                |             |                     |                   |        |          |        |         |         |         |         |        |          |         |

| pending loa | ads and stores nads from the old | night stall due to be<br>est instruction are | 2 from the number of<br>bank contention, an<br>serviced first. How<br>t memory system (b | of banks you found arbiter is added to many cycles does to | each bank so tha    |
|-------------|----------------------------------|----------------------------------------------|------------------------------------------------------------------------------------------|------------------------------------------------------------|---------------------|
|             |                                  |                                              |                                                                                          |                                                            |                     |
|             |                                  |                                              |                                                                                          |                                                            |                     |
|             |                                  |                                              |                                                                                          |                                                            |                     |
|             |                                  |                                              |                                                                                          |                                                            |                     |
|             |                                  |                                              |                                                                                          |                                                            |                     |
|             |                                  |                                              |                                                                                          |                                                            |                     |
|             |                                  |                                              |                                                                                          |                                                            |                     |
|             |                                  |                                              |                                                                                          |                                                            |                     |
| Now, the a  |                                  |                                              | ducing the number of many banks are in                                                   |                                                            | to a lower power of |
| 2). The pro | ogram executes i                 |                                              |                                                                                          |                                                            |                     |
| 2). The pro | ogram executes                   |                                              |                                                                                          |                                                            |                     |
| 2). The pro | ogram executes                   |                                              |                                                                                          |                                                            |                     |
| 2). The pro | ogram executes                   |                                              |                                                                                          |                                                            |                     |
| 2). The pro | ogram executes                   |                                              |                                                                                          |                                                            |                     |
| 2). The pro | ogram executes                   |                                              |                                                                                          |                                                            |                     |
| 2). The pro | ogram executes                   |                                              |                                                                                          |                                                            |                     |
| 2). The pro | ogram executes                   |                                              |                                                                                          |                                                            |                     |
| 2). The pro | ogram executes                   |                                              |                                                                                          |                                                            |                     |

| (d) | Another architect is now designing the second generation of this vector computer. He wants to build a multicore machine in which 4 vector processors share the same memory system. He scales up the number of banks by 4 in order to match the memory system bandwidth to the new demand. However, when he simulates this new machine design with a separate vector program running on every core, he finds |
|-----|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|     | that the average execution time is longer than if each individual program ran on the original single-core system with 1/4 the banks. Why could this be? Provide concrete reason(s).                                                                                                                                                                                                                         |
|     |                                                                                                                                                                                                                                                                                                                                                                                                             |
|     |                                                                                                                                                                                                                                                                                                                                                                                                             |
|     |                                                                                                                                                                                                                                                                                                                                                                                                             |
|     | What change could this architect make to the system in order to alleviate this problem (in less than 20 words), while <i>only</i> changing the shared memory hierarchy?                                                                                                                                                                                                                                     |
|     |                                                                                                                                                                                                                                                                                                                                                                                                             |
|     |                                                                                                                                                                                                                                                                                                                                                                                                             |
|     |                                                                                                                                                                                                                                                                                                                                                                                                             |

# 3 Vector Processing III

Assume a vector processor that implements the following ISA:

| Opcode | Operands   | Number of cycles | Description                                                                      |
|--------|------------|------------------|----------------------------------------------------------------------------------|
| SET    | Vst, imm   | 1                | $\texttt{Vst} \leftarrow \texttt{imm} \; (\texttt{Vst: Vector Stride Register})$ |
| SET    | Vln, imm   | 1                | $Vln \leftarrow imm (Vln: Vector Length Register)$                               |
| VLD    | Vd, addr   | 100, pipelined   | $\texttt{Vd} \leftarrow \texttt{MEM[addr]}$                                      |
| VST    | Vs, addr   | 100, pipelined   | $\texttt{MEM[addr]} \leftarrow \texttt{Vs}$                                      |
| VADD   | Vd, Vs, Vt | 5, pipelined     | $\texttt{Vd} \leftarrow \texttt{Vs} + \texttt{Vt}$                               |
| VMUL   | Vd, Vs, Vt | 10, pipelined    | $\mathtt{Vd}_i \leftarrow \mathtt{Vs}_i \times \mathtt{Vt}_i$                    |
| VDIV   | Vd, Vs, Vt | 20, pipelined    | $\mathtt{Vd}_i \leftarrow \mathtt{Vs}_i \; / \; \mathtt{Vt}_i$                   |

Assume the following:

- The processor has an in-order pipeline.
- The size of a vector element is 4 bytes.
- Vst and Vln are 10-bit registers.
- The processor does not support chaining between vector functional units.
- $\bullet$  The main memory has N banks.
- Vector elements stored in consecutive memory addresses are interleaved between the memory banks. For example, if a vector element at address A maps to bank B, a vector element at address A+4 maps to bank (B+1)%N, where % is the modulo operator and N is the number of banks. N is not necessarily a power of two.
- The memory is byte addressable and the address space is represented using 32 bits.
- Vector elements are stored in memory in 4-byte-aligned manner.
- Each memory bank has a 4 KB row buffer.
- Each memory bank has a single read and a single write port so that a load and a store operation can be performed simultaneously.
- There are separate functional units for executing VLD and VST instructions.

| (a) | What should the minimum valuassuming a vector stride of 1? Ex | avoid stalls whi | le executing a V | LD or VST | instruction, |
|-----|---------------------------------------------------------------|------------------|------------------|-----------|--------------|
|     |                                                               |                  |                  |           |              |
|     |                                                               |                  |                  |           |              |
|     |                                                               |                  |                  |           |              |
|     |                                                               |                  |                  |           |              |
|     |                                                               |                  |                  |           |              |
|     |                                                               |                  |                  |           |              |
|     |                                                               |                  |                  |           |              |

| suming a v | ector stride o | of 2? Explain | 1. |  |  |  |
|------------|----------------|---------------|----|--|--|--|
|            |                |               |    |  |  |  |
|            |                |               |    |  |  |  |
|            |                |               |    |  |  |  |
|            |                |               |    |  |  |  |
|            |                |               |    |  |  |  |
|            |                |               |    |  |  |  |
|            |                |               |    |  |  |  |
|            |                |               |    |  |  |  |
|            |                |               |    |  |  |  |
|            |                |               |    |  |  |  |
|            |                |               |    |  |  |  |
|            |                |               |    |  |  |  |
|            |                |               |    |  |  |  |
|            |                |               |    |  |  |  |
|            |                |               |    |  |  |  |
|            |                |               |    |  |  |  |
|            |                |               |    |  |  |  |
|            |                |               |    |  |  |  |
|            |                |               |    |  |  |  |
|            |                |               |    |  |  |  |

### (c) Assume:

- A machine that has a memory with as many banks as you found is part (a).
- $\bullet\,$  The vector stride is set to 1.
- The value of the vector length is set to M (but we do not know M)

The machine executes the following program:

It takes 4,306 cycles to execute the above program. What is M? Explain.

| If we modifthe same pr                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         | y the vector program in par  | processor to $s$ t (c)? Explain | support chair | ning, how man         | ny cycles wo | ould be requ | ired to exe |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------|---------------------------------|---------------|-----------------------|--------------|--------------|-------------|
| If we modifthe same pr                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         | y the vector rogram in par   | processor to $s$ t (c)? Explain | support chair | <i>ting</i> , how man | ny cycles wo | uld be requ  | ired to exe |
| If we modified the same property                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               | by the vector program in par | processor to $s$ t (c)? Explain | support chair | aing, how man         | ny cycles wo | uld be requ  | ired to exe |
| If we modified the same property                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               | by the vector program in par | processor to $s$ t (c)? Explain | support chair | aing, how man         | ny cycles wo | uld be requ  | ired to exc |
| If we modified the same proof                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | y the vector program in par  | processor to $s$ t (c)? Explain | support chair | ning, how man         | ny cycles wo | uld be requ  | ired to exe |
| If we modified the same processing the same pr | y the vector gram in par     | processor to $s$ t (c)? Explain | support chair | ing, how man          | ny cycles wo | uld be requ  | ired to exe |
| If we modified the same processing the same pr | by the vector program in par | processor to $s$ t (c)? Explain | support chair | <i>ting</i> , how man | ny cycles wo | uld be requ  | ired to exe |
| If we modified the same process.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               | by the vector gram in par    | processor to $s$ t (c)? Explain | support chair | aing, how man         | ny cycles wo | uld be requ  | ired to exe |
| If we modified the same property of the same proper | y the vector cogram in par   | processor to $s$ t (c)? Explain | support chair | aing, how man         | ny cycles wo | uld be requ  | ired to exe |
| If we modified the same processing the same pr | y the vector cogram in par   | processor to $s$ t (c)? Explain | support chair | aing, how man         | ny cycles wo | uld be requ  | ired to exc |
| If we modified the same processing the same pr | y the vector cogram in par   | processor to s t (c)? Explain   | support chair | <i>ting</i> , how man | ny cycles wo | uld be requ  | ired to exe |
| If we modified the same process of the same pr | fy the vector cogram in par  | processor to s t (c)? Explain   | support chair | ning, how man         | ny cycles wo | uld be requ  | ired to exe |
| If we modified the same process.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               | fy the vector regram in par  | processor to $s$ t (c)? Explain | support chair | aing, how man         | ny cycles wo | uld be requ  | ired to exc |
| If we modified the same processing the same pr | fy the vector regram in par  | processor to s t (c)? Explain   | support chair | ving, how man         | ny cycles wo | uld be requ  | ired to exc |
| If we modified the same process of the same pr | fy the vector cogram in par  | processor to s t (c)? Explain   | support chair | ing, how man          | ny cycles wo | uld be requ  | ired to exc |
| If we modified the same processing the same pr | fy the vector regram in par  | processor to s t (c)? Explain   | support chair | aing, how man         | ny cycles wo | uld be requ  | ired to exc |

# 4 Vector Processing IV

A vector processor includes the following instructions in its ISA:

| Opcode | Operands         | Latency (cycles)                    | Description                   |
|--------|------------------|-------------------------------------|-------------------------------|
| VLD    | $V_i$ , #Address | 60, fully interleaved and pipelined | $V_i \leftarrow Mem[Address]$ |
| VST    | $V_i$ , #Address | 60, fully interleaved and pipelined | $Mem[Address] \leftarrow V_i$ |
| VADD   | $V_i, V_j, V_k$  | 8, fully pipelined                  | $V_i \leftarrow V_j + V_k$    |
| VMUL   | $V_i, V_j, V_k$  | 16, fully pipelined                 | $V_i \leftarrow V_j * V_k$    |

Assume that:

- The machine has an in-order pipeline.
- Each vector element is 64 bits in size.
- In order to support 1-element per cycle memory throughput for vector elements, after the first element in a vector, the machine interleaves vector elements across memory banks. All vectors are stored in memory with the first element mapped to bank 0, the second element mapped to bank 1, etc.
- Memory accesses within a vectorized memory request must be issued in order.
- Each memory bank has two ports (so that two loads/stores can be active simultaneously), and there are two load/store functional units available.

Answer the following questions about this vector computer.

| (a) | The number of memory banks in this vector processor is a power of two. What should the minimum        |
|-----|-------------------------------------------------------------------------------------------------------|
|     | number of banks be to avoid stalls while executing a VLD or VST instruction, assuming a vector stride |
|     | of 1? Explain.                                                                                        |

(b) Translate the following piece of code, called SAXPY, into vector code that executes in the minimum possible number of cycles on this vector processor. Constant alpha is stored in vector register V1. (Note: Assume vector length register and vector stride register are already introduced to appropriate values.)

```
for (i = 0; i < VLEN; i++){
    temp = alpha * X[i];
    Z[i] = temp + Y[i];
}</pre>
```

| Sho | etor functional units<br>ow your work.    | s. If the machine |  | not support chaining am, what is the vec |            |
|-----|-------------------------------------------|-------------------|--|------------------------------------------|------------|
|     |                                           |                   |  |                                          |            |
|     |                                           |                   |  |                                          |            |
|     |                                           |                   |  |                                          |            |
|     |                                           |                   |  |                                          |            |
|     |                                           |                   |  |                                          |            |
|     |                                           |                   |  |                                          |            |
|     |                                           |                   |  |                                          |            |
|     | ne architect of the nevious part, what is |                   |  | or the VLEN you for Show your work.      | ound in th |
|     |                                           |                   |  |                                          |            |
|     |                                           |                   |  |                                          |            |
|     |                                           |                   |  |                                          |            |
|     |                                           |                   |  |                                          |            |
|     |                                           |                   |  |                                          |            |
|     |                                           |                   |  |                                          |            |

| (e) | Now the architect decides that she needs to cut costs in the machine's memory system by reducing the number of banks. Because loads and stores might stall due to bank contention, an arbiter is added to each bank so that pending loads from the oldest instruction are serviced first. She observes the program now takes 481 cycles (with chaining). What is the new number of banks? Show your work. (Note: |
|-----|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|     | Recall the number of banks is a power of two.)                                                                                                                                                                                                                                                                                                                                                                   |
|     |                                                                                                                                                                                                                                                                                                                                                                                                                  |
|     |                                                                                                                                                                                                                                                                                                                                                                                                                  |
|     |                                                                                                                                                                                                                                                                                                                                                                                                                  |
|     |                                                                                                                                                                                                                                                                                                                                                                                                                  |
|     |                                                                                                                                                                                                                                                                                                                                                                                                                  |
|     |                                                                                                                                                                                                                                                                                                                                                                                                                  |
|     |                                                                                                                                                                                                                                                                                                                                                                                                                  |
|     |                                                                                                                                                                                                                                                                                                                                                                                                                  |
|     |                                                                                                                                                                                                                                                                                                                                                                                                                  |
|     |                                                                                                                                                                                                                                                                                                                                                                                                                  |

# 5 SIMD Processing

Suppose we want to design a SIMD engine that can support a vector length of 16. We have two options: a traditional vector processor and a traditional array processor.

| The traditional vector processor  The traditional array processor  Nei Justify your answer.  Assuming the latency of an addition operation is five cycles in both processors, how long will a WADD (vector add) instruction take in each of the processors (assume that the adder can be fully pipelined and is the same for both processors)?  For a vector length of 1:  The traditional vector processor:  For a vector length of 4:  The traditional vector processor:  For a vector length of 16:  The traditional vector processor:  For a vector length of 16:  The traditional vector processor: | ) | Which one is more costly in terms of chip area (circle one)?                                              |       |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---|-----------------------------------------------------------------------------------------------------------|-------|
| Assuming the latency of an addition operation is five cycles in both processors, how long will a VADD (vector add) instruction take in each of the processors (assume that the adder can be fully pipelined and is the same for both processors?  For a vector length of 1:  The traditional vector processor:  For a vector length of 4:  The traditional vector processor:  The traditional vector processor:  For a vector length of 16:  The traditional vector processor:                                                                                                                           |   | The traditional vector processor                                                                          | Neith |
| (vector add) instruction take in each of the processors (assume that the adder can be fully pipelined and is the same for both processors)?  For a vector length of 1:  The traditional vector processor:  For a vector length of 4:  The traditional vector processor:  The traditional vector processor:  For a vector length of 4:  The traditional vector processor:  The traditional array processor:  For a vector length of 16:  The traditional vector processor:                                                                                                                                |   | Justify your answer.                                                                                      |       |
| (vector add) instruction take in each of the processors (assume that the adder can be fully pipelined and is the same for both processors)?  For a vector length of 1:  The traditional vector processor:  For a vector length of 4:  The traditional vector processor:  The traditional vector processor:  For a vector length of 4:  The traditional vector processor:  The traditional array processor:  For a vector length of 16:  The traditional vector processor:                                                                                                                                |   |                                                                                                           |       |
| (vector add) instruction take in each of the processors (assume that the adder can be fully pipelined and is the same for both processors)?  For a vector length of 1:  The traditional vector processor:  For a vector length of 4:  The traditional vector processor:  The traditional vector processor:  For a vector length of 4:  The traditional vector processor:  The traditional array processor:  For a vector length of 16:  The traditional vector processor:                                                                                                                                |   |                                                                                                           |       |
| (vector add) instruction take in each of the processors (assume that the adder can be fully pipelined and is the same for both processors)?  For a vector length of 1:  The traditional vector processor:  For a vector length of 4:  The traditional vector processor:  The traditional vector processor:  For a vector length of 4:  The traditional vector processor:  The traditional array processor:  For a vector length of 16:  The traditional vector processor:                                                                                                                                | ı |                                                                                                           | l     |
| The traditional vector processor:  The traditional array processor:  For a vector length of 4: The traditional vector processor:  The traditional array processor:  For a vector length of 16: The traditional vector processor:                                                                                                                                                                                                                                                                                                                                                                         |   | (vector add) instruction take in each of the processors (assume that the adder can be fully pipelined and |       |
| The traditional array processor:  For a vector length of 4: The traditional vector processor:  The traditional array processor:  For a vector length of 16: The traditional vector processor:                                                                                                                                                                                                                                                                                                                                                                                                            |   | For a vector length of 1:                                                                                 |       |
| For a vector length of 4: The traditional vector processor:  The traditional array processor:  For a vector length of 16: The traditional vector processor:                                                                                                                                                                                                                                                                                                                                                                                                                                              | 1 | The traditional vector processor:                                                                         | i     |
| For a vector length of 4: The traditional vector processor:  The traditional array processor:  For a vector length of 16: The traditional vector processor:                                                                                                                                                                                                                                                                                                                                                                                                                                              |   |                                                                                                           |       |
| The traditional vector processor:  The traditional array processor:  For a vector length of 16: The traditional vector processor:                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |   | The traditional array processor:                                                                          |       |
| The traditional vector processor:  The traditional array processor:  For a vector length of 16: The traditional vector processor:                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |   |                                                                                                           | Í     |
| The traditional vector processor:  The traditional array processor:  For a vector length of 16: The traditional vector processor:                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |   |                                                                                                           |       |
| The traditional vector processor:  The traditional array processor:  For a vector length of 16: The traditional vector processor:                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | ١ |                                                                                                           |       |
| The traditional array processor:  For a vector length of 16: The traditional vector processor:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |   | For a vector length of 4:                                                                                 |       |
| For a vector length of 16: The traditional vector processor:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |   | The traditional vector processor:                                                                         |       |
| For a vector length of 16: The traditional vector processor:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |   |                                                                                                           |       |
| For a vector length of 16: The traditional vector processor:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |   |                                                                                                           |       |
| The traditional vector processor:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |   | The traditional array processor:                                                                          |       |
| The traditional vector processor:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |   |                                                                                                           |       |
| The traditional vector processor:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |   |                                                                                                           |       |
| The traditional vector processor:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |   | For a reaction largeth of 16.                                                                             |       |
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |   |                                                                                                           |       |
| The traditional array processor:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         | [ | The traditional vector processor:                                                                         | ĺ     |
| The traditional array processor:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |   |                                                                                                           |       |
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | l | The traditional array processor:                                                                          |       |
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |   |                                                                                                           |       |

# 6 GPUs and SIMD I

We define the SIMD utilization of a program run on a GPU as the fraction of SIMD lanes that are kept busy with active threads during the run of a program. The following code segment is run on a GPU. Each thread executes a single iteration of the shown loop. Assume that the data values of the arrays A and B are already in vector registers so there are no loads and stores in this program. (Hint: Notice that there are 2 instructions in each thread.) A warp in the GPU consists of 32 threads, there are 32 SIMD lanes in the GPU. Assume that each instruction takes the same amount of time to execute.

| for | (i = 0; i < N; i++) { if (A[i] % 3 == 0) { // Instruction 1                                                                                                                                                                                           |
|-----|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| }   | <pre>A[i] = A[i] * B[i]; // Instruction 2 }</pre>                                                                                                                                                                                                     |
| (a) | How many warps does it take to execute this program? Please leave the answer in terms of $N$ .                                                                                                                                                        |
|     |                                                                                                                                                                                                                                                       |
| (b) | Assume integer arrays A have a repetitive pattern which have 24 ones followed by 8 zeros repetitively and integer arrays B have a different repetitive pattern which have 48 zeros followed by 64 ones. What is the SIMD utilization of this program? |
|     |                                                                                                                                                                                                                                                       |
| (c) | Is it possible for this program to yield a SIMD utilization of 100%? Circle one.                                                                                                                                                                      |
|     | YES NO                                                                                                                                                                                                                                                |
|     | If YES, what should be true about array A for the SIMD utilization to be 100%?                                                                                                                                                                        |
|     |                                                                                                                                                                                                                                                       |
|     | What should be true about array B?                                                                                                                                                                                                                    |
|     |                                                                                                                                                                                                                                                       |
|     | If NO, explain why not.                                                                                                                                                                                                                               |
|     |                                                                                                                                                                                                                                                       |

|                                       | YES                        | NO                             |  |
|---------------------------------------|----------------------------|--------------------------------|--|
| If YES, what should be true about     | out array A for the        | SIMD utilization to be 56.25%? |  |
|                                       |                            |                                |  |
|                                       |                            |                                |  |
|                                       |                            |                                |  |
| What should be true about arr         | ay B?                      |                                |  |
|                                       |                            |                                |  |
|                                       |                            |                                |  |
|                                       |                            |                                |  |
| If NO, explain why not.               |                            |                                |  |
| ii wu, expiain why not.               |                            |                                |  |
|                                       |                            |                                |  |
|                                       |                            |                                |  |
|                                       |                            |                                |  |
| Is it possible for this program t     | to yield a SIMD u          | ilization of 50%? Circle one.  |  |
| Is it possible for this program t     | to yield a SIMD ut<br>YES  | ilization of 50%? Circle one.  |  |
| Is it possible for this program t     | YES                        | NO                             |  |
|                                       | YES                        | NO                             |  |
|                                       | YES                        | NO                             |  |
|                                       | YES                        | NO                             |  |
| If YES, what should be true about     | YES<br>out array A for the | NO                             |  |
|                                       | YES<br>out array A for the | NO                             |  |
| If YES, what should be true about     | YES<br>out array A for the | NO                             |  |
| If YES, what should be true about     | YES<br>out array A for the | NO                             |  |
| If YES, what should be true about     | YES<br>out array A for the | NO                             |  |
| If YES, what should be true about arr | YES<br>out array A for the | NO                             |  |
| If YES, what should be true about     | YES<br>out array A for the | NO                             |  |
| If YES, what should be true about arr | YES<br>out array A for the | NO                             |  |

Now, we will look at the technique we learned in class that tries to improve SIMD utilization by merging divergent branches together. The key idea of the *dynamic warp formation* is that threads in one warp can be swapped with threads in another warp as long as the swapped threads have access to the associated registers (i.e., they are on the same SIMD lane).

Consider the following example of a program that consists of 3 warps X, Y and Z that are executing the same code segment specified at the top of this question. Assume that the vector below specifies the direction of the branch of each thread within the warp. 1 means the branch in Instruction 1 is resolved to taken and 0 means the branch in Instruction 1 is resolved to not taken.

|     | <pre>X = {100000000000000000000000000000000000</pre>                                                                                                                                         |
|-----|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| (f) | Given the example above. Suppose that you perform dynamic warp formation on these three warps What is the resulting outcome of each branch for the newly formed warps $X'$ , $Y'$ and $Z'$ . |
|     |                                                                                                                                                                                              |
|     |                                                                                                                                                                                              |
|     |                                                                                                                                                                                              |
| (g) | Given the specification for arrays ${\tt A}$ and ${\tt B}$ , is it possible for this program to yield a better SIMD utilization if dynamic warp formation is used? Explain your reasoning.   |
|     |                                                                                                                                                                                              |
|     |                                                                                                                                                                                              |
|     |                                                                                                                                                                                              |

### 7 GPUs and SIMD II

We define the SIMD utilization of a program run on a GPU as the fraction of SIMD lanes that are kept busy with active threads during the run of a program. As we saw in lecture and practice exercises, the SIMD utilization of a program is computed across the complete run of the program.

The following code segment is run on a GPU. Each thread executes **a single iteration** of the shown loop. Assume that the data values of the arrays A, B, and C are already in vector registers so there are no loads and stores in this program. (Hint: Notice that there are 6 instructions in each thread.) A warp in the GPU consists of 64 threads, and there are 64 SIMD lanes in the GPU. Please assume that all values in array B have magnitudes less than 10 (i.e., |B[i]| < 10, for all i).

```
for (i = 0; i < 1024; i++) {
    A[i] = B[i] * B[i];
    if (A[i] > 0) {
        C[i] = A[i] * B[i];
        if (C[i] < 0) {
              A[i] = A[i] + 1;
        }
        A[i] = A[i] - 2;
    }
}</pre>
```

(a) How many warps does it take to execute this program?



(b) What is the maximum possible SIMD utilization of this program?

| (c) | Please describe what needs to be true about array B to reach the maximum possible SIMD utilization asked in part (b). (Please cover all cases in your answer) |
|-----|---------------------------------------------------------------------------------------------------------------------------------------------------------------|
|     |                                                                                                                                                               |
|     |                                                                                                                                                               |
|     |                                                                                                                                                               |
|     |                                                                                                                                                               |
|     |                                                                                                                                                               |
|     |                                                                                                                                                               |
| (d) | What is the minimum possible SIMD utilization of this program?                                                                                                |
| (u) | what is the minimum possible ShviD utilization of this program:                                                                                               |
|     |                                                                                                                                                               |
|     |                                                                                                                                                               |
|     |                                                                                                                                                               |
|     |                                                                                                                                                               |
|     |                                                                                                                                                               |
|     |                                                                                                                                                               |
|     | Please describe what needs to be true about array B to reach the minimum possible SIMD utilization asked in part (d). (Please cover all cases in your answer) |
|     |                                                                                                                                                               |
|     |                                                                                                                                                               |
|     |                                                                                                                                                               |
|     |                                                                                                                                                               |
|     |                                                                                                                                                               |
|     |                                                                                                                                                               |

### 8 GPUs and SIMD III

We define the SIMD utilization of a program that runs on a GPU as the fraction of SIMD lanes that are kept busy with active threads during the run of the program. As we saw in lecture and practice exercises, the SIMD utilization of a program is computed across the complete run of the program.

The following code segment is run on a GPU. Each thread executes **a single iteration** of the shown loop. Assume that the data values of the arrays A and B are already in vector registers, so there are no loads and stores in this program. (Hint: Notice that there are 3 instructions in each iteration.) A warp in the GPU consists of 32 threads, and there are 32 SIMD lanes in the GPU.

Please answer the following six questions.

| ue instructions v | when no threads | s are active). |  |  |
|-------------------|-----------------|----------------|--|--|
|                   |                 |                |  |  |
|                   |                 |                |  |  |
|                   |                 |                |  |  |
|                   |                 |                |  |  |
|                   |                 |                |  |  |
|                   |                 |                |  |  |
|                   |                 |                |  |  |
|                   |                 |                |  |  |
|                   |                 |                |  |  |

| (c) | Please describe what needs to be true about array A to reach the maximum possible SIMD utilization asked in part (b). (Please cover all cases in your answer.)            |  |  |  |  |  |
|-----|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|--|--|--|
|     |                                                                                                                                                                           |  |  |  |  |  |
|     |                                                                                                                                                                           |  |  |  |  |  |
|     |                                                                                                                                                                           |  |  |  |  |  |
|     |                                                                                                                                                                           |  |  |  |  |  |
|     |                                                                                                                                                                           |  |  |  |  |  |
|     |                                                                                                                                                                           |  |  |  |  |  |
| >   |                                                                                                                                                                           |  |  |  |  |  |
| (d) | What is the <i>minimum</i> possible SIMD utilization of this program?                                                                                                     |  |  |  |  |  |
|     |                                                                                                                                                                           |  |  |  |  |  |
|     |                                                                                                                                                                           |  |  |  |  |  |
|     |                                                                                                                                                                           |  |  |  |  |  |
|     |                                                                                                                                                                           |  |  |  |  |  |
|     |                                                                                                                                                                           |  |  |  |  |  |
|     |                                                                                                                                                                           |  |  |  |  |  |
|     |                                                                                                                                                                           |  |  |  |  |  |
|     |                                                                                                                                                                           |  |  |  |  |  |
| (e) | Please describe what needs to be true about array $\mathbb{A}$ to reach the minimum possible SIMD utilization asked in part (d). (Please cover all cases in your answer.) |  |  |  |  |  |
|     |                                                                                                                                                                           |  |  |  |  |  |
|     |                                                                                                                                                                           |  |  |  |  |  |
|     |                                                                                                                                                                           |  |  |  |  |  |
|     |                                                                                                                                                                           |  |  |  |  |  |
|     |                                                                                                                                                                           |  |  |  |  |  |
|     |                                                                                                                                                                           |  |  |  |  |  |

### 9 GPUs and SIMD IV

We define the SIMD utilization of a program run on a GPU as the fraction of SIMD lanes that are kept busy with active threads during the run of a program.

The following code segment is run on a GPU. Each thread executes **a single iteration** of the shown loop. Assume that the data values of the arrays A, B, and C are already in vector registers so there are no loads and stores in this program. (Hint: Notice that there are 6 instructions in each thread.) A warp in the GPU consists of 64 threads, and there are 64 SIMD lanes in the GPU.

```
for (i = 0; i < 4096; i++) {
    if (B[i] < 8888) {
        A[i] = A[i] * C[i];
        A[i] = A[i] + B[i];
        C[i] = B[i] + 1;
    if (B[i] > 8888) {
        A[i] = A[i] * B[i];
    }
}
(a) How many warps does it take to execute this program?
(b) When we measure the SIMD utilization for this program with one input set, we find that it is 134/320.
    What can you say about arrays A, B, and C? Be precise (Hint: Look at the "if" branch).
   A:
   В:
   C:
```

| (c) | Is it possible for this program to yield a SIMD utilization of $100\%$ (circle one)?                       |               |
|-----|------------------------------------------------------------------------------------------------------------|---------------|
|     | YES NO                                                                                                     |               |
|     | If YES, what should be true about arrays A, B, C for the SIMD utilization to be 100%? NO, explain why not. | Be precise. I |
|     |                                                                                                            |               |
|     |                                                                                                            |               |
|     |                                                                                                            |               |
| (d) | What is the lowest SIMD utilization that this program can yield? Explain.                                  |               |
|     |                                                                                                            |               |
|     |                                                                                                            |               |
|     |                                                                                                            |               |

### 10 GPUs and SIMD V

We define the SIMD utilization of a program that runs on a GPU as the fraction of SIMD lanes that are kept busy with active threads during the run of a program. As we saw in lecture and practice exercises, the SIMD utilization of a program is computed across the complete run of the program.

The following code segment is run on a GPU. A warp in the GPU consists of 32 threads, and there are 32 SIMD lanes in the GPU. Each thread executes **a single iteration** of the shown loop. Assume that the data values of the arrays A and B are already in vector registers so there are no loads and stores in this program. Both A and B are arrays of integers. (Hint: Notice that there are 5 instructions in each iteration.)

Please answer the following questions.

| scheduler | does not issue | instructions | s where no t | hreads are ac | tive). |  |  |
|-----------|----------------|--------------|--------------|---------------|--------|--|--|
|           |                |              |              |               |        |  |  |
|           |                |              |              |               |        |  |  |
|           |                |              |              |               |        |  |  |
|           |                |              |              |               |        |  |  |
|           |                |              |              |               |        |  |  |
|           |                |              |              |               |        |  |  |
|           |                |              |              |               |        |  |  |
|           |                |              |              |               |        |  |  |

| (c) What is the minimum possible SIMD utilization of this program? |                                                                                                                                                |  |  |  |  |  |  |
|--------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------|--|--|--|--|--|--|
|                                                                    |                                                                                                                                                |  |  |  |  |  |  |
|                                                                    |                                                                                                                                                |  |  |  |  |  |  |
|                                                                    |                                                                                                                                                |  |  |  |  |  |  |
| (d)                                                                | What needs to be true about array B to achieve the minimum possible SIMD utilization? Show your work. (Please cover all cases in your answer.) |  |  |  |  |  |  |
|                                                                    |                                                                                                                                                |  |  |  |  |  |  |
|                                                                    |                                                                                                                                                |  |  |  |  |  |  |
|                                                                    |                                                                                                                                                |  |  |  |  |  |  |
|                                                                    |                                                                                                                                                |  |  |  |  |  |  |
|                                                                    |                                                                                                                                                |  |  |  |  |  |  |
|                                                                    |                                                                                                                                                |  |  |  |  |  |  |
|                                                                    |                                                                                                                                                |  |  |  |  |  |  |
|                                                                    |                                                                                                                                                |  |  |  |  |  |  |
|                                                                    |                                                                                                                                                |  |  |  |  |  |  |
|                                                                    |                                                                                                                                                |  |  |  |  |  |  |
|                                                                    |                                                                                                                                                |  |  |  |  |  |  |
|                                                                    |                                                                                                                                                |  |  |  |  |  |  |
|                                                                    |                                                                                                                                                |  |  |  |  |  |  |
|                                                                    |                                                                                                                                                |  |  |  |  |  |  |
| (e)                                                                | If $B[0] = 0$ , what is the maximum possible SIMD utilization of this program?                                                                 |  |  |  |  |  |  |
|                                                                    |                                                                                                                                                |  |  |  |  |  |  |
|                                                                    |                                                                                                                                                |  |  |  |  |  |  |
|                                                                    |                                                                                                                                                |  |  |  |  |  |  |
|                                                                    |                                                                                                                                                |  |  |  |  |  |  |
|                                                                    |                                                                                                                                                |  |  |  |  |  |  |
|                                                                    |                                                                                                                                                |  |  |  |  |  |  |

| (f) | What needs to be true about array B to achieve the maximum possible SIMD utilization, if $B[0] = Show$ your work. (Please cover all cases in your answer.) | 0 |
|-----|------------------------------------------------------------------------------------------------------------------------------------------------------------|---|
|     |                                                                                                                                                            |   |
|     |                                                                                                                                                            |   |
|     |                                                                                                                                                            |   |
|     |                                                                                                                                                            |   |
|     |                                                                                                                                                            |   |
|     |                                                                                                                                                            |   |
|     |                                                                                                                                                            |   |
|     |                                                                                                                                                            |   |