## Computer Architecture and Engineering - 3rd Year - ETSINF - UPV

Exercises and solutions of Unit 2. "Pipelined computers"

# **Basic pipelining**

# MIPS data path

**Ejercicio 2.1.** A MIPS64-compatible processor integrates a pipeline of 5 stages:

**IF:** Instructions fetch.

**ID:** Instructions decoding and reading of register operands.

**EX:** ALU operation and PC updating in branch instructions.

M: Memory access in load and store instructions.

WB: Writing results back to registers

The machine solves control hazards using the delayed branching. It also integrates two separated caches, one for instructions and another for data, and a register file providing two read and one write ports.

n iterations of the following loop are executed on the processor:

```
L: ld $t1,X($t2)
dadd $t1,$t1,$t3
sd $t1,X($t2)
daddi $t2,$t2,8
daddi $t4,$t4,-1
bnez $t4,L
nop
nop
```

- 1. Draw a diagram indicating, for each instruction the clock cycle and the stage it has under completion. Consider only the first loop iteration. Compute the CPI and the execution time obtained according to the number of iterations n. Consider the following cases:
  - a) Data hazards are solved using stalls.
  - b) Data hazards are solved using short circuits.

#### Solución:

1. Since the PC is written in the third cycle, the *branch delay slot* is of two instructions:

```
d: bnez $t4,L IF ID EX M WB d+4: ds1 IF ID EX M WB d+12/L: IF ID EX M WB
```

a) Original code with stalls.

For one iteration, 8 instructions are executed in 14 clock cycles. As a result:

$$CPI = \frac{14}{8} = 1,75 \text{ ciclos}$$

And the execution time for n iterations is  $14 \cdot n$  cycles:

$$T_{ex} = I \cdot CPI \cdot T = 8n \cdot 1,75 \cdot T = 14 \cdot n \cdot T \text{ seg}$$

b) Original code with forwarding.

For one iteration, 8 instructions are executed in 9 cycles. So:

$$CPI = \frac{9}{8} = 1{,}125$$

And the execution time for n iterations is  $9 \cdot n$  cycles:

$$T_{ex} = I \times CPI \times T = 8n \times 1{,}125 \times T = 9 \cdot n \cdot T \text{ seg}$$

# Ejercicio 2.2. The following high-level code is available:

The type int and pointers take 32 bits. Constant NULL is represented with a 0. The compiler produces a loop as the following one:

```
; cont is located in R2
; p is located in R1
...
eti: add r2, r2, #1
   lw r1, 40(r1)
   bnez r1, eti
   nop   ; as many nops as necessary
```

This code executes on different versions of a processor pipelined in the following 5 stages:

IF: Instruction fetch.

**ID:** Instruction decoding and reading of register operands (during the  $2^{nd}$  part of the clock cycle).

**EX:** ALU operation.

**ME:** Memory access in loads and stores.

**WB:** Result is written back to the destination register (during the  $1^{st}$  part of the clock cycle).

The processor works at 100 MHz. It solves data hazards using the short-circuit technique and control hazards using delayed branching.

Compute, for each one of the following cases, the *branch delay slot* (i.e. how many instructions are affected by the delayed branch) and the CPI, MIPS and execution time of the processor. Assume in your computation that the loop is executed 10.000 times:

- 1. The computation of the effective address and the branch condition, and the update of the PC are performed during the ID stage. A *Harvard* architecture is available (i.e. data and instructions cache are separated).
- 2. The computation of the effective address and the branch condition, and the update of the PC are performed during the ID stage, but a shared instruction and data cache is used.
- 3. The computation of the effective address and the branch condition is performed during the EX stage, and the update of the PC is carried out during the MEM stage. In this case, a *Harvard* architecture is considered.

## Solución:

1. PC updating in ID; *Harvard* architecture.

The *branch delay slot* is of 1 instruction, so 4 instructions are executed in each loop iteration. The data hazard between *lw* and *bnez* requires two stalls. So, the total number on cycle per iteration is 6. As a result:

$$\overline{\text{CPI}} = \frac{6}{4} = 1,5 \text{ cycles}$$
 
$$\text{MIPS} = \frac{I}{T_{ej} \cdot 10^6} = \frac{I}{I \cdot CPI \cdot T \cdot 10^6} = \frac{f(\text{en MHz})}{\text{CPI}} = \frac{100}{1,5} = 66,7 \text{ MIPS}$$

In order to process 10.000 iterations 60.000 clock cycles will be required. Since the clock cycle is of 10 ns, this makes a total of 600.000 ns = 0.6 ms. The same resulta can be obtain if the equation of the execution time is applied with  $I = 4 \cdot 10^4$  instructions:

$$T(10000 \text{ iterations}) = (4 \cdot 10^4) \cdot 1.5 \cdot \frac{1}{100 \cdot 10^6} = 6 \cdot 10^{-4} \text{ s.}$$

2. PC updating in ID; common instructions and data cache.

The *branch delay slot* is also of 1 instruction: 4 instructions being executed in each iterations. The structural hazard between *lw* and *nop* overlaps the data hazard between *load* and *bnez*. So, the loop consumes 6 cycles per iteration and the resulting parameters are the same computed in the previous section.

3. PC updating in MEM; *Harvard* architecture.

The *branch delay slot* is of 3 instructions: 6 instructions being executed per iteration. The data hazard between *lw* and *bnez* requires only one stall. The loop consumes 7 cycles per iteration. The following parameters are obtained:

$$\overline{\text{CPI}} = \frac{7}{6} = 1,\!17 \text{ cycles}$$
 
$$\text{MIPS} = \frac{100}{1,\!17} = 85,\!7 \text{ MIPS}$$

In order to process 10.000 iterations 70.000 cycles will be required. Since the clock period is of 10 ns, the execution time is 700.000 ns = 0.7 ms

**Ejercicio 2.3.** A 2000\$ computer incorporates a MIPS/LC processor. This processor has a MIPS-like pipelined instruction unit with 5 stages (IF: instruction fetch, ID: instruction decode, EX: execution, MEM: memory access, and WB: write into the register file). It incorporates a single cache for data and instructions, its *branch slot* is of 1 instruction, it works at 80MHz and it solves data hazards using the short-circuiting technique. A public domain compiler is installed in the considered computer. It compiles the following code:

```
do {
  if (v[i] != 0) {
    temp = v[i];
    v[i] = w[i];
    w[i] = temp;
  }
  i = i-1;
} while (i != 0);
```

and generates the following one:

The loop is typically applied on vectors containing a 80 % of components whose value is 0.

- 1. Compute the average CPI for handling vectors with big sizes (with n elements).
- 2. Assume that the original loop represents a normal workload of this computer. In order to improve the performance of the system, two options are considered:

- Replace the existing processor with the new version MIPS/ST, which incorporates separate data and instruction caches. The resulting computer increases its price in 200\$.
- Buy a new commercial compiler, whose cost is of 200\$, able to optimize the generated code by reducing in 3 clock cycles each loop iteration in which  $v[i] \neq 0$  and in 2 cycles each iteration in which v[i] = 0. The number of instructions is not modified.

Is any one of these options interesting? If it is, which one should be applied? Reason your answers from a cost/performance perspective, and assume that only 200\$ are available for investment.

### Solución:

- 1. When  $v(r1) \neq 0$ , 9 instructions are executed in the loop. In order to solve data hazards, 6 stalls are inserted:
  - Two cycles are required by the beqz r2, eti2 before its ID stage receive the value of r2, which is generated by ld r2, v(r1). Remember that the *branch delay slot* is of 1 instructions, the PC is updated in the ID stage, so the condition must be computed in the same stage.
  - 3 cycles are generated by the rest of load and store instructions, since an structural hazard is generated by the share of instructions and data cache.
  - 1 cycle is required by bnez r1, etil before its ID stage receive the value of r1, which is modified by daddi r1, r1, -8.

The total number of cycles required is 9+2+3+1=15.

If v(r1) = 0, then 6 instructions are executed. Stalls are those corresponding to instructions begz and bnez, because the rest of loads and stores are not executed. So the total number of cycles is then 6+2+1=9.

In order to compute the average CPI the zeros in the vector v must be taken into account:

$$\overline{\textit{CPI}} = \frac{\text{Number of cycles}}{\text{Number of instructions}} = \frac{15 \cdot 0.2 + 9 \cdot 0.8}{9 \cdot 0.2 + 6 \cdot 0.8} = 1.55$$

- 2. Improvements in the average execution time of the loop, expressed in cycles in one both cases:
  - new processor, in which 3 structural hazards are eliminated only when  $v(r1) \neq 0$ :

$$\Delta t = \frac{15 \cdot 0.2 + 9 \cdot 0.8}{12 \cdot 0.2 + 9 \cdot 0.8} = 1.06$$

• new compiler:

$$\Delta t = \frac{15 \cdot 0, 2 + 9 \cdot 0, 8}{12 \cdot 0, 2 + 7 \cdot 0, 8} = 1,28$$

For the same increment of cost (10%), it is clear that the appropriate alternative is to invest the new compiler. On the other hand, the improvement resulting from the processor replacement is lower than the increment of cost, so from the viewpoint of performance and cost, such replacement is not interesting.

**Ejercicio 2.4.** The instruction cycle of an non-pipelined *load/store* processor is defined in terms of the following phases, whose duration is provided within a parenthesis:

- LI (10 ns): instruction fetch.
- DI (5 ns): instruction decoding and source register read.
- EXE (10 ns): computation of effective addresses in L/S instructions, operations in ALU instructions, condition and new PC value in branch instructions.
- EPC (5 ns): PC update in a branch instructions.
- MEM (10 ns): Memory access in L/S instructions
- ER (5 ns): Write the value of destination registers in store and ALU instructions.

The control circuitry manages the aforementioned phases according to the operation code of each instruction. The clock works at 200 MHz, so certain phases executes in 1 cycle and others in 2 cycles. All instruction cycles start with phases LI and DI but depending on the type of instruction, the other phases are (the frequency of each type of instruction is indicated in parenthesis):

- Load instructions (20 %): EXE, MEM and ER
- Store instructions (10 %): EXE and MEM
- ALU instructions (50 %): EXE and ER
- Branch instructions (20%): EXE and EPC

This processor is expected to be pipelined, by using registers with 2 ns of delay and a clock with a null skew. The WPC phase disappears and all the branch-related logic is moved to the DI stage, whose duration is increased to 10 ns. So, the pipelined version of the processor integrates 5 stages LI, DI, EXE, MEM y ERB. Real measures are obtained from this new processor and it is observed that (i) 5 % of load instructions generate 1 stall due to a data hazard and, (ii) the compiler fills 10 % of the *branch delay slot* cases with a nop instruction.

Taking this information into account, compute:

- 1. CPI of the non-pipelined processor.
- 2. Clock frequency of the pipelined processor.
- 3. Number of instructions executed by the pipelined processor with respect to the non-pipelined one.
- 4. CPI of the pipelined processor.
- 5. Speed-up obtained through pipelining.

### Solución:

1. CPI of the non-pipelined processor:

Let us calculate first the value of the CPI for each type of instruction:

| %  | type   | CPI |
|----|--------|-----|
| 20 | load   | 8   |
| 10 | store  | 7   |
| 50 | ALU    | 6   |
| 20 | branch | 6   |

The weighten average is:

$$\overline{CPI}_{NS} = 0.20 \cdot 8 + 0.10 \cdot 7 + 0.50 \cdot 6 + 0.2 \cdot 6 = 6.5$$
 cycles

2. Frequency of the pipelined processor.

The maximum stage delay is 10 ns, and the one related to a register is 2 ns. As a result, the clock period is  $t_S = 10 + 2 = 12$  ns and the frequency:

$$f = \frac{1}{t_S} = \frac{1}{12 * 10^{-9}} = 83.3 * 10^6 \text{ Hz} = 83.3 \text{ MHz}$$

3. Relation between the number of executed instructions.

The program of the pipelined processor will have the same number of instructions than the one of the non-pipelined processor, plus the *nop* instructions that in 10% of the cases follow to the 20% of branches. As the branch logic is located in the ID stage, the *branch delay slot* is of only 1 instruction. As a result,

$$I_S = I_{NS}(1+0.1*0.2) = 1.02*I_{NS}$$
 cycles

## 4. CPI of the pipelined processor

Each fetched instructions contributes with 1 cycle to the execution time. The cycles lost due to hazards must be also considered in 5 % of the 20 % of loads. It is worth noting that the number of instructions increases in this case due to *nop* instructions:

| Quantity           | Type   | CPI                    |
|--------------------|--------|------------------------|
| 20                 | load   | $1+(0,05\cdot 1)=1,05$ |
| 10                 | store  | 1                      |
| 50                 | ALU    | 1                      |
| 20                 | branch | 1                      |
| $2(0, 1 \cdot 20)$ | nop    | 1                      |
| 102                | total  |                        |

The weighted average is:

$$\overline{CPI}_S = \frac{20 \cdot 1,05 + 10 \cdot 1 + 50 \cdot 1 + 20 \cdot 1 + 2 \cdot 1}{102} = 1,01 \text{ cycles}$$

Another way of obtaining the same result is to consider that, when the processor is pipelined, the CPI will be 1 plus the average number of stalls originated by each instruction. In our case, only 5 % of loads (which are 20 out of each 102) will insert 1 stall. So:

$$\overline{CPI}_S = 1 + \frac{0.2 * 0.05}{1.02} = 1.01 \text{ cycles}$$

5. The speed-up resulting from pipelining.

By dividing the execution time of the non-pipelined and the pipelined processor, we obtain:

$$S = \frac{T_{NS}}{T_S} = \frac{I_{NS} * CPI_{NS} * t_{NS}}{I_S * CPI_S * t_S} = \frac{I_{NS} * 6.5 * 5}{1,02 * I_{NS} * 1,01 * 12} = 2,63$$

# Ejercicio 2.5.

A 5-stage pipelined processor is available (IF: instruction fetch; ID: instruction decoding and register file read; EX: ALU operation; MEM: memory access and WB: write to the register file). The processor integrates the MIPS instruction set and it has a separate instruction and data cache. Its clock frequency is of 200 MHz. Data and control hazards are solved using respectively the *forwarding* technique and inserting 2 stalls each time a branch instruction appears.

Programs execute, in average, 18 % of branch, 39 % of load/store and 43 % of arithmetic instructions. Loads are 2 times more frequent than stores. *bytes* and *halfwords* accesses are 20 % of all memory accesses. The frequency of data hazards between a LOAD instruction and another following one consuming the data loaded from memory is the following one:

| Frequency: 25 %         | Frequency: 15 %            |  |  |  |  |  |
|-------------------------|----------------------------|--|--|--|--|--|
| LOAD R1,                | LOAD R1,                   |  |  |  |  |  |
| Instructions reading R1 | Instruction not reading R1 |  |  |  |  |  |
| •••                     | Instruction reading R1     |  |  |  |  |  |

In order to improve the performance, the following modifications are proposed:

• Remove from the instruction set those instructions providing access to *bytes* and *halfwords*. As a result, programs requiring such functionality must use other processor instructions:

• Since the processor design has been simplified, increase its clock frequency.

Answer the following questions:

- 1. CPI of the original processor.
- 2. Number of instructions of the modified processor.
- 3. CPI of the modified processor.
- 4. The minimum clock frequency that should be attained in order to justify the inclusion of the proposed modifications.

#### Solución:

1. CPI of the original processor.

Since the processor is pipeline, its CPI will be 1 plus the average number of stalls per instruction. In this case, stalls are inserted in the case of branches (18 %, 2 stalls) and when data hazards are provoked by a load and an instruction following that one consuming the loaded data (25 %, 1 stalls). It is known that there are 39 % of load/store and loads are twice more frequent than stores. If c and a are load and store percentages, respectively, one obtains that:

$$\begin{array}{rcl}
c+a & = & 39 \\
c & = & 2a
\end{array}$$

Solving, c = 26 % y s = 13 %.

As a result, stalls are:

$$18\% \cdot 2(Branches) + 26\% \cdot 0.25 \cdot 1(Loads) = 0.425$$

y CPI = 1.43 cycles.

2. Number of instructions of the modified processor.

The improved processor executes the same originnal instructions, minus those replaced, plus those equivalent to the replaced ones.

Since accesses to *bytes and halfwords* are 20 % of all memory access, the frequency of occurrence of LB/LH and SB/SH is:

Frequency LB/LH =  $0.26 \cdot 0.20 = 0.052$ 

Frequency SB/SH =  $0.13 \cdot 0.20 = 0.026$ 

Each LB/LH instruction is replaced by 3 new instructions, and each SB/SH instruction is replaced by 8 new instructions. As a result, the new number of instructions is:

$$I' = I - 0.052 \cdot 1 + 0.052 \cdot 3 - 0.026 \cdot 1 + 0.026 \cdot 8 = 1,286I$$

3. CPI of the modified processor.

The number of stalls is modified due the fact that the replacement of LB/LH always introduces 1 stall, despite it provoked before an stall or not. The total number of instructions is also changed: Number of stalls:  $\frac{0.18}{1.29} *2(Branches) + \frac{0.26}{1.29} *0.8*0.25*1(LW) + \frac{0.26}{1.29} *0.2*1(LB/LH) = 0,359$  CPI'=1,36 cycles.

4. The minimum clock frequency that should be attained in order to justify the inclusion of the proposed modifications.

Comparing the execution time of the proposed improved wrt the original configuration.

Original:  $Tex = I \times CPI \times T = I \cdot 1,43 \cdot 5 = 7,15I$  ns

Improvement:  $Tex' = I' \times CPI' \times T' = 1,286I \cdot 1,359 \cdot T' = 1,75IT'$  ns

As a result:

T' = 4.08 ns. So, the frequency should increase until 245 MHz in order to make the proposal of improvement worthwhile.

**Ejercicio 2.6.** A computer with memory-register architecture has a 6-stage pipelined instruction cycle:

- F: Instruction fetch and PC increment.
- RF: Decoding and register file read  $(2^{nd} \text{ part of the clock cycle})$ .
- ALU1: Computation of the effective address in memory accesses and branches.
- MEM: Memory access.
- ALU2: Arithmetic operation, evaluation of branch conditions and new PC updating.
- WB: Write to the register file  $(1^{st}$  part of the clock cycle)

All instructions are executed in 6 stages. There are two types of arithmetic instructions:

Type R: ALUop Rd, Rs, Rt

# Type M: ALUop Rd,Rs,desp(Rt)

- 1. In order to avoid structural hazards, how many adders are (at least) required for pipelining purposes? Determine the minimum number of read/write ports in the file register and in the memory, in order to avoid structural hazards.
- 2. If the machine uses the delayed branch technique, which is the value of its delay- slot?
- 3. Is it interesting to apply a *predict-taken* strategy towards positions in the code preceding the branch? If it is, how many clock cycles of penalty are induced by correctly predicted branches?

### Solución:

- 1. Resources necessary in order to avoid structural hazards.
  - Number of adders. 3 different stages may require an adder:

IF: in order to increment the PC

ALU1: in order to compute the effective address of any memory reference

ALU2: in order to carry out an ALU operation

■ Register file: 2 register are read in RF, and 1 register is written in WB. So, 1 write and 2 read ports are, at least, required.

On the other hand, since the register file has a shared cycle, an optimization can be to have **a single** port combined for reading/writing accesses plus another one only for reading accesses.

- Memory: Instructions are fetched in IF and data is loaded or stored in MEM. So, two ports are required: one for reading and another one for reading/writing. An alternative may consist in implementing a Harvard architecture, thus separating instructions cache from data cache.
- 2. Size of the delay slot.

Since the PC is updated in ALU2, it is when a branch executes WB when instructions can be fetched at the branch destination address. So, up to 4 instructions are fetched and start their execution between the branch and the instruction at the branch destination address.

| Branch      | IF | RF | A1 | ME | <b>A2</b> | WB |    |
|-------------|----|----|----|----|-----------|----|----|
| instr. 1    |    | IF | RF | A1 | ME        | A2 | WB |
| instr. 2    |    |    | IF | RF | A1        | ME | A2 |
| instr. 3    |    |    |    | IF | RF        | A1 | ME |
| instr. 4    |    |    |    |    | IF        | RF | A1 |
| Destination |    |    |    |    |           | IF | RF |

3. Interest of the *predict-taken* strategy.

Yes, since:

- The destination address is known before the condition.
- Conditional branches use to be taken in a high percentage of the case.

The penalty when the condition is met is of two cycles:

| Branch      | IF | RF | A1 | ME | A2 | WB |    |
|-------------|----|----|----|----|----|----|----|
| instr. 1    |    | IF | RF |    |    |    |    |
| instr. 2    |    |    | IF |    |    |    |    |
| Destination |    |    |    | IF | RF | A1 | ME |

## Ejercicio 2.7.

The following instruction-time diagrams correspond to the execution of various code fragments from various computers. Determine for each of them, which technique is used for solving data hazards (stall insertion or short-circuit) and control hazards (stall insertion, *predict-not-taken* or delayed branch), and in which stage is the PC updated.

```
1. L
       LW r2, a(r1)
                      IF ID EX M
                                  WB
  L+4
        ADD r3, r2, r3
                         IF ID ID EX M
                     IF IF ID EX M WB
       ADD r3, r4, r3
  L+8
  L+12 SUB r1, r1, #4
                                  IF ID EX M WB
  L+16 BNEZ rl, L
                                      IF ID EX M
                                                  WB
  L+20
       SW z(r0), r3
                                         IF ID
  L+24 ADD r3,r0,r0
                                            ΙF
        LW r2, a(r1)
                                               IF ID EX M WB
2. T.
        LW r2, a(r1)
                      IF ID EX M WB
  L+4
        ADD r3, r2, r3
                         IF ID ID EX M WB
                            IF IF ID EX M WB
  L+8
        ADD r3, r4, r3
  L+12 SUB r1, r1, #4
                                  IF ID EX M WB
  L+16 BNEZ r1, L
                                      IF ID EX M
                                                  WB
 L+20
        SW z(r0), r3
                                         IF IF IF
        LW r2, a(r1)
                                                  IF ID EX M WB
3. L
        LW r2, a(r1)
                      IF ID EX M WB
                       IF ID EX M WB
  L+4
        SUB r1, r1, #4
        ADD r3, r2, r3
  L+8
                            IF ID ID EX M WB
  L+12 BNEZ r1, L
                               IF IF ID EX M WB
  L+16 ADD r3, r4, r3
                                      IF ID ID EX M
        LW r2, a(r1)
                                         IF IF ID EX M WB
```

### Solución:

- 1. Data hazards: short-circuit; control hazard: predict-not-taken, PC updating in EX.
- 2. Data hazards: short-circuit; control hazard: stall insertion, PC updating in M.
- 3. Data hazards: stall insertion; control hazard: delayed branch, PC update in ID.

# **Multicycle operators**

**Ejercicio 2.8.** A MIPS processor is available. It supports the following multi-cycle instructions:

- Pipelined Adder / Subtracter (Tev=2, IR=1)
- Pipelined Multiplier (Tev=3, IR=1)
- Conventional Divider (Tev=4, IR=1/4)

In this processor, structural and data hazards are detected during the ID stage, inserting stall cycles when necessary. Short-circuits are also used.

Show the execution diagram of the following fragment of code:

```
L.D F1, 0(R1)
DIV.D F4, F0, F1
ADD.D F2, F3, F4
L.D F4, 4(R1)
MULT.D F3, F4, F2
L.D F5, 8(R1)
```

The supplied diagram must represent the stages crossed by each instruction using the following notation: IF (fetching stage), ID (decoding stage), EX (monocyle execution stage), A1, A2 (adder/subtracter execution stages), M1, M2, M3 (multiplier execution stages), D1, D2, D3, D4 (divider execution stages), ME (memory access stage) and WB (register writing stage).

#### Solución:

## Ejercicio 2.9.

A processor executes the following loop computing  $\vec{z} = A\vec{x} + B\vec{y}$ :

```
loop: 1.d F0,x(r10)
    1.d F1,y(r11)
    mult.d F4,F2,F0; F2 contains A.
    mult.d F5,F3,F1; F3 contains B.
    add.d F6,F4,F5
    daddi r14,r14,-1
    daddi r10,r10,8
    daddi r11,r11,8
    s.d F6,z(r12)
    daddi r12,r12,8
    bnez r14,loop
    nop
```

The processor provides two register files to store integer and floating point data. In addition, the following multi-cycle operators are available for floating point operations:

- A pipelined multiplier with  $T_{ev} = 5$  and IR = 1, with stages M1, M2, etc.
- A non-pipelined adder with  $T_{ev} = 3$  and IR = 1/3, with stages A1, A2, etc.

Other instructions are executed using a classic pipeline of 5 stages (IF,ID,EX,ME,WB). Data hazards are solved using short-circuits and inserting stalls whenever necessary. Conditional branches use the *predict-not-taken* technique. Branch condition and destination address are computed along ID stage. If the branch is taken, then the PC is updated with the new destination address at the end of such stage.

It is required:

1. The instructions-time diagram of the first loop integration and the first instruction of second loop iteration.

- 2. Assuming the equality of all loop iterations, compute the average loop CPI for n iterations.
- 3. In order to speed up the loop execution, two alternatives are under study: a) replace the non-pipelined adder by a pipelined one with Tev=3 and IR=1, of b) replace the pipelined multiplier by a non-pipelined one with Tev=2 and IR=1/2. Which option is the most suitable one for reducing the loop execution time? Justify your answer.

### Solución:

1. Instruction-time diagram of the first loop iteration, including the first instruction of the second loop iteration:

```
01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21
1.d F0, x(r10)
                  IF ID EX ME WB
1.d F1, y(r11)
                     IF ID EX ME WB
mult.d F4,F2,F0
                        IF ID M1 M2 M3 M4 M5 WB
mult.d F5, F3, F1
                           IF ID M1 M2 M3 M4 M5 WB
add.d F6, F4, F5
                              IF ID ID ID ID A1 A2 A3 WB
daddi r14, r14, -1
                                  IF IF IF IF ID EX ME WB
daddi r10, r10, 8
                                                  IF ID EX ME WB
daddi r11, r11, 8
                                                     IF ID EX ME WB
s.d F6, z(r12)
                                                        IF ID EX ME WB
daddi r12, r12, 8
                                                           IF ID EX ME WB
bnez r14, loop
                                                              IF ID EX ME WB
                                                                 IF X
nop
                                                                     IF ID EX ME WB
1.d F0, x(r10)
```

2. Average CPI for n iterations:

The second iteration starts at cycle 17, so one iteration takes 16 cycles. Since each iteration executes 11 instructions, the average CPI is CPI = 16/11 = 1,45

3. Execution time improvement:

Option a) will not take any impact on the execution time since the adder is not introducing any structural hazard in the loop and the latency of the alternative adder remains the same.

Option b):

```
01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21
1.d F0, x(r10)
                  IF ID EX ME WB
1.d F1, y(r11)
                     IF ID EX ME WB
mult.d F4,F2,F0
                        IF ID M1 M2 WB
mult.d F5,F3,F1
                           IF ID ID M1 M2 WB
add.d F6,F4,F5
                               IF IF ID ID A1 A2 A3 WB
daddi r14, r14, -1
                                     IF IF ID EX ME WB
daddi r10, r10, 8
                                            IF ID EX ME WB
daddi r11, r11, 8
                                               IF ID EX ME WB
s.d F6, z(r12)
                                                  IF ID EX ME WB
daddi r12, r12, 8
                                                     IF ID EX ME WB
bnez r14, loop
                                                        IF ID EX ME WB
nop
                                                            IF X
1.d F0, x(r10)
                                                               IF ID EX ME WB
```

Option b) improves the execution time since the alternative multiplier has a lower latency and enables the addition to be issued to the operator in cycle 9, which reduces the number of stalls in 2.

# Static instruction scheduling

**Ejercicio 2.10.** A computer providing an MIPS-compatible instruction ser is available. The following sequence of code is executed in this computer:

| i1  |    | L.D F0,X(R1)      |
|-----|----|-------------------|
| i2  |    | MULT.D F0,F0,F4   |
| i3  |    | L.D F2,Y(R1)      |
| i4  |    | ADD.D F0,F0,F2    |
| i5  |    | S.D F0, Y(R1)     |
| i6  |    | DSUB R1,R1,#8     |
| i7  |    | BNEZ R1,L         |
| i8  |    | L.D F0,X(R1)      |
| i9  |    | MULT.D F0,F0,F4   |
| i10 |    | L.D F2,Y(R1)      |
| i11 | L: | DADD R1, R0, #dir |

Identify at least two dependencies of each type in the above fragment of code.

## Solución:

The following figure shows some of the existing dependences in the proposed code.

| Dependencias de datos | Antidependencias  | Dependencias de salida | Dependencias de control |
|-----------------------|-------------------|------------------------|-------------------------|
|                       |                   |                        |                         |
| LD (F0)X(R1)          | LD F0,X(R1)       | LD F0 X (R1)           | LD F0,X(R1)             |
| MULTD FO,F4           | MULTD FO, F0, F4  | MULTD F0,F0,F4         | MULTD F0,F0,F4          |
| LD (F2)Y(R1)          | LD F2,Y(R1)       | LD F2 (R1)             | LD F2,Y(R1)             |
| ADDD FO, F2           | ADDD (F0, F0, F2) | ADDD F0,F0,F2          | ADDD F0,F0,F2           |
| SD Y(R1), F0          | SD Y(R1), F0      | SD Y(R1),F0            | SD Y(R1),F0             |
| SUB (R1) R1,#8        | SUB (R1) R1 ,#8   | SUB R1,#8              | SUB R1,R1,#8            |
| BNEZ R1 L             | BNEZ RI,L         | BNEZ R1,L              | BNEZ R1,L               |
| LD FOX (R1)           | LD (F0, X/(R1)    | LD FOX(R1)             | LD F0,X(R1)             |
| MULTD F0, F0, F4      | MULTD/F0,F0,F4    | MULTO FO,F0,F4         | MULTD F0,F0,F4          |
| LD F2,Y(R1)           | LD F2,Y (R1)      | LD (F2) Y (R1)         | LD F2,Y(R1)             |
| ADD R1,R0,#dir        | ADD R1 R0,#dir    | ADD (R1,)R0,#dir       | ADD R1,R0,#dir          |

**Ejercicio 2.11.** Consider the following MIPs assembler code:

```
L.D F0, 0(R1)
MULT.D F0, F0, F10
ADD.D F0, F0, F11
S.D F0, 0(R1)
DADD R1, R1, #8
BNE R1, R3, loop
```

This code is executed on a MIPS where integer instructions traverse the following pipeline: IF (instruction fetch), ID (instruction decoding, source register reading and hazard detection) EX (execution), M (memory access) y WB (writeback), while floating point instructions follow these stages: IF, ID, En (execution in the corresponding multicycle operator) and WB. Data hazards can be solved using short-circuits, inserting the necessary stalls in ID, and control hazards through *predict-not taken*, updating the PC in ID.

The processor works at 4 GHz and the CPI for ALU instructions is 1.

Multi-cycle functional units have the following features:

| Operator       | Number | Latency  | Type      |
|----------------|--------|----------|-----------|
| Multiplication | 1      | 3 cycles | Pipelined |
| Add/Sub        | 1      | 3 cycles | Pipelined |

- 1. Identify a data dependency, an antidepency, and output dependency and a control dependency in the original code.
- 2. Provide the instructions—time diagram for the first loop iteration. Compute the execution time for n iterations.
- 3. Show the code resulting from the use of the *loop-unrolling* technique. Without providing the instructions—time diagram, compute the execution time of the resulting code and the speed-up (if any) wrt the original one.

### Solución:

- 1. Identified data dependencies
  - Data dependency. Between L.D **F0**, 0(R1) and MULT.D F0, **F0**, F10.
  - Antidependency. MULT.D F0, F0, F10 and ADD.D F0, F0, F11
  - Output dependency. MULT.D **F0**, F0, F10 and ADD.D **F0**, F0, F11
  - Control dependency. BNE R1, R3, loop and any other instruction
- 2. Instructions—time diagram for the first loop iteration in the MIPS processor:

It must be noted that if the branch latency is of 1 instructions, the data dependency DADD/BNE (via R1) provokes an stall in order to apply the short-circuit  $M\rightarrow ID$  during cycle 12; and the prediction miss leads to the cancellation of the instruction following the branch (which appears as follow in the diagram).

Each iteration spends 12 cycles, so the execution time is T(n) = 12n

3. Code after applying the loop-unrolling technique Since 2 stalls are inserted in order to solve data dependencies between MULT.D and ADD.D the loop is unrolled 3 times. So the resulting loop is:

```
L.D F0, 0(R1)
L.D F1, 8(R1)
L.D F2, 16(R1)
MULT.D F0, F0, F10
MULT.D F1, F1, F10
MULT.D F2, F2, F10
ADD.D F0, F0, F11
ADD.D F1, F1, F11
ADD.D F2, F2, F11
S.D F0, 0(R1)
S.D F1, 8(R1)
S.D F2, 16(R1)
DADD R1, R1, #24
BNE R1, R3, loop
```

Now there is no need for stall insertion. Since there are 15 instruction per iteration, and considering that only one third of the original iterations are carried out, the execution time will be:

$$T(n) = \frac{15n}{3}$$

and the improvement over the original code is

$$S = \frac{12n}{15n/3} = 2.4$$

# **Dynamic branch prediction**

## Ejercicio 2.12.

A processor has a dynamic branch predictor of type BTB (*Branch Target Buffer*) that delivers its prediction during the instruction fetch stage. The target address and branch condition are computed at the 3rd stage of the instruction cycle. The probability that a branch is found in the table is 80% and the predictor accuracy is 90%. The branches are effective in 60% of the cases. Questions:

- 1. Show the instructions that will be fetched after the branch and their i–t diagram for the following options:
  - There is no entry in the prediction table and the branch is **not taken**.
  - There is no entry in the prediction table and the branch is **taken**.
  - The predictor predicts that the branch is **not taken** and the branch is finally **not taken**.
  - The predictor predicts that the branch is **not taken** and the branch is finally **taken**.
  - The predictor predicts that the branch is **taken** and the branch is finally **not taken**.
  - The predictor predicts that the branch is **taken** and the branch is finally **taken**.

The instructions should be represented as **Branch**, the branch instruction, **PC+i** (i=1, 2, ...), the subsequent instructions after the branch, **Dest**, the branch target instruction, and **Dest+i** (i=1, 2, ...), the subsequent instructions after the branch target instruction. The instruction stages should be represented as **F1**, **F2**, etc.

- 2. Compute the average CPI for branch instructions.
- 3. Suppose we can modify the BTB design in two ways. The first one increases the number of table entries, so that it stores 90 % of the executed branches. The second one uses a 2-bit predictor so that the prediction accuracy is raised to 95 %. Which modification is able to reduce the CPI of branch instructions?

### Solución:

- 1. Show the instructions that will be fetched after the branch and their i–t diagram for the following options:
  - There is no entry in the prediction table and the branch is **not taken**.

|        | p  |    | d,c |    |    |    |    |    |    |                                 |
|--------|----|----|-----|----|----|----|----|----|----|---------------------------------|
| Branch | F1 | F2 | F3  | F4 | F5 | F6 |    |    |    |                                 |
| PC + 1 |    | F1 | F2  | F3 | F4 | F5 | F6 |    |    | $\rightarrow 0$ penalty cycles. |
| PC + 2 |    |    | F1  | F2 | F3 | F4 | F5 | F6 |    |                                 |
| PC + 3 |    |    |     | F1 | F2 | F3 | F4 | F5 | F6 |                                 |

• There is no entry in the prediction table and the branch is **taken**.

|    |       | p  |    | d,c |    |    |    |    |    |    |                                 |
|----|-------|----|----|-----|----|----|----|----|----|----|---------------------------------|
| Br | anch  | F1 | F2 | F3  | F4 | F5 | F6 |    |    |    |                                 |
| PC | C + 1 |    | F1 | F2  | X  |    |    |    |    |    | $\rightarrow$ 2 penalty cycles. |
| PC | 2 + 2 |    |    | F1  | X  |    |    |    |    |    |                                 |
| De | est   |    |    |     | F1 | F2 | F3 | F4 | F5 | F6 |                                 |

■ The predictor predicts that the branch is **not taken** and the branch is finally **not taken**.

|        | p  |    | d,c |    |    |    |    |    |    |                                 |
|--------|----|----|-----|----|----|----|----|----|----|---------------------------------|
| Branch | F1 | F2 | F3  | F4 | F5 | F6 |    |    |    |                                 |
| PC + 1 |    | F1 | F2  | F3 | F4 | F5 | F6 |    |    | $\rightarrow 0$ penalty cycles. |
| PC + 2 |    |    | F1  | F2 | F3 | F4 | F5 | F6 |    |                                 |
| PC + 3 |    |    |     | F1 | F2 | F3 | F4 | F5 | F6 |                                 |

• The predictor predicts that the branch is **not taken** and the branch is finally **taken**.

|        | p  |    | d,c |    |    |    |    |    |    |                                 |
|--------|----|----|-----|----|----|----|----|----|----|---------------------------------|
| Branch | F1 | F2 | F3  | F4 | F5 | F6 |    |    |    |                                 |
| PC + 1 |    | F1 | F2  | X  |    |    |    |    |    | $\rightarrow$ 2 penalty cycles. |
| PC + 2 |    |    | F1  | X  |    |    |    |    |    |                                 |
| Dest   |    |    |     | F1 | F2 | F3 | F4 | F5 | F6 |                                 |

• The predictor predicts that the branch is **taken** and the branch is finally **not taken**.

|          |    |    |     |    |    |    |    |    |    | _                               |
|----------|----|----|-----|----|----|----|----|----|----|---------------------------------|
|          | p  |    | d,c |    |    |    |    |    |    |                                 |
| Branch   | F1 | F2 | F3  | F4 | F5 | F6 |    |    |    |                                 |
| Dest     |    | F1 | F2  | X  |    |    |    |    |    | $\rightarrow$ 2 penalty cycles. |
| Dest + 1 |    |    | F1  | X  |    |    |    |    |    |                                 |
| PC + 1   |    |    |     | F1 | F2 | F3 | F4 | F5 | F6 |                                 |

• The predictor predicts that the branch is **taken** and the branch is finally **taken**.

|          | p  |    | d,c |    |    |    |    |    |    |                                 |
|----------|----|----|-----|----|----|----|----|----|----|---------------------------------|
| Branch   | F1 | F2 | F3  | F4 | F5 | F6 |    |    |    |                                 |
| Dest     |    | F1 | F2  | F3 | F4 | F5 | F6 |    |    | $\rightarrow 0$ penalty cycles. |
| Dest + 1 |    |    | F1  | F2 | F3 | F4 | F5 | F6 |    |                                 |
| Dest + 2 |    |    |     | F1 | F2 | F3 | F4 | F5 | F6 |                                 |

2. Compute the average CPI for branch instructions.

The following table shows the possible cases, as well as their probability and CPI (1 cycle for the branch instruction plus any stall cycles):

| Case | Probability | CPI |
|------|-------------|-----|
| a    | 0.2*0.4     | 1   |
| b    | 0.2*0.6     | 3   |
| С    | 0.8*0.4*0.9 | 1   |
| d    | 0.8*0.6*0.1 | 3   |
| e    | 0.8*0.4*0.1 | 3   |
| f    | 0.8*0.6*0.9 | 1   |
| CPI  |             | 1.4 |

### Cases:

- a) There is no entry in the prediction table and the branch is **not taken**.
- b) There is no entry in the prediction table and the branch is **taken**.
- c) The predictor predicts that the branch is **not taken** and the branch is finally **not taken**.
- d) The predictor predicts that the branch is **not taken** and the branch is finally **taken**.
- e) The predictor predicts that the branch is **taken** and the branch is finally **not taken**.
- f) The predictor predicts that the branch is **taken** and the branch is finally **taken**.

The average CPI is computed by obtaining the weighted arithmetic mean.

3. Suppose we can modify the BTB design in two ways. The first one increases the number of table entries, so that it stores 90 % of the executed branches. The second one uses a 2-bit predictor so that the prediction accuracy is raised to 95 %. Which modification is able to reduce the CPI of branch instructions?

The following tables show the new values of the probability of each event for each of the options:

■ The size of the prediction table is increased.

| Case                    | Probability | CPI |
|-------------------------|-------------|-----|
| a                       | 0.1*0.4     | 1   |
| b                       | 0.1*0.6     | 3   |
| С                       | 0.9*0.4*0.9 | 1   |
| d                       | 0.9*0.6*0.1 | 3   |
| e                       | 0.9*0.4*0.1 | 3   |
| f                       | 0.9*0.6*0.9 | 1   |
| $\overline{\text{CPI}}$ |             | 1.3 |

■ The accuracy of the prediction is increased.

| Case | Probability  | CPI  |
|------|--------------|------|
| a    | 0.2*0.4      | 1    |
| b    | 0.2*0.6      | 3    |
| С    | 0.8*0.4*0.95 | 1    |
| d    | 0.8*0.6*0.05 | 3    |
| e    | 0.8*0.4*0.05 | 3    |
| f    | 0.8*0.6*0.95 | 1    |
| CPI  |              | 1.32 |

The first option is slightly better.

Consider a processor with an instruction set similar to MIPS and a pipelined unit consisting of the following stages:

- **IF** Instruction fetch.
- **ID** Instruction decode and register read.
- **ALU** Computation of the branch target and memory access addresses.
- **MEM** Memory access.
- **EX1** First execution stage and computation of branch condition.
- EX2 Second execution stage.
- **WB** Register write.

Two branch prediction schemes are considered for implementation in the processor. The predictors that have to be evaluated are: A *Branch Prediction Buffer* and a *Branch Target Buffer*, both of them delivering their prediction at the end of the ID stage. Both mechanisms are implemented with 4 entries in the *buffer* and use a 2-bit predictor whose operation is illustrated in the figure:



For evaluating the prediction mechanisms a test program is used, from which a fragment is shown below:

| Addres | SS      | Instructions    | Address | Instructions   |
|--------|---------|-----------------|---------|----------------|
| 0x03   |         | add r1, r0, r0  | • • •   |                |
|        | lfor:   |                 | 0x10    | add r2, r2, #1 |
|        |         |                 | 0x11    | slt r4, r2, #3 |
| 0x05   |         | beqz r1, lendif | 0x12    | bnez r4, ldo   |
|        |         |                 | lbreak: |                |
|        | lendif: |                 |         |                |
|        |         |                 | 0x15    | add r1, r1, #1 |
|        | ldo:    |                 | 0x16    | slt r6, r1, #2 |
| 0x09   |         | sub r8, r8, r2  | 0x17    | bnez r6, lfor  |
| 0x0A   |         | slt r3, r2, #2  | 0x18    | sw z(r0), r8   |
| 0x0B   |         | seq r4, r1, r0  |         |                |
| 0x0C   |         | and r5, r3, r4  |         |                |
| 0x0D   |         | beqz r5, lbreak |         |                |

Initially the *Branch Prediction Buffer* contains all entries in state "D", and all the *Branch Target Buffer* entries are empty. When a new entry is added to the *Branch Target Buffer*, its status will be "A" if the branch has been taken, and "D" if the branch has not been taken.

The final statistics of the execution of the test program are: 15% of the instructions are conditional branches, 60% of the branches are taken, the *Branch Prediction Buffer* succeeds in predicting 75% of the cases, and the *Branch Target Buffer* succeeds in predicting 90%, including cases in which there is no entry in the table (which are predicted as "not taken").

It is requested:

1. Obtain a trace of the execution of the code fragment until the "sw z(r0), r8" instruction is completely executed. It should show the contents of the entries of both predictors after each branch ("beqz r1, lendif", "beqz r5, lbreak", "bnez r4, ldo" and "bnez r6, lfor"). Labels should be used to record the target addresses.

Branch begz r1, lendif. (r1 = 0)

BPB

| Index | State |
|-------|-------|
| 00    |       |
| 01    |       |
| 10    |       |
| 11    |       |

#### BTB

| Index | Target address | State |
|-------|----------------|-------|
|       |                |       |
|       |                |       |
|       |                |       |
|       |                |       |

Branch begz r5, lbreak. (r5 = 1)

**BPB** 

| Index | State |
|-------|-------|
| 00    |       |
| 01    |       |
| 10    |       |
| 11    |       |

**BTB** 

| Index | Target address | State |
|-------|----------------|-------|
|       |                |       |
|       |                |       |
|       |                |       |
|       |                |       |

Branch bnez r4, ldo. (r4 = 1)

### **BPB**

| Index | State |
|-------|-------|
| 00    |       |
| 01    |       |
| 10    |       |
| 11    |       |

#### **BTB**

| Index | Target address | State |
|-------|----------------|-------|
|       |                |       |
|       |                |       |
|       |                |       |
|       |                |       |

Branch begz r5, lbreak. (r5 = 1)

BPB

| Index | State |
|-------|-------|
| 00    |       |
| 01    |       |
| 10    |       |
| 11    |       |

втв

|   | Index | Target address | State |
|---|-------|----------------|-------|
|   |       |                |       |
|   |       |                |       |
| ĺ |       |                |       |
|   |       |                |       |

Branch bnez r4, ldo. (r4 = 1)

**BPB** 

| Index | State |
|-------|-------|
| 00    |       |
| 01    |       |
| 10    |       |
| 11    |       |

**BTB** 

| Index | Target address | State |
|-------|----------------|-------|
|       |                |       |
|       |                |       |
|       |                |       |
|       |                |       |

Branch begz r5, lbreak. (r5 = 0)

**BPB** 

| Index | State |
|-------|-------|
| 00    |       |
| 01    |       |
| 10    |       |
| 11    |       |

BTB

| Index | Target address | State |
|-------|----------------|-------|
|       |                |       |
|       |                |       |
|       |                |       |
|       |                |       |

Branch bnez r6, lfor. (r6 = 1)

**BPB** 

| Index | State |
|-------|-------|
| 00    |       |
| 01    |       |
| 10    |       |
| 11    |       |

BTB

| Index | Target address | State |
|-------|----------------|-------|
|       |                |       |
|       |                |       |
|       |                |       |
|       |                |       |

Branch begz r1, lendif. (r1 = 1)

**BPB** 

| Index | State |
|-------|-------|
| 00    |       |
| 01    |       |
| 10    |       |
| 11    |       |

BTB

| Index | Target address | State |
|-------|----------------|-------|
|       |                |       |
|       |                |       |
|       |                |       |
|       |                |       |

Branch begz r5, lbreak. (r5 = 0)

**BPB** 

| JI D  |       |
|-------|-------|
| Index | State |
| 00    |       |
| 01    |       |
| 10    |       |
| 11    |       |

**BTB** 

| Index | Target address | State |
|-------|----------------|-------|
|       |                |       |
|       |                |       |
|       |                |       |
|       |                |       |

Branch bnez r6, lfor. (r6 = 0)

BPB

| Index | State |
|-------|-------|
| 00    |       |
| 01    |       |
| 10    |       |
| 11    |       |

BTB

| Index | Target address | State |
|-------|----------------|-------|
|       |                |       |
|       |                |       |
|       |                |       |
|       |                |       |

2. Analyze the behavior of the BTB predictor when it mispredicts, indicating which instructions are executed after the branch. The number of lost execution cycles should be indicated. Canceled instructions should be represented as **X** in the corresponding cycle. The instructions following the branch will be represented as **pc+1**, **pc+2**, ... and the branch target instructions as **dest**, **dest+1**, etc.

3. Compute the average number of cycles per instruction (CPI) for the test program, using the BTB prediction scheme. Assume that instructions that are not branches are executed with CPI=1.

## Solución:

1. Execution trace.

Branch 0x05 (00000101) begz r1, lendif. (r1 = 0) Taken BTB

### **BPB**

| Index | State |
|-------|-------|
| 00    | D     |
| 01    | С     |
| 10    | D     |
| 11    | D     |

| Index | Target address | State |  |
|-------|----------------|-------|--|
| 0x05  | lendif         | A     |  |
|       |                |       |  |
|       |                |       |  |
|       |                |       |  |

Branch 0x0D (00001101) begz r5, lbreak. (r5 = 1) Not taken **BTB** 

### **BPB**

| Index | State |
|-------|-------|
| 00    | D     |
| 01    | D     |
| 10    | D     |
| 11    | D     |

| Index | Target address | State |
|-------|----------------|-------|
| 0x05  | lendif         | A     |
| 0x0D  | lbreak         | D     |
|       |                |       |
|       |                |       |

Branch 0x12 (00010010) bnez r4, ldo. (r4 = 1) **Taken BTB** 

# **BPB**

| Index | State |
|-------|-------|
| 00    | D     |
| 01    | D     |
| 10    | С     |
| 11    | D     |

| Index | Target address | State |
|-------|----------------|-------|
| 0x05  | lendif         | A     |
| 0x0D  | lbreak         | D     |
| 0x12  | ldo            | A     |
|       |                |       |

Branch 0x0D (00001101) beqz r5, lbreak. (r5 = 1) **Not taken BTB** 

## **BPB**

| Index | State |
|-------|-------|
| 00    | D     |
| 01    | D     |
| 10    | С     |
| 11    | D     |

| Index | Target address | State |
|-------|----------------|-------|
| 0x05  | lendif         | A     |
| 0x0D  | lbreak         | D     |
| 0x12  | ldo            | A     |
|       |                |       |

Branch 0x12 (00010010) bnez r4, ldo. (r4 = 1) Taken **BTB** 

## **BPB**

| Index | State |
|-------|-------|
| 00    | D     |
| 01    | D     |
| 10    | A     |
| 11    | D     |

| Index | Target address | State |
|-------|----------------|-------|
| 0x05  | lendif         | A     |
| 0x0D  | lbreak         | D     |
| 0x12  | ldo            | A     |
|       |                |       |

Branch 0x0D (00001101) beqz r5, lbreak. (r5 = 0) Taken

**BPB** 

| I | <b>3</b> T | B |
|---|------------|---|
| 1 | т          | 1 |

| Index | State |
|-------|-------|
| 00    | D     |
| 01    | С     |
| 10    | A     |
| 11    | D     |

| Index | Target address | State |
|-------|----------------|-------|
| 0x05  | lendif         | A     |
| 0x0D  | lbreak         | С     |
| 0x12  | ldo            | A     |
|       |                |       |

Branch 0x17 (00010111) bnez r6, lfor.(r6 = 1) Taken

**BPB** 

| Index | State |
|-------|-------|
| 00    | D     |
| 01    | С     |
| 10    | A     |
| 11    | С     |

| Index | Target address | State |
|-------|----------------|-------|
| 0x05  | lendif         | A     |
| 0x0D  | lbreak         | С     |
| 0x12  | ldo            | A     |
| 0x17  | lfor           | A     |

Branch  $0 \times 05$  (00000101) begz r1, lendif. (r1 = 1) Not taken

**BPB** 

| $\mathbf{R}'$ | ${f r}{f p}$ |
|---------------|--------------|
|               |              |

| Index | State |
|-------|-------|
| 00    | D     |
| 01    | D     |
| 10    | A     |
| 11    | С     |

| Index | Target address | State |
|-------|----------------|-------|
| 0x05  | lendif         | В     |
| 0x0D  | lbreak         | С     |
| 0x12  | ldo            | A     |
| 0x17  | lfor           | A     |

Branch 0x0D (00001101) beqz r5, lbreak. (r5 = 0) Taken

**BPB** 

| KTK          | - | _ | -   | _ |
|--------------|---|---|-----|---|
| $\mathbf{n}$ | v | , | ויו | v |
|              | n |   |     | n |

| Index | State |
|-------|-------|
| 00    | D     |
| 01    | С     |
| 10    | A     |
| 11    | С     |

| Index | Target address | State |
|-------|----------------|-------|
| 0x05  | lendif         | В     |
| 0x0D  | lbreak         | A     |
| 0x12  | ldo            | A     |
| 0x17  | lfor           | A     |

Branch 0x17 (00010111) bnez r6, lfor. (r6 = 0) Not taken

**BPB** 

| _ |   | _ |
|---|---|---|
|   | 1 |   |
| к |   | к |
|   |   |   |

| Index | State |
|-------|-------|
| 00    | D     |
| 01    | С     |
| 10    | A     |
| 11    | D     |

|   | Index | Target address | State |
|---|-------|----------------|-------|
|   | 0x05  | lendif         | В     |
|   | 0x0D  | lbreak         | A     |
| Ì | 0x12  | ldo            | A     |
|   | 0x17  | lfor           | В     |

## 2. Predictor behavior in case of misprediction.

### ■ For BTB:

The predictor predicts that the branch is **not taken** and the branch is **taken**.

|        |    | p  | d   |     | c   |     |    |     |    |     |     |    |
|--------|----|----|-----|-----|-----|-----|----|-----|----|-----|-----|----|
| Branch | IF | ID | ALU | ME  | EX1 | EX2 | WB |     |    |     |     |    |
| PC+1   |    | IF | ID  | ALU | ME  | X   |    |     |    |     |     |    |
| PC+2   |    |    | IF  | ID  | ALU | X   |    |     |    |     |     |    |
| PC+3   |    |    |     | IF  | ID  | X   |    |     |    |     |     |    |
| PC+4   |    |    |     |     | IF  | X   |    |     |    |     |     |    |
| DEST   |    |    |     |     |     | IF  | ID | ALU | ME | EX1 | EX2 | WB |

Four stall cycles are required.

The predictor predicts that the branch is **taken** and the branch is **not taken**.

|        |    | p  | d   |    | С   |     |    |     |    |     |     |    |
|--------|----|----|-----|----|-----|-----|----|-----|----|-----|-----|----|
| Branch | IF | ID | ALU | ME | EX1 | EX2 | WB |     |    |     |     |    |
| PC+1   |    | IF | X   |    |     |     |    |     |    |     |     |    |
| DEST   |    |    | IF  | ID | ALU | X   |    |     |    |     |     |    |
| DEST+1 |    |    |     | IF | ID  | X   |    |     |    |     |     |    |
| DEST+2 |    |    |     |    | IF  | X   |    |     |    |     |     |    |
| PC+1   |    |    |     |    |     | IF  | ID | ALU | ME | EX1 | EX2 | WB |

Four stall cycles are required.

# 3. Average CPI.

It is necessary to obtain the penalty in case the prediction is correct. When the branch is not taken, the predictor has no penalty. Otherwise, the processor using BTB fetches the instruction after accessing the predictor (ID stage), introducing only one stall cycle.

## 4. For BTB:

| Prediction | Condition | Probability | Penalty |
|------------|-----------|-------------|---------|
| No         | No        | 0.9*0.4     | 0       |
| No         | Yes       | 0.1*0.6     | 4       |
| Yes        | No        | 0.1*0.4     | 4       |
| Yes        | Yes       | 0.9*0.6     | 1       |

The average number of stall cycles is: 0.94 cycles, and the average CPI is:  $1+0.15 \cdot 0.94 = 1.14$  cycles.

## Ejercicio 2.14.

Assume the code of function ones, which returns (in \$v0) the number of bits equal to 1 in argument register (\$a0). The way to proceed is to obtain the LSB of the argument (with andi \$t1,\$a0,1), to increase the count in case that LSB==1 (with daddi \$v0,\$v0,1), and then to shift right 1 bit (using dsrl \$a0,\$a0,1). These operations are repeated 64 times to complete the job.

```
next: dsrl $a0,$a0,1  # Shift right $a0 1 bit
    daddi $t0,$t0,-1 # Pending iterations
    bgtz $t0,loop  # Next iteration
    jr $ra  # Exit of the function
```

Note that the code contains two conditional branches and one jump instruction:

```
beqz $t1, next taken each time that LSB==0
bgtz $t0,loop closes the loop
jr $ra unconditional jump.
```

The processor pipeline consists of the typical 5 stages, and implements a branch predictor. PC is written at stage EX (i.e., the branch latency is 2 cycles long). Note that no stall cycle appears due to structural or data hazards.

Obtain the number of executed instructions, the number of stall cycles, and the execution time (in processor cycles) of the function in the following cases:

- 1. Assuming a *predict-not taken* predictor and a0=-1 (0xFFF...FFF).
- 2. The processor implements a BTB for (conditional) branches with a 1-bit predictor, which predicts the target address at *IF* stage and applies *predict-not taken* for return of function jr \$ra. Function ones is called the first time with \$a0=-1 (0xFFF...FFF)
- 3. The previous BTB uses a 2-bit hysteresis branch predictor. Function ones is called the first time with \$a0=0x8080 8080 8080 8080

### Solución:

A call to ones execute  $3+5\times 64+n=323+n$  instructions, being n the number of 1's in the argument.

1. *Predict-not taken*, a0=-1

The branch begz is not taken. Then,

I = 323 + 64 = 387 executed instructions

P = 64 taken branches  $\times$  2 penalty cycles = 128 stall cycles

T = 387 + 128 = 515 total cycles

2. 1 bit BTB predictor, a0=-1

For begz the prediction is always *not taken* (in the first iteration because there is not previous information, and after because the branch is always *not taken*). No stall cycles are produced. For bgtz, the first prediction is *not taken*, but after is always *take*. The prediction will fail in the first and last iteractions. The branch jr \$ra, is not in the BTB, then get always a penalty. Then,

I = 323 + 64 = 387 executed instructions

P = 3 failed predictions  $\times$  2 penalty cycles = 6 stall cycles

T = 387 + 6 = 393 total cycles

3. 2 bits BTB predictor, \$a0=0x8080... The 2 bits predictor work as follows:



With begz, the prediction fails in  $b_0$  because there is not previous information, and the state is set to (A) taken. In  $b_1$  the prediction will be correct and the state will continue in (A) taken. Only when  $b_i = 1$  ( $i = 3, 7, 11, \ldots$ ), 8 times, the predictor will fail and will move to (B) taken. But with the next iteration, it will be back to (A) taken. Then we will get 1 + 8 = 9 penalty cycles. With bgtz, the prediction fails in  $b_0$  because there is not previous info. In the next iterations the BTB predicts correctly until last iteration, where it fails again. We get then 2 penalty cycles. Considering the jr penalty, we get a total of 12 cycles.

I = 323 + 8 = 331 executed instructions

P = 12 failed predictions  $\times$  2 penalty cycles = 24 stall cycles

T = 331 + 24 = 355 total cycles

# Dynamic instruction scheduling and speculation

# Hardware speculation

## Exercise 2.15.

A MIPS processor applies dynamic instruction scheduling with speculation to all instructions. Thus, all instructions traverse the following stages:

- **IF** Instruction fetch.
- I Instruction decoding and issue following the Tomasulo algorithm with speculation.
- Ei Execution in the corresponding operator.
- **WB** Transfer of results through the internal bus.
- C Commit stage. Results are written to the corresponding destination register.

All stages take 1 cycle, except the **Ei** stage whose duration varies from one operator to another. The processor has a BTB predictor whose prediction is obtained during the **IF** stage.

The main characteristics of the functional units of the processor are:

| Type                 | Units            | Latency (cycles) | Additional info                   |  |  |
|----------------------|------------------|------------------|-----------------------------------|--|--|
| Integer arithmetic 1 |                  | 1                | 4 reservation stations            |  |  |
| FP Multiplication    | ltiplication 1 3 |                  | Pipelined, 4 reservation stations |  |  |
| FP Add/Sub           | 1                | 2                | Pipelined, 4 reservation stations |  |  |
| Load/Store           | 1                | 2                | Pipelined                         |  |  |
|                      |                  |                  | 2 load buffers, 2 write buffers   |  |  |

The following code is under execution on the considered processor:

```
l.d f1,d(r0)
l.d f3,n(r0)
loop: sub.d f4,f2,f1
    mul.d f1,f1,f4
    mul.d f3,f3,f4
    c.lt.d f1,f0
    bclt loop    ; Jumps if f1 < f0
    s.d f3,q(r0)</pre>
```

Instruction c.lt.d f1, f0 is evaluated in the add/sub unit, and writes its result in an internal register (floating point state register). Instruction bclt loop jumps if the floating point state register is *true*, thus computing the branch target address and evaluating the condition in the integer unit.

When the code starts its execution, the *reorder buffer* and the operators are empty. The register file and the memory contain the following values:

| F0  | F1  | F2  | F3  | F4  | d(r0) | n(r0) |
|-----|-----|-----|-----|-----|-------|-------|
| 1.0 | 0.0 | 2.0 | 0.0 | 0.0 | 1     | 0.25  |

Take into account that, for obtaining the considered results, each instruction in the loop must be executed only once. However, assume the predictor **incorrectly** predicts that the branch instruction bclt loop is **taken**.

- 1. Draw the instructions-time diagram from the cycle when instruction 1.d f1,d(r0) executes the **IF** stage until the cycle when instruction s.d f3, q(r0) finishes completely.
- 2. Show the state of the *reorder buffer*, the reservation stations, the read and write buffers, and the register file at the end of the cycle when the instruction sub.d f4, f2, f1 of the first loop iteration is committed (it executes its C stage).

**Note:** In order to depict the various stages in the instructions-time diagram use the following notation:  $M1, M2, \ldots$  for the multiplication/division stages,  $A1, A2, \ldots$  for the add/sub stages,  $AC, L1, L2, \ldots$  for the load/store unit, and  $E1, E2, \ldots$  for the integer unit.

# Solution:

# **Instructions-time diagram**

```
PC Instruc.
              1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
1 l.d f3,n(r0)
              IF I AC L1 L2 WB C
3 mul.d f1, f1, f4
                     IF I - - - - M1 M2 M3 WB C
4 mul.d f3,f3,f4
                    IF I - - - M1 M2 M3 WB C
 5 c.lt.d f1,f0
                         IF I - - - - - A1 A2 WB C
                            6 bclt loop
 2 sub.d f4, f2, f1
                              IF I - - - - A1 A2 WB -
3 mul.d f1, f1, f4
                              IF I - - - -
                                                   - M1 X
4 mul.d f3, f3, f4
                                   IF I - - - - - -
                                                       Χ
5 c.lt.d f1,f0
                                   IF I - -
6 bc1t loop
                                       IF I - - - -
2 sub.d f4, f2, f1
                                       IF I - - - -
                                                       Х
3 mul.d f1, f1, f4
                                            IF I - - x
4 mul.d f3,f3,f4
                                              IF I - - x
 5 c.lt.d f1,f0
                                                 IF I - x
                                                   IF I x
6 bc1t loop
 2 sub.d f4, f2, f1
                                                     IF X
3 mul.d f1, f1, f4
 7 \text{ s.d } f3,16(r0)
                                                          IF I AC C L1 L2
```

# State in cycle 10:

### Reorder buffer:

|    | busy | instr          | state | dest | value | pred  | PC |
|----|------|----------------|-------|------|-------|-------|----|
| 0  | NO   | l.df1,d(r0)    | WB    | F1   | 1.00  |       | 0  |
| 1  | NO   | l.d f3,n(r0)   | WB    | F3   | 0.25  |       | 1  |
| 2  | NO   | sub.d f4,f2,f1 | WB    | F4   | 1.00  |       | 2  |
| 3  | YES  | mul.d f1,f1,f4 |       | F1   |       |       | 3  |
| 4  | YES  | mul.d f3,f3,f4 |       | F3   |       |       | 4  |
| 5  | YES  | c.lt.d f1,f0   |       | fpsr |       |       | 5  |
| 6  | YES  | bc1t loop      |       | 2    |       | Taken | 6  |
| 7  | YES  | sub.d f4,f2,f1 |       | F4   |       |       | 2  |
| 8  | YES  | mul.d f1,f1,f4 |       | F1   |       |       | 3  |
| 9  | NO   |                |       |      |       |       |    |
| 10 | NO   |                |       |      |       |       |    |
| 11 | NO   |                |       |      |       |       |    |
| 12 | NO   |                |       |      |       |       |    |
| 13 | NO   |                |       |      |       |       |    |
| 14 | NO   |                |       |      |       |       |    |
| 15 | NO   |                |       |      |       |       |    |

### Reservation stations:

|    | busy | Op | Qj | Vj   | Qk | Vk   | rob | result |
|----|------|----|----|------|----|------|-----|--------|
| e1 | YES  | В  | #5 |      |    | 0    | #6  |        |
| e2 | NO   |    |    |      |    |      |     |        |
| e3 | NO   |    |    |      |    |      |     |        |
| e4 | NO   |    |    |      |    |      |     |        |
| a1 | YES  | -  |    | 2.00 | #3 |      | #7  |        |
| a2 | YES  | <  | #3 |      |    | 1.00 | #5  |        |
| a3 | NO   |    |    |      |    |      |     |        |
| a4 | NO   |    |    |      |    |      |     |        |
| m1 | YES  | *  |    | 1.00 |    | 1.00 | #3  |        |
| m2 | YES  | *  |    | 0.25 |    | 1.00 | #4  |        |
| m3 | YES  | *  | #3 |      | #7 |      | #8  |        |
| m4 | NO   |    |    |      |    |      |     |        |

### Read Buffers:

|    | busy | Qj | Vj | disp | addr | rob | result |
|----|------|----|----|------|------|-----|--------|
| 11 | NO   |    | 0  | d    | d    | #0  | 1.00   |
| 12 | NO   |    | 0  | n    | n    | #1  | 0.25   |

## Write Buffers:

|    | busy | Qj | Vj | disp | addr | rob | Qk | Vk | confirm |
|----|------|----|----|------|------|-----|----|----|---------|
| s1 | NO   |    |    |      |      |     |    |    |         |
| s2 | NO   |    |    |      |      |     |    |    |         |

# Register File:

|       | F0   | F1   | F2   | F3   | F4   | F5 | F6 | F7   |
|-------|------|------|------|------|------|----|----|------|
| rob   |      | #8   |      | #4   | #7   |    |    |      |
| value | 1.00 | 1.00 | 2.00 | 0.25 | 1.00 |    |    |      |
|       | DΛ   | R1   | R2   | R3   | R4   | R5 | R6 | R7   |
|       | R0   | IX I | 1\2  | K3   | 174  | KJ | NO | IX / |
| rob   | RU   | Kı   | K2   | K3   | 174  | KS | KU | IX / |

**Exercise 2.16.** A MIPS64 binary compatible processor applies dynamic instruction scheduling with *hardware* speculation. Instructions traverse the following pipeline stages:

- **IF** Instruction fetch.
- I Instruction decoding and Instruction issue following the Tomasulo algorithm with speculation.
- **Ei** Execution in the corresponding operator.
- **WB** Transfer of results through the internal bus (the transferred result is not available until the next cycle).
- C Commit stage. Results are written to the destination register.

All stages take one cycle, except the **Ei** stage whose duration varies from one operator to another. The processor has a perfect branch predictor that obtains the prediction and the branch target address during the **IF** stage.

The main functional units of the processor have the following characteristics:

| Type               | Units | Latency (cycles) | Additional info                          |
|--------------------|-------|------------------|------------------------------------------|
| Integer arithmetic | 1     | 1                | 3 reservation stations (e1e3)            |
| FP add/sub         | 1     | 2                | Pipelined, 3 reservation stations (a1a3) |
| FP multiplication  | 1     | 4                | Pipelined, 2 reservation stations (m1m2) |
| Load/Store         | 1     | 2                | Non Pipelined                            |
|                    |       |                  | 3 read buffers (1113),                   |
|                    |       |                  | 3 write buffers (s1s3)                   |

On such processor it is considered the execution of the following fragment of code:

When the code starts its execution, the *reorder buffer* and the operators are empty. The initial value of register R1 is 80. The values accessed from vector X and Y are represented by x1, x2, ... and y1, y2, ... respectively. The value that is located at the address Mem[r0 + A] is a.

Show the instructions-time diagram for the first two iterations of the loop, and the state of the various data structures of the processor at the end of the cycle where the branch "bnez r1, loop" is fetched for the second time. Compute the MFLOPS that the execution of the considered piece of code may reach if it was executed on a 150 MHz processor.

**Note:** Represent the execution stages  $\mathbf{Ei}$  following this notation: EX for integer operations, Ai for floating point additions and substractions, Mi for floating point multiplications, and AC, Li for load and stores, where the subindex indicates the number of cycles in execution (1, 2, ...).

### Solution:

## Instructions-time diagram:

```
PC Instruc.
                1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
               IF I AC L1 L2 WB C
0 \text{ l.d } f0, A(r0)
 1 l.d f2,X(r1)
                  IF I AC - L1 L2 WB C
2 mul.d f4, f2, f0
                  IF I - - - - M1 M2 M3 M4 WB C
 3 l.d f6, Y(r1)
                        IF I AC - L1 L2 WB - - - C
                           IF I - - - - - - A1 A2 WB C
4 sub.d f4, f4, f6
 5 \text{ s.d } f4,Z(r1)
                             IF I AC - - - - - - - -
                                                               L1 L2
                                IF I E1 - WB - - - - - C
6 dsubi r1, r1, 8
 7 bnez r1, loop
                                   IF I - - E1 - WB - - - - C
                                     IF I - AC L1 L2 WB - - - - C
1 1.d f2, X(r1)
 2 mul.d f4, f2, f0
                                        IF I - - - - M1 M2 M3 M4 WB - C
                                           IF I AC - L1 L2 WB - - - - C
3 1.d f6, Y(r1)
                                              IF I - - - - - - A1 A2 WB C
 4 sub.d f4, f4, f6
                                                IF I AC - - - - - - - C L1 L2
5 \text{ s.d } f4,Z(r1)
                                                   IF I E1 - WB - - - -
 6 dsubi r1, r1, 8
                                                     IF I - - E1 - WB - - -
7 bnez r1, loop
```

# State of the unit in the 15th cycle:

# ROB

|    | busy | instr          | state | dest | value     | pred  | PC |
|----|------|----------------|-------|------|-----------|-------|----|
| 0  | NO   | l.d f0,A(r0)   | WB    | F0   | а         |       | 0  |
| 1  | NO   | l.df2,X(r1)    | WB    | F2   | x1        |       | 1  |
| 2  | NO   | mul.d f4,f2,f0 | WB    | F4   | a*x1      |       | 2  |
| 3  | NO   | l.d f6, Y(r1)  | WB    | F6   | <i>y1</i> |       | 3  |
| 4  | YES  | sub.d f4,f4,f6 |       | F4   |           |       | 4  |
| 5  | YES  | s.d f4,Z(r1)   |       | s1   |           |       | 5  |
| 6  | YES  | dsubi r1,r1,8  | WB    | R1   | 72        |       | 6  |
| 7  | YES  | bnez r1,loop   | WB    | 1    | Taken     | Taken | 7  |
| 8  | YES  | 1.d f2,X(r1)   | WB    | F2   | x2        |       | 1  |
| 9  | YES  | mul.d f4,f2,f0 |       | F4   |           |       | 2  |
| 10 | YES  | 1.d f6,Y(r1)   |       | F6   |           |       | 3  |
| 11 | YES  | sub.d f4,f4,f6 |       | F4   |           |       | 4  |
| 12 | YES  | s.d f4,Z(r1)   |       | s2   |           |       | 5  |
| 13 | YES  | dsubi r1,r1,8  |       | R1   |           |       | 6  |
| 14 | NO   |                |       |      |           |       |    |
| 15 | NO   |                |       |      |           |       |    |

# Reservation stations

|    | busy | Op | Qj | Vj | Qk  | Vk | rob | result  |
|----|------|----|----|----|-----|----|-----|---------|
| e1 | YES  | -  |    | 72 |     | 8  | #13 |         |
| e2 | NO   | В  |    | 72 |     | 0  | #7  | Taken   |
| e3 | NO   |    |    |    |     |    |     |         |
| a1 | YES  | -  |    | a  |     | y1 | #4  | a*x1-y1 |
| a2 | YES  | -  | #9 |    | #10 |    | #11 |         |
| a3 | NO   |    |    |    |     |    |     |         |
| m1 | NO   | *  |    | x1 |     | а  | #2  | a*x1    |
| m2 | YES  | *  |    | x2 |     | a  | #9  |         |

# Read Buffers

|    | busy | Qj | Vj | disp | addr | rob | result     |
|----|------|----|----|------|------|-----|------------|
| 11 | NO   |    | 72 | X    | 72+X | #8  | <i>x</i> 2 |
| 12 | YES  |    | 72 | Y    | 72+Y | #10 |            |
| 13 | NO   |    | 80 | Y    | 80+Y | #3  | <i>y1</i>  |

# Write Buffers

|    | busy | Qj | Vj | disp | addr | rob | Qk  | Vk | confirm |
|----|------|----|----|------|------|-----|-----|----|---------|
| s1 | YES  |    | 80 | Z    | 80+Z | #5  | #4  |    | NO      |
| s2 | YES  |    | 72 | Z    | 72+Z | #12 | #11 |    | NO      |
| s3 | NO   |    |    |      |      |     |     |    |         |

# Register File

|       | F0   | F1        | F2 | F3 | F4   | F5 | F6  | F7 |
|-------|------|-----------|----|----|------|----|-----|----|
| rob   |      |           | #8 |    | #11  |    | #10 |    |
| value | 5.00 |           | x1 |    | a*x1 |    | y1  |    |
|       |      |           |    |    |      |    |     |    |
|       | R0   | R1        | R2 | R3 | R4   | R5 | R6  | R7 |
| rob   | R0   | R1<br>#13 | R2 | R3 | R4   | R5 | R6  | R7 |

Without considering the first instruction, which does not belong to the main loop, the considered processor issues one instruction per cycle, so it takes 7 cycles in issuing all instructions belonging to the first loop iteration. In each iteration, 2 **floating point** operation are executed (multid and subd). Thus, the MFLOPS that can be reached by the processor for the considered fragment of code are:

$$R_{\infty} = \frac{2}{7} op./cycle \cdot 150 \cdot 10^6 \ cycles/sec. = \frac{300}{7} \cdot 10^6 \ op./sec. = 42,86 \ MFLOPS$$

### Exercise 2.17.

In order to evaluate the performance of a certain computer, the following loop code is used  $\vec{y} = a \cdot \vec{x}$ . This computer has a processor similar to MIPS but implements hardware speculation for all the instructions. The instructions execute the following stages:

- **IF** Instruction fetch.
- I Instruction decode and issue following Tomasulo's algorithm with speculation.
- **Ei** Execution at the corresponding operator.
- **WB** Result transfer through the common data bus.
- C Commit stage. Result writing into the destination register. Prediction checking and instruction canceling, if necessary.

All stages last one clock cycle, except stage **Ei** whose duration depends on the operator. The processor has a **two**-bit branch predictor of type *branch target buffer* that delivers the prediction at the end of stage **IF**.

The characteristics of the functional units of the processor are:

| Type               | Units | Latency (cycles) | Additional info                         |
|--------------------|-------|------------------|-----------------------------------------|
| Integer arithmetic | 1     | 1                | 3 reservation stations                  |
| FP multiplication  | 1     | 3                | Pipelined, 3 reservation stations       |
| Load/Store         | 1     | 2                | Non pipelined, 3 load + 3 store buffers |

The code generated by the compiler for the loop is:

The state of the registers and memory before the last iteration, is as follows:

| Reg.  | R1 | F0 | F2 | Mem   | X   | x+8 | у  | y+8 |
|-------|----|----|----|-------|-----|-----|----|-----|
| Value | 8  | 3  | 2  | Value | 100 | 102 | 10 | 12  |

1. Draw an instructions-time diagram showing the execution of the instructions that are issued in the last loop iteration, until the cycle in which the trap 0 instruction is fetched. Note that the predictor will incorrectly predict the bnez r1, loop instruction (predict taken and finally not taken). In order to represent the execution stages in the diagram, please use: M1, M2 and M3 for the multiplication stages, AC, L1 and L2 for the load/store unit, and EX for the integer unit. For simplicity, assume that the ROB and the reservation stations are empty before starting the last iteration, and that there is no instruction in the pipeline (in other words, assume that all the instructions in previous iterations already committed).

- 2. Assuming that the ROB and the reservation stations are empty, and that there is no instruction in the pipeline (i.e., the *reorder buffer* and the *rob* fields in the register file are empty) before executing the last iteration, show the state of the *reorder buffer*, the register file, and memory at the end of the clock cycle in which the s.d f2, y(r1) instruction executes the **Commit** stage.
- 3. Assume now that the bnez r1, loop branch is not stored in the table when the loop execution begins, and that the initial value of R1 is 160. Indicate, justifying the response, the loop execution time in cycles. Consider that the loop ends when the branch instruction in the last iteration reaches the *Commit* stage.

## Solution:

1. Instructions-time diagram

| РC | Instruc.         | 1  | 2  | 3  | 4  | 5  | 6  | 7  | 8  | 9  | 10            | 11 | 12 | 13 | 14 | 15 |
|----|------------------|----|----|----|----|----|----|----|----|----|---------------|----|----|----|----|----|
| 0  | 1.d $f2, x(r1)$  | IF | I  | AC | L1 | L2 | WB | С  |    |    |               |    |    |    |    |    |
| 1  | mul.d f2, f2, f0 |    | IF | I  | _  | _  | _  | M1 | M2 | МЗ | $\mathtt{WB}$ | С  |    |    |    |    |
| 2  | s.d f2, y(r1)    |    |    | IF | I  | AC | -  | -  | -  | -  | -             | -  | С  | L1 | L2 |    |
| 3  | dsubi r1,r1,8    |    |    |    | IF | I  | E1 | WB | -  | _  | _             | _  | _  | С  |    |    |
| 4  | bnez r1,loop     |    |    |    |    | IF | Ι  | -  | Ε1 | WB | -             | -  | -  | -  | С  |    |
| 0  | 1.d f2, x(r1)    |    |    |    |    |    | IF | Ι  | AC | L1 | L2            | WB | -  | _  | Х  |    |
| 1  | mul.d f2, f2, f0 |    |    |    |    |    |    | IF | Ι  | _  | -             | _  | M1 | M2 | Χ  |    |
| 2  | s.d f2, y(r1)    |    |    |    |    |    |    |    | IF | I  | AC            | _  | -  | _  | Х  |    |
| 3  | dsubi r1,r1,8    |    |    |    |    |    |    |    |    | IF | Ι             | Ε1 | WB | -  | Х  |    |
| 4  | bnez r1,loop     |    |    |    |    |    |    |    |    |    | IF            | I  | -  | E1 | Χ  |    |
| 0  | 1.d $f2, x(r1)$  |    |    |    |    |    |    |    |    |    |               | IF | I  | AC | Χ  |    |
| 1  | mul.d f2, f2, f0 |    |    |    |    |    |    |    |    |    |               |    | IF | I  | Х  |    |
| 2  | s.d f2, y(r1)    |    |    |    |    |    |    |    |    |    |               |    |    | IF | Χ  |    |
| 3  | dsubi r1,r1,8    |    |    |    |    |    |    |    |    |    |               |    |    |    | Χ  |    |
| 5  | trap 0           |    |    |    |    |    |    |    |    |    |               |    |    |    |    | IF |

2. State at the end of cycle 12.

# **ROB**

|    | busy | instr          | state | dest | value     | pred  | PC |
|----|------|----------------|-------|------|-----------|-------|----|
| 0  | NO   | l.df2,x(r1)    | WB    | F2   | 102.00    |       | 0  |
| 1  | NO   | mul.d f2,f2,f0 | WB    | F2   | 306.00    |       | 1  |
| 2  | NO   | s.df2,y(r1)    |       | s1   |           |       | 2  |
| 3  | YES  | dsubi r1,r1,8  | WB    | R1   | 0         |       | 3  |
| 4  | YES  | bnez r1,loop   | WB    | 0    | Not taken | Taken | 4  |
| 5  | YES  | 1.d f2,x(r1)   | WB    | F2   | 100.00    |       | 0  |
| 6  | YES  | mul.d f2,f2,f0 |       | F2   |           |       | 1  |
| 7  | YES  | s.d f2,y(r1)   |       | s2   |           |       | 2  |
| 8  | YES  | dsubi r1,r1,8  | WB    | R1   | -8        |       | 3  |
| 9  | YES  | bnez r1,loop   |       | 0    |           | Taken | 4  |
| 10 | YES  | 1.d f2,x(r1)   |       | F2   |           |       | 0  |

# Registers

|       | F0   | F1       | F2     | F3 | F4 | F5 | F6 | F7 |
|-------|------|----------|--------|----|----|----|----|----|
| rob   |      |          | #10    |    |    |    |    |    |
| value | 3.00 |          | 306.00 |    |    |    |    |    |
|       |      |          |        |    |    |    |    |    |
|       | R0   | R1       | R2     | R3 | R4 | R5 | R6 | R7 |
| rob   | R0   | R1<br>#8 | R2     | R3 | R4 | R5 | R6 | R7 |

Memory

| Adr | Data   |
|-----|--------|
| X   | 100.00 |
| x+8 | 102.00 |
| ••• | •••    |
| y   | 10.00  |
| y+8 | 12.00  |

3. Since the value in R1 is 160, the loop will perform 20 iterations. When the predictor hits, the time to execute one iteration is 5 clock cycles. Whenever the branch is mispredicted or the branch is not found in the predictor table (the first iteration) 9 cycles are lost. Therefore:

$$T_{execution} = 20 * 5 + 2 * 9 = 100 + 18 = 118$$
 cycles

# Multiple instruction issue

### Exercise 2.18.

The code of the loop  $\vec{y} = a\vec{x} + b\vec{y} + c$  is used to evaluate the performance of a given computer. Such computer has a **2-way** superscalar processor applying dynamic instruction scheduling with speculation to all instructions. The instruction pipeline has the following stages:

- **IF** Instruction Fetch.
- I Instruction decoding and issuing.
- **Ei** Execution in the corresponding operator.
- **WB** Result transfer through the internal bus.
- C Commit stage. Results are written to the destination register.

All stages take one clock cycle, except stage **Ei** whose duration depends on the operator. The processor has a *branch target buffer* predictor obtaining the prediction at the end of stage **IF**. Each transfer through the common data buses takes 1 clock cycle.

The main characteristics of the processor functional units are:

| Type               | Units | Latency (cycles) | Additional info                   |
|--------------------|-------|------------------|-----------------------------------|
| Integer arithmetic | 2     | 1                | 6 reservation stations            |
| FP Mult            | 1     | 4                | Pipelined, 4 reservation stations |
| FP Add/Sub         | 1     | 2                | Pipelined, 4 reservation stations |
| Load/Store         | 2     | 2                | 2 read buffers, 2 write buffers   |

The code generated by the compiler for the considered loop is the following one:

```
loop: 1.d f2,x(r1)
1.d f4,y(r2)
mult.d f2,f2,f0
mult.d f4,f4,f1
add.d f6,f3,f2
add.d f6,f6,f4
s.d f6,y(r2)
dsub r1,r1,#8
dsub r2,r2,#8
bnez r1,loop
trap #0
```

The state of registers and memory is the following one:

| R1 | R2 | F0  | F1  | F2  | F3  | F4  | F5  | F6  | x+32 | x+24 | y+40 | y+32 |
|----|----|-----|-----|-----|-----|-----|-----|-----|------|------|------|------|
| 32 | 40 | 3.0 | 0.5 | 2.0 | 2.0 | 0.0 | 0.0 | 0.0 | 100  | 102  | 10   | 12   |

- 1. Draw the instructions-time diagram for the first two loop iterations. Assume the predictor predicts **correctly** that instruction bnez r1, loop is **taken**. Detail in which stage is each instruction under execution in each cycle. Represent execution stages in the diagram using the following notation: M1, M2, M3 and M4 for multiplication stages, A1, A2 for add/sub stages, AC, L1, L2 for the load/store unit and EX for the integer unit.
- 2. Show the state of the *reorder buffer*, the reservation stations and buffers, the register file and the memory at the end of the cycle when the instruction bnez r1, loop of the second iteration executes stage **IF**.

# Solution:

```
PC Instruc.
          1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
IF I AC - - - - -
                                     - - - C L1 L2
6 s.d f6, y(r2)
                IF I E1 WB - - - - - - - - -
7 dsubi r1, r1, 8
8 dsubi r2, r2, 8
                IF I E1 WB - - - - - -
9 bnez r1, loop
                  IF I - E1 WB - - - - - - -
0 \text{ l.d } f2,x(r1)
                   IF I AC L1 L2 WB - - -
1 l.d f4,y(r2)
                    IF I - AC L1 L2 WB - - -
2 mul.d f2, f2, f0
                     IF I - - - M1 M2 M3 M4 WB - -
                      IF I - - - M1 M2 M3 M4 WB - - -
3 mul.d f4, f4, f1
                        IF I - - - - - - A1 A2 WB - - C
4 add.d f6,f3,f2
5 add.d f6, f6, f4
                        IF I - - - - - - - - A1 A2 WB C
                          IF I AC - - - - - - -
6 \text{ s.d } f6, y(r2)
7 dsubi r1, r1, 8
                          IF I E1 - WB -
                            IF I E1 WB - - - -
8 dsubi r2, r2, 8
                            IF I - - E1 WB - - - - - -
9 bnez r1, loop
                                                        С
                            IF I - AC L1 L2 - WB - - -
0 l.d f2, x(r1)
```

# The state of the unit at the 10th cycle is:

# ROB

|    | busy | instr          | state | dest | value  | pred  | PC |
|----|------|----------------|-------|------|--------|-------|----|
| 0  | NO   | l.df2,x(r1)    | WB    | F2   | 100.00 |       | 0  |
| 1  | NO   | l.df4,y(r2)    | WB    | F4   | 10.00  |       | 1  |
| 2  | YES  | mul.d f2,f2,f0 |       | F2   |        |       | 2  |
| 3  | YES  | mul.d f4,f4,f1 |       | F4   |        |       | 3  |
| 4  | YES  | add.d f6,f3,f2 |       | F6   |        |       | 4  |
| 5  | YES  | add.d f6,f6,f4 |       | F6   |        |       | 5  |
| 6  | YES  | s.d f6,y(r2)   |       | s1   |        |       | 6  |
| 7  | YES  | dsubi r1,r1,8  | WB    | R1   | 24     |       | 7  |
| 8  | YES  | dsubi r2,r2,8  | WB    | R2   | 32     |       | 8  |
| 9  | YES  | bnez r1,loop   | WB    | 0    | Taken  | Taken | 9  |
| 10 | YES  | 1.d f2,x(r1)   |       | F2   |        |       | 0  |
| 11 | YES  | 1.d f4,y(r2)   |       | F4   |        |       | 1  |
| 12 | YES  | mul.d f2,f2,f0 |       | F2   |        |       | 2  |
| 13 | YES  | mul.d f4,f4,f1 |       | F4   |        |       | 3  |
| 14 | YES  | add.d f6,f3,f2 |       | F6   |        |       | 4  |
| 15 | YES  | add.d f6,f6,f4 |       | F6   |        |       | 5  |
| 16 | YES  | s.d f6,y(r2)   |       | s2   |        |       | 6  |
| 17 | YES  | dsubi r1,r1,8  |       | R1   |        |       | 7  |

# Reservation stations

|    | busy | Op | Qj  | Vj     | Qk  | Vk   | rob | result |
|----|------|----|-----|--------|-----|------|-----|--------|
| e1 | YES  | -  |     | 24     |     | 8    | #17 |        |
| e2 | NO   | -  |     | 40     |     | 8    | #8  | 32     |
| e3 | NO   | В  |     | 24     |     | 0    | #9  | Taken  |
| e4 | NO   |    |     |        |     |      |     |        |
| e5 | NO   |    |     |        |     |      |     |        |
| e6 | NO   |    |     |        |     |      |     |        |
| a1 | YES  | +  |     | 2.00   | #2  |      | #4  |        |
| a2 | YES  | +  | #4  |        | #3  |      | #5  |        |
| a3 | YES  | +  |     | 2.00   | #12 |      | #14 |        |
| a4 | YES  | +  | #14 |        | #13 |      | #15 |        |
| m1 | YES  | *  |     | 100.00 |     | 3.00 | #2  | 300.00 |
| m2 | YES  | *  |     | 10.00  |     | 0.50 | #3  |        |
| m3 | YES  | *  | #10 |        |     | 3.00 | #12 |        |
| m4 | YES  | *  | #11 |        |     | 0.50 | #13 |        |

# Read buffers

|    | busy | Qj | Vj | disp | addr | rob | result |
|----|------|----|----|------|------|-----|--------|
| 11 | YES  |    | 24 | X    | x+24 | #10 | 102.00 |
| 12 | YES  |    | 32 | у    | y+72 | #11 |        |

# Write buffers

|    | busy | Qj | Vj | disp | addr | rob | Qk  | Vk | confirm |
|----|------|----|----|------|------|-----|-----|----|---------|
| s1 | YES  |    | 40 | y    | y+40 | #6  | #5  |    | NO      |
| s2 | YES  |    | 32 | у    |      | #16 | #15 |    | NO      |

# Registers

|       | F0   | F1   | F2     | F3   | F4    | F5 | F6  | F7 |
|-------|------|------|--------|------|-------|----|-----|----|
| rob   |      |      | #12    |      | #13   |    | #15 |    |
| value | 3.00 | 0.50 | 100.00 | 2.00 | 10.00 |    |     |    |
|       | R0   | R1   | R2     | R3   | R4    | R5 | R6  | R7 |
| rob   |      | #17  | #8     |      |       |    |     |    |
| value |      | 32   | 40     |      |       |    |     |    |

## Memory

| Adr  | Data   |
|------|--------|
|      | •••    |
| x+24 | 102.00 |
| x+32 | 100.00 |
|      | •••    |
| y+32 | 12.00  |
| y+40 | 10.00  |

## Ejercicio 2.19.

Consider a superscalar processor that is binary-compatible with MIPS, able to fetch and issue up to two instructions per clock cycle. Clock frequency is 900 MHz, it has integer and floating point instructions, and it applies dynamic instruction scheduling with speculation for all the instructions. Instruction execution is split into the following stages:

**IF** Instruction fetch.

I Instruction decode and issue.

**En** Execution at the corresponding operator.

**WB** Result transfer through the shared data bus (the transferred result will not be available until the next cycle).

C Instruction commit stage. Result writing into the destination register. Prediction checking and instruction canceling, if necessary. Issue of memory write operations.

All stages last one clock cycle, except stage **En** whose duration depends on the operator. The execution unit has the following independent operators:

| Type             | Units | Latency (cycles) | Additional info                            |
|------------------|-------|------------------|--------------------------------------------|
| Integer/branches | 2     | 1                | 6 reservation stations                     |
| Load/Store       | 2     | 2                | Pipelined (linear), 6 buffers              |
| Floating point   | 2     | 3                | Pipelined (linear), 6 reservation stations |

The processor solves the control hazards by using a perfect predictor with zero clock cycles penalty. Each register file has four read ports and two write ports. One clock cycle is needed to transfer the data through the shared data buses in the execution unit.

This processor executes the code obtained from the compilation of  $\vec{B}(i) := x + \vec{A}(i)$ :

Draw a diagram that indicates, for each instruction and clock cycle, what stage of the instruction is being executed. Consider only the first and second loop iteration. For representing multicycle operation stages  $\mathbf{E}_n$  in the timing diagram, please use: EX for integer operations, Ai for floating point, and AC, Li for loads and stores; where i indicates the number of the cycle under execution (1, 2, ...).

Assuming that the second iteration is representative of what happens in the rest of iterations, compute the average CPI and the MFLOPS achieved by the computer running this code fragment.

### Solución:

| РC | Instruc.       | 1  | 2  | 3  | 4  | 5  | 6  | 7  | 8  | 9  | 10 | 11                     | 12 | 13 | 14 | 15 | 16 | 17 | 18 |
|----|----------------|----|----|----|----|----|----|----|----|----|----|------------------------|----|----|----|----|----|----|----|
| 0  | dsubi r1,r1,8  | IF | I  | E1 | WB | С  |    |    |    |    |    |                        |    |    |    |    |    |    |    |
| 1  | 1.d f0,A(r1)   | ΙF | I  | _  | _  | AC | L1 | L2 | WB | С  |    |                        |    |    |    |    |    |    |    |
| 2  | add.d f4,f0,f2 |    | ΙF | Ι  | _  | -  | -  | -  | -  | A1 | Α2 | АЗ                     | WB | С  |    |    |    |    |    |
| 3  | s.d f4,B(r1)   |    | ΙF | I  | _  | AC | _  | _  | _  | _  | _  | _                      | _  | С  | L1 | L2 |    |    |    |
| 4  | bnez r1,loop   |    |    | ΙF | Ι  | Ε1 | WB | -  | -  | -  | -  | -                      | -  | _  | С  |    |    |    |    |
| 5  | trap #0        |    |    | ΙF | Χ  |    |    |    |    |    |    |                        |    |    |    |    |    |    |    |
| 0  | dsubi r1,r1,8  |    |    |    | ΙF | Ι  | E1 | WB | -  | -  | -  | -                      | -  | -  | С  |    |    |    |    |
| 1  | 1.d f0,A(r1)   |    |    |    | IF | I  | -  | -  | AC | L1 | L2 | $\mathbb{W}\mathbb{B}$ | -  | _  | _  | С  |    |    |    |
| 2  | add.d f4,f0,f2 |    |    |    |    | IF | Ι  | -  | -  | -  | -  | -                      | A1 | A2 | A3 | WB | С  |    |    |
| 3  | s.d f4,B(r1)   |    |    |    |    | IF | I  | -  | AC | -  | -  | -                      | -  | _  | _  | -  | С  | L1 | L2 |
| 4  | bnez r1,loop   |    |    |    |    |    | ΙF | Ι  | Ε1 | WB | -  | -                      | -  | -  | -  | -  | -  | С  |    |
| 5  | trap #0        |    |    |    |    |    | ΙF | Χ  |    |    |    |                        |    |    |    |    |    |    |    |
| 0  | dsubi r1,r1,8  |    |    |    |    |    |    | IF | Ι  | Ε1 | WB | -                      | -  | -  | -  | -  | -  | С  |    |
| 1  | 1.d f0, A(r1)  |    |    |    |    |    |    | IF | I  | _  | _  | AC                     | L1 | L2 | WB | _  | _  | _  | С  |

Each iteration executes 5 instructions, needs 3 clock cycles, and performs a floating point operation:

$$CPI = \frac{3}{5} = 0.6.$$

 $v = \frac{1}{3} = 0.33$  op/cycle @ 900 MHz = 297 MFLOPS.

### Ejercicio 2.20.

A given program spends most of its execution time performing the following vector operations:

$$\vec{Y} = \vec{X} * \vec{Y} + b$$
$$\vec{X} = a * \vec{X}$$

Such program is executed on a MIPS/S processor working at 3GHz. It applies a dynamic instruction scheduling technique based on the use of the Tomasulo algorithm with *hardware* speculation. The stages in the processor pipeline are: IF (instruction fetch), I (*issue*), En (execution), WB (1-cycle transfer through the shared data buses), and C (*Commit*). The processor is a 2-way superscalar processor and its instruction cache (stage IF) delivers two instructions aligned on an even address to stage I. Whenever only one of the two instructions is required, the other one is cancelled at stage I and it is not decoded.

The MIPS/S integrates the following multicycle pipelined operators:

| MIPS/S                |                                  |
|-----------------------|----------------------------------|
| Pipelined             | 2 operators, 2 cycles,           |
| load/store            | 6 load buffers                   |
| units                 | 6 store buffers                  |
| Pipelined adders      | 2 operators, 2 cycles, 4 buffers |
| Pipelined multipliers | 2 operators, 3 cycles, 4 buffers |
| Integer/branch units  | 2 operators, 1 cycle, 6 buffers  |

and it uses a 1-bit BTB predictor that obtains its prediction during the IF stage.

Assume that the compiler has allocated the constants in registers f0 and f1 and it has generated the following code using a MIPS-like instruction set:

```
.text 0x400000
loop: l.d f2,X(r1)
    l.d f3,Y(r1)
    mul.d f3,f3,f2
    add.d f3,f3,f1
    s.d f3,Y(r1)
    mul.d f4,f0,f2
    s.d f4,X(r1)
    dadd r1,r1,#8
    bne r1,r4,loop
    trap #0; End of program
```

Before executing the last iteration, the ROB is empty and the state of registers and memory is the following one:

| Reg.  | r1 | r4 | f0 | f1 | f2  | f3 | Mem   | x+16 | x+24 | x+32 | y+16 | y+24 | y+32 |
|-------|----|----|----|----|-----|----|-------|------|------|------|------|------|------|
| Value | 24 | 32 | 10 | 20 | 0.5 | -1 | Value | 88   | 96   | 104  | 188  | 196  | 204  |

Answer the following questions and justify your answers:

- 1. Draw the instructions-time diagram of the **last loop iteration** for the MIPS/S. Show only those instructions reaching the *Commit* stage.
- 2. Indicate the number of cycles spent for running an iteration when the predictor hits and when it misses.
- 3. Detail in which clock cycle, from the beginning of the iteration, it can be considered that the variable X has been updated in memory. One of the instructions trap #0 is not finally cancelled, in which cycle is it fetched?
- 4. Consider the clock cycle when instruction 1.d f2, X(R1) is *committed* and show:
  - a) The state of register f3 at the end of that cycle.
  - b) The state of multiply operators at the end of that cycle.
  - c) The content of the ROB entry corresponding to instruction bnez r1, r4, loop at the end of that cycle.
- 5. If the loop executes 500 times, compute the execution time (in cycles), the resulting CPI, and the execution speed in MFLOPS. Assume that, initially, the instruction at the end of the loop has never been executed before.

### Solución:

1. Draw the instructions-time diagram of the **last loop iteration** for the MIPS/S. Show only those instructions reaching the *Commit* stage.

```
PC Instruc.
            1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
0 l.d f2,X(r1) IF I AC L1 L2 WB C
1 l.d f3,Y(r1) IF I AC L1 L2 WB C
4 s.d f3,Y(r1) IF I AC - - - - - C L1 L2
6 s.d f4,X(r1)
                 IF I AC - - - - - - C L1 L2
IF I - E1 WB - - - - C
8 bne r1,r4,loop
9 trap #0
0 1.d f2, X(r1)
                        IF I AC L1 L2 WB - - - x
1 l.d f3, Y(r1)
                        IF I AC L1 L2 WB - - - x
2 mul.d f3,f3,f2
                          IF I - - M1 M2 M3 WB x
3 add.d f3, f3, f1
                            IF I AC - - - x
4 \text{ s.d } f3,Y(r1)
5 mul.d f4, f0, f2
                             IF I - - M1 M2 M3 WB x
                              IF I AC - - - x
6 s.d f4,X(r1)
7 daddi r1, r1, 8
                               IF I E1 WB - - x
8 bne r1, r4, loop
                                IF I - E1 WB - x
9 trap #0
0 1.d f2, X(r1)
                                   IF I AC L1 L2 X
1 1.d f3, Y(r1)
                                    IF I AC L1 L2 X
                                    IF I - - x
2 mul.d f3, f3, f2
3 add.d f3, f3, f1
                                      IF I - - x
4 \text{ s.d } f3,Y(r1)
                                        IF I AC x
 5 mul.d f4, f0, f2
                                        IF I -
6 \text{ s.d } f4,X(r1)
                                        IF I x
7 daddi r1, r1, 8
                                          IF I X
8 bne r1, r4, loop
                                           IF X
9 trap #0
                                            IF X
0 \text{ l.d } f2,X(r1)
                                             X
1 l.d f3, Y(r1)
                                               Χ
8 bne r1, r4, loop
                                                 IF X
 9 trap #0
                                                 IF I
```

2. Indicate the number of cycles spent for running an iteration when the predictor hits and when it misses.

If the predictor hits: 5 cycles

If the predictor misses: 16 cycles

3. Detail in which clock cycle, from the beginning of the iteration, it can be considered that the variable x has been updated in memory. One of the instructions trap #0 is not finally cancelled, in which cycle is it fetched?

The variable x is updated in memory at the end of cycle 17

The trap #0 instruction will be fetched at cycle 17

- 4. Consider the clock cycle when instruction 1.d f2, X(R1) is *committed* and show:
  - a) The state of register f3 at the end of that cycle. Registers

| register |       |       |       |        |    |    |    |    |
|----------|-------|-------|-------|--------|----|----|----|----|
|          | F0    | F1    | F2    | F3     | F4 | F5 | F6 | F7 |
| rob      |       |       | #9    | #10    | #5 |    |    |    |
| value    | 10.00 | 20.00 | 96.00 | 196.00 |    |    |    |    |
|          | R0    | R1    | R2    | R3     | R4 | R5 | R6 | R7 |
| rob      |       | #7    |       |        |    |    |    |    |
| value    |       | 24    |       |        | 32 |    |    |    |

b) The state of multiplier operators at the end of that cycle.

| Reser | vation s | station | IS |        |    |       |     |        |
|-------|----------|---------|----|--------|----|-------|-----|--------|
|       | busy     | Op      | Qj | Vj     | Qk | Vk    | rob | result |
| e1    | YES      | +       |    | 24     |    | 8     | #7  | 32     |
| e2    | YES      | В       | #7 |        |    | 32    | #8  |        |
| e3    | NO       |         |    |        |    |       |     |        |
| e4    | NO       |         |    |        |    |       |     |        |
| e5    | NO       |         |    |        |    |       |     |        |
| e6    | NO       |         |    |        |    |       |     |        |
| a1    | YES      | +       | #2 |        |    | 20.00 | #3  |        |
| a2    | NO       |         |    |        |    |       |     |        |
| a3    | NO       |         |    |        |    |       |     |        |
| a4    | NO       |         |    |        |    |       |     |        |
| m1    | YES      | *       |    | 196.00 |    | 96.00 | #2  |        |
| m2    | YES      | *       |    | 10.00  |    | 96.00 | #5  |        |
| m3    | NO       |         |    |        |    |       |     |        |
| m4    | NO       |         |    |        |    |       |     |        |

c) The content of the ROB entry corresponding to instruction bnez r1, r4, loop at the end of that cycle.

| <b>ROB</b> |      |                |       |      |        |       |    |
|------------|------|----------------|-------|------|--------|-------|----|
|            | busy | instr          | state | dest | value  | pred  | PC |
| 0          | NO   | l.df2,X(r1)    | WB    | F2   | 96.00  |       | 0  |
| 1          | NO   | l.df3, Y(r1)   | WB    | F3   | 196.00 |       | 1  |
| 2          | YES  | mul.d f3,f3,f2 |       | F3   |        |       | 2  |
| 3          | YES  | add.d f3,f3,f1 |       | F3   |        |       | 3  |
| 4          | YES  | s.d f3,Y(r1)   |       | s1   |        |       | 4  |
| 5          | YES  | mul.d f4,f0,f2 |       | F4   |        |       | 5  |
| 6          | YES  | s.d f4,X(r1)   |       | s2   |        |       | 6  |
| 7          | YES  | daddi r1,r1,8  | WB    | R1   | 32     |       | 7  |
| 8          | YES  | bne r1,r4,loop |       | 0    |        | Taken | 8  |
| 9          | YES  | 1.d f2,X(r1)   |       | F2   |        |       | 0  |
| 10         | YES  | 1.d f3,Y(r1)   |       | F3   |        |       | 1  |

5. If the loop executes 500 times, compute the execution time (in cycles), the resulting CPI, and the execution speed in MFLOPS.

The branch predictor misses once at the beginning of the loop and once again at the end. At the beginning of the loop, the branch is not in the table, so it is predicted as "Not taken". Since it is the first iteration, the branch is effective and the predictor misses. At the end of the loop, the branch is in the table and the prediction is "Taken". Since it is the last iteration, the branch is not effective and the predictor misses. Therefore:

$$T(500) = 498 * 5 + 2 * 16 = 2522$$
 cycles

Each iteration executes 9 instructions. Therefore:

$$CPI = \frac{2522}{9*500} = 0,56 \text{ cycles/instr.}$$

Each iteration executes 3 floating point operations. Therefore:

$$MFLOPS = \frac{3*500}{2522} = 0,595 \text{ op./cycle } @ 3 \text{ GHz} = 1786 \text{ MFLOPS}.$$

**Ejercicio 2.21.** The following assembly code for a MIPS processor implements the operation  $\vec{Z} = a\vec{X} + b\vec{Y}$  conditional to the contents of the mask vector  $\vec{M}$ .

The code is executed on a **2-way superscalar** processor that applies dynamic instruction scheduling with hardware speculation. Instruction execution is split into the following stages:

- IF Instruction fetch.
- I Instruction decode and issue.
- En Execution at the corresponding operator.
- WB Result transfer through the common data buses.
- C Instruction commit stage.

All stages last one clock cycle, except stage Ei, whose duration depends on the operator. The processor solves the control hazards by using a 1-bit dynamic branch predictor of type *Branch Target Buffer*. The predictor obtains its prediction at the end of the IF stage and **updates its prediction at the Commit stage**. The register files have 4 read ports and 2 write ports. The transfer time through the common data buses is 1 cycle.

The characteristics of the functional units of the processor are:

| Type               | Operators | Latency | Characteristics                     |
|--------------------|-----------|---------|-------------------------------------|
| Integer arithmetic | 2         | 1       | 8 reservation stations              |
| Load/Store         | 1         | 2       | Pipelined, 4 read + 4 write buffers |
| FP Add/Subtract    | 1         | 2       | Pipelined, 4 reservation stations   |
| FP Multiplication  | 1         | 4       | Pipelined, 4 reservation stations   |

Assume that the initial content of the vector  $\vec{M}$  is 1,0,1,0,... Thus, in the first iteration, the branch BEQZ R3, endif will not be taken. The initial predictor state is:

| Branch           | Prediction |
|------------------|------------|
| BEQZ R3, endif   | Taken      |
| BNE R1, R2, loop | Taken      |

**Questions:** 

1. Draw the instructions-time diagram for the first loop iteration (until the cycle in which the branch BNE R1, R2, loop executes the Commit stage). Show only those instructions that fit on the answer sheet. In order to represent the execution stages (En) in the diagram, please use M1, M2, M3 and M4 for the multiplication stages, A1 and A2 for the add/subtract stages, AC, L1 and L2 for the load/store stages, and E for the execution stage of the integer unit.

2. Show the state of the F2, F3, and F4 registers at the end of the Commit stage for the instruction L.D F2, X(R1) from the first iteration. The initial value of those registers is 2, 3, and 4, respectively. Show the contents of the value and rob fields. The contents of vector  $\vec{X}$  is  $100, 108, \ldots$  and the one for  $\vec{Y}$  is  $10, 11, \ldots$ 

### Solución:

1. Instructions-time diagram for the first loop iteration.

| PC | Instruc.         | 1  | 2  | 3  | 4  | 5  | 6  | 7  | 8  | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 |
|----|------------------|----|----|----|----|----|----|----|----|---|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|
| 0  | ld r3,M(r1)      | IF | Ι  | AC | L1 | L2 | WB | С  |    |   |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |
| 1  | beqz r3,endif    | ΙF | I  | _  | _  | -  | _  | E1 | WB | С |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |
| 8  | daddi r1, r1, 8  |    | IF | Ι  | E1 | WB | _  | _  | _  | Х |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |
| 9  | bne r1, r2, loop |    | IF | I  | -  | -  | E1 | WB | -  | Х |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |
| 0  | ld r3,M(r1)      |    |    | IF | I  | -  | AC | L1 | L2 | Χ |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |
| 1  | beqz r3,endif    |    |    | IF | I  | _  | _  | _  | _  | Х |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |
| 8  | daddi r1,r1,8    |    |    |    | IF | Ι  | Ε1 | WB | _  | Х |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |
| 9  | bne r1, r2, loop |    |    |    | IF | I  | _  | _  | E1 | Χ |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |
| 0  | ld r3,M(r1)      |    |    |    |    | ΙF | Ι  | -  | AC | Χ |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |
| 1  | beqz r3,endif    |    |    |    |    | ΙF | I  | _  | -  | Х |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |
| 8  | daddi r1,r1,8    |    |    |    |    |    | IF | Ι  | Ε1 | Χ |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |
| 9  | bne r1, r2, loop |    |    |    |    |    | IF | I  | _  | Х |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |
| 0  | ld r3,M(r1)      |    |    |    |    |    |    | IF | I  | Х |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |
| 1  | beqz r3,endif    |    |    |    |    |    |    | IF | Ι  | Х |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |
| 8  | daddi r1,r1,8    |    |    |    |    |    |    |    | IF | Χ |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |
|    | bne r1, r2, loop |    |    |    |    |    |    |    | IF | Χ |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |
| 0  | ld r3,M(r1)      |    |    |    |    |    |    |    |    | Χ |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |
|    | beqz r3,endif    |    |    |    |    |    |    |    |    | Χ |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |
|    | 1.d f2,X(r1)     |    |    |    |    |    |    |    |    |   |    | _  | AC |    |    |    | -  |    |    |    |    |    |    |    |    |    |    |    |
|    | 1.d f3,Y(r1)     |    |    |    |    |    |    |    |    |   | IF | I  | -  | AC | L1 | L2 | WB | С  |    |    |    |    |    |    |    |    |    |    |
|    | mul.d f3,f1,f3   |    |    |    |    |    |    |    |    |   |    | ΙF | Ι  | -  | -  | -  |    |    | M2 |    |    |    | С  |    |    |    |    |    |
| 5  | mul.d f2,f0,f2   |    |    |    |    |    |    |    |    |   |    | IF | Ι  | _  | _  | _  | M1 | M2 | МЗ | M4 | WB | _  | С  |    |    |    |    |    |
|    | add.d f4,f2,f3   |    |    |    |    |    |    |    |    |   |    |    | IF | -  | -  | -  | -  | -  | -  | -  | -  | -  | A1 | A2 | WB | С  |    |    |
|    | s.d f4,Z(r1)     |    |    |    |    |    |    |    |    |   |    |    | IF | Ι  |    |    | -  | -  | -  | -  | -  | -  | -  | -  | -  | С  | L1 | L2 |
|    | daddi r1,r1,8    |    |    |    |    |    |    |    |    |   |    |    |    | IF | Ι  | Ε1 | WB |    | -  | -  | -  | -  | -  | -  | -  | -  | С  |    |
| 9  | bne r1, r2, loop |    |    |    |    |    |    |    |    |   |    |    |    | IF | Ι  | -  | _  | Ε1 | WB | -  | -  | _  | _  | _  | _  | -  | С  |    |

2. State of the registers at the end of the Commit stage for the instruction L.D F2, X(R1) (cycle 16).

| Register | F2  | F3 | F4 |
|----------|-----|----|----|
| Value    | 100 | 3  | 4  |
| Rob      | #3  | #2 | #4 |

### Exercise 2.22.

The following figure shows the simplified instructions—time diagrams for the execution of two applications A and B on a 3-way superscalar processor. It is considered that a high latency stall happens whenever instruction issue is interrupted during two or more cycles.

|     | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
|-----|---|---|---|---|---|---|---|---|---|
|     |   |   | A |   |   |   |   |   |   |
| ľ   | Α |   | Α |   |   |   |   | A | A |
| - 1 |   |   |   |   |   |   |   |   |   |

| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
|---|---|---|---|---|---|---|---|---|
|   |   |   |   | В |   |   |   |   |
|   | В |   |   | В | В | В |   |   |
| В | В |   |   | В | В | В | В | В |

In order to improve the processor resource utilization, the execution of multiple *threads* is under consideration.

1. From the viewpoint of how processor resources are shared, explain which are the main differences between a fine-grain, coarse-grain and simultaneous (*SMT*) multithreading.

2. Show a possible instructions—time diagram, similar to those already shown, corresponding to the execution of applications A and B on fine-grain, coarse-grain and simultaneous multithreading processors.

### Solution:

- Fine-grain multithreading
  - Thread switching is done every clock cycle, often in a round-robin fashion, skipping threads that have no ready instructions at that time.
  - Requires very fast thread switching.
  - Hides the throughput losses that arise from both short and long latency stalls, but may delay the execution of some threads.

| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 |
|---|---|---|---|---|---|---|---|---|---|----|----|----|----|
|   |   |   |   | A |   | В |   |   |   |    |    |    |    |
| A |   |   | В | A |   | В | В |   | В | A  |    | A  |    |
| A | В | Α | В | Α | Α | В | В | Α | В | A  | В  | Α  | В  |

- Coarse-grained multithreading
  - Thread switching is only done on long-latency stalls.
  - It does not require fast thread switching.
  - Limited ability to overcome throughput losses, specially from short stalls.

| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 |
|---|---|---|---|---|---|---|---|---|---|----|----|----|----|
|   |   | Α |   |   |   |   |   |   | В |    |    |    |    |
| A |   | Α |   |   | В |   | A | Α | В | В  | В  |    |    |
| A | Α | Α | Α | В | В | Α | Α | Α | В | В  | В  | В  | В  |

### ■ *SMT*

- Instructions from different threads can be issued at the same time, trying to use all the available resources.
- The number of threads under execution varies attending to the ILP of each thread and the set of available resources.

| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
|---|---|---|---|---|---|---|---|---|
| В | В | Α |   | В |   | В | В | В |
| A | В | Α |   | В | В | В | A | A |
| A | A | A | A | В | В | A | A | A |