# Homework 03

黄家睿 202283890036 IOT

## Exercise 4.1 Consider the following instruction:

Instruction: and rd, rsl, rs2

Interpretation: Reg[rd] = Reg[rs1] AND Reg[rs2]

- 1. What are the values of control signals generated by the control in Figure 4.10 for this instruction?
- 2. Which resources (blocks) perform a useful function for this instruction?
- 3. Which resources (blocks) produce no output for this instruction? Which resources produce output that is not used?



Figure 4.10 The datapath for the memory instructions and the R-type instructions.

#### Solution: 1:

| RegWrite | ALUSrc | ALU operation | MemWrite | MemRead | MemtoReg |
|----------|--------|---------------|----------|---------|----------|
| True     | 0000   | And           | False    | False   | 0        |

- 2: Register, the ALUSrc Mux, ALU, MemtoReg
- 3: All blocks produce some output, but data memory is not used.

Exercise 4.3 Consider the following instruction mix:

| R-type | I-type (non-lw) | Load | Store | Branch | Jump |
|--------|-----------------|------|-------|--------|------|
| 24%    | 28%             | 25%  | 10%   | 11%    | 2%   |

- 1. What fraction of all instructions use data memory?
- 2. What fraction of all instructions use instruction memory?
- 3. What fraction of all instructions use the sign extend?
- 4. What is the sign extend doing during cycles in which its output is not needed?

# Solution

- 1. Only Load and Store use data memory. Fraction=25%+10%=35%
- 2. All kinds of instruction must be fetched and then decoded from instruction

memory before its execution, fraction=100%

- 3. All kinds of instruction except r-type one does not need sign extend, fraction=100%-24%=76%
  - 4. If output does not need sign extend, it will be ignored.

**Exercise 4.7** Problems in this exercise assume that the logic blocks used to implement a processor's datapath have the following latencies:

| I-Mem /<br>D-Mem | Register<br>File |       | ALU   |       |     |       | Register<br>Setup |                  | Control          |
|------------------|------------------|-------|-------|-------|-----|-------|-------------------|------------------|------------------|
| 250 ps           | 150 ps           | 25 ps | 200ps | 150ps | 5ps | 30 ps | 20ps              | $50 \mathrm{ps}$ | $50 \mathrm{ps}$ |

"Register read" is the time needed after the rising clock edge for the new register value to appear on the output. This value applies to the PC only. "Register setup" is the amount of time a register's data input must be stable before the rising edge of the clock. This value applies to both the PC and Register File.

- 1. What is the latency of an R-type instruction (i.e., how long must the clock period be to ensure that this instruction works correctly)?
- 2. What is the latency of lw? (Check your answer carefully. Many students place extra muxes on the critical path.)
- 3. What is the latency of sw? (Check your answer carefully. Many students place extra muxes on the critical path.)
- 4. What is the latency of beq?
- 5. What is the latency of an arithmetic, logical, or shift I-type (non-load) instruction?
- 6. What is the minimum clock period for this CPU?

**Solution:** 1. R-type: 30ps+250ps+150ps+25ps+200ps+25ps+20ps=700ps

2. lw: 30ps+250ps+150ps+25ps+200ps+250ps+25ps+20ps=950ps

3. sw: 30ps+250ps+150ps+25ps+200ps+250ps=905ps

4. beq: 30ps+250ps+150ps+25ps+200ps+5ps+25ps+20ps=705ps

5. I-type: 30ps+250ps+150ps+25ps+200ps+25ps+20ps=700ps

6. The minimum clock cycle required is the largest time slot that a instruction takes, the amount of time of accessing memory via lw instruction ,950ps.

Exercise 4.8 Suppose you could build a CPU where the clock cycle time was different for each instruction. What would the speedup of this new CPU be over the CPU presented in Figure 4.21 given the instruction mix below?

| R-type / I-type (non-lw) | 1w  | sw  | beq |
|--------------------------|-----|-----|-----|
| 52%                      | 25% | 11% | 12% |

**Solution:** The average time per instruction for the instruction mix is:

 $52\% \times 700$ ps+25×950ps+11%×705ps+12%×705=785.65ps

While a CPUC executing different instruction for the same clock cycle time is 950ps.

The speedup factor:  $\frac{950ps}{785.65ps} = 1.209$ 

Exercise 4.9 Consider the addition of a multiplier to the CPU shown in Figure 4.21. This addition will add 300 ps to the latency of the ALU, but will reduce the number of instructions by 5% (because there will no longer be a need to emulate the multiply instruction).



Figure 4.21 The datapath in operation for a branch-on-equal instruction.

- 1. What is the clock cycle time with and without this improvement?
- 2. What is the speedup achieved by adding this improvement?
- 3. What is the slowest the new ALU can be and still result in improved performance?

**Solution:** 

- 1. Without the improvement the cycle time is 950ps and after the improvement is 1250, which the cycle time would 300ps longer.
- 2. The speedup factor is :  $\frac{950ps}{1250ps} = 0.8$
- 3. The worst case is the new one takes as much cycle time as the previous one, i.e. 950ps and given that the improvement would reduce the number of instructions by 5% so the maximum cycle time is

$$\frac{950ps}{95\%} = 1000ps$$

In other words, the time the new ALU takes should only increase at most 50ps, which means a slowest improved ALU should take 250ps to finish computation

Exercise 4.15 lw is the instruction with the longest latency on the CPU from Section 4.4. If we modified lw and sw so that there was no offset (i.e., the address to be loaded from/stored to must be calculated and placed in rs before calling lw/sw), then no instruction would use both the ALU and Data memory. This would allow us to reduce the clock cycle time. However, it would also increase the number of instructions, because many ld and sd instructions would need to be replaced with lw/add or sw/add combinations.

- 1. What would the new clock cycle time be?
- 2. Would a program with the instruction mix presented in Exercise 4.7 run faster or slower on this new CPU? By how much? (For simplicity, assume every lw and sw instruction is replaced with a sequence of two instructions.)
- 3. What is the primary factor that influences whether a program will run faster or slower on the new CPU?
- 4. Do you consider the original CPU (as shown in Figure 4.21) a better overall design; or do you consider the new CPU a better overall design? Why?

Solut

- **ion: 1.** The clock time is :30ps+250ps+150ps+max{25ps+200ps,25ps+250ps+25ps+20}=750ps
  - 2. The original version of the CPU takes 950ps to run cycle while the revised one take 750ps. Beside since Id and sd instructions are equivalent to two lws and sws, there will be an extra 25%+1 1%=36% load/store instruction, resulting in  $1.36 \times n \times 750\% = n \times 1012$ . 5ps and the speedup factor

$$\frac{n \times 950 \text{ps}}{n \times 1012.5 \text{ps}} = 0.94 < 1.$$

Hence, the program run slower on this new CPU

- 3. The number and effect of load/store instructions are the primary factors
- **4.** The new one is better. Despite increasing instruction count, this design streamlines memory access, enhances scalability, and facilitates future optimizations, masking it a superior choice for systems prioritizing memory latency reduction and overall performance.

**Exercise 4.16** In this exercise, we examine how pipelining affects the clock cycle time of the processor. Problems in this exercise assume that individual stages of the datapath have the following latencies:

| IF    | ID     | EX    | MEM   | WB    |
|-------|--------|-------|-------|-------|
| 250ps | 350 ps | 150ps | 300ps | 200ps |

Also, assume that instructions executed by the processor are broken down as follows:

| ALU/Logic | Jump/Branch | Load | Store |
|-----------|-------------|------|-------|
| 45%       | 20%         | 20%  | 15%   |

- 1. What is the clock cycle time in a pipelined and non-pipelined processor?
- 2. What is the total latency of an lw instruction in a pipelined and non-pipelined processor?
- 3. If we can split one stage of the pipelined datapath into two new stages, each with half the latency of the original stage, which stage would you split and what is the new clock cycle time of the processor?
- 4. Assuming there are no stalls or hazards, what is the utilization of the data memory?
- 5. Assuming there are no stalls or hazards, what is the utilization of the write-register port of the "Registers" unit?

soluti

on: 1. For non-pipelined process 250ps+350ps+150ps+300ps+200ps=1250ps, for pipelined t=350ps

The time cost for a non-pipelined is 1250ps, for the pipelines one It is equal to the number of stages times the cycle time:

```
3×350=1750ps
```

- 3. Split the ID stage since it is the stage with the longest running ti me and then the cycle, time becomes the format second longest running : 300 ps
- 4. 35%.
- 5. 65%

Exercise 4.18 Assume that \$s0 is initialized to 11 and \$s1 is initialized to 22. Suppose you executed the code below on a version of the pipeline from Section 4.6 that does not handle data hazards (i.e., the programmer is responsible for addressing data hazards by inserting NOP instructions where necessary). What would the final values of registers \$s2 and

```
addi $s0, $s1, 5
add $s2, $s0, $s1
addi $s3, $s0, 15
```

\$s3 be?

solut

ion: \$s2=22+5+22=49 and \$s3=22+5+15=42

Essential 4.00 All years in the state of the

Exercise 4.20 Add NOP instructions to the code below so that it will run correctly on a pipeline that does not handle data hazards.

```
addi $s0, $s1, 5
add $s2, $s0, $s1
addi $s3, $s0, 15
add $s4, $s2, $s1
```

```
      solution:
      addi
      $s0,$s1,5

      nop
      nop

      add
      $s2,$s0,$s1

      nop
      add
      $s4,$s2,$s1
```

Exercise 4.22 Consider the fragment of MIPS assembly below:

```
sd $s5, 12($s3)
1d $s5, 8($s3)
sub $s4, $s2, $s1
beqz $s4, label
add $s2, $s0, $s1
```

Suppose we modify the pipeline so that it has only one memory (that handles both instructions and data). In this case, there will be a structural hazard every time a program needs to fetch an instruction during the same cycle in which another instruction accesses data.

- 1. Draw a pipeline diagram to show were the code above will stall.
- 2. In general, is it possible to reduce the number of stalls/NOPs resulting from this structural hazard by reordering code?
- 3. Must this structural hazard be handled in hardware? We have seen that data hazards can be eliminated by adding NOPs to the code. Can you do the same with this structural hazard? If so, explain how. If not, explain why not.
- 4. Approximately how many stalls would you expect this structural hazard to generate in a typical program? (Use the instruction mix from Exercise 4.8.)

```
Solution:
             1.
                     $s5,12($s3)
                                   IF
                                            ΕX
                                                ME WB
                 sd
                                       ID
                 Ld $s5,8($s3)
                                       IF
                                            ID
                                                EX
                                                    ME WB
                 Sub $s4,$s2,$s1
                                            IF
                                                ID
                                                         ME
                                                             WB
                 Beqz$s4,label
                                                         IF
                                                                      ME WB
                                                             ID
                                                                  EX
                 Add $s2,$s0,$$1
                                                                  ID
                                                                      EX
                                                                          ME WB
                 Sub $s2,$s6,$s1
                                                                  IF
                                                                      ID
                                                                          EX ME WB
```

- There is no possibility in reordering the code to reduce the number of stalls since every instruction must be fetched before any stages later from the same memory where data comes from
- 3. It cannot be solved by inserting NOPs since NOP itself shall be fetched from instruction memory as well
- 4. 25%+11%=36% since every data access will lead to a stall

Exercise 4.24 Which of the two pipeline diagrams below better describes the operation of the pipeline's

Choice 1:

hazard detection unit? Why?

```
      ld
      $11, 0($12)
      IF ID EX ME WB

      add
      $13, $11, $14
      IF ID EX .. ME WB

      or
      $15, $16, $17
      IF ID .. EX ME WB

      Choice 2:
      If ID EX ME WB

      ld
      $11, 0($12)
      IF ID EX ME WB

      add
      $12, $11, $14
      $14
```

 1d
 \$11, 0(\$12)
 IF ID EX ME WB

 add
 \$13, \$11, \$14
 IF ID .. EX ME WB

 or
 \$15, \$16, \$17
 IF .. ID EX ME WB

**Solution**: choice 2 since stall demands are detected in the stage of instruction fetching

Exercise 4.27 Problems in this exercise refer to the following sequence of instructions, and assume that it is executed on a five-stage pipelined datapath:

```
add $s3, $s1, $s0
lw $s2, 4($s3)
```



Figure 4.59 Pipelined control overview, showing the two multiplexers for forwarding, the hazard detection unit, and the forwarding unit.

- 1. If there is no forwarding or hazard detection, insert NOPs to ensure correct execution.
- Now, change and/or rearrange the code to minimize the number of NOPs needed. You can assume register \$t0 can be used to hold temporary values in your modified code.
- 3. If the processor has forwarding, but we forgot to implement the hazard detection unit, what happens when the original code executes?
- 4. If there is forwarding, for the first seven cycles during the execution of this code, specify which signals are asserted in each cycle by hazard detection and forwarding units in Figure 4.59.
- 5. If there is no forwarding, what new input and output signals do we need for the hazard detection unit in Figure 4.59? Using this instruction sequence as an example, explain why each signal is needed.
- 6. For the new hazard detection unit from 4.26.5, specify which output signals it asserts in each of the first five cycles during the execution of this code.

### Solution:

```
1. add $s3,$s1,$s0
```

Nop

Nop

Lw \$s1,4(\$s3)

Lw \$s1,0(\$s4)

Nop

Or \$s2,\$s3,\$s2

Nop

Nop

Sw \$s2,0(\$s3)

- 2. There is no way to reduce the number of NOPs by rearranging the code
- It executes correctly. Only when a instruction uses the result from the preceding load instruction would hazard detection be needed, which is not the

4. The first seven cycle:

| Cycle | 1  | 2  | 3  | 4  | 5  | 6  | 7  | 8  |    |
|-------|----|----|----|----|----|----|----|----|----|
| add   | IF | ID | EX | ME | WB |    |    |    |    |
| ld    |    | IF | ID | EX | ME | WB |    |    |    |
| ld    |    |    | IF | ID | EX | ME | WB |    |    |
| or    |    |    |    | IF | ID | EX | ME | WB |    |
| sd    |    |    |    |    | IF | ID | EX | ME | WB |

- (1) Forward A = X; Forward B = X (no instruction in EX stage yet)
- (2) Forward A = X; Forward B = X (no instruction in EX stage yet)
- (3) Forward A = 0; Forward B = 0 (no forwarding; values taken from registers)
- (4) Forward A = 2; Forward B = 0 (base register taken from result of previous instruction)
- (5) Forward A = 1; Forward B = 1 (base register taken from result of two instructions previous )
- (6) Forward A = 0; Forward B = 2 (rs1 = x15 taken from register; rs2 = x13 taken from result of 1st ld—two instructions ago)
- (7) Forward A = 0; Forward B = 2 (base register taken from register file.

  Data to be written taken from previous instruction)
- 5. The hazard detection unit requires the values of \$rd from the MEM/WB register and there is no extra need for output signals. To detect potential data hazard between add and Id, the value \$rd from EX/MEM should be employed.

  Besides, the one between the first Id and the or should be detected by

using

the value of \$rd from MEM/WB

6. The first five cycle:

| Cycle              | 1  | 2  | 3  | 4  | 5  |
|--------------------|----|----|----|----|----|
| Add \$s3,\$s1,\$s0 | IF | ID | EX | ME | WB |
| Lw \$s2,4(\$S3)    |    | IF | ID | ** | ** |
| Lw \$s1,0(\$S4)    |    |    | IF | ** | ** |

- (1) PC Write = 1; IF/ID Write = 1; control mux = 0
- (2) PC Write = 1; IF/ID Write = 1; control mux = 0
- (3) PC Write = 1; IF/ID Write = 1; control mux = 0
- (4) PC Write = 0; IF/ID Write = 0; control mux = 1
- (5) PC Write = 0; IF/ID Write = 0; control mux = 1

Exercise 4.28 The importance of having a good branch predictor depends on how often conditional branches are executed. Together with branch predictor accuracy, this will determine how much time is spent stalling due to mispredicted branches. In this exercise, assume that the breakdown of dynamic instructions into various instruction categories is as follows:

| R-type | beqz/bnez | jal | 1w  | sw |
|--------|-----------|-----|-----|----|
| 40%    | 25%       | 5%  | 25% | 5% |

Also, assume the following branch predictor accuracies:

| Always-Taken | Always-Not-Taken | 2-Bit |
|--------------|------------------|-------|
| 45%          | 55%              | 85%   |

- Stall cycles due to mispredicted branches increase the CPI. What is the extra CPI
  due to mispredicted branches with the always-taken predictor? Assume that branch
  outcomes are determined in the ID stage and applied in the EX stage that there
  are no data hazards, and that no delay slots are used.
- 2. Repeat 4.28.1 for the "always-not-taken" predictor.
- 3. Repeat 4.28.1 for the 2-bit predictor.
- 4. With the 2-bit predictor, what speedup would be achieved if we could convert half of the branch instructions to some ALU instruction? Assume that correctly and incorrectly predicted instructions have the same chance of being replaced.
- 5. With the 2-bit predictor, what speedup would be achieved if we could convert half of the branch instructions in a way that replaced each branch instruction with two ALU instructions? Assume that correctly and incorrectly predicted instructions have the same chance of being replaced.
- 6. Some branch instructions are much more predictable than others. If we know that 80% of all executed branch instructions are easy-to-predict loop-back branches that are always predicted correctly, what is the accuracy of the 2-bit predictor on the remaining 20% of the branch instructions?

Solution:

1. Since a mispredicted branch, which makes up 25% of the instruction mi, leas to a three-instruction penalty ,i.e. the instruction in the IF, ID, and EX stage and given that there are 100%-45%=55% iof instruction triggers the flush for the always-taken predictor, the CPI would reach

- 2. 1+25%×3×(100%-45%)=1.3375
- 3.  $1+25\%\times3\times(100\%-85\%)=1.1125$
- 4. Given that half of the branch instruction are converted into ALU instruction, the proportion of branch becomes 12.5%, while leads to a lower CPI:

and subsequently the speedup factor:

5. Given that half of the branch instructions are converted into two ALU instructions, 12.5% of the branch instructions will have an extra cycle since it would be replaced with ALU instruction and another 12,5% that is not replaced would then have a 100%-85%=15% chance of penalty so the CPI becomes  $1+12.5\%\times1+12.5\%\times15\%\times1=1.14375$  and thus we can obtain the speedup factor:  $1.1125\div1.14375\approx0.97$ 

6. Since 80% of the branch instruction are always predicted correctly, only the remaining 20% ave to predict depending on the 2-bit predictor accuracy. The overall accuracy, however, does not change, i.e. 85%. Thus the accuracy of the remaining ones is obtained by solving the equation  $80\%+20\%\times x=85\%$ , which is 25%.