## Homework 8

Chinese Name: 杨人一

Pinyin Name: YANG RENYI

Student ID: 2023533030

E-Mail (omit "@shanghaitech.edu.cn"): yangry2023

- 20 1. Put T (True) or F (False) for each of following statement. [2 points each]
  - (1) (F) On-chip cache has bigger capacity compared with the main memory because they are physically closer to the CPU.
  - (2) (T) In a Moore FSM, the output depends only on the current state; in a Mealy FSM, the output depends on both the current state and the inputs.
  - (3) ( F ) Assign the base address of array X to register t0. The instruction lw a0,4(t0) will always load X[1] into a0.
  - (4) (F) Register a0-a7 can not be changed during execution of a function.
  - (5) (T) Pseudo instructions could be output by the compiler.
  - (6) (T) The linker finalizes destination addresses for all jump instruction.
  - (7) (T) Spatial locality refers to the phenomenon where accessing a specific memory location makes accessing proximate storage positions more likely in the near future.
  - (8) (T) Assume that a processor has a two-level cache hierarchy. If L1 cache follows the write-through policy while the L2 cache follows the write-back policy, a write hit at L1 cache would always cause a write at L2 cache.
  - (9) (F) Instruction-level parallelism (ILP) techniques include Very Long Instruction Word (VLIW) and Single Instruction, Multiple Data (SIMD).
  - (10) (F) A page fault occurs if a page table entry can not be found in the TLB.

## 2. Number Rrepresentation [12 points]

(a) Assume the horizontal axis represents a bit string incrementing from 00000000 to 111111111, while the vertical axis represents a decimal number. There are four different number representation. Please draw the lines of the different number representation on the figure. (Unsigned has been provided as a sample.)



00000000 01111111 1000000 111111111

Unsigned

255

128
127

1
0
-1
-127
-128

One's complement

01111111 10000000

11111111

-255

00000000





Two's complement

(b) What is the smallest number greater than 2 that can be represented by the IEEE 754 single-precision floating-point format?

Answer:  $2 + 2^{-23+1} = 2 + 2^{-22} = 2.0000002384185791015625$ 

(c) What is the smallest number greater than 4 that can be represented by the IEEE 754 single-precision floating-point format?

Answer:  $4 + 2^{-23+2} = 4 + 2^{-21} = 4.000000476837158203125$ 

- Email: yangry2023
- 3. Logic [8 points]
- (a) (Multiple Choice) Which of the following statement(s) is(are) true about boolean algebra? ABCD

A. 
$$XYZ + XY + X = X$$
.

$$B. \ (X + \overline{Y})X = X + X\overline{Y}.$$

C. 
$$X + YZ = (X + Y)(X + Z)$$
.

D. 
$$(X+Y)(\overline{X}+Z)(Y+Z) = (X+Y)(\overline{X}+Z)$$
.

(b) (Multiple Choice) Which of the following statement(s) is(are) true about boolean algebra? ABCD

A. 
$$X + \overline{X}Y = X + Y$$
.

B. 
$$X\overline{Y}Z + \overline{X} + Y + \overline{Z} = 1$$
.

C. 
$$XY + \overline{X}Z + YZ = XY + \overline{X}Z$$
.

D. 
$$\overline{\overline{XZ} + \overline{X}YZ} + \overline{Y}Z + XY\overline{Z} = XYZ$$
.

4. FSM [6 points]



(a) Fill in the truth table for the FSM

| state bit 1 | state bit 0 | input | next state bit 1 | next state bit 0 | output |
|-------------|-------------|-------|------------------|------------------|--------|
| 0           | 0           | 0     | 0                | 0                | 0      |
| 0           | 0           | 1     | 0                | 1                | 0      |
| 0           | 1           | 0     | 0                | 0                | 0      |
| 0           | 1           | 1     | 1                | 0                | 1      |
| 1           | 0           | 0     | 0                | 0                | 0      |
| 1           | 0           | 1     | 1                | 0                | 1      |

(b) Using St1(state bit 1), St0(state bit 0) and Ip(Input) as the input and Output as the output, extract a boolean expression from the truth table

$$Output = Ip (St1 + St0)$$

## 5. C, Risc-V and Cache [23 points]

(a) Suppose we have the C struct node as follows:

```
struct node {
  int8_t a;
  uint8_t b;
  short c;
  int d;
};
```

Now given N is an array, and each element of N is a pointer to struct node. We need to implement the following function.

```
for(int i=0; i<length; i++) {
   N[i]->d=N[i]->a*N[i]->b+N[i]->c;
}
```

Complete the following RISC-V code to implement the function, assuming machine is 32-bit.

```
li x12, 0x00FF00000 // base address of N li x13, 0x3 // assume length = 3 li x14, 0x0 // int i = 0 j end loop:  \frac{\text{slli }t0, x14, 3}{\text{add }t1, t0, x12} \\ \frac{\text{lw }t2, 0(t1)}{\text{lw }t3, 1(t1)} \\ \frac{\text{lw }t4, 2(t1)}{\text{mul }t5, t2, t3} \\ \frac{\text{add }t6, t5, t4}{\text{sw }t6, 4(t1)}
```

```
addi x14, x14, 1
end:
blt x14, x13, loop
```

(b) Suppose we have the following memory space before running the above program. If the machine is little-endian, what does the memory space look like after running the program? If the machine is big-endian, what does the memory space look like after running the program?

Please fill in the results in the corresponding table locations.

| 0xFFFFFFF  |            | 0xFFFFFFF  |            |  |  |
|------------|------------|------------|------------|--|--|
|            |            |            |            |  |  |
| 0x00FF0000 | 0x00000C88 |            | 0x00000C88 |  |  |
|            | 0x00000000 |            | 0xCC000000 |  |  |
|            | 0x0000CC00 | 0x00FF0000 | 0x0000CC00 |  |  |
|            |            |            |            |  |  |
|            |            |            |            |  |  |
|            | 0xA608FA1D |            | 0xA608FA1D |  |  |
|            |            |            |            |  |  |
| 0x0000CC00 | 0x94AE4092 | 00000CC00  | 0x94AE4092 |  |  |
|            |            | 0x0000CC00 |            |  |  |
|            |            |            |            |  |  |
| 0x00000C88 | 0x1FB19515 | 0x00000C88 | 0x1FB19515 |  |  |
| 0x00000C88 |            | 0x00000C88 |            |  |  |
|            | г и        |            |            |  |  |
| Little-    | Endian     | Big-E      | Big-Endian |  |  |

(c) Now suppose we run the above code in the above memory space twice consecutively (starting from li x12, 0x00FF0000 and ending with blt x14, x13, loop). All below cance use LRU policy.

Suppose we now have a data cache with a size of 256 Bytes and a cache block of 8 Bytes that is direct mapped. what is the hit rate of the data cache?

 $\frac{7}{8}$ 

9

Suppose we now have a data cache with a size of 256 Bytes and a cache block of 8 Bytes that is 2-ways associative. what is the hit rate of the data cache?

 $\frac{7}{8}$ 

Suppose we now have a data cache with a size of 256 Bytes and a cache block of 16 Bytes that is 4-ways associative. what is the hit rate of the data cache?

 $\frac{11}{12}$ 

## 6. Pipeline [19 points]

Consider a five-stage pipelined RISC-V processor shown below:



5 stage pipelined Risc-V processor

Suppose that we have the following delay and setup times.

| Register clk-to-q | 25ps  | Register setup   | 25ps  | Register hold | 5ps  |
|-------------------|-------|------------------|-------|---------------|------|
| RegFile read      | 100ps | RegFile setup    | 20ps  | Mux           | 25ps |
| ALU               | 150ps | Branch comp.     | 75ps  | Imm. Gen.     | 15ps |
| Memory read       | 200ps | DMEM write setup | 150ps |               |      |

(a) What is the highest clock frequency for this processor? Show calculation details. The longest delay of the five stages is Memory read:

$$t_{max}=t_{clk-to-q}+t_{setup}+t_{memory-read}=25ps+25ps+200ps=250ps$$
 
$$f_{max}=\frac{1}{t_{clk}}=\frac{1}{250ps}=4GHz$$

- 4
- (b) If this 5 stage pipelined Risc-V processor is changed to a single-cycle datapath, what is the lowest possible  $t_{clk}$ ? Show calculation details.

$$t_{clk} \ge t_{clk-to-q} + t_{I-memory-read} + t_{regfile-read} + t_{mux} + t_{D-memory-read} + t_{mux} + t_{regfile-setup}$$
  
=  $25ps + 200ps + 100ps + 25ps + 200ps + 25ps + 20ps = 595ps$ 

- (c) Forwarding datapaths are added to this five-stage pipelined RISC-V processor (the red line in Figure). The given code is excuted on this processor.
  - 1 I1: lw a1, 4(a0)
    2 I2: add a2, a2, a0
    3 I3: sub a3, a4, a1
    4 I4: lw a5, 0(a1)
    5 I5: sub a6, a4, a3
    6 I6: lw a7, 0(a6)
    7 I7: add a2, a7, a3

add a5, a2, a4

Plese identify which instruction(s) activate the forwarding data path to reduce stall(s)? (Use  $I1 \rightarrow I3$ , etc. to represent the instruction I1 are forward to I3.)

$$\frac{I3 \to I5}{I7 \to I8}$$

8 I8:

- 3
  - (d) With the forwarding datapaths, there is still one unavoidable stall left to correctly execute the code above. Please locate the instructions and registers that cause the stall (Use the blanks below, use I1, I2, etc. to represent the instructions).
    - The register a7 in instruction I6 and I7 causes the unavoidable stall.
- (e) Suppose we have a program with 4000 lines instructions running on this processor. What is the running time of the program? Suppose we use the processor frequency in (a). And we assume that the CPI is the same as the CPI of the program in (c). Show calculation details.

(Hint 
$$\frac{time}{program} = \frac{instructions}{program} \times \frac{cycles}{instruction} \times \frac{time}{cycle}$$
)

$$time_{program} = 4000 \times \frac{9}{8} \times \frac{1}{4GHz}$$

$$= 4000 \times \frac{9}{8} \times \frac{1}{4 \times 10^{9}}$$

$$= 4000 \times \frac{9}{32} \times 10^{-9}$$

$$= 1.125 \times 10^{-6}$$

$$= 1.125 \mu s$$

- 7. AMAT and Performance [12 points]
- (a) We have a two-level cache where L1 has a hit rate of 80% and a hit latency of 2 cycles, and L2 has a hit latency of 15 cycles. Assume 100 total accesses to this two-level cache will cause five L2 misses on average, what is the L2 local miss rate? Assume access to main memory takes 100 cycles, what is the Average Memory Access Time (AMAT) of this system?

L2 local miss rate = 
$$\frac{\text{L2 misses}}{\text{L2 accesses}} = \frac{5}{100 - 80} = \frac{5}{20} = 0.25$$

$$AMAT = 2 + 0.2 \times (15 + 0.25 \times 100)$$
 
$$AMAT = 10 \quad cycles$$

(b) Now we add a L3 cache to the above two-levels cache. If the hit latency of L3 cache is 30 cycles, what is the maximal local miss rate of L3 cache if we want to reduce the AMAT of the system to 8 cycles?

$$AMAT = 2 + 0.2 \times (15 + 0.25 \times (30 + m \times 100))$$
  

$$8 = 2 + 0.2 \times (22.5 + 25m)$$
  

$$m = 0.3$$

(c) Assume that you are given a program, of which a fraction f (measured in percentage) can be optimized for parallel computing by employing N machines built as a cluster. Consider that f=80%. If you want to accelerate the program's performance to about 3 times, what is the minimal number of N you need to employ?

$$Speedup = \frac{1}{(1-f) + \frac{f}{N}}$$
 
$$3 = \frac{1}{0.2 + \frac{0.8}{N}}$$
 
$$N = \frac{0.8}{1/3 - 0.2}$$
 
$$N = 6$$