

German University in Cairo - GUC Faculty of Media Engineering and Technology- MET Department of Computer Sciences Assoc. Prof. Milad Ghantous

# CSEN702-Microprocessors Winter Semester 2021

# Midterm Exam - Solution

#### Bar Code

#### Instructions: Read Carefully Before Proceeding.

- 1- Non-programmable calculators are allowed
- 2- Write your solutions in the space provided
- 3- The exam consists of (XX) questions
- 4- This exam booklet contains (XX) pages including this page
- 5- Total time allowed for this exam is (120) minutes
- 6- When you are told that time is up, stop working on the test

Good Luck!

| Question       | 1  | 2  | 3  | 4  | 5  | Σ  |
|----------------|----|----|----|----|----|----|
| Possible Marks | 10 | 20 | 10 | 30 | 20 | 90 |
| Final Marks    |    |    |    |    |    |    |

# Question 1 (10 pts)

Mark each of the following statement with T or F. Ambiguous letters will not be considered.

| #  | Statement                                                                                                                                         | T/F |
|----|---------------------------------------------------------------------------------------------------------------------------------------------------|-----|
| 1  | The first access to any array element A[i] will result in a compulsory miss.                                                                      |     |
| 2  | Using multi-level caches helps in reducing the miss penalty.                                                                                      |     |
| 3  | Way-prediction lowers the average memory access time by lowering the miss rate.                                                                   |     |
| 4  | The size of the blocks in the blocking optimization, employed by the compiled or the programmer, should be less or equal to the cache block size. |     |
| 5  | The only limitation of the loop unrolling is the growing code size.                                                                               |     |
| 6  | An output dependence is a WAW hazard.                                                                                                             |     |
| 7  | For a program with no loops at all, branch prediction will increase the execution time compared with not using prediction at all.                 |     |
| 8  | Leakage in microprocessor chips increases with lower voltages.                                                                                    |     |
| 9  | Thermal Design power (TDP) is the maximum heat generated by the chip.                                                                             |     |
| 10 | The global miss rate of L2 cache should, in theory, be always less than that of the L1 cache.                                                     |     |

## Question 2 (20 pts)

Answer all the following short questions.

A) Consider a pipelined processor with 50% ALU operations, 30% branches, and 20% memory operations. It uses a branch prediction scheme for its branch instructions with a 2 cycles penalty. Given that we have 1 stall cycle every 5 instructions due to data hazards and that no structural hazards are present, what's the minimum **accuracy** of the branch prediction scheme to keep the CPI less than or equal to 1.3? (5 pts)

B) Consider the following code:

```
for (i=0; i<9; i++)
  for (j=0; j<9; j++)
  {
     A[j][i]++;
}</pre>
```

Given that A is a double-precision floating point array of size 10x10, and that the cache is fully associative, starts empty, and it has a total size of 192 KB:

What effect does loop interchange on the overall performance compared with the original code performance shown above if the cache block size is 8 bytes? Discuss. 2 pts for effect and 3 for discussion

C) What's the difference between power gating (1 pt) and clock gating (1 pt)? Which one(s) affect the dynamic power and/or the static power? (3 pts)

D) Two processors A and B exhibit the same average power and the same clock frequency. They have a different instruction set and architecture but manage to get the same CPI. Both have no stalls. Will they have the same energy efficiency (1 pt)? Explain why or why not. (4 pts)

### Exercise 3 (10 pts)

Consider a cache system with a miss penalty of C cycles for a cache block of size C words. This is due to copying each word into the cache requiring 1 cycle.

Assume the following memory content organized in words.

| Word  | 0     |
|-------|-------|
| Word  | 1     |
| Word  | 2     |
| Word  | 3     |
| Word  | 4     |
| Word  | 5     |
| Word  | 6     |
| Word  | 7     |
| Word  | 8     |
| Word  | 9     |
| Word  | 10    |
| And s | so on |
|       |       |

i) For a cache that starts empty, with block sizes of 16 bytes, and a critical word first approach employed, compute the miss penalty cycles when word 2 is requested from memory. Also compute the number of cycles reduced compared the conventional approach. (3 pts)

- ii) For a cache that starts empty, with block sizes of 16 bytes, and a early restart approach employed, compute the miss penalty cycles when word 2 is requested from memory. Also compute the number of cycles reduced compared the conventional approach. (3 pts)
- iii) For which words, do both approaches perform the same 1pt? Discuss. 1 pts
- iv) For which words, does early restart perform its worst 1 pt? Discuss 1 pts

### Exercise 4 (30 pts)

Consider two memory hierarchy systems M1 and M2.

M1 has two-level caches with the following measurements:

- In 100 memory references, 80 hit in L1 and 16 hit in L2.
- Hit time of L1 is 1 cycle and doubles in L2.
- Penalty of L2 is 10 cycles.

M2 has three-level caches with the following measurements:

- In 150 memory references, 120 hit in L1.
- Global miss rate of L2 is 6%, while it's 3% in L3.
- Penalty of L3 is 8 clock cycles.
- Hit time in L1 is 1 cycle, 3 cycles in L2 and 4 cycles in L3.

A) Compute the **local and global** miss rates of all caches in M1 and local miss rates of L1, L2 and L3 in M2. Show your work. 7 pts, 1 for each miss rate.

B) Compare both systems in terms of average memory access time. Show your work. (7 pts)

| C) Given that 60% of   | the instructions are ALU and branch operations and the              |
|------------------------|---------------------------------------------------------------------|
| remaining are loads as | nd stores, Compute the misses per instruction in all caches in both |
| systems M1 and M2.     | (10 pts)                                                            |

D) compare both systems in terms of average memory stalls per instruction. Show your work. (6 pts)

## M1:

 $\begin{aligned} \text{Average memory stalls per instruction} &= \text{Misses per instruction}_{L1} \times \text{Hit time}_{L2} \\ &+ \text{Misses per instruction}_{L2} \times \text{Miss penalty}_{L2} \end{aligned}$ 

### Exercise 5 (20 pts)

Consider the following MIPS code and answer all the parts.

```
FO, 800(R1)
LOOP: L.D
      L.D
               F1, 800(R2)
      ADD.D
               FO, FO, F1
               FO, FO, F7
      MUL.D
               FO, 800(R1)
      S.D
                R1, R1, -8
      ADDUI
      ADDUI
                R2, R2, -8
      BEO
                R2, R3 LOOP
```

Assume the following latencies and that forwarding is applied and also assume branches execute in the **decode** stage normally and they don't use a delayed slot.

| Instruction producing result | Instruction using result | Latency in clock cycles |  |
|------------------------------|--------------------------|-------------------------|--|
| FP load                      | FP ALU op                | 1                       |  |
| FP add                       | FP store                 | 2                       |  |
| FP multiply                  | FP store                 | 4                       |  |
| FP add                       | FP ALU op                | 3                       |  |

3.1) Compute the number of cycles needed for 1 iteration with no scheduling or unrolling.

Show your work. 8 pts (6 for code, and 2 for cycles)

```
Solution:
           F0, 800(R1)
LOOP: L.D
L.D
      F1, 800(R2)
stall
ADD.D F0, F0, F1
stall stall stall
MUL.D F0, F0, F4
stall stall stall
S.D F0, 800(R1)
ADDUI
         R1, R1, -8
ADDUI
         R2, R2, -8
stall
BEQ R2, R3 LOOP
```

# of cycles per iteration = 17.

The code is re-given here for your reference.

```
LOOP: L.D
               F0, 800(R1)
               F1, 800(R2)
      L.D
                F0, F0, F1
       ADD.D
       MUL.D
                F0, F0, F7
       S.D
                F0, 800(R1)
       ADDUI
                 R1, R1, -8
       ADDUI
                 R2, R2, -8
       BEQ
                 R2, R3 LOOP
```

3.2) Unroll the loop 2 times  $\underline{\text{and}}$  schedule the loop to reduce the number of stalls to  $\underline{\text{minimum}}$ .

Re-compute the number of cycles needed per iteration. (Use the registers F2 and F3, in addition to F0 and F1).

Show your work 12 pts (8 for code, 4 for # of cycles per iteration) Solution:

```
F0, 800(R1)
LOOP: L.D
            F1, 800(R2)
      L.D
            F2, 792(R1)
      L.D
      L.D
            F3, 792(R2)
      ADD.D F0, F0, F1
      ADD.D F2, F2, F3
      Stall stall
      MUL.D F0, F0, F4
      MUL.D F2, F2, F4
      ADDUI
               R1, R1, -16
      ADDUI
               R2, R2, -16
      stall
      S.D
            F0, 816(R1)
      S.D
            F2, 808(R1)
      BEQ
               R2, R3 LOOP
```

Total number = 16 cycles per 2 iterations → 8 cycles per iteration.