# Final Exam Solution



EECS 370 Fall 2023: Introduction to Computer Organization

| You are to abide by the University of Michigan College of Engineering Honor Code. Please sign below to signify that you have kept the honor code pledge:  I have neither given nor received aid on this exam, nor have I concealed any violations of the Honor Code. |                                                               |  |  |  |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------|--|--|--|
| Signature:                                                                                                                                                                                                                                                           |                                                               |  |  |  |
| Name:                                                                                                                                                                                                                                                                | ANSWER KEY                                                    |  |  |  |
| Uniqname:                                                                                                                                                                                                                                                            | eecs370staff_                                                 |  |  |  |
|                                                                                                                                                                                                                                                                      | erson sitting to your <i>Right</i> are at the end of the row) |  |  |  |
|                                                                                                                                                                                                                                                                      | erson sitting to your <i>Left</i> are at the end of the row)  |  |  |  |

#### **Exam Directions:**

- You have **120 minutes** to complete the exam. Don't spend too much time on a problem.
- There are 6 questions in the exam on 13 pages (double-sided). Please flip through your exam to ensure you have all pages.
- You must show your work to be eligible for partial credit!
- Write *legibly* and *dark* enough for the scanners to read your answers.
- Write your uniquame on the line provided at the top of each page.

#### **Exam Materials:**

- You are allotted **one 8.5 x 11 double-sided** note sheet to bring into the exam room.
- You are allowed to use calculators that do not have an internet connection. All other
  electronic devices, such as cell phones or anything or calculators with an internet
  connection, are strictly forbidden.

| Problem     | 1  | 2   | 3  | 4  | 5    | 6     |
|-------------|----|-----|----|----|------|-------|
| Pages       | 2  | 3-5 | 6  | 7  | 8-10 | 11-13 |
| Point Value | 15 | 20  | 10 | 10 | 20   | 25    |

# Problem 1: I guess they sometimes miss, huh?

15 points

For this problem, consider a cache with the following specifications:

- Cache Size is 32 Bytes
- Addresses are 20 bits
- Cache is 2-way set associative
- Cache block size is 8 Bytes
- Write Back policy
- No Allocate on Write

| <ol> <li>How many sets does the cache have</li> </ol> |
|-------------------------------------------------------|
|-------------------------------------------------------|

- 2. How many bits of the address would be for the cache tag? 20-1-3 = 16
- 3. Say the cache begins empty, and the system makes the accesses below. For each row, fill in the bubble if the access to that address results in a cache hit or miss.

| Row | Instruction | Address | Cache Hit or Miss         |
|-----|-------------|---------|---------------------------|
| 1   | load        | 0x3B8D9 | ◯ Hit <mark>◯ Miss</mark> |
| 2   | load        | 0x3B8D7 | ○ Hit ○ Miss              |
| 3   | store       | 0x3967A | ○ Hit ○ Miss              |
| 4   | load        | 0x3967E | ○ Hit ○ Miss              |
| 5   | store       | 0x39678 | ○ Hit ○ Miss              |
| 6   | load        | 0xE1F3D | ○ Hit ○ Miss              |
| 7   | load        | 0x3B8D4 | ○ Hit ○ Miss              |
| 8   | store       | 0x3B8DC | ○ Hit ○ Miss              |
| 9   | store       | 0x3B8DF | ○ Hit ○ Miss              |

| 4. | If the cache were instead to have a write-through policy, which access would change |
|----|-------------------------------------------------------------------------------------|
|    | whether or not main memory is accessed? (Select and fill in the bubble for one.)    |

| $\overline{}$ |   | Rov                 | ٧, | 3 |
|---------------|---|---------------------|----|---|
| \             | , | $1 \times 0 \times$ | v  | • |

| O Row 5 | ○ Row 8 | ○ Row 9    |
|---------|---------|------------|
|         |         | ( ) INOW 3 |

| $\frown$ | Dave | C |
|----------|------|---|
| . )      | Row  | С |

| _             |     |   |
|---------------|-----|---|
| $\overline{}$ | DOW | C |

5. If the cache were instead to be fully associative, which access would turn from a cache miss into a hit? (Select and fill in the bubble for one.)

○ Row 2

○ Row 6

O Row 7

Row 8

# **Problem 2: Pipe Dreams**

20 points

Fill in the bubble for the best answer in each of the following questions.

| 1. | Given a multi-cycle processor and a pipelined processor implementing the same ISA and with the same clock frequency, we would expect:                                                                                          |
|----|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|    | <ul> <li>The multi-cycle would have a higher CPI than the pipelined processor.</li> </ul>                                                                                                                                      |
|    | <ul> <li>The multi-cycle and pipelined processor would have the same CPI.</li> </ul>                                                                                                                                           |
|    | <ul> <li>The multi-cycle would have a lower CPI than the pipelined processor.</li> </ul>                                                                                                                                       |
| 2. | Given 2 multi-cycle processors with the same clock period, but one which uses a RISC ISA and one which uses a CISC ISA, we would expect:                                                                                       |
|    | <ul> <li>The processor with the RISC ISA will have a lower CPI.</li> </ul>                                                                                                                                                     |
|    | <ul> <li>The two processors will have similar CPIs.</li> </ul>                                                                                                                                                                 |
|    | <ul> <li>The processor with the CISC ISA will have a lower CPI.</li> </ul>                                                                                                                                                     |
| 3. | Given 2 equivalent pipelined processors, one which requires avoidance, and one which uses detect-and-stall to resolve all hazards, we would expect:                                                                            |
|    | <ul> <li>The pipeline with detect-and-stall will have a lower CPI since it will execute fewer<br/>instructions.</li> </ul>                                                                                                     |
|    | <ul> <li>The two pipelines will have the same CPI since they have the same cycle and<br/>instruction counts.</li> </ul>                                                                                                        |
|    | <ul> <li>The pipeline with detect-and stall will have a higher CPI since it will run for more<br/>cycles.</li> </ul>                                                                                                           |
|    | <ul> <li>The pipeline with detect-and-stall will have a higher CPI since it will execute<br/>fewer instructions.</li> </ul>                                                                                                    |
| 4. | Given 2 equivalent pipelined processors, except that one which uses detect-and-stall to solve data hazards, and one which uses detect-and-forward where possible to solve data hazards, we would expect:                       |
|    | <ul> <li>The pipeline with detect-and-stall will have a lower CPI since it will execute more<br/>instructions.</li> </ul>                                                                                                      |
|    | <ul> <li>The two pipelines will have the same CPI because of control hazards.</li> </ul>                                                                                                                                       |
|    | <ul> <li>The pipeline with detect-and-stall will have a higher CPI since it will run for more<br/>cycles.</li> </ul>                                                                                                           |
|    | <ul> <li>The pipeline with detect-and-stall will have a higher CPI since it will execute<br/>fewer instructions.</li> </ul>                                                                                                    |
| 5. | Given 2 equivalent pipelined processors, one which uses detect-and-stall to solve control hazards, and one which uses speculate-and-squash and a branch predictor with 10% correct prediction rate, we would generally expect: |
|    | ○ The pipeline with detect-and-stall will have a higher CPI.                                                                                                                                                                   |
|    | ○ The two pipelines will have the same CPI.                                                                                                                                                                                    |
|    |                                                                                                                                                                                                                                |

O The pipeline with detect-and-stall will have a lower CPI.

| ıniqname: |  |  |
|-----------|--|--|
|           |  |  |

- 6. Say we took the 5-stage LC2K pipeline discussed in class and shortened the pipeline by combining the EX and MEM stages into 1 stage. We would generally expect:
  - O The shorter pipeline would have a higher CPI and higher or same clock period.
  - The shorter pipeline would have a higher CPI but a lower or same clock period.
  - The shorter pipeline would have a lower CPI but a higher or same clock period.
  - O The shorter pipeline would have a lower CPI and a lower or same clock period.
- 7. Say we added an instruction cache to a processor that didn't previously have one. Which of the following is most true?
  - A program with few branches and also a program with many loops would both benefit from the I cache primarily due to temporal locality.
  - A program with few branches and also a program with many loops would both benefit from the I cache primarily due to spatial locality.
  - A program with few branches would benefit from the I cache primarily due to temporal locality, and a program with many loops would benefit from the I cache primarily due to spatial locality.
  - A program with few branches would benefit from the I cache primarily due to spatial locality, and a program with many loops would benefit from the I cache primarily due to temporal locality.

For the next parts, consider the given LC2K program and pipeline. Assume we run the program on the pipeline. In the given table, indicate the opcode of the instructions that are in the **WriteBack** stage of the pipeline during the specified cycle until the processor halts. Use noop to indicate no instruction in the stage. Inclusion of the halt opcode is optional. You might not need all cycles. The first cycle is done for you.

8. Fill out the table for the below hazard-free program run on the 5-stage pipeline discussed in class.

| lw   | 0 | 1     | num1        | Each pro  | ogram will ha | ive 4 noops to | start   |
|------|---|-------|-------------|-----------|---------------|----------------|---------|
| nor  | 0 | 0     | 2           |           |               |                |         |
| noop |   | For h | nazard free | programs, | the instruct  | ions follow i  | n order |
| noop |   |       |             |           |               |                |         |
| add  | 1 | 1     | 3           |           |               |                |         |
| halt |   |       |             |           |               |                |         |

num1 .fill 100

| Cycle     | 1    | 2    | 3    | 4    | 5  | 6   | 7    | 8    | 9   | 10   |
|-----------|------|------|------|------|----|-----|------|------|-----|------|
| WB Opcode | поор | noop | noop | noop | lw | nor | noop | noop | add | halt |

9. Fill out the table for the below program run on the 5-stage pipeline from lecture which uses detect-and-stall to solve data hazards.

lw 0 4 num2

2 stalls per data hazard in Lecture D&S

sw 4 0 0

halt

num2 .fill 280

| Cycle     | 1    | 2    | 3    | 4    | 5  | 6    | 7    | 8  | 9    | 10 |
|-----------|------|------|------|------|----|------|------|----|------|----|
| WB Opcode | noop | noop | noop | noop | lw | noop | noop | SW | halt |    |

10. Fill out the table for the below program run on the 5-stage pipeline from class which uses detect-and-forward to solve data hazards, as far as possible.

lw 0 5 num3 2nd L

2nd LW RegA depends on 1st LW RegB

lw 5 6 0 SW RegB depends on 2nd LW regB

sw 0 6 num3 1 noop per LW dependency

halt

num3 .fill 370

| Cycle     | 1    | 2    | 3    | 4    | 5  | 6    | 7  | 8    | 9  | 10   |
|-----------|------|------|------|------|----|------|----|------|----|------|
| WB Opcode | поор | noop | noop | noop | lw | noop | lw | noop | SW | halt |

11. Fill out the table for the below program run on the 5-stage pipeline from class which uses detect-and-forward to solve data hazards as far as possible, and speculate-and-squash with a 1-bit branch predictor per branch initialized to Not Taken to solve control hazards. For simplicity, assume the BTB/branch predictor are read in the IF stage, and updated in the same cycle as when a branch is resolved.

add 0 0 7

b1 beq 0 7 skip Taken 1st time, not taken 2nd

halt

skip nor 0 0 7

beq 0 0 b1

| Cycle     | 1    | 2    | 3    | 4    | 5   | 6   | 7    | 8    | 9    | 10  |
|-----------|------|------|------|------|-----|-----|------|------|------|-----|
| WB Opcode | поор | noop | noop | noop | add | beq | noop | noop | noop | nor |
|           | -    | -    | -    |      |     | -   | -    | -    | -    |     |

Taken

| Cycle     | 11  | 12   | 13   | 14   | 15  | 16   | 17   | 18   | 19   | 20 |
|-----------|-----|------|------|------|-----|------|------|------|------|----|
| WB Opcode | beq | noop | noop | noop | beq | noop | noop | noop | halt |    |

# **Problem 3: Branch Predictor Matching**

## 10 points

Consider a program in C, with the corresponding LC2K code:

```
int main(void) {
                                                          0
                                                                 4
                                                                       mask
      int sum = 0;
                                                    add
                                                          0
                                                                       5
                                                                 0
      for(int i = 3;
                                                    lw
                                                          0
                                                                 2
                                                                       i
                       i <= 6;
                                                                 3
                                                    lw
                                                          0
                                                                       max
                               ++i) {
                                                    lw
                                                          0
                                                                 1
                                                                       one
         // Take branch if i == 7
                                             loop
                                                    bea
                                                          2
                                                                 3
                                                                       end
                                                    nor
                                                          2
                                                                 2
                                                                       6
             if (i % 4 != 0)
                                                    nor
                                                          4
                                                                 6
                                                                       6
         // Take branch if i % 4 == 0
                                                          6
                                                    beq
                                                                 0
                                                                       skip
                   sum += i;
                                                    add
                                                          5
                                                                 2
                                                                       5
      }
                                             skip
                                                    add
                                                          2
                                                                 1
                                                                       2
         // Take branch unconditionally
                                                    bea
                                                          0
                                                                       loop
}
                                             end
                                                    halt
                                                    .fill -4
                                             mask
                                             i
                                                    .fill 3
                                                    .fill 7
                                             max
                                             one
                                                    .fill 1
```

Say we run the above program on processors with different branch predictors. Match the below branch predictors to their misprediction counts by placing the corresponding letter into the blank next to the **number of mispredictions across all branches**. Each branch predictor will have a unique number of mispredicts. Assume the branch target buffer is preloaded with the target PC of all three branch instructions. You may use the table below for scratch work.

- A. Always Not Taken
- B. Forward branches predicted Not Taken, Backward branches predicted Taken
- C. Last-Time (1-Bit) Predictor per branch, initialized to Not Taken
- D. Saturated (2-Bit) Predictor per branch, initialized to Weakly Not Taken
- E. Saturated (2-Bit) Predictor per branch, initialized to Weakly Taken

| Branch<br>Predictor (A-E) | Total Misprediction Count |
|---------------------------|---------------------------|
| В                         | 2                         |
| D                         | 3                         |
| С                         | 4                         |
| Е                         | 5                         |
| А                         | 6                         |

(Branch outcome scratch work, not graded)

| NT | NT | NT | NT | Т |
|----|----|----|----|---|
| NT | Τ  | NT | NT |   |
| Т  | Т  | Т  | Т  |   |

#### **Problem 4: Wolverine Accesses**

### 10 points

The University has had trouble keeping Wolverine Access working efficiently on course registration days, and has tasked you to help optimize their code to avoid slowdowns and improve quality of life. Consider the following C program fragment run on a **64-bit**, **memory-aligned system** with a cache that has a **block size of 8 bytes**. Then answer the following questions. Assume UM\_catalog begins at address 0x1000.

| Assume the following data type sizes: |         |  |  |  |
|---------------------------------------|---------|--|--|--|
| bool                                  | 1 byte  |  |  |  |
| char                                  | 1 byte  |  |  |  |
| short                                 | 2 bytes |  |  |  |
| int                                   | 4 bytes |  |  |  |
| float                                 | 4 bytes |  |  |  |
| double                                | 8 bytes |  |  |  |

1. What is the size of the above course struct? (Select and fill in the bubble for one.)

total credits += UM catalog[i].credits;

○ 22 B

}

○ 24 B

for(unsigned int i = 0; i<CATALOG\_SIZE; ++i) {</pre>

if(UM\_catalog[i].graded) {

return grade / total credits;

○ 32 B

grade += UM\_catalog[i].median \* UM\_catalog[i].credits;

2. Provide a reordering of the course struct that minimizes its size **AND** minimizes the number of cache misses when the weighted grade function is run.

#### List the member identifiers in order

To minimize misses, graded, median, and credits must be in the same 8B segment. This means graded and credits must be adjacent to each other, and median can be adjacent to either.

To minimize size, in addition to the above, department and course\_number must be adjacent.

The minimum size is 24 bytes. There exist reorderings that minimize size but don't minimize cache misses.

3. Say the cache block size was decreased to 4 bytes. If weighted\_grade was run once, we would generally expect the new block size to: (Fill in the bubble for all that apply.)

Increase Hits

Increase Compulsory Misses

O Increase Other Misses

### **Problem 5: Virtually Correct**

## 20 points

1. In 1-2 sentences, give a disadvantage of using a physically addressed cache, and a disadvantage of using a virtually addressed cache. You should reference the TLB somewhere in your answer.

A physically addressed cache must access the TLB or page table before accessing the cache, increasing minimum latency. A virtually addressed cache needs some way to deal with homonyms, or the same address being used in different virtual address spaces but referring to different data, which can be solved by either storing only 1 process's data in the cache at a time, and flushing on context switch, or including the PID in the tag, which requires more storage.

You're working on developing a LC2K operating system and are trying to debug store instructions in virtual memory. Recall that LC2K is a word-addressable architecture. Since it's *Little* Computer 2000, there's not a lot of memory to work with. Virtual addresses are **16 bits** in length, while there are only **4 pages of physical memory**. Each page is **256 words**.

Current State of Physical Memory:

| Physical Page 0 | Reserved for OS and Page Tables                                       |
|-----------------|-----------------------------------------------------------------------|
| Physical Page 1 | Least Recently Used Data Page (VPN ?)                                 |
| Physical Page 2 | Unused/Free                                                           |
| Physical Page 3 | Most Recently Used Data Page (VPN 0, Including Text Section & .fills) |

After all physical pages are used, virtual pages are replaced based on true LRU policy, ignoring the reserved page.

The system uses a 2-level page table, with bits **15-12** of the virtual address representing the first level index and bits **11-8** representing the second level index. The page table base address is **0xE0**.

The 1st level entries are encoded as follows:

| Bits 31-11     | Bit 10 | Bits 9-0                              |
|----------------|--------|---------------------------------------|
| Unused (all 0) | Valid  | Start Address of 2nd Level Page Table |

The 2nd level entries are encoded as follows:

| Bits 31-3      | Bit 2 | Bits 1-0             |
|----------------|-------|----------------------|
| Unused (all 0) | Valid | Physical Page Number |

| 2          | How many      | entries are in the | 1st level page table? | 16 |
|------------|---------------|--------------------|-----------------------|----|
| <b>~</b> . | I IOW IIIaiiy |                    | 13t icvci page table: | 10 |

| Page | 9 |
|------|---|
|------|---|

uniqname: \_\_\_\_\_

Say we execute the following LC2K program in virtual memory on our machine:

|     | lw    | 0 | 1    | val   | // | Load val into r1       |
|-----|-------|---|------|-------|----|------------------------|
|     | SW    | 1 | 1    | 20480 | // | VMem[r1 + 0x5000] = r1 |
|     | add   | 1 | 1    | 2     | // | r2 = 2 * r1            |
|     | SW    | 2 | 1    | 20480 | // | VMem[r2 + 0x5000] = r1 |
|     | halt  |   |      |       |    |                        |
| val | .fill |   | 1801 |       | // | 0x709 in hex           |

The below table shows a small snippet of physical memory as the program begins, from 0xE0 to 0x10F.

| Addr  | +0  | +1  | +2  | +3  | +4  | +5  | +6  | +7  | +8  | +9  | +A  | +B  | +C  | +D  | #<br># | +F  |
|-------|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|--------|-----|
| 0x0E0 | 430 | 0FC | 164 | 2C3 | 106 | 4F0 | 387 | 12D | 077 | 26d | 1BA | 324 | 013 | 3B8 | 337    | 3EE |
| 0x0F0 | 003 | 002 | 000 | 002 | 001 | 003 | 002 | 005 | 003 | 001 | 000 | 003 | 002 | 003 | 001    | 002 |
| 0x100 | 000 | 000 | 000 | 000 | 000 | 000 | 000 | 000 | 000 | 000 | 000 | 000 | 000 | 000 | 000    | 000 |

| 4. | For the first store, | what is the destination virtual address?                            |
|----|----------------------|---------------------------------------------------------------------|
|    | Ox0709               |                                                                     |
|    | Ox5000               |                                                                     |
|    | Ox5709               | 0x5000 + 0x709                                                      |
|    | Ox6801               |                                                                     |
| 5. | For the first store, | what is first level page table index?                               |
|    | Ox2                  |                                                                     |
|    | Ox5                  | First 4 bits of above address.                                      |
|    | Ox6                  |                                                                     |
|    | OxA                  |                                                                     |
| 6. | For the first store, | what are the initial contents of the second level page table entry? |
|    | Ox1                  | 1st level PTE is at 0xE5 (0x0E0 + 0x5). Contents are 0x4F0.         |
|    | Ox3                  | Thus the 2nd level is valid, and begins at 0xF0.                    |
|    | Ox5                  | 2nd level index is 0x7. Address is 0xF7. Contents are 0x5           |
|    | ○ Cannot be          | determined from the given information                               |
| 7. | For the first store, | what is the final destination physical address?                     |
|    | Ox009                |                                                                     |
|    | Ox109                | 2nd level contents of 0x5 means it is valid and PPN is 1            |
|    | Ox209                | Concatenation of 0x1 with page offset of 0x09 gives 0x109           |
|    | Ox309                |                                                                     |
|    |                      |                                                                     |

#### (Snippet repeated for ease.)

| Addr  | +0  | +1  | +2  | +3  | +4  | +5  | +6  | +7  | +8  | +9  | +A  | +B  | +C  | +D  | +E  | +F  |
|-------|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|
| 0x0E0 | 430 | 0FC | 164 | 2C3 | 106 | 4F0 | 387 | 12D | 077 | 26d | 1BA | 324 | 013 | 3B8 | 337 | 3EE |
| 0x0F0 | 003 | 002 | 000 | 002 | 001 | 003 | 002 | 005 | 003 | 001 | 000 | 003 | 002 | 003 | 001 | 002 |
| 0x100 | 000 | 000 | 000 | 000 | 000 | 000 | 000 | 000 | 000 | 000 | 000 | 000 | 000 | 000 | 000 | 000 |

| 8.  | For the sec | ond store,  | what is the destination virtual address?                            |
|-----|-------------|-------------|---------------------------------------------------------------------|
|     | Ox5         | E0E         |                                                                     |
|     | Ox5         | E12         | 0x709 * 2 = 0xE12                                                   |
|     | Ox5         | E18         |                                                                     |
|     | Ox80        | 602         |                                                                     |
| 9.  | For the sec | ond store,  | what is the first level page table index?                           |
|     | Ox2         |             |                                                                     |
|     | O 0x4       |             |                                                                     |
|     | Ox5         |             | First 4 bits of above address.                                      |
|     | OxA         |             |                                                                     |
| 10. | For the sec | ond store,  | what are the initial contents of the second level page table entry? |
|     | Ox1         |             | 1st level PTE is at 0xE5. Contents are 0x4F0.                       |
|     | Ox3         |             | Thus the 2nd level is valid, and begins at 0xF0.                    |
|     | Ox5         |             | 2nd level index is 0xE. Address is 0xFE. Contents are 0x1           |
|     | ○ Can       | not be dete | ermined from the given information                                  |
| 11. | For the sec | ond store,  | what is the final destination physical address?                     |
|     | Ox1         | 12          | 2nd level contents of 0x1 means it is invalid                       |
|     | Ox2         | 12          | We allocate an empty page, PPN 2.                                   |
|     | Ox3         | 12          | Concatenation of 0x2 with page offset of 0x12 gives 0x212           |
|     | ○ Can       | not be dete | ermined from the given information                                  |
|     |             |             |                                                                     |

12. The store instructions in the program will result in exactly 2 updates to the memory values in the above snippet. Since the machine code is valid in physical memory, you may ignore page table updates from the load, and from instruction fetches. For each value in the above table that is modified, write below the physical address receiving the update and the new value. Put all values in Hexadecimal.

| Physical Address (in Hex, in the range 0xE0 to 0x10F) | New Value (in Hex)     |  |  |  |  |
|-------------------------------------------------------|------------------------|--|--|--|--|
| 0x109 (Final destination of 1st access)               | 0x709 (RegB value)     |  |  |  |  |
| 0xFE (2nd level PTE of 2nd access)                    | 0x6 (Valid with PPN 2) |  |  |  |  |

# **Problem 6: Cost-Benefit Analysis**

### 25 points

You are doing an internship as a computer architect and were asked to evaluate a set of architectural trade-offs. They want to implement the features that provide the best performance-cost trade-off for their system. Your task is to predict the benchmarks' performance and determine which configurations are optimal for their applications.

This baseline in-order pipelined processor was designed to have no stalls for data hazards of any type. You are given a simulator and collect the following measurements:

| Total instructions             | 35,000,000 |
|--------------------------------|------------|
| Memory instructions            | 16,000,000 |
| Branch instructions            | 7,100,000  |
| I-cache hit rate               | 98%        |
| D-cache hit rate               | 91%        |
| Branch predictor accuracy rate | 92%        |

Unfortunately, the simulator they gave you did not compute cycles or CPI. Your first task is to figure these out for the two benchmarks. After some digging, you will find the following facts from the simulator's configuration parameters:

- The branch misprediction penalty is 7 cycles
- I and D cache miss penalties are 20 cycles
- 1. You first compute the cycles spent on misprediction and miss penalties for the benchmark.

| Branch misprediction cycles | 7.1 M * 8% * 7 = 3.976 M |  |  |  |  |  |
|-----------------------------|--------------------------|--|--|--|--|--|
| I-cache miss cycles         | 35 M * 2% * 20 = 14 M    |  |  |  |  |  |
| D-cache miss cycles         | 16 M * 9% * 20 = 28.8 M  |  |  |  |  |  |

2. Then, produce a summary of total cycles and CPI (rounded to 2 decimal places). You may ignore the few cycles to empty or fill the pipeline.

| Total cycles | 35 M + above = 81.776 M |
|--------------|-------------------------|
| СРІ          | 81.776 / 35 = 2.336     |

(Original metrics repeated for ease)

| Total instructions             | 35,000,000 |
|--------------------------------|------------|
| Memory instructions            | 16,000,000 |
| Branch instructions            | 7,100,000  |
| I-cache hit rate               | 98%        |
| D-cache hit rate               | 91%        |
| Branch predictor accuracy rate | 92%        |

- The branch misprediction penalty is 7 cycles
- I and D cache miss penalties are 20 cycles
- 3. Your manager asks you to decide between two features, each adding equal processor area:
  - A branch predictor that increases prediction accuracy to 95%
  - A better data cache, which increases the D-cache hit rate to 96%

Each feature adds about 25% to the processor area. Since the company doesn't want to add too much extra cost to the processor (and cost is proportional to area), they would like to pick between the two options. You start eyeballing the benchmark breakdown from before and quickly realize that the choice is clear.

| O Better D-cache | O Better branch predictor |
|------------------|---------------------------|
|                  | O Better D-cache          |

Please explain your reasoning in 1-2 sentences. You do not need to make specific calculations, but your answer should include comparisons between relevant metrics.

The D cache is accessed by more instructions (since there are more memory instructions than

branches), costs more cycles on a miss, and it gets a greater % of misses eliminated (91% →

96% is better than 92%  $\rightarrow$  95%). These multiply out to a greater cycle reduction.

4. The better D-cache and branch predictor were deemed too expensive for the product, so you are asked to evaluate another idea instead. This new idea is an accelerator engine that can accelerate certain sets of instructions but which adds a 10% area cost. You look at the accelerator's new custom instructions and note that using them results in the execution of 8,000,000 fewer arithmetic instructions (the number of branch and memory instructions executed stays the same).

You produce the following results:

| Total instructions                           | 27,000,000                 |
|----------------------------------------------|----------------------------|
| Cycles reduced from the initial architecture | 8 M (1 + 2% * 20) = 11.2 M |

5. In a discussion with your team, it comes out that a decision has to be made between the initial architecture and the accelerator. The former two options were deemed to be too expensive for the product.

The main objective is to **minimize the cost** of the device, and due to system constraints, they can only run the processor at a maximum of 1.5 GHz. (1 GHz = 10<sup>9</sup> Hz)

They further explain that the use case they are designing this processor for is a sensor that must be able to run the benchmark at least 10 times per second. They want to leave headroom for future applications, so you **should plan** for 2x capacity (i.e., plan for enough performance headroom for the benchmark to be able to run 20 times per second).

| Choose the best and fill in the bubble: | O Initial Architecture | ○ Accelerator |
|-----------------------------------------|------------------------|---------------|
|-----------------------------------------|------------------------|---------------|

Please explain your reasoning in 1-2 sentences, using calculations to support it.

For the initial arch, 81.776 M cycles \* 20 times =  $1.63552 * 10^9$  cycles. We can only do  $1.5 * 10^9$  cycles in a second, so the accelerator is the only option.

Partial credit answer: the accelerator gets about at 13% reduction in runtime (cycles) for a 10% area cost, so the performance per dollar is better. However, since we want to minimize the cost, we would go with the initial architecture if it met the minimum requirement.

#### THIS PAGE INTENTIONALLY LEFT BLANK.