Architetture dei Sistemi di Elaborazione O2GOLOV Delivery date: October 30<sup>th</sup> 2024

Laboratory 4

Expected delivery of **lab\_4.zip** must include:

- each configuration of the custom architecture (riscv\_o3\_custom.py) that you modify.
- This document with all the field compiled and in PDF form.

## **Introduction and Background**

Simulating an Out-of-Order (OoO) CPU (O3CPU)



In this laboratory, you will be able to configure an OoO CPU by using a script called riscv\_o3\_custom.py. In a few words, the script configures an <u>Out-of-Order (O3) processor</u> based on the *DerivO3CPU*, a superscalar processor with a reduced number of features.

### **Pipeline**

The processor pipeline stages can be summarized as:

- **Fetch stage:** instructions are fetched from the instruction cache. The fetchWidth parameter sets the number of fetched instructions. This stage does branch prediction and branch target prediction.
- **Decode stage:** This stage decodes instructions and handles the execution of unconditional branches. The decodeWidth parameter sets the maximum number of instructions processed per clock cycle.
- **Rename stage:** As suggested by the name, registers are renamed, and the instruction is pushed to the IEW (Issue/Execute/Write Back) stage. It checks that the *Instruction Queue* (IQ)/*Load and Store Queue* (LSQ) can hold the new instruction. The maximum number of instructions processed per clock cycle is set by the renameWidth parameter.



Figure 1: Understanding configurable OoO CPU parameters.

- **Dispatch stage**: instructions whose renamed operands are available are dispatched to functional units (**FU**). For loads and stores, they are dispatched to the Load/Store Queue (**LSQ**). The maximum number of instructions processed per clock cycle is set by the dispatchWidth parameter.
- **Issue stage**: The simulated processor has a single instruction queue from which all instructions are issued. Ordinarily, <u>instructions are taken in-order from this queue</u>. An instruction is issued if it does not have any dependency.
- Execute stage: the functional unit (FU) processes their instruction. Each functional unit can be configured with a different latency. Conditional branch <u>mispredictions are identified here</u>. The maximum number of instructions processed per clock cycle depends on the different functional units configured and their latencies.
- Writeback stage: it sends the result of the instruction to the reorder buffer (ROB). The maximum number of instructions processed per clock cycle is set by the wbWidth parameter.
- Commit stage: it processes the reorder buffer, freeing up reorder buffer entries. The maximum number of instructions processed per clock cycle is set by the commitWidth parameter. Commit is done in order.

In the event of a **branch misprediction**, trap, or other speculative execution event, "squashing" can occur at all stages of this pipeline. When a pending instruction is squashed, it is removed from the instruction queues, reorder buffers, requests to the instruction cache, etc.



Figure 2: Example of a branch **misprediction** (transparent rows)

### **Pipeline Resources**

Additionally, it has the following structures:

- Branch predictor (BP)
  - Allows for selection between several branch predictors, including a local predictor, a
    global predictor, and a tournament predictor. Also has a branch target buffer (BTB)
    and a return address stack (RAS).
- Reorder buffer (ROB)
  - o Holds instructions that have reached the back end. Handles squashing instructions and keep instructions in program order.
- Instruction queue (IQ)
  - Handles dependencies between instructions and scheduling ready instructions. Uses the **memory dependence predictor** to tell when memory operations are ready.
- Load-store queue (LSQ)
  - O Holds loads and stores that have reached the back end. It hooks up to the d-cache and initiates accesses to the memory system once memory operations have been issued and executed. Also handles forwarding from stores to loads, replaying memory operations if the memory system is blocked, and detecting memory ordering violations.
- Functional units (FU)
  - o Provides timing for instruction execution. Used to determine the latency of an instruction executing, as well as what instructions can issue each cycle.
  - Floating point units, floating point registers, and respective instructions are supported.



Figure 3: Pipeline example of FP instructions and FP registers

# **Laboratory: hands-on**

### All the needed resources are at a GitHub repository:

https://github.com/cad-polito-it/ase\_riscv\_gem5\_sim

To create your simulation environment:

For HTTPS clone:

~/my\_gem5Dir\$ git clone https://github.com/cad-polito-it/ase riscv gem5 sim.git

#### For SSH:

~/my gem5Dir\$ git clone git@github.com:cad-polito-it/ase riscv gem5 sim.git

The environment is configured to be executed on the LABINF MACHINES.

Follow the HOWTO instructions available on the GitHub Repository for simulating a program.

## **Exercise 1:**

Simulate the benchmark  $my\_c\_benchmark\_2$  (main.c) by using the gem5 simulator to obtain the trace.out file. Then, you can visualize the pipeline (i.e., load the trace.out file on Konata).

Based on the CPU architecture described in riscv\_o3\_custom.py, visualize the Konata's pipeline to find out the conditions:

- 1. Out-of-order execution (issue), in-order commit (commit)
- 2. Two commits in the same clock cycle
- 3. Flush of the pipeline.

For every condition, fill the following tables.

| Condition          | Out-of-order execution, in-order commit                                                 |  |  |  |  |
|--------------------|-----------------------------------------------------------------------------------------|--|--|--|--|
| Screenshot         | 656: s714 (t0: r415): 0x0000028c: addiw a5, a5, 0 3 4 Is Cm 1                           |  |  |  |  |
| from               | 657: s715 (t0: r416): 0x00000290: lui a4, 1                                             |  |  |  |  |
| Konata             | 658: s716 (t0: r417): 0x000000294: addi a4, a4, 512                                     |  |  |  |  |
| <b>Explain the</b> | La addi può eseguire la IS prima della addiw (in quanto quest'ultima richiede più       |  |  |  |  |
| reason             | cicli di clock) in quanto ha bisogno di a4, preso dal forwarding della lui precedente;  |  |  |  |  |
| behind the         | ma poi per fare il commit, deve aspettare che la lui precedente termini il commit.      |  |  |  |  |
| condition          |                                                                                         |  |  |  |  |
| Briefly            | L'OoO execution permette di eseguire istruzioni successive prima delle precedent        |  |  |  |  |
| explain the        | (l'esecuzione di istruzioni non ancora eseguite); si possono eseguire le operazioni     |  |  |  |  |
| advantages         | successive con operandi indipendenti dai precedenti, prima della terminazione           |  |  |  |  |
| of the OoO         | dell'esecuzione di queste ultime (riduce quindi il numero di stalli e di cicli di clock |  |  |  |  |
| execution          | richiesti).                                                                             |  |  |  |  |
| in a CPU           |                                                                                         |  |  |  |  |
|                    |                                                                                         |  |  |  |  |
|                    |                                                                                         |  |  |  |  |
| Condition          | Two or more commits in the same clock cycle                                             |  |  |  |  |
| Screenshot         | 1476: s1534 (t0: r1011): 0x0000001a8: lw a5, -1560(s0)                                  |  |  |  |  |
| from               | 1477: s1535 (t0: r1012): 0x0000001ac: addiw a5, a5, 0                                   |  |  |  |  |
| Konata             | 1478: s1536 (t0: r1013): 0x0000001b0: lui a4, 1 Rn 1 Is Cm 1 2 3                        |  |  |  |  |

| <b>Explain the</b> | Perchè abbiamo una CPU che supporta OoO execution e quindi i 2 commit sono              |  |  |  |  |
|--------------------|-----------------------------------------------------------------------------------------|--|--|--|--|
| reason             | contemporanei, nonostante la IS dell'istruzione dopo sia fuori ordine rispetto alla     |  |  |  |  |
| behind the         | IS prima. Inoltre questo avviene perchè ho commitWidth > 1.                             |  |  |  |  |
| condition          | r                                                                                       |  |  |  |  |
| Briefly            | Nel COMMIT il ROB viene ordinato in base all'ordine originale. Appena                   |  |  |  |  |
| explain the        | un'istruzione raggiunge la head del ROB:                                                |  |  |  |  |
| Commit             | 66 6                                                                                    |  |  |  |  |
| functionin         | a. se è un branch non predetto correttamente, il buffer viene ripulito e                |  |  |  |  |
|                    | l'esecuzione riparte con la corretta istruzione seguente                                |  |  |  |  |
| g                  | b. se è un branch predetto correttamente, il risultato viene scritto nel                |  |  |  |  |
|                    |                                                                                         |  |  |  |  |
|                    | registro/memoria                                                                        |  |  |  |  |
|                    | In entrambi i casi, la entry del ROB viene segnata come libera.                         |  |  |  |  |
| Condition          | Flush of the pipeline                                                                   |  |  |  |  |
| Screenshot         | 1586: s1644 (t0: r1121): 0x0 50 51 52 53 54 55 56 57 58 59 60 Dc Rn 1 Is Cm 1 2         |  |  |  |  |
| from               | 1587: s1645 (t0: r1122): 0x0 48 49 50 51 52 53 54 55 56 57 58 Dc Rn 1 Ds Is Cm 1        |  |  |  |  |
| Konata             | 1588: s1646 (t0: r1123): 0x0 48 49 50 51 52 53 54 55 56 57 58 59 60 Dc Rn 1 Is Cm       |  |  |  |  |
| Kunata             | 1589: s1647 (t0: r1124): 0x0 48 49 50 51 52 53 54 55 56 57 58 59 60 Dc Rn 1 2 Is        |  |  |  |  |
|                    | 1590: s1648 (t0: r1125): 0x0 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 Dc Rn      |  |  |  |  |
|                    | 1591: s1649 (t0: r1126): 0x0 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 DC Rn      |  |  |  |  |
|                    | 1592: s1650 (t0: r1127): 0x0 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 Dc Rn   |  |  |  |  |
| Ermlein 41:        | 1593: s1651 (t0: r1128): 0x0 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 Dc Rn   |  |  |  |  |
| Explain the        | Il salto è stato predetto in maniera errata, quindi le istruzioni successive sono state |  |  |  |  |
| reason             | flushate dalla pipeline; la branch mal predetta era una bge alla riga 1587.             |  |  |  |  |
| behind the         |                                                                                         |  |  |  |  |
| condition          |                                                                                         |  |  |  |  |
|                    |                                                                                         |  |  |  |  |
|                    |                                                                                         |  |  |  |  |
|                    |                                                                                         |  |  |  |  |
|                    |                                                                                         |  |  |  |  |
|                    |                                                                                         |  |  |  |  |

# **Exercise 2:**

Given your benchmark (main.c in my\_c\_benchmark\_2), optimize the CPU architecture (i.e., modify the riscv\_o3\_custom.py file) and write down the improvements in terms of CPI and speedup.

o To optimize the CPU architecture, open the configuration file of the CPU (i.e., the riscv o3 custom.py), and tune specific hardware-related parameters.

You have to change specific values in **one or more** stages of the pipeline:

- o # FETCH STAGE
  - Tune parameters such as the fetchWidht, fetchBuffersize and so on, and see the effects on your system.
- o # DECODE STAGE
- o # RENAME STAGE
  - Try changing some values, <u>but don't touch the "Phys" ones.</u>
- # DISPATCH/ISSUE STAGE
- # EXECUTE STAGE

- Here you can optimize the Functional units of your CPU like the INT ALU, the FP ALU, the FP Multiplier/Divider and so on.
- Tune the number of units (count) that you have in the system, as well as their latency (opLat) to see how this affects the execution of your program.
- o You can create a different branch predictor. They are defined in create predictor.py)
- You can also try to change the parameters of the L1 Cache. Look for the "class L1Cache" in the riscv\_o3\_custom.py file. The L1 cache, also referred to as the primary cache, is the smallest and fastest level of memory. It is located directly on the processor, and it is used to store frequently accessed data by the CPU. In this way, the CPU saves time with respect to the normal access to the main memory.

<u>HINT:</u> To implement the best hardware optimization, and understand how to change the parameters, the best option consists in analysing the *stats.txt* file (in ase\_riscv\_gem5\_sim/results/my\_c\_benchmark\_2).

Find information regarding the workload profiling. In other words, look for lines such as "system.cpu.commitStats0.committedInstType::IntAlu", and the following ones to understand which kind of instructions are executed the most. In this way, you can target a specific functional unit and modify its specifications.

Fill the following Tables with the CPI that you obtain with the old and the new architectures. Compute also the equivalent speedup that you obtain.

HINT: You can get the CPI and other useful information from the stats.txt file.

| Parameters     | Configuration    | Configuration 2  | Configuration 4 | Configuration 5 |
|----------------|------------------|------------------|-----------------|-----------------|
|                | 1                |                  |                 |                 |
| First changed  | the_cpu.fetchWi  | the_cpu.issueWid | the_cpu.issue   | the_cpu.wbW     |
| paramenter     | dth = 2          | th = 2           | Width $= 5$     | idth = 1        |
| Second changed | the_cpu.dispatch | None             | the_cpu.com     | None            |
| paramenter     | Width $=1$       |                  | mitWidth = 1    |                 |

Original CPI (no hardware optimization): 2.083105

|               | Configuration 1 | Configuration 2 | Configuration 4 | Configuration 5 |
|---------------|-----------------|-----------------|-----------------|-----------------|
| CPI           | 2.257799        | 2.083105        | 2.085101        | 2.076865        |
| Speedup (wrt  | 0.922626        | 1.000000        | 0.999043        | 1.003004        |
| Original CPI) |                 |                 |                 |                 |

Which is the best optimization in terms of CPI and speedup, why?

#### Your answer:

La configurazione 5 è la migliore in termine di CPI e speedup, in quanto è l'unica configurazione che diminuisce il CPI e aumenta lo speedup. Quindi l'ottimizzazione migliore che ho trovato è stata diminuire da 2 a 1 la width della write back.

