## Advanced Computer Architectures - Notes - ${\rm v}0.3.0$

260236

March 2025

#### Preface

Every theory section in these notes has been taken from two sources:

- Computer Architecture: A Quantitative Approach. [1]
- Pipelining slides. [2]
- Course slides. [3]

#### About:

GitHub repository



These notes are an unofficial resource and shouldn't replace the course material or any other book on advanced computer architectures. It is not made for commercial purposes. I've made the following notes to help me improve my knowledge and maybe it can be helpful for everyone.

As I have highlighted, a student should choose the teacher's material or a book on the topic. These notes can only be a helpful material.

### Contents

| 1  | Pip            | elining                               | 4  |
|----|----------------|---------------------------------------|----|
|    | 1.1            | Basic Concepts                        | 4  |
|    | 1.2            |                                       | 9  |
|    |                |                                       | 12 |
|    |                |                                       | 4  |
|    | 1.3            |                                       | 16 |
|    |                |                                       | 18 |
|    |                |                                       | 21 |
|    | 1.4            |                                       | 23 |
| 2  | Con            | ntrol Hazards and Branch Prediction 2 | 26 |
|    | 2.1            | Conditional Branch Instructions       | 26 |
|    | 2.2            | Control Hazards                       | 28 |
|    | 2.3            |                                       | 30 |
|    | 2.4            |                                       | 33 |
|    | 2.5            |                                       | 34 |
|    |                | 2.5.1 Branch Always Not Taken         | 35 |
|    |                |                                       | 37 |
|    |                |                                       | 10 |
| In | $\mathbf{dex}$ | 4                                     | 12 |

#### 1 Pipelining

#### 1.1 Basic Concepts

Pipelining is a fundamental **technique** in computer architecture aimed at improving instruction throughput by overlapping the execution of multiple instructions. The main idea behind pipelining is to divide the execution of an instruction into distinct stages and process different instructions simultaneously in these stages. This approach significantly increases the efficiency of instruction execution in modern processors.

#### X Understanding the RISC-V instruction set

Before delving into pipelining, it is essential to understand the **basic instruction set** of the RISC-V architecture. The instruction set consists of three major categories:

- ALU Instructions (Arithmetic and Logic Operations)
  - Performs addition between registers:
  - add rd, rs1, rs2

Performs the addition between the values in registers rs1 and rs2 and stores the result in register rd.

$$\texttt{rd} \leftarrow \texttt{rs1} + \texttt{rs2}$$

- Performs an addition between a constant and a register:
- addi rd, rs1, 4

Performs the addition between the value in register rs1 and the value 4 and stores the result in register rd.

$$\mathtt{rd} \leftarrow \mathtt{rs1} + 4$$

- Load/Store Instructions (Memory Operations)
  - Loads data from memory:
  - 1 ld rd, offset(rs1)

Load data into register rd from an address formed by adding rs1 to a signed offset.

$$\mathtt{rd} \leftarrow M \left[ \mathtt{rs1} + \mathtt{offset} \right]$$

- Stores data in memory:
- sd rs2, offset(rs1)

Store data from register rs2 to an address formed by adding rs1 to a signed offset.

$$M \left[ \mathtt{rs1} + \mathtt{offset} \right] \leftarrow \mathtt{rs2}$$

#### • Branching Instructions (Control Flow Management)

#### - Conditional Branches

- \* Branch on equal:
- beq rs1, rs2, L1

Branch to the label L1 if the value in register rs1 is equal to the value in register rs2.

$$rs1 = rs1 \xrightarrow{go to} L1$$

- \* Branch on not equal:
- bne rs1, rs2, L1

Branch to the label L1 if the value in register rs1 is not equal to the value in register rs2.

$$rs1 \neq rs2 \xrightarrow{go to} L1$$

#### - Unconditional Jumps

- \* Jump to the label (jump):
- 1 j L1

Jump directly to the L1 label.

- \* Jump to the address stored in a register (jump register):
- ı jr ra

Take the value in register ra and use it as the address to jump to. So it is assumed that ra contains an address.

These basic instructions will be used throughout the course.

#### **Execution phases in RISC-V**

- 1. **IF** (**Instruction Fetch**): The instruction is **fetched** from memory.
- 2. **ID** (Instruction Decode): The instruction is decoded, and the required registers are read.
- 3. **EX** (Execution): The instruction is **executed**, typically involving ALU operations.
- 4. **ME** (Memory Access): For *load/store* instructions, this stage reads from or writes to memory.
- 5. WB (Write Back): The result is written back to the destination register.

These five stages form the basis of the RISC-V pipeline.

#### **✗** Implementation of the RISC-V Data Path

The RISC-V Data Path is a fundamental component of the processor's architecture, responsible for executing instructions efficiently by coordinating various hardware units. It defines how instructions flow through different stages of execution, interacting with memory, registers, and the Arithmetic Logic Unit (ALU).



Figure 1: Generic implementation of the RISC-V Data Path.



Figure 2: Specific implementation of the RISC-V Data Path.

Its fundamental components include:

- Instruction Memory and Data Memory Separation. RISC-V adopts a Harvard Architecture style, where the Instruction Memory (IM) and Data Memory (DM) are separate. This prevents structural hazards where instruction fetch and memory access could conflict in a single-memory design (this topic will be addressed later).
- General-Purpose Register File (RF). It consists of **32** registers, each **32-bit** wide. The register file has **two read ports** and **one write port** to support simultaneous read and write operations. This setup allows faster register access, which is crucial for pipelined execution.
- Program Counter (PC). It holds the address of the next instruction to be fetched. Automatically increments during execution, typically by 4 bytes (for 32-bit instructions).

• Arithmetic Logic Unit (ALU). Performs arithmetic and logical operations required by instructions. Inputs to the ALU come from registers or immediate values decoded from the instruction.

Other components that we can see in the general implementation of the RISC-V data path are:

- Register File. Stores temporary values used by instructions. Contains read ports (two registers can be read simultaneously for ALU operations) and write port (one register can be updated per clock cycle). The register file ensures high-speed execution of operations by reducing memory accesses.
- Instruction Fetch (IF). The PC (Program Counter) retrieves the next instruction from Instruction Memory. The PC is incremented using an adder (PC + 4), ensuring sequential instruction flow.
- Instruction Decode (ID). Extracts opcode (determines the instruction type), source and destination registers, immediate values (if present). It reads values from the Register File based on instruction requirements.
- Execution (EX). The ALU performs arithmetic and logical operations. A multiplexer (MUX) selects the second operand: a register value (for R-type instructions) or an immediate value (for I-type instructions like addi). The ALU result is forwarded to the next stage.
- Memory Access (ME). Load (1d) and Store (sd) instructions interact with data memory. Data is either loaded from memory into a register or stored from a register into memory.
- Write Back (WB). The result from ALU or memory is written back to the Register File.

#### Example 1: Data Path Execution Example

Let's consider a simple RISC-V load instruction (ld x10, 40(x1)) passing through the data path:

- 1. **IF Stage**: Instruction Fetch
  - ullet PC ightarrow Instruction Memory ightarrow 1d x10, 40(x1) fetched
  - $\bullet$  PC updated to PC + 4
- 2. **ID Stage**: Instruction Decode
  - Registers read: x1 (base register for memory access)
  - Immediate value extracted: 40
- 3. EX Stage: Execution
  - $\bullet$  ALU calculates memory address: x1 + 40
- 4. ME Stage: Memory Access
  - Data is loaded from M[x1 + 40]
- 5. **WB Stage**: Write Back
  - $\bullet$  Data stored in x10

#### 1.2 RISC-V Pipelining

Pipelining is analogous to an assembly line in a factory. Instead of waiting for one instruction to complete before starting the next, different instructions are executed simultaneously in different stages.

If we consider a **non-pipelined execution**:

- Each instruction completes all five stages sequentially before the next instruction starts.
- If each instruction stage (IF stage, ID stage, etc.) takes, say, 2 nanoseconds, executing all stages of an instruction (IF, ID, EX, MEM, WB) takes 5 times 2 nanoseconds, then 10 nanoseconds. If we also want to execute 5 instructions, we need 10 nanoseconds times 5, then 50 nanoseconds!

Now, we consider a **pipelined execution**:

- Once the first instruction moves to the second stage, the next instruction starts in the first stage.
- The **pipeline becomes fully utilized** after the first few cycles, significantly **improving throughput**.

In an **ideal scenario**, a 5-stage pipeline should provide a speedup of  $5 \times$  reducing execution time to:

$$(5+4) \times 2 \text{ ns} = 18 \text{ ns}$$

Where 5 are the steps of the first instruction, 5 are the steps of the last instruction, minus 1 because one step is already counted in the first instruction, so 4. Therefore, 9 is multiplied by 2 nanoseconds, the time taken by each stage. The result, 18 nanoseconds, is the time it takes the pipeline to execute 5 instructions in an ideal scenario.



Figure 3: Sequential vs. Pipelining execution.

#### Pipeline Performance and Speedup

The ideal performance improvement from pipelining is derived from the fact that once the pipeline is filled, a new instruction completes every cycle. The key performance metrics include:

- Latency (Execution Time): The total time to complete a single instruction does not change (sequential or pipeline).
- Throughput (Instructions per Unit Time): The number of completed instruction per unit time significantly increases.

#### • Speedup Calculation

- A non-pipelined CPU with 5 execution cycles of 2 ns would take 10 ns per instruction.
- A pipelined CPU with 5 stages of 2 ns results in 1 instruction completing every 2 ns.
- This gives a theoretical speedup of  $5 \times$  (ideal case).

Unfortunately, real-world implementations are subject to **pipeline hazards** that reduce efficiency.

#### **■** Understanding Pipelining Performance

Pipelining improves instruction throughput by allowing multiple instructions to be processed simultaneously in different stages. The execution of an instruction is divided into 5 pipeline stages:

- 1. IF (Instruction Fetch)
- 2. ID (Instruction Decode)
- 3. EX (Execution)
- 4. MEM (Memory Access)
- 5. WB (Write Back)

Each stage takes 2 ns (a *pipeline cycle*), meaning that an instruction moves from one stage to the next every 2 ns. Now, let's analyze the timeline of instruction execution:

| Clock Cycle    | IF | ID | EX | MEM | $\overline{\mid \mathbf{W} \mathbf{B}}$ |
|----------------|----|----|----|-----|-----------------------------------------|
| 1st (0-2 ns)   | I1 |    |    |     |                                         |
| 2nd (2-4 ns)   | 12 | I1 |    |     |                                         |
| 3rd (4-6 ns)   | 13 | 12 | I1 |     |                                         |
| 4th (6-8 ns)   | 14 | 13 | 12 | I1  |                                         |
| 5th (8-10 ns)  | 15 | 14 | 13 | 12  | I1                                      |
| 6th (10-12 ns) | 16 | 15 | 14 | 13  | 12                                      |
| 7th (12-14 ns) | 17 | 16 | 15 | 14  | 13                                      |
| 8th (14-16 ns) | 18 | 17 | 16 | I5  | 14                                      |
| 9th (16-18 ns) | 19 | 18 | 17 | 16  | 15                                      |

Table 1: Pipelining timeline execution in an ideal case.

- The first instruction I1 takes 5 (clock) cycles to complete, i.e., 10 ns.
- However, starting from cycle 5, a new instruction finishes every cycle (every 2 ns).
- In a non-pipelined system, each instruction would take 10 ns (5 stages  $\times$  2 ns each).
- In a pipelined system, once the pipeline is full, an instruction completes every cycle (every 2 ns), achieving a 5× speedup compared to the non-pipelined execution.

Thus, after an initial "fill" time (1st, 2nd, 3rd, 4th), a new instruction completes every 2 ns (from 5th to 6th, I1 is finished; from 6th to 7th, I2 is finished; from 7th to 8th, I3 is finished), which is the duration of a single pipeline stage.

#### 1.2.1 Pipelined execution of instructions

Each RISC-V instruction follows the five pipeline stages, but their interactions with the pipeline vary depending on the instruction type.

#### • ALU Instructions (e.g., op \$x, \$y, \$z)

These are register-based operations that do not require memory access. Since there is no memory operation, the instruction **bypasses the ME stage**.

| Stage                  | Description                                    |
|------------------------|------------------------------------------------|
| IF                     | Fetch instruction from memory                  |
| ID                     | Decode instruction, read registers $y$ and $z$ |
| $\mathbf{E}\mathbf{X}$ | Perform ALU operation ( $x = y + z$ )          |
| ME                     | No memory access (skipped)                     |
| WB                     | Write the ALU result to $x$                    |

#### • Load Instructions (e.g., lw \$x, offset(y))

These instructions retrieve data from memory and store it in a register. The **memory access stage** (ME) **is crucial** here since the instruction must fetch data from memory.

| Stage                  | Description                                |
|------------------------|--------------------------------------------|
| IF                     | Fetch instruction from memory              |
| ID                     | Decode instruction, read base register \$y |
| $\mathbf{E}\mathbf{X}$ | Compute memory address (\$y + offset)      |
| ME                     | Read data from memory                      |
| WB                     | Write data into destination register $x$   |

#### • Store Instructions (e.g., sw \$x, offset(\$y))

These instructions write data from a register into memory. Unlike lw, store instructions do not require the WB stage, as data is written directly into memory.

| Stage                  | Description                                                        |
|------------------------|--------------------------------------------------------------------|
| IF                     | Fetch instruction from memory                                      |
| ID                     | Decode instruction, read base register \$y and source register \$x |
| $\mathbf{E}\mathbf{X}$ | Compute memory address (\$y + offset)                              |
| ME                     | Write \$x into memory at the computed address                      |
| WB                     | No write-back stage (skipped)                                      |

#### • Conditional Branches (e.g., beq \$x, \$y, offset)

Branching introduces control hazards, as the pipeline needs to determine whether the branch is taken or not. Branches can introduce **stalls** due to dependencies on comparison results. This issue is typically mitigated using branch prediction.

| Stage                  | Description                                    |
|------------------------|------------------------------------------------|
| IF                     | Fetch instruction from memory                  |
| ID                     | Decode instruction, read registers \$x and \$y |
| $\mathbf{E}\mathbf{X}$ | Compare \$x and \$y, compute target address    |
| ME                     | No memory access (skipped)                     |
| WB                     | Update PC if branch is taken                   |

This section breaks down how different types of instructions behave in the pipeline:

- ALU Instructions complete in the EX stage and do not use memory.
- Load Instructions require a memory access in the ME stage.
- Store Instructions write to memory instead of registers.
- Branch Instructions introduce control hazards because they may change the PC.

This means that **not all instructions behave the same** in the pipeline. Some instructions **skip certain stages** (e.g., stores do not have WB, ALU instructions skip ME), and some instructions **introduce potential problems** (e.g., branches can cause delays).

In conclusion, this section sets the stage for understanding pipeline stalls, forwarding, and hazard resolution techniques that are essential for designing high-performance processors.

#### 1.2.2 Pipeline Implementation

The RISC-V pipeline implementation is designed to efficiently execute multiple instructions simultaneously, following the classical five-stage pipeline model:

- 1. IF (Instruction Fetch)
- 2. ID (Instruction Decode)
- 3. EX (Execution)
- 4. MEM (Memory Access)
- 5. WB (Write Back)

Each clock cycle, a new instruction enters the pipeline while previous instructions move to the next stage, allowing five different instructions to be in execution at the same time.



Figure 4: Structure of RISC-V pipeline.

#### \* Execution Stages and Pipeline Modules

Each stage of the pipeline corresponds to a specific hardware module in the CPU. The RISC-V pipeline is composed of five primary hardware modules:

- Instruction Fetch (IF) Module: Fetches instructions from instruction memory and updates the PC.
- Instruction Decode (ID) Module: Decodes the fetched instruction and reads register values.
- Execution (EX) Module: Performs arithmetic/logical operations in the ALU or computes memory addresses.
- Memory Access (MEM) Module: Reads from or writes data to memory.

• Write Back (WB) Module: Writes the computed result back into the register file.

Each module is responsible for a specific **stage of execution**, and together they allow overlapping execution of multiple instructions.

#### Pipeline Registers

To maintain separation between stages, pipeline registers are used (see Figure 4, page 14). These registers store intermediate results and ensure proper communication between stages:

- IF/ID Register: Holds fetched instruction and updated PC.
- ID/EX Register: Stores decoded instruction, read register values, and control signals.
- **EX/MEM Register**: Holds ALU results, destination register, and memory access information.
- MEM/WB Register: Stores memory data or ALU result to be written back to registers.

These pipeline registers eliminate the need for re-fetching or re-decoding instructions at each cycle, thus maintaining pipeline efficiency.

#### 1.3 Problem of Pipeline Hazards

#### Assumptions Made

Until now, our discussion on the RISC-V pipeline implementation has relied on several key assumptions to simplify the analysis and focus on fundamental concepts. These assumptions help in understanding the ideal case of pipelining before introducing complexities like hazards and optimizations.

- 1. All instructions are independent, so there are no dependencies between them.
- 2. No branches or jumps that change execution flow.

This is a theoretical idealization, because in real-world scenarios, hazards (structural, data, and control) interfere with smooth execution. Also, our second assumption ignores branch instructions (beq, bne, j, jr), which cause control hazards that require branch prediction or pipeline flushing.

#### **?** What is a Pipeline Hazard?

Now that we have understood the ideal execution of a RISC-V pipeline, we must discuss pipeline hazards, which are obstacles that prevent the pipeline from operating at maximum efficiency.

A Hazard (or conflict) is a phenomenon that occurs when the overlapping execution of instructions in the pipeline changes the expected order of instruction execution. This can lead to incorrect results or the need to insert stalls (pipeline bubbles), reducing performance.

In other words, hazards cause the next instruction in the pipeline to be delayed, which reduces the ideal throughput of 1 instruction per cycle. Thus, hazards disrupt the smooth flow of instructions and require techniques to resolve them.

#### **=** Classes of Pipeline Hazards

- Structural Hazards: Attempt to use the same resource from different instructions simultaneously.
  - **?** Example: Single memory for both instruction and data access.
- Data Hazards: Attempt to use a result before it is ready.
  - **?** Example: Instruction depending on a result of a previous instruction still in the pipeline.
- Control Hazards: Try to make a decision about the next statement to execute before the condition is evaluated.
  - **?** Example: Conditional branch execution.

#### Structural Hazards

A structural hazard occurs when multiple pipeline stages need to use the same hardware resource at the same time.

✓ Structural Hazard cannot be applied to RISC-V. This is a great thing, because thanks to the Harvard Architecture, RISC-V uses separate instruction and data memory, and this adoption avoids structural hazards.

#### **?** Control Hazards

A control hazard occurs when the pipeline does not know which instruction to fetch next, usually due to a branch or jump instruction. It is discussed in the following sections.

#### ② Data Hazards

A data hazard occurs when an instruction depends on the result of a previous instruction that is still in the pipeline.

There are several types of data hazards:

• RAW (Read After Write). An instruction tries to read a register before a previous instruction writes to it.

#### **?** Example:

```
1 lw x2, 0(x1)
2 add x3, x2, x4
```

The add instruction needs x2, but x2 is still being fetched from memory in the MEM stage. Without hazard resolution, the processor would get the wrong value for x2.

- WAR (Write After Read). A later instruction writes to a register before an earlier instruction reads it (rare in RISC).
- WAW (Write After Write). Two instructions try to write to the same register in the wrong order.

#### 1.3.1 RISC-V Optimized Pipeline

The RISC-V optimized pipeline introduces refinements that reduce stalls, improve data access, and enhance instruction throughput. The key optimizations include:

- ✓ Efficient Register File Access. In the standard RISC-V pipeline, register accesses happen in two stages:
  - ID (Instruction Decode)  $\rightarrow$  Reads register values.
  - $\bullet$  WB (Write Back)  $\to$  Writes computed values back to registers.

#### **22** Optimization: Read and Write in the Same Cycle

- In the optimized pipeline:
  - Register writing happens in the *first half* of the **clock cycle**;
  - While register reading happens in the second half of the clock cycle.
- This means an instruction can write its result to a register in WB, and the **next instruction can immediately read** that value in ID during the **same cycle**.

This optimization removes unnecessary stalls when an instruction immediately depends on a result written in the previous cycle.



Figure 5: Visual example of an optimized pipeline; here the result (WB stage) of I1 is written in the first half of the clock cycle and the read (ID stage) of I4 is done in the second half of the clock cycle. So there is no hazards!

- ✓ Forwarding (Bypassing) to Reduce Stalls. Forwarding (also called bypassing) is a hardware technique that eliminates stalls by providing ALU results directly to dependent instructions without waiting for the WB stage. It is a possible solution for Data Hazards.
  - Forwarding Paths: To support forwarding, the pipeline includes extra paths that allow instructions to fetch values from intermediate pipeline registers instead of waiting for WB.

• EX/EX Path. Allows ALU results to be forwarded from EX stage output to the next EX stage input. Used when an instruction depends on an arithmetic result of the previous instruction.

# Example 2: EX/EX Forwarding 1 sub x2, x1, x3 # Compute x2 = x1 - x3 2 and x12, x2, x5 # Use x2 immediately

| Cycle | sub x2, x1, x3 | and x12, x2, x5 |  |  |
|-------|----------------|-----------------|--|--|
| 1     | IF             |                 |  |  |
| 2     | ID             | IF              |  |  |
| 3     | EX             | ID              |  |  |
| 4     | MEM            | Stall           |  |  |
| 5     | WB             | Stall           |  |  |
| 6     | 6 EX           |                 |  |  |
| 7     |                | MEM             |  |  |
| 8     |                | WB              |  |  |

The and instruction must wait until WB writes x2 to the register file. Two stall cycles are introduced and this wastes execution time.

Instead of waiting for WB, we forward the ALU result from the EX stage of sub directly to the EX stage of and.

| Cycle | sub x2, x1, x3 | and x12, x2, x5           |  |  |
|-------|----------------|---------------------------|--|--|
| 1     | IF             |                           |  |  |
| 2     | ID             | IF                        |  |  |
| 3     | EX             | ID                        |  |  |
| 4     | MEM            | EX (forwarded x2 from EX) |  |  |
| 5     | WB             | MEM                       |  |  |
| 6     |                | WB                        |  |  |

In cycle 4, and x12, x2, x5 gets the forwarded x2 from the EX stage of sub, removing stalls.

This is  $\rm EX/EX$  forwarding, taking ALU results from one EX stage directly into the next EX stage.

• MEM/EX Path. Forwards the ALU result from MEM stage to EX stage. Used when an instruction depends on an ALU operation two cycles before.



Figure 6: Example of MEM/EX path.

• MEM/MEM Path. Forwarding directly between two memory operations in the MEM stage. It removes stalls in Load/Store dependencies.



Figure 7: Example of MEM/MEM path.



Figure 8: Implementation of RISC-V with Forwarding Unit.

#### 1.3.2 Solutions to RAW Hazards

To handle RAW hazards, we can **use both** static (compile-time) and dynamic (hardware-based) techniques. These include:

- Static (compile-time):
  - ✓ nop insertion: compiler adds empty instructions to delay execution.
  - Instruction Scheduling: compiler reorders instructions to avoid conflicts.
- Dynamic (hardware-based):
  - ✓ Pipeline Stalling (bubbles): inserts delay cycles when necessary.
  - ✓ Forwarding (bypassing): uses intermediate values from the pipeline instead of waiting.

#### Static (compile-time) solution: Inserting nops (naïve)

One simple way to handle RAW hazards is to insert nop instructions manually between dependent instructions. This gives the pipeline time to complete the write-back of the needed value.

Key takeaway of inserting nops:

- **X** Simple, but inefficient because it wastes clock cycles. It should be the very last solution considered. 

   The should be the very last solution considered.
- ✗ Instead of using useful instructions, the processor waits, reducing performance.

```
Example 3: nop insertion

1 sub x2, x1, x3
2 nop  # Delay slot (bubble)
3 and x12, x2, x5 # Now x2 is ready
```

#### Static (compile-time) solution: Instruction Scheduling

A more efficient technique is **instruction reordering**, also known as **compiler scheduling**. The **compiler reorders instructions to avoid data hazards without inserting nops**.

Key takeaway of instruction scheduling:

- Instruction reordering is a **compiler optimization**.
- ✓ It works well if independent instructions are available.
- ✗ In some cases, no independent instructions exist, so stalling or forwarding is needed.

# Example 4: Instruction Scheduling 1 sub x2, x1, x3 2 # Independent instruction 3 # (can execute while sub is completing) 4 add x4, x10, x11 5 and x12, x2, x5 # Now x2 is ready Instead of a nop, we insert add x4, x10, x11, which does not depend on x2. This keeps the pipeline utilized while avoiding RAW hazards.

#### **⊘** Dynamic (hardware-based): Pipeline Stalling (Bubble Insertion)

When no independent instructions can be scheduled, the **hardware must stall** the pipeline by inserting a **bubble** (stall cycle).

Key takeaway of pipeline stalling:

- X Stalling is simple but **reduces performance** (pipeline sits idle).
- ✓ We **prefer forwarding** (next solution) instead of stalling.

```
x2, x1, x3
                         ID
                              EX ME
                                        WB
                         IF
                                             EX
                                                        WB
                                        ID
                                                  ME
                             ID-Stallo ID-Stallo
and x12, x2, x5
                                              ID
                                                   EX
                                                        ME
                                                            WB
                                         IF
                             IF-Stallo IF-Stallo
or x13, x6, x2
                                                             ME
add x14, x2, x2
                                              IF
                                                   ID
                                                        EX
                                                                   W
sd x15,100(x2)
                                                   IF.
                                                        ID
                                                             EX
                                                                  ME
                                                                        WB
```

Figure 9: Example of inserting stalls.

#### **⊘** Dynamic (hardware-based): Forwarding (Bypassing)

Forwarding is an optimized hardware technique that avoids pipeline stalls by directly passing results between pipeline registers. The entire implementation has already been explained on page 18.

Key takeaway of forwarding:

- ✓ Forwarding is the best solution because it eliminates stalls and maximizes performance.
- It requires extra hardware (MUX and control logic), but it significantly improves throughput.

#### 1.4 Performance evaluation

Evaluating the performance of a pipelined processor is essential to understanding the impact of stalls, hazards, and instruction throughput. In an **ideal scenario**, a **pipeline achieves one instruction per cycle** ( $\mathtt{CPI} = 1$ ), but real-world execution includes pipeline stalls, which degrade performance.

Several **key metrics** are used to evaluate the efficiency of pipelining:

- Instruction Count (IC). Represents the total number of instructions executed. Used as a basis for performance calculations.
- Clocks Per Instruction (CPI). CPI measures the average number of clock cycles required to execute one instruction. Ideal CPI for a pipelined processor is 1, but hazards and stalls increase CPI.

$$CPI = \frac{\text{Total Clock Cycles}}{\text{Instruction Count (IC)}}$$
 (1)

Where the total clock cycles is:

Total Clock Cycles = 
$$IC + 4 + Stall Cycles$$
 (2)

Where +4 is the fill time of the first instruction. The +4 represents the initial pipeline fill time required before the pipeline reaches full execution throughput.

#### Example 5: Why is the Pipeline Startup Overhead +4?

A 5-stage pipeline (IF, ID, EX, MEM, WB) requires 4 extra cycles before the first instruction completes. Consider the following scenario:

| Clock Cycle | IF | ID | EX | MEM | WB |
|-------------|----|----|----|-----|----|
| 1           | I1 |    |    |     |    |
| 2           | 12 | I1 |    |     |    |
| 3           | 13 | 12 | I1 |     |    |
| 4           | 14 | 13 | 12 | I1  |    |
| 5           | 15 | 14 | 13 | 12  | I1 |
| 6           | 16 | 15 | 14 | 13  | 12 |
| 7           | 17 | 16 | 15 | 14  | 13 |

The first instruction (I1) requires 5 cycles to complete. The next instruction (I2) completes in cycle 6, and so on. After the first 5 cycles, the **pipeline reaches steady state**, completing 1 instruction per cycle (ideal scenario, no hazards).

• Instruction Per Clock (IPC). IPC is the inverse of CPI:

$$IPC = \frac{1}{CPI} \tag{3}$$

Measures how many instructions complete per clock cycle.

• Millions of Instructions Per Second (MIPS). Evaluates processor speed in terms of millions of instructions executed per second:

$$MIPS = \frac{f_{\text{clock}}}{\text{CPI} \times 10^6} \tag{4}$$

Higher clock frequency  $(f_{\text{clock}})$  and lower CPI result in better MIPS.

#### **Example 6: Performance Calculation**

Given:

- Instruction Count (IC) = 5
- Stall Cycles = 2
- Clock Frequency = 500 MHz

Metrics:

• Total Clock Cycles:

Clock Cycles = 
$$IC + Stall Cycles + 4 = 5 + 2 + 4 = 11$$

• CPI Calculation:

$$\mathtt{CPI} = \frac{11}{5} = 2.2$$

• MIPS Calculation:

$$\mathtt{MIPS} = \frac{500 \ \mathrm{MHz}}{2.2 \times 10^6} = 227$$

Without stalls, CPI would be 1 (ideal pipeline). But stalls increase CPI, reducing MIPS and overall efficiency.

#### Performance in Loops and Asymptotic Analysis

When evaluating **loops** or **long-running programs**, we use asymptotic performance metrics.

For a loop with:

- *m* instructions per iteration.
- k stall cycles per iteration.
- $\bullet$  *n* iterations.

We have:

• Clock Cycles per Iteration:

Clock Cycles per Iteration = 
$$m + k + 4$$
 (5)

• CPI per Iteration:

$$CPI_{iter} = \frac{(m+k+4)}{m} \tag{6}$$

• MIPS per Iteration:

$$MIPS_{iter} = \frac{f_{clock}}{CPI_{iter} \times 10^6}$$
 (7)

For large n, the impact of pipeline startup delay (+4 cycles) is reduced:

• CPI per Iteration:

$$\begin{aligned} \text{CPI}_{\text{AS}} &= \lim_{n \to \infty} \frac{(\text{IC}_{\text{AS}} + \text{Stall Cycles}_{\text{AS}} + 4)}{\text{IC}_{\text{AS}}} \\ &= \lim_{n \to \infty} \frac{(m \times n + k \times n + 4)}{(m \times n)} \\ &= \frac{(m + k)}{m} \end{aligned} \tag{8}$$

• Millions of Instructions Per Second (MIPS):

$$MIPS_{AS} = \frac{f_{clock}}{CPI_{AS} \times 10^6}$$
 (9)

For large programs, startup stalls become negligible, and performance depends mainly on stall cycles per iteration. Minimizing k (stalls per iteration) is crucial to achieving high efficiency.

#### Why CPI is Greater than 1 in Real Pipelines

Even with an **optimized pipeline**, **real-world execution is affected by hazards**. Thus, actual CPI is always greater than 1, even in well-optimized designs.

#### 2 Control Hazards and Branch Prediction

#### 2.1 Conditional Branch Instructions

In pipelined processor architectures, control flow is not always linear, and decisions about the next instruction to execute are often dependent on certain conditions. This introduces the necessity for conditional branch instructions, particularly relevant in RISC-V architectures, where typical instructions include:

- beq (branch if equal): beq rs1, rs2, L1
   Transfers execution to the label L1 if the contents of registers rs1 and rs2 are equal.
- bne (branch if not equal): bne rs1, rs2, L1
   Transfers control to L1 if rs1 and rs2 hold different values.

These branch instructions are essential in implementing control structures such as loops, conditionals (if/else), and function returns.

At the hardware level, the **Branch Target Address (BTA)** plays a central role. This **address** represents **where the processor should continue execution if the branch is taken** (i.e., if the condition specified by the branch instruction evaluates as true). When the condition is satisfied, the processor **updates the Program Counter (PC)** with the BTA, thus redirecting the flow of instruction fetch.

Conversely, if the **condition is not satisfied**, the **branch is not taken**, and the processor continues **sequential execution**. In RISC-V, since instructions are generally 32 bits (4 bytes) long, the next instruction is fetched from PC + 4.

Understanding whether a branch is taken or not is crucial for instruction fetch in pipelined architectures. **Mispredicting** this can introduce **performance penalties**, which are addressed in detail through the study of control hazards and branch prediction techniques in later sections.

#### X Execution of Branches in Pipelined Architectures

When executing a **branch instruction**, such as **beq rx**, **ry**, **L1**, the processor must **compare two registers** (**rx**, **ry**) to determine whether the branch should be **taken** (i.e., jump to label **L1**) or **not taken** (continue sequentially). In **RISC-V**, the **Branch Outcome** (Taken/Not Taken) and **Branch Target Address** (**BTA**) are **calculated during the EX stage** (when the ALU performs arithmetic and logical operations):

- EX Stage:
  - Compare rx and ry using the ALU.
  - Compute PC + offset to obtain the BTA.

- ME Stage:
  - Based on the comparison, update the PC to either PC + 4 (if not taken) or PC + offset (if taken).

MIPS follows a similar structure but emphasizes that branch decisions are finalized at the end of the EX stage, with the PC update happening at the ME stage. This introduces a delay in resolving the branch, which becomes critical for understanding control hazards.

#### **A** Implication for IF

Since new instructions are fetched every clock cycle, the processor needs to decide early which instruction to fetch next. However, with branches, this decision is not immediately clear because the branch condition hasn't yet been evaluated. The Program Counter (PC) cannot be updated correctly until the branch outcome is known, leading to potential pipeline stalls or incorrect instruction fetches.

#### 2.2 Control Hazards

In a pipelined architecture, one of the primary challenges in achieving high performance is dealing with Control Hazards, also known as branch hazards. These arise due to the presence of conditional branch instructions, where the processor must decide the next instruction to fetch before knowing whether the branch will be taken.

#### **?** What really causes Control Hazards?

To sustain the pipeline and avoid idle stages, a processor needs to fetch one instruction per clock cycle. However, with branch instructions, this becomes problematic because the branch decision (branch outcome) and the branch target address (BTA) are not immediately available. Specifically, during the IF stage, when the next instruction is fetched, the processor still does not know whether the branch will be taken or not, because this information typically becomes available later in the pipeline.

This leads to uncertainty:

- **?** Which instruction should be fetched after a branch?
- ? Should it be sequential instruction (PC + 4) or the instruction at the BTA?

If the processor fetches the wrong instruction, it might need to discard or "flush" it later, wasting valuable cycles. Alternatively, the processor may stall, delaying the fetching of any instruction until the branch decision is known, which also hurts performance.

The key issue is this: in a pipeline, the **instruction stream needs to continue**, but the **correct path is unclear** until the branch condition is evaluated. Thus:

- 1. Either wrong-path instructions are fetched (requiring flushing later)
- 2. Or **pipeline stalls** are introduced (causing delay and loss of ideal speedup)

#### Key Takeaways: Control Hazards

- Definition: Pipeline hazard due to uncertainty in branch outcome during instruction fetch.
- Cause: The branch condition is unresolved when the next instruction must be fetched (IF stage).
- Instructions Involved: Conditional branches (beq, bne) and jumps, all instructions modifying the PC.
- Pipeline Timing Conflict: BO and BTA known only in EX or later, but instruction fetch must occur every cycle.

- Main Problem: Cannot decide whether to fetch next sequential instruction (PC + 4) or BTA.
- Possible Outcomes:
  - Stall: Delay fetch until branch resolved.
  - Fetch wrong instruction  $\rightarrow$  flush.
- Performance Impact: Loss of ideal pipelining speedup; reduced throughput due to stalls or wasted fetches.
- Goal of Solutions: Mitigate stalls and improve fetch accuracy through early evaluation or prediction.

#### 2.3 Naïve Solutions to Control Hazards

To manage control hazards in a simple and reliable way, one of the **earliest approaches** developed was the **conservative solution** of introducing **branch stalls**. The idea is straightforward: when a branch instruction enters the pipeline, the **processor stalls** the **pipeline until the branch decision is known and the correct next instruction can be safely fetched.** 

#### ? How does the conservative solution work?

In the typical 5-stage pipeline, the Branch Outcome (i.e., whether the branch is taken or not) and the Branch Target Address (BTA) are usually resolved at the end of the EX stage. However, the Program Counter (PC) is actually updated at the end of the ME stage. This introduces a delay of multiple cycles between the branch instruction entering the pipeline and the point when the next instruction can be fetched with certainty.

To avoid fetching an incorrect instruction, the **processor simply pauses instruction fetch** for **3 clock cycles** after the branch instruction enters the pipeline. These are called **stalls**, essentially empty cycles where no new instructions enter the pipeline. Once the PC is correctly updated based on the branch outcome, instruction fetch resumes.



Figure 10: Example of stalls inserted in the pipeline to read the correct value after a branch condition.

#### Performance Impact

It's pretty obvious that if each branch introduces a penalty of 3 cycles, it will significantly degrade performance, especially in programs with frequent branching. Since pipelining aims to maximize instruction throughput, this solution sacrifices speed for correctness. In fact, it is called conservative because it does not attempt to guess or speculate about the branch outcome. Instead, it waits for certainty, favoring reliability over efficiency.

#### **?** Can optimized evaluation at the ID stage improve performance?

Although the conservative solution degrades performance, it can be relatively optimized thanks to hardware optimization. Processor designers have introduced hardware optimizations that allow the branch outcome (BO) and the branch target address (BTA) to be computed earlier in the pipeline. Specifically, these computations can be moved from the EX stage to the

**ID stage** (during the decoding phase). This optimization is often referred to as **Early Branch Evaluation**.

#### **✗** How Early Branch Evaluation Works

To achieve this, the Instruction Decode (ID) stage must be enhanced with additional hardware logic that allows it to:

- Compare register values (rx and ry) to determine the branch condition.
- 2. Compute the BTA using the sign-extended offset from the instruction and the current PC value.
- 3. Update the PC as soon as BO and BTA are known.

By doing this, both BO and BTA are available at the end of the ID stage, allowing the processor to update the PC immediately and fetch the correct instruction in the following cycle.



Figure 11: Hardware features modifications to allow early branch evaluation.

#### ★ Hardware Overhead: Complexity vs. Performance

This optimization requires **more complex hardware**, as the ID stage now includes:

- ALU logic for comparison and addition.
- Additional multiplexer and control signals to direct the PC update.
- Expanded data paths for handling the offset and register values.

#### **♥** Effect on Pipeline Execution

Let's consider an example. In a pipeline using early evaluation:

- The processor **only stalls for 1 cycle** after a branch instruction, as opposed to the 3 cycles required by the conservative approach.
- This **one-cycle stall** allows the processor to fetch the correct instruction **immediately after the branch** is resolved.

The following diagram illustrates that the instruction fetch after the branch is delayed by only one stall cycle, resulting in a smaller performance hit.

| beq x1, x3, L1  | lF | ID \     | EX   | ME | WB |    |    |    |    |
|-----------------|----|----------|------|----|----|----|----|----|----|
| and x12, x2, x5 |    | IF stall | \ IF | ID | EX | ME | WB |    |    |
| or x13, x6, x2  |    |          |      | IF | ID | EX | ME | WB |    |
| add x14, x2, x2 |    |          |      |    | IF | ID | EX | ME | WB |

#### Conclusion

In summary, by anticipating branch evaluation at the ID stage, the processor reduces the branch penalty to 1 cycle per branch. This is a significant improvement over the 3-cycle stall of conservative stalling. While it introduces hardware overhead, it offers a better balance between performance and correctness and serves as a stepping stone toward even more advanced techniques, such as branch prediction.

#### 2.4 Intro to Branch Prediction

In modern computer architectures, achieving high performance requires efficient instruction-level parallelism (ILP)<sup>1</sup>. However, one of the **major obstacles to ILP is the occurrence of branch hazards**, which happen when the processor encounters a branch instruction (e.g., if, for, while) and cannot immediately determine which instruction to execute next. To mitigate the performance loss caused by these hazards, branch prediction is employed.

Branch Prediction is essentially a speculative execution technique where the processor guesses the outcome of a branch instruction, whether the branch will be taken (control jumps to the branch target) or not taken (execution continues sequentially), before the actual result is known. Instead of stalling the pipeline and waiting for the branch condition to be resolved, the processor proceeds based on the predicted outcome. If the prediction:

- ✓ Is **correct**, performance is preserved.
- **X** Is wrong (a misprediction), the incorrectly fetched instructions are flushed, and execution restarts at the correct address, causing a performance penalty.

#### **≡** Branch Prediction categories

Branch prediction techniques are generally classified into two main categories:

- Static Branch Prediction Techniques. In this method, the branch direction (taken/untaken) is decided at compile time and remains fixed during the program's execution. Static prediction often relies on compiler heuristics or profiling data to guess likely outcomes. Since the behavior doesn't adapt to runtime changes, this technique works best when branch outcomes are highly predictable and consistent across executions.
- Dynamic Branch Prediction Techniques. Unlike static methods, dynamic prediction uses hardware mechanisms to observe past branch behavior at runtime and make predictions accordingly. The prediction adapts to actual program execution, making it more effective for applications with complex or data-dependent control flow. This method can dynamically switch its guess depending on the branch history.

It's important to note that in both static and dynamic techniques, the **processor must avoid updating its internal state** (registers, memory, etc.) **until the branch outcome is known with certainty**. This ensures speculative execution doesn't cause side effects in case of misprediction.

Additionally, **hybrid approaches** are possible, where static and dynamic predictions are combined to optimize performance further.

<sup>&</sup>lt;sup>1</sup>Instruction-Level Parallelism (ILP): A measure of how many of the operations in a computer program can be performed simultaneously. High ILP enables multiple instructions to be executed in parallel within a single processor cycle, exploiting the parallelism inherent in sequential instruction streams through techniques like pipelining, superscalar execution, and out-of-order execution.

#### 2.5 Static Branch Prediction

Static branch prediction represents one of the **simplest approaches** to handling branch hazards. In this technique, the **prediction** regarding whether a branch will be taken or not is **made at compile time** and **remains unchanged throughout program execution**. This method **relies heavily on heuristics or compiler-generated hints**, which estimate the likely behavior of each branch without any consideration of the program's actual runtime behavior.

#### ▼ When does static branch prediction work well?

This approach is particularly effective in scenarios where the **branch behavior** is stable and highly predictable, such as in embedded or domain-specific applications. In such cases, the overhead and complexity of dynamic prediction mechanisms may not justify the potential benefits, making static prediction a practical alternative.

#### A RISC-V assumption

A key architectural note here is the assumption that we are working with a RISC-V processor, which is optimized for early branch evaluation during the Instruction Decode (ID) stage (see more in section 2.3, page 31). This means that in RISC-V, the decision to predict a branch direction occurs early in the pipeline, minimizing the potential for instruction fetch delays if the prediction is accurate.

#### 2.5.1 Branch Always Not Taken

The Branch Always Not Taken strategy is the simplest form of static branch prediction. It operates under the assumption that the branch condition will never be satisfied, i.e., the control flow of the program will continue sequentially as if the branch is not taken. As a result, instructions immediately following the branch in program order are fetched and executed without any need to determine or access the Branch Target Address (BTA).

#### ♦ When is Branch Always Not Taken effective?

This approach is especially effective for **certain control flow patterns**, such as **if-then-else** structures where the **then** clause is more likely to be executed than the **else** clause. For example:

```
1 z = x + y;

2 if (z > 0)

3  w = x;

4 else

5  w = y;
```

Assuming z is typically positive, the branch is not taken because execution proceeds sequentially to w = x. This makes predict-not-taken a suitable and effective default strategy for such cases.

#### **X** Implementation Details

The prediction is made at the end of the Instruction Fetch (IF) stage, without calculating or knowing the BTA (since the branch is always not taken and the next instruction to execute is the PC + 4, as always). This makes the approach lightweight and efficient.

#### **A** Misprediction Case

If the actual **branch outcome** (BO) evaluated during the Instruction Decode (ID) stage is **not taken**, then the **prediction is correct**, and **no penalty cycles** are incurred. The pipeline proceeds as planned.

Otherwise, if the actual **branch outcome** (BO) turns out to be **taken**, then the **prediction was incorrect**, leading to a **misprediction penalty**. The processor must:

- 1. Flush the fetched instruction(s) after the branch (turned into NOPs).
- 2. Fetch the instruction at the Branch Target Address (BTA) and restart execution from there.

This results in a **one-cycle branch penalty**, which is minimal but still affects performance.



Figure 12: Branch Always Not Taken techniques failed and the processor must flush instruction i+1 and restart execution from the BTA.

#### 2.5.2 Branch Always Taken

This approach represents the dual case of the previous technique (page 35): it assumes that every branch will be taken, meaning the control flow will jump to the branch target address rather than continue sequentially. This method is especially useful for backward branches, which occur in loops such as for, while, and do-while, since these branches are typically taken repeatedly during loop iterations.

#### **?** Implementation Challenge

Unlike the not-taken strategy, where the processor simply continues to PC + 4, the taken strategy requires knowledge of the Branch Target Address (BTA) at the Instruction Fetch (IF) stage. This is non-trivial because:

- ? The BTA depends on the branch instruction's target, which typically requires decoding.
- ✓ To solve this, we introduce a Branch Target Buffer (BTB), a special hardware structure.

#### **?** What is BTB and why do we need it?

The Branch Target Buffer (BTB) is a specialized cache in the processor designed to predict the target address of a taken branch instruction before the branch condition is actually resolved.

In Branch Always Taken, we assume that the program will jump to a new address (the Branch Target Address, BTA). However, this **BTA** is not immediately known during Instruction Fetch (IF) because it typically requires decoding the branch instruction (Instruction Decode, ID). To avoid delays, the **BTB** remembers past branch target addresses, allowing the processor to quickly predict where to jump when encountering a branch.

#### \* How does BTB work? Quickest explanation

- BTB Structure, it is a kind of lookup table or cache where:
  - Key: address of the branch instruction (the PC value where the branch resides)
  - Value: Predicted Target Address (PTA), i.e., where to jump if the branch is taken
- BTB Lookup: when fetching a branch instruction, the processor simultaneously queries the BTB via the branch PC.
  - ✓ If a match is found (cache hit), the BTB immediately provides the Predicted Target Address (PTA), and the processor starts fetching from that address, before knowing if the branch is actually taken.

X If a no match (cache miss), the processor might default to sequential execution (PC + 4) or wait for the BTA to be calculated, which causes delay.

#### Example 1: BTB and Branch Always Taken technique

Let's say:

- A loop branch at address 0x100 typically jumps to 0x80.
- The BTB stores:  $0x100 \rightarrow 0x80$  (key  $\rightarrow$  value).

When the branch at 0x100 is fetched again:

- The BTB predicts the next instruction will be at 0x80 (taken).
- The processor starts fetching from 0x80, without waiting to evaluate the branch condition.

If it turns out the **branch was not taken**, the processor **flushes the incorrect fetch** from 0x80 and resumes at 0x104.

#### **♥** Correct Prediction Path

If the branch is indeed taken, and the BTB correctly supplies the BTA, the processor proceeds without penalty. Execution continues from the target address just as expected.

#### Misprediction Case

If the actual outcome is not taken, the prediction is incorrect:

- 1. The Instruction Fetched (IF) from the target address is flushed (NOP).
- 2. The processor must fetch the sequential instruction at PC + 4.
- 3. One-cycle penalty incurred, similar to the not-taken misprediction case.

#### **M** When is this technique effective?

This method is well-suited for loop constructs, where branches typically go backward and are taken with high probability. For example:

- In a do-while loop, the branch is taken almost every time except the last iteration.
- Conversely, in forward branches like if-then-else, the branch is less likely to be taken, making this technique less effective.

This underscores that **branch direction** (forward or backward) **can influence** the effectiveness of prediction strategies.



Figure 13: Branch Always Taken techniques failed and the processor must flush instruction i+1 and restart execution from the BTA.

#### 2.5.3 Backward Taken Forward Not Taken (BTFNT)

The Backward Taken Forward Not Taken (BTFNT) strategy represents a refinement of static prediction that uses a simple yet effective heuristic: the direction of the branch, whether it jumps backward or forward in memory, can be used to predict its outcome.

#### Prediction Rule

- Backward-going branches (i.e., branches where the target address is lower than the current PC) are predicted as taken.
  - These branches **often occur in loops**, where execution loops back to an earlier instruction (e.g., in for, while, or do-while constructs).
- Forward-going branches (i.e., target address is greater than the current PC) are predicted as **not taken**.
  - These branches typically correspond to if-then-else constructs, where the else path is less probable and control usually proceeds sequentially.

#### **?** Why does this work?

The rationale behind BTFNT lies in **empirical observations**:

- Loops tend to execute multiple times, hence backward branches are mostly taken.
- Conditional statements often have rarely taken else paths, hence forward branches are mostly not taken.

#### Pros and Cons

- ✓ Simple to implement because BTFNT requires just a comparison of the target address vs the current PC:
  - target address < PC  $\Rightarrow$  predict taken
  - target address > PC  $\Rightarrow$  predict <u>not</u> taken

Also, better accuracy than uniform always-taken or always-not-taken, especially for mixed codebases.

**X** Not adaptive; fails for atypical control flows where direction doesn't align with expected behavior.

#### References

- [1] J.L. Hennessy and D.A. Patterson. Computer Architecture: A Quantitative Approach. ISSN. Elsevier Science, 2017.
- [2] Cristina Silvano. Lesson 1, pipelining. Slides from the HPC-E master's degree course on Politecnico di Milano, 2024.
- [3] Cristina Silvano. Advanced computer architecture. Slides from the HPC-E master's degree course on Politecnico di Milano, 2024-2025.

## Index

| В                                              |            |
|------------------------------------------------|------------|
| Backward Taken Forward Not Taken (BTFNT)       | 40         |
| Branch Always Not Taken                        | 35         |
| Branch Outcome                                 | 26         |
| Branch Prediction                              | 33         |
| Branch Target Address (BTA)                    | 26         |
| Branch Target Buffer (BTB)                     | 37         |
| $\mathbf{C}$                                   |            |
| Clocks Per Instruction (CPI)                   | 23         |
| Compiler Scheduling                            | 21         |
| Control Hazards                                | 16, 17, 28 |
| D                                              |            |
| Data Hazards                                   | 16, 17     |
| Dynamic Branch Prediction Techniques           | 33         |
| $\mathbf{E}$                                   |            |
| Early Branch Evaluation                        | 31         |
| EX (Execution)                                 | 5          |
| EX/EX Path                                     | 19         |
| $\mathbf{F}$                                   |            |
| Forwarding                                     | 18         |
| Н                                              |            |
| Hazard                                         | 16         |
|                                                |            |
| I D (Instruction December)                     | ٠          |
| ID (Instruction Decode) IF (Instruction Fetch) | 5<br>5     |
| Instruction Count (IC)                         | 23         |
| Instruction Per Clock (IPC)                    | 23         |
| Instruction-Level Parallelism (ILP)            | 33         |
| $\mathbf{M}$                                   |            |
| ME (Memory Access)                             | 5          |
| MEM/EX Path                                    | 20         |
| MEM/MEM Path                                   | 20         |
| Millions of Instructions Per Second (MIPS)     | 24         |
| Misprediction                                  | 33         |
| P                                              |            |
| Pipeline Registers                             | 15         |
| Pipelining                                     | 4          |
| R                                              |            |
| RAW (Read After Write)                         | 17         |
| RISC-V Data Path                               | 6          |

|                                     | Index  |
|-------------------------------------|--------|
|                                     |        |
| $\mathbf{S}$                        |        |
| Static Branch Prediction Techniques | 33     |
| Structural Hazards                  | 16, 17 |
| W                                   |        |
| WAR (Write After Read)              | 17     |
| WAW (Write After Write)             | 17     |
| WB (Write Back)                     | 5      |