Data Bypass and Scoreboard

This page is maintained by Denis Los

Data bypassing

Data bypassing (or data forwarding) is an optimization in pipelined CPUs to limit performance deficits which occur due to the pipeline stalls caused by RAW (Read after Write) dependencies.

Let us look at the following example:

add R8, R2, R3 ; (R8 <- R2 + R3)
sub R5, R8, R1 ; (R5 <- R8 - R1)

The second instruction has to wait for the first one to write its result back to the register file. However, the result will be ready as soon as sub approaches the Decode stage (assuming that there are no cache misses and such arithmetic instructions as add require only one cycle of execution to be complete), hence there would be no need to stall the pipeline if data was able to be obtained before an actual execution of the second instruction.

Scoreboard for data forwarding

Control Unit should be aware of the current state of the pipeline so as to be able to decide whether the stall is needed and identify a specific pipeline unit which data should be get from. It can be achieved by the implementation of the following structure which has much in common with a scoreboard:

Register	EXE_0	MEM	is in RF?	is ready for bypass?
R0		✓	no	yes
R1			yes	no
...
RN	✓		no	no

The first column denotes a set of registers which are used by the simulated CPU, while the next ones (EXE_0,...,EXE_N,MEM,WB) represent corresponding pipeline stages. The following flags are used to get the information whether a value of a source register should be obtained from a bypass, the register file should be accessed in order to get valid data or the pipeline has to be stalled because of the sources being not ready yet.

Implementation

Control and bypass logic is mainly kept inside DataBypass class with some common interface features segregated into separate classes.

RegisterStage is the class which is used to represent an execution stage in the pipeline in which an instruction that writes a result of its execution back to a specific register is. For example, with the simulation of only four following pipeline stages (fetch, decode, exec, writeback) for simple arithmetic instructions possible options of RegisterStage value for R1, used in add R1, R2, R3 pseudo-instruction, would be exec and writeback stages, because of the reference to the first two being not informative in the current context.
BypassCommand class is meant to be the tool of simulation connecting Decode stage with the first execution stage. Transferred through ports, its instances provide the execution stage with the information such as source registers which should obtain their values from bypass and ports of the simulator the corresponding data should be get from.

Constructing an object

An object of DataBypass class can be created using the following constructor with a latency of a long arithmetic logic unit passed as a parameter.

template <typename ISA>
explicit DataBypass( uint64 long_alu_latency);

Entry in scoreboard

In our simulator a structure called RegisterInfo is used to implement an entry in the scoreboard.

struct RegisterInfo
{
    RegisterStage current_stage;
    RegisterStage ready_stage;
    RegisterStage next_stage_after_first_execution_stage;
    bool is_bypassible;
    bool is_traced;
};

It means that for each register we keep the accurate information about a specific stage it is being processed in at the moment of the investigation and about a pipeline stage when a functional unit is expected to be ready with a result, so that a register can be considered as ready for bypass as soon as the instruction which exploits it as a destination target has a result of execution ready too. Moreover, the information about the execution flow is kept as well in order to be able to update a state of the scoreboard properly with respect to a particular type of an instruction. Different execution flows will be discussed in detail in the chapter, dedicated to the implementation of non-unified pipeline.

Interface

DataBypass class

`bool is_stall( const Instr& instr)`

Determines whether the pipeline should be stalled due to data hazards, structural hazards or the support for precise exceptions. This method returns true, if an instruction instr can be successfully processed in the pipeline, and false otherwise. Please, refer to Stalling the pipeline for more details.

`bool is_in_RF( const Instr& instr, uint8 src_index)`

Determines whether a register file should be accessed so as to get a value of a source register with the index src_index of the instruction instr. It returns true, if the state of the source register is in_RF, or false otherwise.

`bool is_bypassible( const Instr& instr, uint8 src_index)`

Determines whether a value of a source register with the index src_index of the instruction instr can be obtained from bypass. It compares a current stage of the source register with the stage when it is expected to be ready and returns true, if the register is in the same stage or further in the pipeline, and false otherwise.

`void trace_new_instr( const Instr& instr)`

Garners information about new instruction instr and updates the scoreboard with its destination register. It sets an execution stage of the corresponding destination register to the first execution stage of the pipeline (EXE_0).

Note: This function should be invoked AFTER calling update method, for a status of recently renewed register does not need to be updated.

`void update()`

Updates the scoreboard and all concomitant structures. This method changes the current stages of all traced registers to the next ones in the pipeline in accordance with types of respective instructions and updates information about admissibility of data bypassing. Furthermore, control status used to determine the need for stalling the pipeline because of structural hazards and handling precise exceptions is updated as well.

`void handle_flush()`

Removes all the information regarding a current state of the pipeline due to a flush caused by a misprediction of a branch. It resets the information about traced registers by changing their states to in_RF.

Note: Information about destination registers of a mispredicted branch and all instructions processed in the subsequent stages (correctness of branch prediction is determined now in the mem stage) is discarded as well on the basis of assumption that all of these instructions will retire before any of the fetched anew approach the decode stage. It might become incorrect due to following reappraisals of the simulated model.

`BypassCommand get_bypass_command( const Instr& instr, uint8 src_index)`

Returns an instance of BypassCommand class carrying the information which pipeline stage a value of a source register with the index src_index of the instruction instr should be bypassed from.

RegisterStage class

void set_to_stage( Latency v);
void set_to_first_execution_stage();
void set_to_mem_stage();
void set_to_writeback();
void set_to_in_RF();

bool is_same_stage( Latency v);
bool is_first_execution_stage();
bool is_mem_stage();
bool is_writeback();
bool is_in_RF();

void inc();

Note: Not having initialized an instance of the RegisterStage class properly with the listed methods, the one should not expect it to work correctly. By default, the value is set to in RF.

BypassCommand class

`uint8 get_bypass_direction()`

Returns an integer number indicating an execution unit which bypassing data should be get from.

Returned number	Type of bypass
0	EXE_0 — EXE_0
1	EXE_N — EXE_0
2	MEM — EXE_0
3	WB — EXE_0

Non-unified pipeline

Non-unified pipeline is a term which is used in the current context to describe an architecture of a pipelined in-order machine with non-unified latencies of execution units. Exact latency of a functional unit is determined by specifications of the simulated machine and complexity of its operations.

Now in our model of simulated CPU there are three major categories of instructions, which can be distinguished on the basis of the set of pipeline stages they are in during the execution:

FE - fetch

DEC - decode

EXE_0 - first execution stage

EXE_N - last execution stage

MEM - mem

WB - writeback

Three categories

The first figure shows the execution flow of simple arithmetic instructions. There is a result produced in one of the arithmetic logic units (ALU) in the EXE_0 stage.
Branch executions, memory accesses, etc. Note that EXE and MEM are only the names of pipeline stages, but not a detailed description of operations usually done. For instance, EXE_0 stage is an address generation in one of address generation units (AGU) for loads and stores, but a calculation of condition of a branch in one of branch execution units (BEU) for branch instructions.
Arithmetic such as multiplication and division is processed in a long arithmetic logic unit, which is simulated as fully pipelined. Data bypassing for the long arithmetic logic unit is supported only for the last execution stage (EXE_N).

Command line options

A latency of arithmetic logic unit which operates on long arithmetic such as multiplication and division of integer numbers can be set from the command line with an integer number from 2 to 64 passed as a parameter

--long-alu-latency - latency of a long arithmetic logic unit

Stalling the pipeline

According to the currently simulated model of in-order CPU, there are several reasons for an instruction to stall its dispatch to execution units. Let us look at some of them closely:

Data hazards: Data dependencies between operands of different instructions can cause data hazards, leading to necessity to stall the pipeline unless a corresponding dependency is resolved. Mishandling of data hazards can result in totally incorrect output of a program. Let us look at different types of data dependencies and consider a few approaches used in the simulated model of in-order CPU to tackle each of them.
- RAW (Read after Write)
  
  In the example below the second instruction has to read a value of the register R8 in order to dispatch to an execution unit. However, a valid value of the register R8 is the one that is going to be produced by the first instruction. Hence, a dispatch of the second instruction to an execution unit will be postponed unless the first one is ready with a result.
```
add R8, R5, R4 ; (R8 <- R5 + R4)
add R1, R8, R3 ; (R1 <- R8 + R3)
```
  As it has been previously mentioned, time length of the pipeline stall caused by some RAW data dependencies might be decreased with an application of an optimization technique called data bypassing. Its general idea is to forward data produced by an execution unit to other execution units before writing down a corresponding value to the register file. This optimization has a potential to significantly mitigate performance drop for instruction sequences with back-to-back exploitation of the result on short intervals.
  
  Thus for our model of the simulated in-order machine RAW data dependencies will be resolved for an instruction if valid values of all its source registers are available in the register file or can be bypassed from one of execution units.
- WAW (Write after Write)
  
  As it can be seen from the example below, WAW data dependencies refer to the situation in which several instructions tend to write their results back to the same architectural register. According to the current architecture of the simulated in-order CPU, WAW data dependencies can be mainly caused by several instructions with distinct latencies accessing writeback stage as a reordered program sequence.
```
add R8, R2, R1 ; (R8 <- R2 + R1)
add R8, R7, R6 ; (R8 <- R7 + R6)
```
  Hence, in order to resolve WAW data dependencies in the currently simulated machine instructions should be prevented from writing down to a specific register in any order that is different from the one defined by a program sequence.
- WAR (Write after Read)
  
  These data dependencies are mainly caused by an instruction writing down its result back to a register used as a source register in one of previous instructions. In the example below the second instruction writes its result to the register R8 used as a source register in the first instruction.
```
add R4, R8, R1 ; (R4 <- R8 + R1)
add R8, R2, R3 ; (R8 <- R2 + R3)
```
  WAR data dependencies, however, are not the case for simulated in-order CPU since all the corresponding reads of a particular architectural register are guaranteed to happen before any following write accesses to the register.
Structural hazards: Situations in which a request for access to one of machine resources cannot be approved due to a lack of available instances or the resource being already in use are referred to as structural hazards. Denial of access to CPU resources caused by structural hazards usually result in postponement of instruction execution, leading to performance drop. Let us look at some examples of structural hazards:
- A lack of available write ports to the register file can cause structural hazards. In accordance with the simulated model of in-order CPU it may happen for several instructions with distinct latencies of execution trying to access writeback stage at the same time. Thus availability of write ports to the register file should be checked before the dispatch of an instruction to execution units in the simulated model.
- Some of long arithmetic logic units may be not fully pipelined. It means that there can be only one instruction processed in these execution units at each moment. Hence, endeavor to send an instruction to these not fully pipelined units before completion of execution of instructions dispatched to these units earlier will lead to the structural hazard. However, all arithmetic logic units are modeled as fully pipelined in the current version of simulated CPU. It might change in the future due to following enhancements of the simulated model.
Precise exceptions: Support for precise exceptions implies that the architectural state of the machine should be kept consistent when an exception (interrupt) is ready to be handled. It means that when an exception is raised in one of instructions, all previous instructions, but no latter one should be retired. Since the model of the simulated CPU is meant to fully support handling precise exceptions, all of the aforementioned conditions are checked to be satisfied.

MIPT-V / MIPT-MIPS — Cycle-accurate pre-silicon simulation.

Main Page
Lectures
Getting started
MIPS
C++
Build and Deployment
Tools
Communication
Functional simulator internals
Performance simulator internals
Studies
- Cache associativity
- IPC sensitivity to cache line size
Integration
- Interactive simulation with GDB
Library
About Us

Provide feedback

Saved searches

Use saved searches to filter your results more quickly