

# **BSV Training**

Eg06: Memory-to-Memory Mergesort

An IP block\* that sorts a vector in memory, using the "mergesort" algorithm.

06a uses an *ad hoc* test driver; 06b generalizes this to an "SoC" structure with an AXI4 interconnect; 06c replaces the test driver with a RISC-V CPU core



### Mergesort algorithm

Binary mergesort is a standard sorting algorithm, described in many textbooks and courses on algorithms. The basic idea is illustrated below.

1<sup>st</sup> pass: merge segments of span 1 into segments of span 2

2<sup>nd</sup> pass: merge segments of span 2 into segments of span 4

3<sup>rd</sup> pass: merge segments of span 4 into segments of span 8

•

... and so on, until segment span >= N, the length of the array



Invariant: every segment that is input to "merge" is already sorted

Some edge conditions we need to take care of:

- N is usually not a power of 2, so last two spans may have unequal length
- Depending on N, the final sorted array may be in B, and so may have to be copied back into the original array A

After completing a pass, we can swap arrays A and B.



#### Mergesort algorithm (contd.)

The "merge" step sorts two already-sorted segments of length 'span' into a sorted segment of length '2 x span'



#### Three variations in this example



Rather than create an *ad hoc* interface for our mergesort module, let us prepare it to be ready for plugging into an "SoC" (System on a Chip) as an "accelerator" module, illustrated to the right.

An SoC typically consists of CPUs, memories, an interconnect, and custom IP blocks ("Intellectual Property Blocks") that perform particular functions for reasons of greater speed (acceleration) and/or less power consumption (compared to executing the same function in software on a CPU).



The interconnect fabric carries memory requests (red paths in the figure) and responses (green paths).

- Initiator ports (like the CPU port and the IP block read/write port) send requests and receive responses
- Target ports (like the Memory port and the IP block config port) receive requests and send responses

Memory requests are routed to the memory or to the IP block config port based on the address contained in the request. I.e., the (usually small number of) configuration registers in the IP block appear, to the CPU, just like memory locations at a particular base address (these addresses are disjoint from addresses serviced by the Memory block). We also say that the config registers are "memory-mapped".

To operate the IP block, the CPU writes information needed for the operation to the config registers in the IP block, after which the IP block can perform its function (by reading and writing to memory). When the function is completed, the IP block may write a particular value to one of its config registers. The CPU can detect when the IP block has completed its function by "polling" (repeatedly reading) this register.





#### To perform the mergesort:

- The external environment must write the addresses and size of the vectors A and B to the config registers at offset 0x04, 0x08 and 0x0C, and finally write a "1" (meaning: "start running") to the config register at offset 0
- The mergesort module then does its work, reading and writing through its data access port; when completed, it writes "0" to the config reg at offset 0
- The external environment can "poll" (repeated read) the config reg at offset 0, to detect completion



#### Time-out to reinforce some concepts

#### Please study the lectures:

- Lec\_Types to review types, which are used to define memory requests and responses
- Lec\_Interfaces\_TLM to review the concepts behind interfaces like Get, Put, Client and Server, which are used for most of the interfaces in this example.
- Lec\_Interfaces\_TLM and Lec\_Typeclasses to review the concepts behind the mkConnection abstraction, which is used in the testbench to connect all components together.
- Lec\_StmtFSM for the concepts behind structured rule-based processes, which are used both in the mergesort module and in the testbench.
- Lec\_Interop\_C for the concepts behind importing C code into BSV, which is used in the memory model in this example.



#### Memory requests: Resources/Req\_Rsp.bs

```
-- Operation requested
data RR_Op = RR_Op_R | RR_Op_W
    deriving (Eq. Bits, FShow)
-- Size requested
data RR Size = RR Size 8b | RR Size 16b | RR Size 32b | RR Size 64b
    deriving (Ea. Bits. FShow)
-- Requests
-- Note: wdata is always in the least-significant bits (unlike AXI4!)
struct (RR_Req :: # -> # -> *) wd_tid wd_addr wd_data = {
    tid :: Bit wd_tid; -- Transaction Id
         :: RR_Op;
    addr :: Bit wd_addr;
    size :: RR_Size;
    wdata :: Bit wd_data -- write-data (not relevant for read-requests)
    deriving (Bits, FShow)
```

Memory requests contain a op (READ/WRITE), an address, data (for WRITE commands), a spec of the size of data being transferred, and a "transaction id" (tid).

Transaction Ids (tids) are common on memory requests/responses in modern SoCs, because:

- There may be multiple initiators, and the tid can serve as a "return address" identifying where a response should go
- Responses may be in a different order from the original requests, and the tid can identify the original order

Note: parameterized on the bit-width of tid, addr, data



### Memory requests and responses: Resources/Req\_Rsp.bs

```
-- Response status
data RR_Status = RR_Status_OKAY
                                      -- = AXI4 OKAY
              RR_Status_RESERVED
                                     -- = AXI4 EXOKAY; here unused
              | RR_Status_TARGETERR
                                      -- = AXI4 SLVERR (e.g., misaligned)
              | RR Status DECERR
                                      -- = AXI4 DECERR (decode err: no such addr)
   deriving (Eq. Bits. FShow)
-- Responses
-- Note: rdata is always in the least-significant bits (unlike AXI4!)
struct (RR_Rsp :: # -> # -> *) wd_tid wd_data = {
                            -- Transaction Id
   tid
          :: Bit wd_tid;
   status :: RR_Status;
   rdata :: Bit wd data; -- read-data (not relevant for write-responses)
                            -- For debugging only
          :: RR_Op
   ор
   deriving (Bits, FShow)
```

Memory responses contain the original op (READ/WRITE), rdata (for READ op), a status, and the original "transaction id" (tid).

Note: parameterized on the bit-width of tid, data



#### Specific choices for memory requests and responses in our Mergesort example

Resources/Req\_Rsp.bs contains generic definitions for the types of memory requests and responses.

In particular, they are parameterized by the bit-widths of addresses (addr\_sz), data (data\_sz) and transaction ids (tid\_sz), so that they can be used in various SoCs with various requirements.

In Eg06a Mergesort/src/Fabric Defs.bs, we make particular choices for these parameter for our mergesort example.

```
type Wd_Id = 4

type Wd_Addr = 32

type Wd_Data = 32
```

Tids are 4-bits wide

Addresses are 32-bits wide

Data are 32-bits wide

```
-- Names for types of certain request/response fields

type Fabric_Id = Bit Wd_Id

type Fabric_Addr = Bit Wd_Addr

type Fabric_Data = Bit Wd_Data

type Fabric_User = Bit Wd_User

-- Fabric requests, responses
-- (specializations of the generic RR_Req and RR_Rsp)

type Fabric_Req = RR_Req Wd_Id Wd_Addr Wd_Data

type Fabric_Rsp = RR_Rsp Wd_Id Wd_Data
```

Specializations of various types for chosen sizes (in Utils/Fabric\_Req\_Rsp.bs)



#### Specific choices for memory-mapping in our Mergesort example

In Eg06a\_Mergesort/src/SoC\_Map.bs, we also make particular choices for the "addresses" at which the memory lives, and at which the mergesort configuration port lives.



#### Interface for our mergesort module



```
mkMergesort :: Module Mergesort IFC
mkMergesort =
 module
   -- Section: Configuration
   -- Vector of CSRs (Config and Status Regs)
   v_csr :: Vector N_CSRs (Reg Fabric_Addr) <- replicateM (mkReg 0)</pre>
   rules
        "rl_handle_configReq": when True
         ==> do
   -- Section: Merge sort behavior
   merge_engine :: Merge_Engine_IFC <- mkMerge_Engine</pre>
    -- 'span' starts at 1, and doubles on each merge pass
   rg_span :: Reg Fabric_Addr <- mkRegU
   ... FSM (in rules) to repeatedly invoke merge_engine with spans 1,2,4,8,...
   interface
       mem_bus_ifc
                      = merge_engine.mem_bus_ifc
```

The mergeEngine's memory interface is directly used as the memory interface

In Mergesort.bs: mkMergeSort module structure

Instantiate the configuration registers

This rule receives incoming config requests, reads/writes config regs, sends responses

Instantiate module for the "merge" step

This FSM implements the following pseudo-code:

```
while (True)
wait for 'run' command, init span=1, p1=A, p2=B
while (span < n) // do another pass:
i=0;
while (i < n)
merge (i, span, p1, p2);
i += 2*span;
swap p1,p2; span = 2x span
if final array is B, copy it back to A
config reg [0] = 0 (announce completion)
```



In Mergesort.bs: mkMergeEngine

This module implements the "merge" step which we saw eariler:



#### mkMergeEngine module data flow



mkMergeEngine is highly concurrent (pipelined):

- rl\_req0, rl\_req1 and rl\_merge continuously stream requests to memory
- rl\_rsp0, rl\_rsp1 and rl\_drain... continuously handle the stream of responses

This is typical of high-performance accelerators which try to maximize utilization of available memory bandwidth. A software implementation on a CPU may not be able to generate such concurrent, pipelined memory accesses.



#### **There is a danger of deadlock.** Example:

- Suppose rl merge does not consume f data1, for example because the next segment 1 item is > many segment 0 items
- Then, f data1 may become full, and if the first item in f memRsps is from segment 1, then we get stuck (the segment 0 items we need may be behind it). This kind of deadlock is called "head-of-line blocking"

#### **Solution:**

- The code has a parameter:  $\max n \text{ regs in flight} = 8$
- f data0 and f data1 are sized to accommodate 8 responses. "Credit counters" crg credits0 and crg credits1 are also intialized to 8.
- rl reg0 and rl reg1 decrement crg credits0 and crg credits1, respectively, whenever they issue a memory request, and stop issuing requests when their credit goes to 0.
- rl merge increments crg0 credits0 (respectively, crg1 credits) whenever it consumes a response from f data0 (respectively, f data1)
- Using a CReg (instead of a Reg) allows a {rl reg0, rl reg1} and rl merge to operate concurrently (increment and decrement credits in the same cycle).

This prevents the above deadlock situation.



In Resources/Memory\_Model.bs: a memory model

To test our mergesort block, we need to provide a memory containing the vector A to be sorted and the vector B for its scratch working area.

Large memories (particularly those implemented in DRAM) are typically not expressed in a hardware design language. Hence we merely use a *model* of memory for testing our IP block in simulation.

This is provided in Resources/Memory\_Model.bs, which is excerpted below:

```
interface Memory_IFC =
   bus_ifc :: Server Fabric_Req Fabric_Rsp
...

mkMemory_Model :: Module Memory_IFC
mkMemory_Model =
   module
   ...
```

The interface is a "Server" where one can "put" Fabric\_Reqs and "get" Fabric\_Rsps.

The body of the module mkMemory\_Model is fairly straightforward. We use a Bluespec "Register File" to model memory, in particular a variant that will load its initial contents from a "Mem.hex" file:

```
let last_index :: Raw_Mem_Addr = 0x4000000 - 1 -- 16M Raw_Mem_Words

rf :: RegFile Raw_Mem_Addr Raw_Mem_Word <- mkRegFileLoad "Mem.hex"
0
last_index
```

The register file is 4-bytes wide (to accommodate the common access pattern), but it also supports 1- and 2-byte accesses.

It also checks that access requests are naturally aligned.



In Testbench.bs: a testbench

Our testbench is excerpted below:

```
interface Test_Driver_IFC =
    start :: Action
    busy :: Bool
    bus_ifc :: Client Fabric_Req Fabric_Rsp

mkTest_Driver :: Module Test_Driver_IFC
mkTest_Driver =
    module
    ...
    rules
        ... write mergesort's config regs ...
        ... loop, polling mergesort's config reg for completion ...
```

The testbench only performs a very small test (sort 29 words) so that you can easily inspect the outputs:



#### Build and run the 1st version

- In the Build directory, build and run using the 'make' commands, with Bluesim and/or with Verilog sim, as described earlier
- Observe the inputs and outputs and verify that they are reasonable (final memory contents are a sorted version of initial memory contents)



In this version we re-use components from Eg06a (Test Driver, Mergesort, Memory).

We only generalize the environment around it into a "SoC" model.

We will get the AXI4 interconnect from Bluespec's Piccolo RISC-V repository. Please clone this now:

\$ git clone https://github.com/bluespec/Piccolo



Eg06b\_Mergesort/

Use 'make copy\_files' to copy re-used components from Eg06a/src/ to Eg06b/src/ and to copy AXI4 files from the Piccolo repo:

# Please edit the definition of PICCOLO\_REPO in Makefile to point at your Piccolo clone directory.

\$ make copy\_files



Please study: src/SoC\_Map.bs which describes the overall "address map" for the SoC, and the number of "initiators" and "targets" on the interconnect fabric.

- Initiators (a.k.a. Clients, Masters): send requests, receive responses
  - (Test Driver, Mergesort's memory-access port)
- Targets (a.k.a. Servers, Slaves): receive requests, send responses
  - (Memory, Mergesort's configuration port)

```
-- Count and initiator-numbers of initiators in the fabric.

type Num_Initiators = 2

test_driver_initiator_num :: Integer; test_driver_initiator_num = 0
accel_0_initiator_num :: Integer; accel_0_initiator_num = 1

-- Count and target-numbers of targets in the fabric.

type Num_Targets = 2
type Target_Num = Bit (TLog Num_Targets)

mem0_controller_target_num :: Integer; mem0_controller_target_num = 0
accel_0_target_num :: Integer; accel_0_target_num = 1
```



Eg06b Mergesort/

Note that we do some "type-level" arithmetic to derive the type of a target number, based on the number of targets.



Test

Driver

mem access AXI4

Interconnect

Mergesort

Merge Engine confia

Please study: src/AXI4\_Fabric.bsv

It's interface is just a vector of Servers facing the initiators, and a vector of Clients facing the targets:

The module mkAXI4\_Fabric is a "full crossbar" switch, i.e., there is a separate datapath from each of M initiators to each of S targets, and vice versa. These are represented by rules that are generated in for-loops.

The whole module is parameterized by a "routing function" that decides, based on the address in each request, which target (if any) it should be sent to.



Memory

Please study: src/SoC\_Fabric.bs to see how we specialize the general AXI4 Fabric definition for our particular SoC.

Inside the module we have an "address decoder" function that decides, based on the address in a request, which target (if any) it should be sent to.

This function is passed as an argument to the mkAXI4\_Fabric module constructor.

```
mkSoC Fabric :: Module SoC Fabric IFC
mkSoC Fabric =
  module
    soc map :: SoC Map IFC <- mkSoC Map
    -- Target address decoder.
    -- Identifies whether a given addr is legal and, if so, which target services it.
    let fn_addr_to_target_num :: Fabric_Addr -> (Bool, Target_Num)
        fn_addr_to_target_num
                                 addr =
            -- Mem 0
            if ( (soc_map.m_mem0_controller_addr_base <= addr)</pre>
                && (addr < soc_map.m_mem0_controller_addr_lim)) then
                (True, fromInteger mem0_controller_target_num)
            -- Accelerator 0
            else if ( (soc_map.m_accel_0_addr_base <= addr)</pre>
                     && (addr < soc map.m accel 0 addr lim)) then
                (True, fromInteger accel_0_target_num)
            else
               (False, _ )
    soc_fabric :: SoC_Fabric_IFC <- mkAXI4_Fabric fn_addr_to_target_num</pre>
    return soc_fabric
```



#### Please study:

- src/AXI4\_Types.bsv to see how we define industry-standard AXI4 bus types and interfaces in Bluespec code.
- src/AXI4\_Fabric.bsv to see how we define an AXI4 crossbar switch that is parameterized by number of initiators, number of targets, and width of the address, data, id and user buses.

These are written in BSV, not Bluespec Classic. AXI4\_Types is more easily written in BSV (because of the heavy use of Verilog signal-naming customization). AXI4\_Fabric could just as easily be written in Classic.



To re-use a component like the Test Driver, we use adapters defined in: Resources/Adapters\_Req\_Rsp\_AXI4.bs.

Then, in: src/Top.bs after instantiating Test\_Driver, we convert its original Req\_Rsp interface into and AXI4\_Master interface:

```
test_driver :: Test_Driver_IFC <- mkTest_Driver
...
test_driver_master :: SoC_Fabric_Initiator_IFC <- mkReq_Rsp_to_AXI4_Master test_driver.bus_ifc</pre>
```



Then, we connect the AXI4 Master interface to one of the fabric's "sockets":

```
mkConnection test_driver_master (soc_fabric.v_from_masters !! test_driver_initiator_num)
```

Similarly, we use adapters on mergesort's mem\_bus interface and config\_bus interface, and the memory's bus interface, and connect them all up to the fabric using 'mkConnection'.



#### Build and run the 2<sup>nd</sup> version

In the Build directory, build and run using the 'make' commands, with Bluesim and/or with Verilog sim, as before, and verify that the output is as expected.



#### In this version:

- We expand the AXI4 interconnect from 2x2 to 3x3
- We replace Test\_Driver with a Bluespec Piccolo RISC-V CPU core
- It has two initiator ports into the fabric (for Instruction and Data access).
- We add a UART target to the fabric, so that the CPU can write messages to the console.

Then, we run programs on Piccolo by pre-loading the binary program code into memory from a "Mem.hex" file. The binary program code, in turn, is derived from the compiled ELF file for the program.



In this last version (Eg06c) we'll run two programs on Piccolo, compiled from C with gcc:

"Hello World!" (writes to the UART)

hello.c

mergesort.c

riscv-qcc

FIF

executable

mergesort, which will sort an array twice, once with a C function, and once using the IP block "accelerator", and compare elapsed times, writing results to the UART.

bluespec

© Bluespec, Inc., 2013-2019

Please clone Bluespec's Piccolo RISC-V repository if you had not already done so in Eg06b.

\$ git clone https://github.com/bluespec/Piccolo

Copy components from Eg06a into local copies:

\$ make copy\_files

Copy Piccolo files (this will copy into a new "src\_Piccolo" directory:

# Please edit the definition of PICCOLO\_REPO in Makefile to point at your Piccolo clone directory.

\$ make copy\_files

Build (compile and link) as before:

\$ make b\_compile b\_link
\$ make b\_sim

Note: since we haven't provided a Mem.hex file to load into memory, the CPU sees and illegal instruction and falls into an infinite trap loop.

Study the source codes of the two programs of interest: ../C\_programs\_RV32/hello/hello.c and .../mergesort/mergesort.c

Run the two programs of interest:

\$ make b\_sim\_hello

\$ make b sim mergesort

First copies ../C programs RV32/hello/hello Mem.hex to Mem.hex; then runs.

First copies ../C programs RV32/mergesort/mergesort Mem.hex to Mem.hex; then runs.



bluespec

### Suggested exercises

- All three versions of the example sort 32-bit (4-byte) words of memory.
  - Modify the design to have a *static* parameter such that it will compile to a circuit that sorts memory in units of 1, 2, 4 or 8 bytes (static = the size is fixed at compile time). Note that Resources/Req\_Rsp.bs already defines an enum type RR\_Size to specify byte size.
    - The mergesort engine should issue memory requests with the selected unit size.
  - Modify the design so that the byte-size selection is done *dynamically*:
    - Add another config register in which the CPU can specify the size.
    - The mergesort engine should issue memory requests with the selected unit size.
  - Modify the last design so that memory requests are always for 8 bytes, even if the sort is on smaller units. E.g., if the sort is on 1-byte units:
    - Only 1 memory read is needed to fetch 8 units.
    - Only 1 memory write is needed to store 8 units.
- All the examples perform a binary (radix 2) merge sort, i.e., the basic merge step merges two spans.
  - Modify the program to perform a radix 4 merge sort, i.e., the basic merge step should merge *four* spans. Question: when sorting an array of length *n*, how many memory references does this perform, compared to the binary merge sort?
  - Parameterize the module for a radix k mergesort, where k is a static parameter that may take some chosen range of values (2, 3, 4, ...).
- mkMergesort currently instantiates 1 copy of mkMerge\_Engine. If your fabric and memory have more bandwidth, it could instantiate
  multiple mkMergeEngines running concurrently to do each pass faster. Make this modification: expand the fabric to 4x4; add a second
  memory bank as the extra target. Instantiate 2 mkMerge\_Engines, using the extra initiator port for the second one. Modify mkMergesort to
  use two engines concurrently. (Caution! having two memory banks can introduce a memory-ordering problem! Use a "CompletionBuffer"
  from the Bluespec library to solve this.)

#### Summary

This example has shown you key features of an IP block built for high-performance in an SoC context:

- Useful functionality (sorting, which is useful in *many* applications)
- Implementation using an efficient mathematical algorithm (mergesort)
- Key concepts of SoC structure: Fabrics, initiators, targets, memory mapping, ...
- Key concepts of high-performance: pipelining, task queue parallelism, memory bandwidth, managing out-of-order communication, ...
- Generality through parameterization on many dimensions (and hence capable of much re-use in other contexts)





# End

typede Midiki kuuli

teele or id-ad kuuli

liigge filiopii = ii:

franto Midiki datembe penglauliuli

reite filiopii datembe penglauliuli

reite filiopii titundik

reite filiopii titundik

reite filiopii titundik

referitiani elitandik

referitiani eli

endpedde oeg ful enik ko



