## RISC-V XBitmanip Extension

Document Version 0.35-draft

Editor: Clifford Wolf Symbiotic GmbH clifford@symbioticeda.com April 26, 2018 Contributors to all versions of the spec in alphabetical order (please contact editors to suggest corrections): Allen Baum, Steven Braeger, Michael Clark, Po-wei Huang, Luke Kenneth Casson Leighton, Rex McCrary, and Clifford Wolf.

This document is released under a Creative Commons Attribution 4.0 International License.

# Contents

| 1 | Intr | roduction 1                                         |    |  |  |  |  |  |  |  |
|---|------|-----------------------------------------------------|----|--|--|--|--|--|--|--|
|   | 1.1  | ISA Extension Proposal Design Criteria              | 1  |  |  |  |  |  |  |  |
|   | 1.2  | B Extension Adoption Strategy                       | 2  |  |  |  |  |  |  |  |
|   | 1.3  | Next steps                                          | 2  |  |  |  |  |  |  |  |
| 2 | RIS  | C-V XBitmanip Extension                             | 3  |  |  |  |  |  |  |  |
|   | 2.1  | Count Leading/Trailing Zeros (clz, ctz)             | 3  |  |  |  |  |  |  |  |
|   | 2.2  | Count Bits Set (pcnt)                               | 4  |  |  |  |  |  |  |  |
|   | 2.3  | And-with-complement (andc)                          | 4  |  |  |  |  |  |  |  |
|   | 2.4  | Shift Ones (Left/Right) (slo, sloi, sro, sroi)      | 5  |  |  |  |  |  |  |  |
|   | 2.5  | Rotate (Left/Right) (rol, ror, rori)                | 6  |  |  |  |  |  |  |  |
|   | 2.6  | Generalized Reverse (grev, grevi)                   | 8  |  |  |  |  |  |  |  |
|   | 2.7  | Generalized zip/unzip (gzip)                        | 9  |  |  |  |  |  |  |  |
|   | 2.8  | Bit Extract/Deposit (bext, bdep)                    | 14 |  |  |  |  |  |  |  |
|   | 2.9  | Compressed instructions (c.not, c.neg, c.brev)      | 15 |  |  |  |  |  |  |  |
|   | 2.10 | Pseudo instructions                                 | 16 |  |  |  |  |  |  |  |
|   |      | 2.10.1 Bit-field extract pseudo instruction (bfext) | 16 |  |  |  |  |  |  |  |
|   |      | 2.10.2 Pseudo instructions using grevi and gzip     | 16 |  |  |  |  |  |  |  |
| 3 | Disc | cussion                                             | 19 |  |  |  |  |  |  |  |
|   | 3.1  | Frequently Asked Questions                          | 19 |  |  |  |  |  |  |  |
|   | 3.2  | Analysis of used encoding space                     | 20 |  |  |  |  |  |  |  |

| 4            | Eva    | duation 23                                |    |  |  |  |  |  |  |  |
|--------------|--------|-------------------------------------------|----|--|--|--|--|--|--|--|
|              | 4.1    | MIX/MUX pattern                           | 23 |  |  |  |  |  |  |  |
|              | 4.2    | Bit scanning and counting                 | 24 |  |  |  |  |  |  |  |
|              | 4.3    | Arbitrary (data-defined) bit permutations | 25 |  |  |  |  |  |  |  |
|              |        | 4.3.1 Using butterfly operations          | 25 |  |  |  |  |  |  |  |
|              |        | 4.3.2 Using omega-flip networks           | 26 |  |  |  |  |  |  |  |
|              |        | 4.3.3 Using baseline networks             | 26 |  |  |  |  |  |  |  |
|              |        | 4.3.4 Using sheep-and-goats               |    |  |  |  |  |  |  |  |
|              | 4.4    | Emulating x86 Bit Manipulation ISAs       |    |  |  |  |  |  |  |  |
|              | 4.5    | Emulating RI5CY Bit Manipulation ISA      | 29 |  |  |  |  |  |  |  |
|              | 4.6    | Decoding RISC-V Immediates                | 29 |  |  |  |  |  |  |  |
| 5            | XBi    | tfield Extension                          | 31 |  |  |  |  |  |  |  |
|              | 5.1    | Bit-field extract and place (bfxp)        | 32 |  |  |  |  |  |  |  |
|              | 5.2    | Evaluation: Decoding RISC-V Immediates    |    |  |  |  |  |  |  |  |
| $\mathbf{C}$ | hang   | e History S                               | 35 |  |  |  |  |  |  |  |
| Ri           | iblios | raphy                                     | 37 |  |  |  |  |  |  |  |

## Chapter 1

## Introduction

This is the RISC-V XBitmanip Extension draft spec. Originally it was the B-Extension draft spec, but the work group got dissolved for bureaucratic reasons in November 2017.

It is currently an independently maintained document. We'd happily donate it to the RISC-V foundation as starting point for a new B-Extension work group, if there will be one.

## 1.1 ISA Extension Proposal Design Criteria

Any proposed changes to the ISA should be evaluated according to the following criteria.

- Architecture Consistency: Decisions must be consistent with RISC-V philosophy. ISA changes should deviate as little as possible from existing RISC-V standards (such as instruction encodings), and should not re-implement features that are already found in the base specification or other extensions.
- Threshold Metric: The proposal should provide a *significant* savings in terms of clocks or instructions. As a heuristic, any proposal should replace at least four instructions. An instruction that only replaces three may be considered, but only if the frequency of use is very high.
- Data-Driven Value: Usage in real world applications, and corresponding benchmarks showing a performance increase, will contribute to the score of a proposal. A proposal will not be accepted on the merits of its *theoretical* value alone, unless it is used in the real world.
- Hardware Simplicity: Though instructions saved is the primary benefit, proposals that dramatically increase the hardware complexity and area, or are difficult to implement, should be penalized and given extra scrutiny. The final proposals should only be made if a test implementation can be produced.
- Compiler Support: ISA changes that can be natively detected by the compiler, or are already used as intrinsics, will score higher than instructions which do not fit that criteria.

### 1.2 B Extension Adoption Strategy

The overall goal of this extension is pervasive adoption by minimizing potential barriers and ensuring the instructions can mapped to the largest number of ops, either direct or pseudo, that are supported by the most popular processors and compilers. By adding generic instructions and taking advantage of the RISC-V base instruction that already operate on bits, the minimal set of instructions need to added while at the same time enabling a rich of operations.

The instructions cover the four major categories of bit manipulation: Count, Extract, Insert, Swap. The spec supports RV32, RV64, and RV128. "Clever" obscure and/or overly specific instructions are avoided in favor of more straightforward, fast, generic ones. Coordination with other emerging RISC-V ISA extensions groups is required to ensure our instruction sets are architecturally consistent.

### 1.3 Next steps

- Add support for this extension to processor cores and compilers so we can run quantitative evaluations on the instructions.
- Create assembler snippets for common operations that do not map 1:1 to any instruction in this spec, but can be implemented easily using clever combinations of the instructions. Add support for those snippets to compilers.

## Chapter 2

# RISC-V XBitmanip Extension

In the proposals provided in this section, the C code examples are for illustration purposes. They are not optimal implementations, but are intended to specify the desired functionality.

The sections on encodings are mere placeholders.

## 2.1 Count Leading/Trailing Zeros (clz, ctz)

The clz operation counts the number of 0 bits at the MSB end of the argument. That is, the number of 0 bits before the first 1 bit counting from the most significant bit. If the input is 0, the output is XLEN. If the input is -1, the output is 0.

The ctz operation counts the number of 0 bits at the LSB end of the argument. If the input is 0, the output is XLEN. If the input is -1, the output is 0.

```
uint_xlen_t clz(uint_xlen_t rs1)
{
    for (int count = 0; count < XLEN; count++)
        if ((rs1 << count) >> (XLEN - 1))
            return count;
    return XLEN;
}
uint_xlen_t ctz(uint_xlen_t rs1)
{
    for (int count = 0; count < XLEN; count++)
        if ((rs1 >> count) & 1)
            return count;
    return XLEN;
}
```

| 31 |             | 20 19                | $15 \ 14$ |        | 12 11                 | 7 6 |           | 0 |
|----|-------------|----------------------|-----------|--------|-----------------------|-----|-----------|---|
|    | imm[11:0]   | rs1                  |           | funct3 | rd                    |     | opcode    |   |
|    | 12          | 5                    |           | 3      | 5                     | •   | 7         |   |
|    | ??????????? | $\operatorname{src}$ |           | CLZ    | $\operatorname{dest}$ |     | OP-IMM    |   |
|    | ??????????? | $\operatorname{src}$ |           | CTZ    | dest                  |     | OP-IMM    |   |
|    | ??????????? | $\operatorname{src}$ |           | CLZW   | dest                  | (   | OP-IMM-32 |   |
|    | ??????????  | $\operatorname{src}$ |           | CTZW   | $\operatorname{dest}$ | (   | OP-IMM-32 |   |

One possible encoding for clz and ctz is as standard I-type opcodes somewhere in the brownfield surrounding the shift-immediate instructions.

### 2.2 Count Bits Set (pcnt)

This instruction counts the number of 1 bits in a register. This operations is known as population count, popcount, sideways sum, bit summation, or Hamming weight. [5, 4]

```
uint_xlen_t pcnt(uint_xlen_t rs1)
{
   int count = 0;
   for (int index = 0; index < XLEN; index++)
        count += (rs1 >> index) & 1;
   return count;
}
```

| 31 |            | 20 19                | $15 \ 14$ | 1      | 12 11                 | 7 6       | 0 |
|----|------------|----------------------|-----------|--------|-----------------------|-----------|---|
|    | imm[11:0]  | rs1                  |           | funct3 | rd                    | opcode    |   |
|    | 12         | 5                    |           | 3      | 5                     | 7         |   |
|    | ?????????? | $\operatorname{src}$ |           | PCNT   | $\operatorname{dest}$ | OP-IMM    |   |
|    | ?????????? | $\operatorname{src}$ |           | PCNTW  | $\operatorname{dest}$ | OP-IMM-32 |   |

One possible encoding for pcnt is as a standard I-type opcode somewhere in the brownfield surrounding the shift-immediate instructions.

## 2.3 And-with-complement (andc)

This instruction implements the and-with-complement operation.

```
uint_xlen_t andc(uint_xlen_t rs1, uint_xlen_t rs2)
{
    return rs1 & ~rs2;
}
```

Other with-complement operations (orc, nand, nor, etc) can be implemented by combining not (c.not) with the base ALU operation. (Which can fit in 32 bit when using two compressed instructions.) Only and-with-complement occurs frequently enough to warrant a dedicated instruction.

| 31   | 25   | 24 2            | 0 19                  | 15 14  | 12 11 7 | 6 0    |
|------|------|-----------------|-----------------------|--------|---------|--------|
| fund | ct7  | rs2             | rs1                   | funct3 | rd      | opcode |
| •    | 7    | 5               | 5                     | 3      | 5       | 7      |
|      | ???? | $\mathrm{src}2$ | $\operatorname{src}1$ | ANDC   | dest    | OP     |
|      | ???? | $\mathrm{src}2$ | $\operatorname{src}1$ | ANDCW  | dest    | OP-32  |

## 2.4 Shift Ones (Left/Right) (slo, sloi, sro, sroi)

These instructions are similar to shift-logical operations from the base spec, except instead of shifting in zeros, they shifts in ones.

```
uint_xlen_t slo(uint_xlen_t rs1, uint_xlen_t rs2)
{
    int shamt = rs2 & (XLEN - 1);
    return ~(~rs1 << shamt);
}
uint_xlen_t sro(uint_xlen_t rs1, uint_xlen_t rs2)
{
    int shamt = rs2 & (XLEN - 1);
    return ~(~rs1 >> shamt);
}
```

| 3 | 1 2     | 5 24 20         | 19 15                 | 5 14 12 | 2 11 7                | 6 0    |
|---|---------|-----------------|-----------------------|---------|-----------------------|--------|
|   | funct7  | rs2             | rs1                   | funct3  | rd                    | opcode |
|   | 7       | 5               | 5                     | 3       | 5                     | 7      |
|   | 10????? | src2            | $\operatorname{src}1$ | SRO     | $\operatorname{dest}$ | OP     |
|   | 10????? | $\mathrm{src}2$ | $\operatorname{src}1$ | SLO     | $\operatorname{dest}$ | OP     |
|   | 10????? | $\mathrm{src}2$ | $\operatorname{src}1$ | SROW    | $\operatorname{dest}$ | OP-32  |
|   | 10????? | ${ m src}2$     | $\operatorname{src}1$ | SLOW    | $\operatorname{dest}$ | OP-32  |

| 3 | 1         | $27\ 26$ | 20 19         | 1                    | 5 14   | 12 11 | 7 6 | 0      |   |
|---|-----------|----------|---------------|----------------------|--------|-------|-----|--------|---|
|   | imm[11:7] | imm      | [6:0]         | rs1                  | funct3 | rd    |     | opcode | 7 |
|   | 5         | 7        |               | 5                    | 3      | 5     | ·   | 7      | _ |
|   | 10???     | sha      | $\mathrm{mt}$ | $\operatorname{src}$ | SLOI   | dest  | C   | P-IMM  |   |
|   | 10???     | sha      | $\mathrm{mt}$ | $\operatorname{src}$ | SROI   | dest  | C   | P-IMM  |   |

| 31 |          | $25\ 24$ | 20 19 | 15                   | 5 14   | 12 11                 | 7 6  | 0     |
|----|----------|----------|-------|----------------------|--------|-----------------------|------|-------|
| i  | mm[11:5] | imm[4    | 1:0]  | rs1                  | funct3 | rd                    | ot   | ocode |
|    | 7        | 5        |       | 5                    | 3      | 5                     |      | 7     |
|    | 10?????  | shan     | nt    | $\operatorname{src}$ | SLOIW  | $\operatorname{dest}$ | OP-I | MM-32 |
|    | 10?????  | shan     | nt    | $\operatorname{src}$ | SROIW  | $\operatorname{dest}$ | OP-I | MM-32 |

s(1/r)o(i) is encoded similarly to the logical shifts in the base spec. However, the spec of the entire family of instructions is changed so that the high bit of the instruction indicates the value to be inserted during a shift. This means that a sloi instruction can be encoded similarly to an slli instruction, but with a 1 in the highest bit of the encoded instruction. This encoding is backwards compatible with the definition for the shifts in the base spec, but allows for simple addition of a ones-insert.

When implementing this circuit, the only change in the ALU over a standard logical shift is that the value shifted in is not zero, but is a 1-bit register value that has been forwarded from the high bit of the instruction decode. This creates the desired behavior on both logical zero-shifts and logical ones-shifts.

## 2.5 Rotate (Left/Right) (rol, ror, rori)

These instructions are similar to shift-logical operations from the base spec, except they shift in the values from the opposite side of the register, in order. This is also called 'circular shift'.

```
uint_xlen_t rol(uint_xlen_t rs1, uint_xlen_t rs2)
{
    int shamt = rs2 & (XLEN - 1);
    return (rs1 << shamt) | (rs1 >> (XLEN - shamt));
}
uint_xlen_t ror(uint_xlen_t rs1, uint_xlen_t rs2)
{
    int shamt = rs2 & (XLEN - 1);
    return (rs1 >> shamt) | (rs1 << (XLEN - shamt));
}</pre>
```

| 31     | 25 24 | 20 19     | 15 14  | 12 11 | 7 6    | 0 |
|--------|-------|-----------|--------|-------|--------|---|
| funct' | 7 rs  | s2 rs1    | funct3 | rd    | opcode |   |
| 7      |       | 5 5       | 3      | 5     | 7      |   |
| 11???  | ?? sr | c2 $src1$ | ROR    | dest  | OP     |   |
| 11???  | ?? sr | c2 $src1$ | ROL    | dest  | OP     |   |
| 11???  | ?? sr | c2 $src1$ | RORW   | dest  | OP-32  |   |
| 11???  | ?? sr | c2 	 src1 | ROLW   | dest  | OP-32  |   |

|   | 31        | 27 26   | 20 19 | 15           | 14     | 12 11 | 7 6 |        | 0 |
|---|-----------|---------|-------|--------------|--------|-------|-----|--------|---|
|   | imm[11:7] | imm[6:0 | )] rs | 1            | funct3 | rd    |     | opcode |   |
| Ī | 5         | 7       | ļ     | •            | 3      | 5     |     | 7      | _ |
|   | 11???     | shamt   | Sl    | $\mathbf{c}$ | RORI   | dest  | -   | OP-IMM |   |

| 31        | $25\ 24$ | 20 19        | 9 15                 | 14 15  | 2 11 7                | 7 6 0     |
|-----------|----------|--------------|----------------------|--------|-----------------------|-----------|
| imm[11:5] | im       | m[4:0]       | rs1                  | funct3 | rd                    | opcode    |
| 7         |          | 5            | 5                    | 3      | 5                     | 7         |
| 11?????   | sł       | $_{ m namt}$ | $\operatorname{src}$ | RORIW  | $\operatorname{dest}$ | OP-IMM-32 |



Figure 2.1: rot permutation network

Rotate shift is implemented very similarly to the other shift instructions. One possible way to encode it is to re-use the way that bit 30 in the instruction encoding selects 'arithmetic shift' when bit 31 is zero (signalling a logical-zero shift). We can re-use this so that when bit 31 is set (signalling a logical-ones shift), if bit 31 is also set, then we are doing a rotate. The following table summarizes the behavior. The generalized reverse instructions can be encoded using the bit pattern that would otherwise encode an "Arithmetic Left Shift" (which is an operation that does not exist). Likewise, the generalized zip instruction can be encoded using the bit pattern that would otherwise encode an "Rotate left immediate".

Table 2.1: Rotate Encodings

| Bit 31 | Bit 30 | Meaning             |
|--------|--------|---------------------|
| 0      | 0      | Logical Shift-Zeros |
| 0      | 1      | Arithmetic Shift    |
| 1      | 0      | Logical Shift-Ones  |
| 1      | 1      | Rotate              |



Figure 2.2: grev permutation network

## 2.6 Generalized Reverse (grev, grevi)

This instruction provides a single hardware instruction that can implement all of byte-order swap, bitwise reversal, short-order-swap, word-order-swap (RV64), nibble-order swap, bitwise reversal in a byte, etc, all from a single hardware instruction. It takes in a single register value and an immediate that controls which function occurs, through controlling the levels in the recursive tree at which reversals occur.

This operation iteratively checks each bit i in rs2 from i = 0 to XLEN -1, and if the corresponding bit is set, swaps each adjacent pair of  $2^i$  bits.

```
uint32_t grev32(uint32_t rs1, uint32_t rs2)
{
    uint32_t x = rs1;
    int shamt = rs2 & 31;
    if (shamt & 1) x = ((x & 0x555555555) << 1) | ((x & 0xAAAAAAAA) >> 1);
    if (shamt & 2) x = ((x & 0x333333333) << 2) | ((x & 0xCCCCCCCC) >> 2);
    if (shamt & 4) x = ((x & 0x0F0F0F0F) << 4) | ((x & 0xF0F0F0F0) >> 4);
    if (shamt & 8) x = ((x & 0x00FF00FF) << 8) | ((x & 0xFFF00F00) >> 8);
    if (shamt & 16) x = ((x & 0x0000FFFF) << 16) | ((x & 0xFFFF0000) >> 16);
    return x;
}
```

```
uint64_t grev64(uint64_t rs1, uint64_t rs2)
{
   uint64_t x = rs1;
   int shamt = rs2 & 63;
   ((x & OxAAAAAAAAAAAAAAAaall) >>
                                                1):
   ((x & 0xCCCCCCCCCCCCCull) >>
                                                2):
   if (shamt & 4) x = ((x \& 0x0F0F0F0F0F0F0F0Full) <<
                    ((x & 0xF0F0F0F0F0F0F0F0ull) >>
                                                4);
   if (shamt & 8) x = ((x \& 0x00FF00FF00FF00FFull) <<
                                                8) I
                    ((x & 0xFF00FF00FF00ull) >>
   if (shamt & 16) x = ((x \& 0x0000FFFF0000FFFFull) << 16) |
                    ((x & 0xFFFF0000FFFF0000ull) >> 16);
   if (shamt & 32) x = ((x \& 0x00000000FFFFFFFFF111) << 32) |
                    ((x & 0xFFFFFFF00000000ull) >> 32);
   return x;
}
```

The above pattern should be intuitive to understand in order to extend this definition in an obvious manner for RV128.

The grev operation can easily be implemented using a permutation network with  $log_2(XLEN)$  stages. Figure 2.1 shows the permutation network for rot for reference. Figure 2.2 shows the permutation network for grev.

|   | 31 25 24  |      | 24                    |       |                       |      | 14     | 12 11  |                       |     | 6                       | 0 |
|---|-----------|------|-----------------------|-------|-----------------------|------|--------|--------|-----------------------|-----|-------------------------|---|
|   | funct7    |      | rs2                   |       | rs1                   |      | funct3 | et3 rd |                       |     | $\operatorname{opcode}$ |   |
|   | 7         |      | 5                     |       | 5                     |      | 3      |        | 5                     |     | 7                       |   |
|   | ???????   |      | $\operatorname{src}2$ |       | $\operatorname{src}1$ |      | GREV   |        | $\operatorname{dest}$ |     | OP                      |   |
|   | ???????   |      | src2                  |       | $\operatorname{src}1$ |      | GREVW  | •      | $\operatorname{dest}$ |     | OP-32                   |   |
|   |           |      |                       |       |                       |      |        |        |                       |     |                         |   |
| _ | 31        | 27 2 | 6                     | 20 19 | )                     | 15 1 | 14     | 12 1   | 1                     | 7 6 |                         | 0 |
|   | imm[11:7] |      | imm[6:0]              |       | rs1                   |      | funct3 |        | $\operatorname{rd}$   |     | $\operatorname{opcode}$ |   |
| • | 5         |      | 7                     |       | 5                     |      | 3      |        | 5                     | ·   | 7                       |   |
|   | ?????     |      | mode                  |       | $\operatorname{src}$  |      | GREVI  |        | $\operatorname{dest}$ |     | OP-IMM                  |   |
|   |           |      |                       |       |                       |      |        |        |                       |     |                         |   |
|   | 31        | 25 2 | 4                     | 20 19 | )                     | 15 1 | 14     | 12 1   | 11                    | 7 6 |                         | 0 |
|   | imm[11:5] |      | imm[4:0]              |       | rs1                   |      | funct3 |        | rd                    |     | opcode                  |   |
| • | 7         |      | 5                     |       | 5                     |      | 3      |        | 5                     | ·   | 7                       |   |
|   | ???????   |      | mode                  |       | $\operatorname{src}$  |      | GREVIW |        | $\operatorname{dest}$ |     | OP-IMM-32               |   |

grev is encoded as standard R-type opcode and grevi is encoded as standard I-type opcode. grev and grevi can use the instruction encoding for "arithmetic shift left".

## 2.7 Generalized zip/unzip (gzip)

gzip is the third bit permutation instruction in XBitmanip, after rori and grevi.

The gzip instruction uses an I-type encoding similar to grevi. There are XLEN different generalized zip operations, some of which are reserved because they are no-ops, or equivalent to other modes, or encode for obscure combinations of other modes. The bit pattern for the non-reserved modes match the regular expression /^0\*(10+|11+0\*[01])\$/. See Table 2.2.

Reserving modes that encode for "obscure combinations of other modes" can help implementations that use different base permutations (or completely different mechanisms) to implement the gzip instruction. The reserved modes can be used to encode unary functions such as ctz, clz, and pcnt.

| mode                | Bit index rotations        | Pseudo-Instruction |
|---------------------|----------------------------|--------------------|
| 0000-0              | no-op                      | reserved           |
| 0000 - 1            | no-op                      | reserved           |
| $0001 \ 0$          | i[1] -> i[0]               | zip.n, unzip.n     |
| 0001 - 1            | $equivalent\ to\ 0001\ 0$  | reserved           |
| $0010 \ 0$          | i[2] -> i[1]               | zip2.b, unzip2.b   |
| 0010 - 1            | $equivalent\ to\ 0010\ 0$  | reserved           |
| $0011 \ 0$          | i[2] -> i[0]               | zip.b              |
| 0011 1              | i[2] <- i[0]               | unzip.b            |
| 0100 0              | i[3] -> i[2]               | zip4.h, unzip4.h   |
| 0100 - 1            | 1                          | reserved           |
| 0101 - 0            | i[3] -> i[2], i[1] -> i[0] | reserved           |
| 0101 - 1            | 1                          | reserved           |
| $0110 \ 0$          |                            | zip2.h             |
| $0110\ 1$           | i[3] <- i[1]               | unzip2.h           |
| $0111\ 0$           | i[3] -> i[0]               | zip.h              |
| 0111 1              |                            | unzip.h            |
| $1000\ 0$           |                            | zip8, unzip8       |
| $\frac{1000-1}{}$   | 1                          | reserved           |
| $\frac{1001 - 0}{}$ | i[4] -> i[3], i[1] -> i[0] | reserved           |
| <del>1001-1</del>   | 1                          | reserved           |
| <del>1010 0</del>   |                            | reserved           |
| <del>1010-1</del>   | <u> </u>                   | reserved           |
| <del>1011-0</del>   | i[4] -> i[3], i[2] -> i[0] | reserved           |
| 1011-1              |                            | reserved           |
| $1100 \ 0$          |                            | zip4               |
| $1100 \ 1$          |                            | unzip4             |
| <del>1101-0</del>   |                            |                    |
| <del>1101-1</del>   |                            | reserved           |
| $1110 \ 0$          |                            | zip2               |
| 1110 1              |                            | unzip2             |
| 1111 0              |                            | zip                |
| 1111 1              | i[4] <- i[0]               | unzip              |

Table 2.2: RV32 modes for gzip instruction

Like GREV and rotate shift, the gzip instruction can be implemented using a short sequence of atomic permutations, that are enabled or disabled by the mode (shamt) bits. But zip has one stage fewer than GREV and the LSB bit of mode controls the order in which the stages are applied:



Figure 2.3: gzip permutation network without "flip" stages

```
uint32_t gzip32_stage(uint32_t src, uint32_t maskL, uint32_t maskR, int N)
    uint32_t x = src & ~(maskL | maskR);
    x \mid = ((src \ll N) \& maskL) \mid ((src >> N) \& maskR);
    return x;
}
uint32_t gzip32(uint32_t rs1, uint32_t rs2)
    uint32_t x = rs1;
    int mode = rs2 & 31;
    if (mode & 1) {
        if (mode & 2) x = gzip32\_stage(x, 0x44444444, 0x22222222, 1);
        if (mode & 4) x = gzip32\_stage(x, 0x30303030, 0x0c0c0c0c, 2);
        if (mode & 8) x = gzip32_stage(x, 0x0f000f00, 0x00f000f0, 4);
        if (mode & 16) x = gzip32_stage(x, 0x00ff0000, 0x0000ff00, 8);
    } else {
        if (mode & 16) x = gzip32_stage(x, 0x00ff0000, 0x0000ff00, 8);
        if (mode & 8) x = gzip32_stage(x, 0x0f000f00, 0x00f000f0, 4);
        if (mode & 4) x = gzip32\_stage(x, 0x30303030, 0x0c0c0c0c, 2);
        if (mode & 2) x = gzip32\_stage(x, 0x44444444, 0x22222222, 1);
    }
    return x;
}
```

Alternatively gzip can be implemented in a single network with one more stage than GREV, with the additional first and last stage executing a permutation that effectively reverses the order of the inner stages. However, since the inner stages only mux half of the bits in the word each, a hardware implementation using this additional "flip" stages might actually be more expensive than simply creating two networks.

```
uint32_t gzip32_flip(uint32_t src)
{
    uint32_t x = src \& 0x88224411;
    x = ((src << 6) \& 0x22001100) | ((src >> 6) \& 0x00880044);
    x = ((src << 9) \& 0x00440000) | ((src >> 9) \& 0x00002200);
    x = ((src \ll 15) \& 0x44110000) | ((src >> 15) \& 0x00008822);
    x = ((src \ll 21) \& 0x11000000) | ((src >> 21) \& 0x00000088);
    return x;
}
uint32_t gzip32alt(uint32_t rs1, uint32_t rs2)
{
    uint32_t x = rs1;
    int mode = rs2 \& 31;
    if (mode & 1)
        x = gzip32_flip(x);
    if ((mode & 17) == 16 || (mode & 3) == 3)
        x = gzip32\_stage(x, 0x00ff0000, 0x0000ff00, 8);
    if ((mode \& 9) == 8 | (mode \& 5) == 5)
        x = gzip32\_stage(x, 0x0f000f00, 0x00f000f0, 4);
    if ((mode \& 5) == 4 \mid | (mode \& 9) == 9)
        x = gzip32\_stage(x, 0x30303030, 0x0c0c0c0c, 2);
    if ((mode & 3) == 2 || (mode & 17) == 17)
        x = gzip32\_stage(x, 0x44444444, 0x22222222, 1);
    if (mode & 1)
        x = gzip32_flip(x);
    return x;
}
```

Figure 2.4 shows the gzip permutation network with "flip" stages and Figure 2.3 shows the gzip permutation network without "flip" stages.

The **zip** instruction with the upper half of its input cleared performs the commonly needed "fanout" operation. (Equivalent to **bdep** with a 0x55555555 mask.) The **zip** instruction applied twice fans out the bits in the lower quarter of the input word by a spacing of 4 bits.



Figure 2.4: gzip permutation network with "flip" stages

For example, the following code calculates the bitwise prefix sum of the bits in the lower byte of a 32 bit word on RV32:

```
andi a0, a0, 0xff
zip a0, a0
zip a0, a0
slli a1, a0, 4
add a0, a1
slli a1, a0, 8
add a0, a1
slli a1, a0, 16
add a0, a1
```

The final prefix sum is stored in the 8 nibbles of the a0 output word.

|   | 31        | 27 26    | 20    | 19                   | 15 | 14     | 12                  | 11                    | 7 | 6         | 0 |
|---|-----------|----------|-------|----------------------|----|--------|---------------------|-----------------------|---|-----------|---|
|   | imm[11:7] | imm      | [6:0] | rs1                  |    | funct3 | $\operatorname{rd}$ |                       |   | opcode    |   |
|   | 5         | ,        | 7     | 5                    |    | 3      |                     | 5                     |   | 7         |   |
|   | ?????     | mo       | ode   | $\operatorname{src}$ |    | GZIP   |                     | $\operatorname{dest}$ |   | OP-IMM    |   |
|   |           |          |       |                      |    |        |                     |                       |   |           |   |
|   | 31        | $25\ 24$ | 20    | 19                   | 15 | 14     | 12                  | 11                    | 7 | 6         | 0 |
|   | imm[11:5] | imm      | [4:0] | rs1                  |    | funct3 |                     | rd                    |   | opcode    |   |
| _ | 7         | ļ        | 5     | 5                    |    | 3      | •                   | 5                     | • | 7         |   |
|   | ???????   | mo       | ode   | $\operatorname{src}$ |    | GZIPW  |                     | $\operatorname{dest}$ |   | OP-IMM-32 |   |

There is no R-type instruction for gzip. It is an I-type only instruction. gzip can use the instruction encoding for "rotate left immediate".

### 2.8 Bit Extract/Deposit (bext, bdep)

This instructions implement the generic bit extract and bit deposit functions. This operation is also referred to as bit gather/scatter, bit pack/unpack, parallel extract/deposit, compress/expand, or right\_compress/right\_expand.

bext collects LSB justified bits to rd from rs1 using extract mask in rs2.

bdep writes LSB justified bits from rs1 to rd using deposit mask in rs2.

```
uint_xlen_t bext(uint_xlen_t rs1, uint_xlen_t rs2)
    uint_xlen_t c = 0, m = 1, mask = rs2;
    while (mask) {
        uint_xlen_t b = mask & -mask;
        if (rs1 & b)
            c \mid = m;
        mask -= b;
        m <<= 1;
    }
    return c;
}
uint_xlen_t bdep(uint_xlen_t rs1, uint_xlen_t rs2)
{
    uint_xlen_t c = 0, m = 1, mask = rs2;
    while (mask) {
        uint_xlen_t b = mask & -mask;
        if (rs1 & m)
            c |= b;
        mask -= b;
        m <<= 1;
    }
    return c;
}
```

Implementations may choose to use smaller multi-cycle implementations of bext and bdep, or even emulate the instructions in software.

Even though multi-cycle bext and bdep often are not fast enough to outperform algorithms that use sequences of shifts and bit masks, dedicated instructions for those operations can still be of great advantage in cases where the mask argument is not constant.

For example, the following code efficiently calculates the index of the tenth set bit in a0 using bdep:

li a1, 0x00000200 bdep a0, a1, a0 ctz a0, a0

For cases with a constant mask an optimizing compiler would decide when to use bext or bdep based on the optimization profile for the concrete processor it is optimizing for. This is similar to the decision whether to use MUL or DIV with a constant, or to perform the same operation using a longer sequence of much simpler operations.

| 31 | 25      | 5 24 | 20 19 1               | 15 14  | 12 11 7               | 6 0    |
|----|---------|------|-----------------------|--------|-----------------------|--------|
|    | funct7  | rs2  | rs1                   | funct3 | rd                    | opcode |
|    | 7       | 5    | 5                     | 3      | 5                     | 7      |
|    | ??????? | src2 | $\operatorname{src}1$ | BEXT   | $\operatorname{dest}$ | OP     |
|    | ??????? | src2 | $\operatorname{src}1$ | BDEP   | $\operatorname{dest}$ | OP     |
|    | ??????? | src2 | $\operatorname{src}1$ | BEXTW  | $\operatorname{dest}$ | OP-32  |
|    | ??????? | src2 | $\operatorname{src}1$ | BDEPW  | $\operatorname{dest}$ | OP-32  |

### 2.9 Compressed instructions (c.not, c.neg, c.brev)

The RISC-V ISA has no dedicated instructions for bitwise inverse (not) and arithmetic inverse (neg). Instead not is implemented as xori rd, rs, -1 and neg is implemented as sub rd, x0, rs.

In bitmanipulation code **not** and **neg** are very common operations. But there are no compressed encodings for those operations because there is no **c.xori** instruction and **c.sub** can not operate on **x0**.

Many bit manipulation operations that have dedicated opcodes in other ISAs must be constructed from smaller atoms in RISC-V XBitmanip code. But implementations might choose to implement them in a single micro-op using macro-op-fusion. For this it can be helpful when the fused sequences are short. **not** and **neg** are good candidates for macro-op-fusion, so it can be helpful to have compressed opcodes for them.

Likewise brev (an alias for grevi rd, rs, -1, i.e. bitwise reversal) is also a very common atom for building bit manipulation operations. So it is helpful to have a compressed opcode for this instruction as well.

The compressed instructions c.not, c.neg, c.brev must be supported by all implementations that support the C extension and XBitmanip.

This three instructions fit nicely in the reserved space in C.LUI/C.ADDI16SP. They only occupy 0.1% of the  $\approx 15.6$  bits wide RVC encoding space.

| 15 14 13 | 12        | 11 10             | 9 8               | 7   | 6    | 5 4   | 3    | 2    | 1 ( | 0                                |
|----------|-----------|-------------------|-------------------|-----|------|-------|------|------|-----|----------------------------------|
| 011      | nzimm[9]  |                   | 2                 |     | nzii | mm[4] | 6 8: | 7[5] | 01  | C.ADDI16SP $(RES, nzimm=\theta)$ |
| 011      | nzimm[17] | $\mathrm{rd}_{7}$ | $ \neq \{0, 2\} $ |     | nz   | zimm  | 16:1 | 2]   | 01  | C.LUI (RES, nzimm=0; HINT, rd=0) |
| 011      | 0         | 00                | rs2'/r            | rd' |      | 0     |      |      | 01  | C.NEG                            |
| 011      | 0         | 01                | 01   rs1'/r       |     |      | 0     |      |      | 01  | C.NOT                            |
| 011      | 0         | 10  rs1'/rd'      |                   |     |      | 0     |      |      | 01  | C.BREV                           |
| 011      | 0         | 11                |                   |     |      | 0     |      |      | 01  | Reserved                         |

#### 2.10 Pseudo instructions

#### 2.10.1 Bit-field extract pseudo instruction (bfext)

Extract the continous bit field starting at pos with length len from rs:

If possible, an implementation should fuse this two shift operations in a single macro-op. Some implementors have raised concerns about the lack of a dedicated bit field extract instruction with large immediate, especially for implementations that can not fuse instructions into macro-ops and/or implementations that do not support compressed instructions. See Chapter 5 if you are one of those people.

#### 2.10.2 Pseudo instructions using grevi and gzip

On RV32:

```
brev
                       grevi rd, rs, 31
        rd, rs
                 ->
                                           ; bitwise reverse
                 ->
                       grevi rd, rs, 15
                                           ; reverse bits in each 16 bit half-word
brev.h rd, rs
brev.b rd, rs
                 ->
                       grevi rd, rs, 7
                                           ; reverse bits in each 8 bit byte
bswap
        rd, rs
                 ->
                       grevi rd, rs, 24
                                           ; reverse the byte order
bswap.h rd, rs
                 ->
                       grevi rd, rs,
                                           ; swap bytes in each 16 bit half-word
hswap
        rd, rs
                 ->
                       grevi rd, rs, 16
                                           ; swap the two 16 bit half-words
```

On RV64:

```
brev
                      grevi rd, rs, 63
                                           ; bitwise reverse
        rd, rs
                 ->
brev.w rd, rs
                 ->
                      grevi rd, rs, 31
                                           ; reverse bits in each 32 bit word
brev.h rd, rs
                      grevi rd, rs, 15
                                           ; reverse bits in each 16 bit half-word
                 ->
brev.b rd, rs
                 ->
                      grevi rd, rs, 7
                                           ; reverse bits in each 8 bit byte
```

```
grevi rd, rs, 56 ; reverse the byte order
bswap rd, rs
               ->
               ->
                    grevi rd, rs, 24 ; reverse byte order in each 32 bit word
bswap.w rd, rs
                    grevi rd, rs, 8
bswap.h rd, rs
               ->
                                     ; swap bytes in each 16 bit half-word
hswap
     rd, rs
               -> grevi rd, rs, 48
                                     ; reverse order of 16 bit half-words
hswap.w rd, rs
                    grevi rd, rs, 16
                                      ; swap 16 bit half-words in each 32 bit word
               ->
wswap
      rd, rs -> grevi rd, rs, 32 ; swap the two 32 bit words
```

See Section 2.7 for the list of pseudo-instructions using gzip.

## Chapter 3

## Discussion

### 3.1 Frequently Asked Questions

grev seems to be overly complicated? Do we really need it?

The grev instruction can be used to build a wide range of common bit permutation instructions, such as endianess conversion or bit reversal.

If grev were removed from this spec we would need to add a few new instructions in its place for those operations.

#### Do we really need all the \*W opcodes for 32 bit ops on RV64?

I don't know. I think nobody does know at the moment. But they add very little complexity to the core. So the only question is if it is worth the encoding space. We need to run proper experiments with compilers that support those instructions. So they are in for now and if future evaluations show that they are not worth the encoding space then we can still throw them out.

#### Why only and and not any other complement operators?

Early versions of this spec also included other \*c operators. But experiments<sup>1</sup> have show that andc is much more common in bit manipulation code than any other operators. Especially because it is commonly used in mix and mux operations.

#### Why andc? It can easily be emulated using and and not.

Yes, and we did not include any other ALU+complement operators. But andc is so common (mostly because of the mix and mux patterns), and its implementation is so cheap, that we decided

 $<sup>^{1} \</sup>rm http://svn.clifford.at/handicraft/2017/bitcode/$ 

to dedicate an R-type instruction to the operation.

# The shift-ones instructions can be emulated using not and logical shift? Do we really need it?

Yes, a shift-ones instruction can easily be implemented using the logical shift instructions, with a bitwise invert before and after it. (This is literally the code we are using in the reference C implementation of rotate shift.)

We have decided to include it for now so that we can collect benchmark data before making a final decision on the inclusion or exclusion of those instructions.

#### BEXT/BDEP look like really expensive operations. Do we really need them?

Yes, they are expensive, but not as expensive as one might expect. A single-cycle 32 bit BEXT+BDEP+GREV core can be implemented in less space than a single-cycle 16x16 bit multiplier with 32 bit output.<sup>2</sup>

It is also important to keep in mind that implementing those operations in software is very expensive. Hacker's Delight contains a highly optimized software implementation of 32-bit BEXT that requires > 120 instructions. Their BDEP software implementation requires > 160 instructions. (Please disregard the "hardware-oriented algorithm" described in Hacker's Delight. It is extremely expensive compared to other implementations.<sup>3</sup>)

#### But do we really need 64-bit BEXT/BDEP?

Good question. A 64-bit BEXT/BDEP unit certainly is more than 2x the size of a 32-bit unit and in most cases 32-bit would be sufficient.

However, one solution here would be to still reserve the opcode for 64-bit BEXT/BDEP and leave it to the implementation to decide whether to implement the function in hardware or emulating it using a software trap.

## 3.2 Analysis of used encoding space

So how much encoding space is used by the XBitmanip extension?

<sup>&</sup>lt;sup>2</sup>https://github.com/cliffordwolf/bextdep

<sup>&</sup>lt;sup>3</sup>https://github.com/cliffordwolf/bextdep

| R               | V32      | RV       | 64       | Instruction                                      |
|-----------------|----------|----------|----------|--------------------------------------------------|
| 3x              | 0        | 6x       | 0        | CLZ, CLZW, CTZ, CTZW, PCNT, PCNTW                |
| $\frac{1}{2x}$  | 15<br>15 | 4x<br>2x | 15<br>16 | GREV, GREVW, GREVIW, ZIPW<br>GREVI, ZIP          |
| $\frac{1}{2x}$  | 15<br>15 | 6x<br>2x | 15<br>16 | SLO, SRO, SLOW, SROW, SLOIW, SROIW<br>SLOI, SROI |
| 2x<br>1x        | 15<br>15 | 5x<br>1x | 15<br>16 | ROR, ROL, RORW, ROLW, RORIW<br>RORI              |
| 3x              | 15       | 3x<br>3x | 15<br>15 | ANDC, BEXT, BDEP<br>ANDCW, BEXTW, BDEPW          |
| $\overline{3x}$ | 4        | 3x       | 4        | C.NEG, C.NOT, C.BREV                             |

Table 3.1: XBitmanip encoding space (log2, i.e. in equivalent number of bits)

We do not count any encoding space for the unary instructions clz, clzw, ctz, ctzw, pcnt, and pcntw because they can be implemented in the reserved modes in zip.

The compressed encoding space is  $\approx 15.6$  bits wide.

$$log_2(3 \cdot 2^{14}) \approx 15.585$$

The compressed XBitmanip instructions need the equivalent of a 5.6 bit encoding space, or  $\approx 0.1\%$  of the total  $\approx 15.6$  bits available.

$$log_2(3 \cdot 2^4) \approx 5.585$$
  
 $100/(2^{15.585 - 5.585}) \approx 0.098$ 

On RV32, XBitmanip requires the equivalent of a  $\approx$  18.8 bit encoding space in the uncompressed encoding space. For comparison: A single standard I-type instruction (such as ADDI or SLTIU) requires a 22 bit encoding space. I.e. the entire RV32 XBitmanip extension needs less than one-eighth of the encoding space of the SLTIU instruction.

$$log_2(14 \cdot 2^{15}) \approx 18.807$$

On RV64, XBitmanip requires the equivalent of a  $\approx 20.0$  bit encoding space in the uncompressed encoding space. I.e. the entire RV64 XBitmanip extension needs about one-quarter of the encoding space of the SLTIU instruction.

$$log_2(21 \cdot 2^{15} + 5 \cdot 2^{16}) \approx 19.954$$

## Chapter 4

## **Evaluation**

This chapter contains a collection of short code snippets and algorithms using the XBitmanip extension for evaluation purposes. For the sake of simplicity we assume RV32 for most examples in this chapter.

Most assembler routines in this chapter are written as if they were ABI functions, i.e. arguments are passed in a0, a1, ... and results are returned in a0. Registers t0, t1, ... and a0, a1, ... are used for spilling.

Some of the assembler routines below can not or should not overwrite their first argument. In those cases the arguments are passed in a1, a2, ... and results are returned in a0.

## 4.1 MIX/MUX pattern

A MIX pattern selects bits from a0 and a1 based on the bits in the control word a2.

```
c.and a0, a2
andc a1, a1, a2
c.or a0, a1
```

A MUX operation selects word a0 or a1 based on if the control word a2 is zero or nonzero, without branching.

```
snez a2, a2
c.neg a2
c.and a0, a2
andc a1, a1, a2
c.or a0, a1
```

Or when a2 is already either 0 or 1:

```
c.neg a2
c.and a0, a2
andc a1, a1, a2
c.or a0, a1
```

Alternatively, a core might fuse a conditional branch that just skips one instruction with that instruction to form a fused conditional macro-op.

## 4.2 Bit scanning and counting

Counting leading ones:

```
not a0, a0 clz a0, a0
```

Counting trailing ones:

```
not a0, a0 ctz a0, a0
```

Counting bits cleared:

```
not a0, a0 pcnt a0, a0
```

(This is better than XLEN-pcnt because RISC-V has no "reverse-subtract-immediate" operation, but not can even be implemented in a compressed instruction if rd = rs.)

Odd parity:

```
pcnt a0, a0
andi a0, a0, 1
```

Even parity:

```
pcnt a0, a0
addi a0, a0, 1
andi a0, a0, 1
```

(Using addi here is better than using xori, because there is a compressed opcode for addi but none for xori.)

### 4.3 Arbitrary (data-defined) bit permutations

This section lists code snippets for computing arbitrary bit permutations that are defined by data (as opposed to bit permutations that are known at compile time and can likely be compiled into shift-and-mask operations and/or a few instances of bext/bdep).

#### 4.3.1 Using butterfly operations

The following macro performs a stage-N butterfly operation on the word in a0 using the mask in a1.

```
grevi t0, a0, (1 << N)
and t0, t0, a1
andc a0, a0, a1
or a0, a0, t0
```

The bitmask in a1 must be preformatted correctly for the selected butterfly stage. A butterfly operation only has a XLEN/2 wide control word. The following macros format the mask assuming those XLEN/2 bits in the lower half of a1 on entry (preformatted mask in a1 on exit):

```
bfly_msk_0:
    zip a1, a1
    slli t0, a1, 1
    or a1, a1, t0

bfly_msk_1:
    zip2 a1, a1
    slli t0, a1, 2
    or a1, a1, t0

bfly_msk_2:
    zip4 a1, a1
    slli t0, a1, 4
    or a1, a1, t0
```

A sequence of  $2 \cdot log_2(XLEN) - 1$  butterfly operations can perform any arbitrary bit permutation (Beneš network):

```
butterfly(LOG2_XLEN-1)
butterfly(LOG2_XLEN-2)
...
```

```
butterfly(0)
...
butterfly(LOG2_XLEN-2)
butterfly(LOG2_XLEN-1)
```

Many permutations arising from real-world applications can be implemented using shorter sequences. For example, any sheep-and-goats operation with either the sheep or the goats bit reversed can be implemented in  $log_2(XLEN)$  butterfly operations.

Reversing a permutation implemented using butterfly operations is as simple as reversing the order of butterfly operations.

#### 4.3.2 Using omega-flip networks

The omega operation is a stage-0 butterfly preceded by a zip operation:

```
zip a0, a0
grevi t0, a0, 1
and t0, t0, a1
andc a0, a0, a1
or a0, a0, t0
```

The flip operation is a stage-0 butterfly followed by an unzip operation:

```
grevi t0, a0, 1
and t0, t0, a1
andc a0, a0, a1
or a0, a0, t0
unzip a0, a0
```

A sequence of  $log_2(XLEN)$  omega operations followed by  $log_2(XLEN)$  flip operations can implement any arbitrary 32 bit permutation.

As for butterfly networks, permutations arising from real-world applications can often be implemented using a shorter sequence.

#### 4.3.3 Using baseline networks

Another way of implementing arbitrary 32 bit permutations is using a baseline network followed by an inverse baseline network.

A baseline network is a sequence of  $log_2(XLEN)$  butterfly(0) operations interleaved with unzip operations. For example, a 32-bit baseline network:

```
butterfly(0)
unzip
butterfly(0)
unzip.h
butterfly(0)
unzip.b
butterfly(0)
unzip.n
butterfly(0)
```

An inverse baseline network is a sequence of  $log_2(XLEN)$  butterfly(0) operations interleaved with zip operations. The order is opposite to the order in a baseline network. For example, a 32-bit inverse baseline network:

```
butterfly(0)
zip.n
butterfly(0)
zip.b
butterfly(0)
zip.h
butterfly(0)
zip
butterfly(0)
```

A baseline network followed by an inverse baseline network can implement any arbitrary bit permutation.

#### 4.3.4 Using sheep-and-goats

The Sheep-and-goats (SAG) operation is a common operation for bit permutations. It moves all the bits selected by a mask (goats) to the LSB end of the word and all the remaining bits (sheep) to the MSB end of the word, without changing the order of sheep or goats.

The SAG operation can easily be performed using bext (data in a0 and mask in a1):

```
bext t0, a0, a1
not a1, a1
bext a0, a0, a1
pcnt a1, a1
ror a0, a0, a1
or a0, a0, t0
```

Any arbitrary bit permutation can be implemented in  $log_2(XLEN)$  SAG operations.

The Hacker's Delight describes an optimized standard C implementation of the SAG operation. Their algorithm takes 254 instructions (for 32 bit) or 340 instructions (for 64 bit) on their reference RISC instruction set. [2, p. 152f, 162f]

## 4.4 Emulating x86 Bit Manipulation ISAs

The following code snippets implement all instructions from the x86 bit manipulation ISA extensions ABM, BMI1, BMI2, and TBM using RISC-V code that does not spill any registers and thus could easily be implemented in a single instruction using macro-op fusion. (Some of them simply map directly to instructions in this spec and so no macro-op fusion is needed.) Note that shorter RISC-V code sequences are possible if we allow spilling to temporary registers.

Table 4.1: Emulating other Bit Manipulation ISAs using macro-op fusion

| x86 Ext | x86 Instruction           | $_{\mathrm{By}}$ | $	ext{tes}$ | RISC-V Code               |
|---------|---------------------------|------------------|-------------|---------------------------|
|         |                           | x86              | RV          |                           |
| ABM     | popcnt                    | 5                | 4           | pcnt a0, a0               |
|         | lzcnt                     | 5                | 4           | clz a0, a0                |
| BMI1    | andn                      | 5                | 4           | andc a0, a2, a1           |
|         | bextr (regs) <sup>1</sup> | 5                | 12          | c.add a0, a1              |
|         |                           |                  |             | slo a0, zero, a0          |
|         |                           |                  |             | c.and a0, a2              |
|         |                           |                  |             | srl a0, a0, a1            |
|         | blsi                      | 5                | 6           | neg a0, a1                |
|         |                           |                  |             | c.and a0, a1              |
|         | blsmsk                    | 5                | 6           | addi a0, a1, -1           |
|         |                           |                  |             | c.xor a0, a1              |
|         | blsr                      | 5                | 6           | addi a0, a1, -1           |
|         |                           |                  |             | c.and a0, a1              |
| BMI2    | bzhi                      | 5                | 6           | slo a0, zero, a2          |
|         |                           |                  |             | c.and a0, a1              |
|         | $\mathtt{mulx}^2$         | 5                | 4           | mul                       |
|         | pdep                      | 5                | 4           | bdep                      |
|         | pext                      | 5                | 4           | bext                      |
|         | $rorx^2$                  | 6                | 4           | rori                      |
|         | $\mathtt{sarx}^2$         | 5                | 4           | sra                       |
|         | $\mathtt{shrx}^2$         | 5                | 4           | srl                       |
|         | $shlx^2$                  | 5                | 4           | sll                       |
| TBM     | bextr (imm)               | 7                | 4           | c.slli a0, (32-START-LEN) |
|         |                           |                  |             | c.srli aO, (32-LEN)       |

<sup>&</sup>lt;sup>1</sup> The BMI1 bextr instruction expects the length and start position packed in one register operand. Our version expects the length in a0, start position in a1, and source value in a2.

 $<sup>^2</sup>$  The \*x BMI2 is nstructions just perform the indicated operation without changing any flags. RISC-V does not use flags, so this instructions trivially just map to their regular RISC-V counterparts.

| x86 Ext | x86 Instruction | Ву  | tes | RISC-V Code     |  |  |  |  |  |  |
|---------|-----------------|-----|-----|-----------------|--|--|--|--|--|--|
|         |                 | x86 | RV  |                 |  |  |  |  |  |  |
|         | blcfill         | 5   | 6   | addi a0, a1, 1  |  |  |  |  |  |  |
|         |                 |     |     | c.and a0, a1    |  |  |  |  |  |  |
|         | blci            | 5   | 8   | addi a0, a1, 1  |  |  |  |  |  |  |
|         |                 |     |     | c.not a0        |  |  |  |  |  |  |
|         |                 |     |     | c.or a0, a1     |  |  |  |  |  |  |
|         | blcic           | 5   | 10  | addi a0, a1, 1  |  |  |  |  |  |  |
|         |                 |     |     | andc a0, a1, a0 |  |  |  |  |  |  |
|         |                 |     |     | c.not a0        |  |  |  |  |  |  |
|         | blcmsk          | 5   | 6   | addi a0, a1, 1  |  |  |  |  |  |  |
|         |                 |     |     | c.xor a0, a1    |  |  |  |  |  |  |
|         | blcs            | 5   | 6   | addi a0, a1, 1  |  |  |  |  |  |  |
|         |                 |     |     | c.or a0, a1     |  |  |  |  |  |  |
|         | blsfill         | 5   | 6   | addi a0, a1, -1 |  |  |  |  |  |  |
|         |                 |     |     | c.or a0, a1     |  |  |  |  |  |  |
|         | blsic           | 5   | 10  | addi a0, a1, -1 |  |  |  |  |  |  |
|         |                 |     |     | andc a0, a1, a0 |  |  |  |  |  |  |
|         |                 |     |     | c.not a0        |  |  |  |  |  |  |
|         | t1mskc          | 5   | 10  | addi a0, a1, +1 |  |  |  |  |  |  |
|         |                 |     |     | andc a0, a1, a0 |  |  |  |  |  |  |
|         |                 |     |     | c.not a0        |  |  |  |  |  |  |
|         | t1msk           | 5   | 8   | addi a0, a1, -1 |  |  |  |  |  |  |
|         |                 |     |     | andc a0, a0, a1 |  |  |  |  |  |  |

There will be a separate RISC-V standard for recommended sequences for macro-op fusion. The macros listed here are merely for demonstrating that suitable sequences exist. We do not advocate for any of those sequences to become "standard sequences" for macro-op fusion.

## 4.5 Emulating RI5CY Bit Manipulation ISA

TBD

## 4.6 Decoding RISC-V Immediates

The following code snippets decode and sign-extend the immedate from RISC-V S-type, B-type, J-type, and CJ-type instructions. They are nice "nothing up my sleeve"-examples for real-world bit permutations.

| 31  | 27       | 26   | 25  | 24      | 20        | 19   | 15 | 14 | 12 | 11    | 7       | 6 | 0 |        |
|-----|----------|------|-----|---------|-----------|------|----|----|----|-------|---------|---|---|--------|
| im  | m[11:    | 5]   |     |         |           |      |    |    |    | imm   | n[4:0]  |   |   | S-type |
| imn | n[12 10] | ):5] |     |         |           |      |    |    |    | imm[4 | 4:1 11] |   |   | B-type |
|     |          |      | imn | n[20 1] | 0:1 11 19 | 9:12 |    |    |    |       |         |   |   | J-type |

```
decode_s:
                                             bdep a1, a1, t2
  li t0, 0xfe000f80
                                             rori a0, a0, 11
                                             bext a0, a0, t0
  bext a0, a0, t0
  c.slli a0, 20
                                             bdep a0, a0, t3
  c.srai a0, 20
                                             c.or a0, a1
                                             c.srai a0, 20
  ret
                                             ret
decode_b:
  li t0, 0xeaa800aa
                                           // variant 2 (without XBitmanip)
  rori a0, a0, 8
                                           decode_cj:
  grevi a0, a0, 8
                                             srli a5, a0, 2
  gzip a0, a0, 14
                                             srli a4, a0, 7
  bext a0, a0, t0
                                             c.andi a4, 16
                                             slli a3, a0, 3
  c.slli a0, 20
  c.srai a0, 19
                                             c.andi a5, 14
                                             c.add a5, a4
  ret
                                             andi a3, a3, 32
decode_j:
                                             srli a4, a0, 1
  li t0, 0x800003ff
                                             c.add a5, a3
                                             andi a4, a4, 64
  li t1, 0x800ff000
                                             slli a2, a0, 1
  bext a1, a0, t1
  c.slli a1, 23
                                             c.add a5, a4
  rori a0, a0, 21
                                             andi a2, a2, 128
  bext a0, a0, t0
                                             srli a3, a0, 1
  c.slli a0, 12
                                             slli a4, a0, 19
  c.or a0, a1
                                             c.add a5, a2
                                             andi a3, a3, 768
  c.srai a0, 11
  ret
                                             c.slli a0, 2
                                             c.add a5, a3
// variant 1 (with XBitmanip)
                                             andi a0, a0, 1024
decode_cj:
                                             c.srai a4, 31
  li t0, 0x28800001
                                             c.add a5, a0
  li t1, 0x000016b8
                                             slli a0, a4, 11
  li t2, 0xb4e00000
                                             c.add a0, a5
  li t3, 0x4b000000
                                             ret
  bext a1, a0, t1
```

## Chapter 5

## **XBitfield Extension**

The most asked-for feature in XBitmanip is an instruction for bit field extract. XBitmanip recommends the usage of slli followed by slri for this operation and asks implementors to fuse this sequence into a single macro-op. However, there are valid concerns regarding this approach:

- Not all processors that might want to implement a fast bitfield extract can fuse macro-ops.
- In some architectures it might be hard to fuse the two shift instructions if one of them is using a 32-bit instruction encoding. Therefore the instruction can not be fused if  $rd \neq rs1$ .

Therefore this chapter proposes the RISC-V XBitfield extension. Note that, unlike XBitmanip, we do <u>not</u> propose XBitfield for adoption as official RISC-V ISA extension.

The encoding for a proper 32-bit bitfield extract instruction, if encoded in a clean and future-safe way that scales all the way to RV128, would require a 14 bit immediate (7 bits for start and 7 bits for length), or the equivalent of 4 I-type instruction with full immediate length, or half of a major opcode, way too wasteful for an official RISC-V ISA extension, especially for something that can be encoded in the same space with two compressed instructions.

However, there is a tendency for implementors to fill the unused instruction encoding space with a myriad of instructions with large immediates. [3]

So we propose XBitfield, a custom, non-standard extension that uses the entire major opcode 1111011 (custom-3) for a single bit-field extract and place (bfxp) instruction, effectively packing 3 shift operations in one single instruction.

The goal here is that individual implementors can easily share patches for compiler toolchains. By standardising one powerful (but easy to implement) instruction instead of a whole range of simpler instructions we keep the complexity of the extension low. This extension might be of interest to every implementor who is building a product with focus on bit manipulation code and is not otherwise using the *custom-3* encoding space and does not want to spend a lot of hardware resources on making use of that instruction encoding space.

### 5.1 Bit-field extract and place (bfxp)

The bit-field extract and place instruction extracts a bitfield of length len starting at bit position start, and places it at the new offset dest. The other bits of the result are set to zero.

The immediate arguments start, len, and dest are unsigned 5 bit immediates. On RV64 and RV128 len = 0 encodes for len = 32.

The instruction returns 0 if start + len > XLEN or dest + len > XLEN.

```
uint_xlen_t bfxp(uint_xlen_t rs1, unsigned start, unsigned len, unsigned dest)
{
   assert(start < 32 && len < 32 && dest < 32);

   if (XLEN > 32 && len == 0)
        len = 32;

   if (start + len > XLEN || dest + len > XLEN || len == 0)
        return 0;

   uint_xlen_t x = rs1;
   x <<= XLEN-start-len;
   x >>= XLEN-len;
   x <<= dest;

   return x;
}</pre>
```

The assembler mnemonic for bfxp is as follows. (dest = 0 is assumed when dest is omitted.)

```
bfxp rd, rs, start, len[, dest]
```

bfxp occupies the *custom-3* major opcode:

| 31    | 27 26 |     | 22 | 21 20     | 19 |     | 15 | 14   | 12    | 11 | 7 6 |         | 0 |
|-------|-------|-----|----|-----------|----|-----|----|------|-------|----|-----|---------|---|
| start |       | len |    | dest[4:3] |    | rs1 |    | dest | [2:0] | rd |     | 1111011 |   |
| 5     | •     | 5   | ·  | 2         | •  | 5   |    | 3    |       | 5  |     | 7       |   |

## 5.2 Evaluation: Decoding RISC-V Immediates

The following code snippets decode and sign-extend the immedate from RISC-V S-type, B-type, J-type, and CJ-type instructions. They are nice "nothing up my sleeve"-examples for real-world bit permutations.

| 31 | 27       | 26   | 25  | 24       | 20       | 19    | 15 | 14 | 12 | 11   | 7       | 6 | 0 |        |
|----|----------|------|-----|----------|----------|-------|----|----|----|------|---------|---|---|--------|
| i  | mm[11:   | 5]   |     |          |          |       |    |    |    | imn  | n[4:0]  |   |   | S-type |
| im | m[12 10] | 0:5] |     |          |          |       |    |    |    | imm[ | 4:1 11] |   |   | B-type |
|    |          | j    | imn | n[20 10] | 0:1 11 1 | 9:12] |    |    |    |      |         |   |   | J-type |

| 15 | 14 | 13 | 12 |    |                          |     | 8     | •  |           | -     | _    | 3 | 2 | 1 | 0 |         |
|----|----|----|----|----|--------------------------|-----|-------|----|-----------|-------|------|---|---|---|---|---------|
|    |    |    |    | iı | $\overline{\mathrm{nm}}$ | [11 | 4 9:8 | 10 | $ 6 ^{7}$ | 7 3:1 | L[5] |   |   |   |   | CJ-type |

```
decode_s:
                                            c.or a0, a2
 bxfp a1, a0, 7, 5, 20
                                            c.or a0, a3
 bxfp a0, a0, 25, 5, 25
                                            c.srai a0, 11
 c.or a0, a1
                                            ret
  c.srai a0, 20
  ret
                                           decode_cj:
                                            bxfp a1, a0, 11, 1, 24
decode_b:
                                            bxfp a2, a0, 9, 2, 28
  bxfp a1, a0, 7, 1, 30
                                            bxfp a3, a0, 8, 1, 30
  bxfp a2, a0, 25, 4, 24
                                            c.or a1, a2
  bxfp a3, a0, 8, 4, 20
                                            c.or a1, a3
 bxfp a0, a0, 31, 1, 31
                                            bxfp a2, a0, 7, 1, 26
                                            bxfp a3, a0, 6, 1, 27
 c.or a0, a1
  c.or a0, a2
                                            bxfp a4, a0, 3, 3, 21
                                            bxfp a5, a0, 2, 1, 25
  c.or a0, a3
  c.srai a0, 19
                                            bxfp a0, a0, 12, 1, 31
                                            c.or a0, a1
  ret
                                            c.or a0, a2
decode_j:
                                            c.or a0, a3
  bxfp a1, a0, 21, 12, 11
                                            c.or a0, a4
 bxfp a2, a0, 20, 1, 22
                                            c.or a0, a5
 bxfp a3, a0, 12, 8, 23
                                            c.srai a0, 20
 bxfp a0, a0, 31, 1, 31
                                            ret
  c.or a0, a1
```

# Change History

Table 5.1: Summary of Changes

| Date       | Rev  | Changes                                                     |
|------------|------|-------------------------------------------------------------|
| 2017-07-17 | 0.10 | Initial Draft                                               |
| 2017-11-02 | 0.11 | Removed roli, assembler can convert it to use a rori        |
|            |      | Removed bitwise subset and replaced with andc               |
|            |      | Doc source text same base for study and spec.               |
|            |      | Fixed typos                                                 |
| 2017-11-30 | 0.32 | Jump rev number to be on par with associated Study          |
|            |      | Moved pdep/pext into spec draft and called it scattergather |
| 2018-04-07 | 0.33 | Move to github, throw out study, convert from .md to .tex   |
|            |      | Fixed typos and fixed some reference C implementations      |
|            |      | Rename bgat/bsca to bext/bdep                               |
|            |      | Remove post-add immediate from clz                          |
|            |      | Clean up encoding tables and code sections                  |
| 2018-04-20 | 0.34 | Add GREV, CTZ, and compressed instructions                  |
|            |      | Restructure document: Move discussions to extra sections    |
|            |      | Add FAQ, add analysis of used encoding space                |
|            |      | Add Pseudo-Ops, Macros, Algorithms                          |
|            |      | Add Generalized Bit Permutations (shuffle)                  |
| ????-??-?? | 0.35 | Replace shuffle with generalized zip (gzip)                 |
|            |      | Add additional XBitfield Extension                          |

# **Bibliography**

- [1] Bit scanning equivalencies. https://fgiesen.wordpress.com/2013/10/18/bit-scanning-equivalencies/. Accessed: 2017-04-24.
- [2] Sean Eron Anderson. Bit twiddling hacks. http://graphics.stanford.edu/~seander/bithacks.html. Accessed: 2017-04-24.
- [3] Pasquale Davide Schiavone Andreas Traber, Michael Gautschi. Ri5cy: User manual. https://pulp-platform.org//wp-content/uploads/2017/11/ri5cy\_user\_manual.pdf. Accessed: 2017-04-26.
- [4] Henry S. Warren. Hacker's Delight. Addison-Wesley Professional, 2nd edition, 2012.
- [5] Wikipedia. Hamming weight. https://en.wikipedia.org/wiki/Hamming\_weight. Accessed: 2017-04-24.